Integrations

Designing for API rate limits before they bite

Andrew Roper · 8 Mar 2026 · 6 min read

Quick answer: rate-limit failures usually come from one of four sources: a backfill that hammers an endpoint that’s normally quiet, a thundering herd at the top of the hour, retries cascading after a partial outage, or another customer’s integration sharing the same vendor account quota. The design patterns that survive: respect the rate-limit headers, exponential backoff with jitter, persistent queues that absorb spikes, and explicit budget management for shared quotas.

API rate limits are documented. They’re also the reason a meaningful share of integration outages happen. The gap between the two is mostly about designing as if the rate limit were a real constraint — which most integrations don’t.

The four ways rate limits actually bite

1. The backfill. An integration that normally handles 200 requests/hour suddenly needs to import three years of historical data. The endpoint that’s normally quiet is now seeing 50,000 requests in an afternoon. Either the integration gets blocked, or it succeeds — at the cost of every other integration sharing the same quota.

2. The thundering herd. A reporting job runs at the top of every hour. So does every other reporting job for every other customer of the same SaaS platform. At 9:00:00 sharp, the platform’s API is being hammered by everyone simultaneously, and the rate limit is hit collectively even though no single integration is misbehaving.

3. The cascading retry. The destination API has a brief 5xx outage. Your integration retries. So does every other integration. When the API recovers, it’s immediately overwhelmed by the catch-up traffic, returning 429 errors. Your retry logic, if naive, makes this worse.

4. The shared-account quota. Many SaaS platforms apply rate limits at the account level, not the integration level. If your business runs three integrations against the same HubSpot or Salesforce account, they share the quota. When one misbehaves, the others fail.

Pattern 1: respect the rate-limit headers

Most modern APIs return rate-limit information in response headers. The exact header names vary by vendor, but the canonical set is:

X-RateLimit-Limit: total quota for the window
X-RateLimit-Remaining: how many requests remain
X-RateLimit-Reset or Retry-After: when the window resets, or how long to wait

GitHub’s rate-limit headers are a good reference; the patterns generalise across vendors.

The minimum integration behaviour:

Read the headers from every response
If Remaining falls below a threshold (say 10% of Limit), slow down voluntarily
If you receive a 429, respect Retry-After exactly — don’t guess

Sloweing yourself down before the platform forces you to is dramatically cheaper than getting blocked.

Pattern 2: exponential backoff with jitter

When a request fails for a transient reason (5xx error, 429 rate limit, network timeout), retry — but with backoff that doesn’t make the problem worse.

The pattern:

Retry 1: wait 1 second
Retry 2: wait 2 seconds
Retry 3: wait 4 seconds
Retry 4: wait 8 seconds
And so on, capped at some maximum

Add jitter: a random component to each wait. Without jitter, all your integrations retry at the same intervals after a shared outage, creating exactly the thundering-herd problem you’re trying to avoid.

wait = min(max_wait, base * 2^attempt) * (0.5 + random_0_to_1)

This is the boring, correct answer. It’s also the answer most naive retry implementations skip.

Pattern 3: persistent queue between event and API call

The most robust pattern for any integration that fires API calls in response to events:

Receive event → enqueue
Worker dequeues at controlled rate
Each call observes rate limits before firing
On 429, the worker pauses, the queue holds

This pattern absorbs spikes. A backfill of 50,000 events doesn’t generate 50,000 simultaneous API calls — it produces a queue that drains at whatever rate the destination API tolerates. The same pattern handles the thundering herd: even if 50,000 events arrive at the top of the hour, the queue smooths the outflow to a sustainable rate.

Without a queue, every spike becomes a cascade of failed and retried calls. With a queue, every spike becomes a longer-than-usual processing window with no failures.

Pattern 4: token-bucket budgeting for shared quotas

When multiple integrations share the same account-level rate limit, the right pattern is explicit budget management.

Each integration is assigned a portion of the total quota. A central rate-limiting layer (Redis-backed token bucket, or similar) enforces the allocation. No integration can consume more than its share, regardless of how badly it’s behaving.

The allocation can be:

Static: integration A gets 60% of the quota, B gets 30%, C gets 10%
Priority-weighted: critical integrations get more of the quota during contention
Time-windowed: bulk jobs only run outside business-critical hours

This pattern is especially valuable when integrations are owned by different teams, or when one of them is operated by an external vendor that doesn’t care about your other integrations’ quota.

Pattern 5: scheduling to avoid clustering

Don’t schedule jobs at the top of the hour. Don’t schedule them at common round numbers (0, 15, 30, 45 minutes). Schedule them at deliberately offset times: 7 minutes past, 23 minutes past, 41 minutes past.

This is a tiny change with outsized impact. The default behaviour of most cron systems and most developers is round numbers. Avoiding the herd by even a few minutes puts you outside the contention window and avoids the worst of the rate-limit pressure.

For backfills specifically: schedule them outside business hours wherever possible, and use a deliberately throttled worker rather than the default “process as fast as possible” behaviour.

Pattern 6: pre-flight checks for backfills

Before starting a backfill, check:

The total number of records to be processed
The relevant rate limit
The expected duration at full speed
Whether other integrations need quota during the backfill window

If the maths says “this backfill will use 80% of our daily quota and take 6 hours”, the right answer is to break it into chunks across multiple days, not to run it as a single operation and hope.

The integration tooling we deploy includes a backfill-preview step that surfaces this calculation before the work starts. That single check has avoided more incidents than any retry logic.

What good observability looks like

For any integration with non-trivial volume, the rate-limit-relevant signals to monitor:

Requests per minute, per endpoint
429 response rate
Quota consumed as a percentage of allowance, ideally with the vendor’s reported limit headers
Worker queue depth (a growing queue often means rate-limited writes upstream)
Time since last successful call, alerted if it exceeds expected gap

These aren’t exotic metrics. They’re the difference between catching rate-limit pressure as it builds and discovering it after the integration has been blocked for a day.

What this means for vendor selection

Rate-limit pressure is one of several common ways integrations break in production; it tends to compound with the others.

When evaluating a SaaS platform that you’ll integrate with:

Read the rate limit documentation carefully. Vendors that document limits clearly tend to enforce them predictably. Vendors who don’t publish limits often have undocumented ones that bite.
Test at expected production volume. Many integrations work fine at the volume you tested with and break at the volume you’ll actually run.
Ask about higher-tier limits. Most platforms offer raised limits on enterprise tiers. If you anticipate hitting limits, knowing the upgrade path matters.
Understand the quota scope. Per-user? Per-account? Per-app? The scope dictates how much risk you carry from other integrations on the same account.

Common questions

What is API rate limiting? A constraint imposed by the API operator on how many requests a client can make in a given time window. Designed to prevent any single client from exhausting shared infrastructure capacity. When exceeded, the API returns 429 errors and refuses further requests until the window resets.

How do I avoid hitting API rate limits? Read the rate-limit headers, slow down voluntarily before the limit, use exponential backoff with jitter on errors, queue events between receipt and processing, schedule jobs to avoid clustering, and explicitly budget quota when multiple integrations share an account.

What should I do when a 429 error happens? Respect the Retry-After header exactly. Don’t guess. Pause the worker, hold the event in the queue, and resume after the specified delay. If 429s become persistent rather than occasional, the integration design needs revisiting — not the retry logic.

How long does a rate-limit window last? Vendor-dependent. Common windows are 1 minute, 1 hour, or 24 hours. The relevant header (Retry-After or X-RateLimit-Reset) tells you when the window resets for that specific endpoint.

Can I request a higher rate limit? Often yes, especially on paid tiers. Most platforms negotiate higher limits for legitimate use cases. The catch: don’t request a higher limit as a substitute for fixing a misbehaving integration. Vendors notice, and the conversation gets harder.

If you’ve got an integration that’s hitting rate limits and you’re not sure whether the fix is more quota or better design, start a project for an audit. The fix is usually design, and it’s usually cheap.

Designing for API rate limits before they bite

The four ways rate limits actually bite

Pattern 1: respect the rate-limit headers

Pattern 2: exponential backoff with jitter

Pattern 3: persistent queue between event and API call

Pattern 4: token-bucket budgeting for shared quotas

Pattern 5: scheduling to avoid clustering

Pattern 6: pre-flight checks for backfills

What good observability looks like

What this means for vendor selection

Common questions

What AI actually costs to run in production

Why integrations break in production (and what to design for)

The hidden costs of SaaS once your business is established

The right system,
built once, properly.

Designing for API rate limits before they bite

The four ways rate limits actually bite

Pattern 1: respect the rate-limit headers

Pattern 2: exponential backoff with jitter

Pattern 3: persistent queue between event and API call

Pattern 4: token-bucket budgeting for shared quotas

Pattern 5: scheduling to avoid clustering

Pattern 6: pre-flight checks for backfills

What good observability looks like

What this means for vendor selection

Common questions

What AI actually costs to run in production

Why integrations break in production (and what to design for)

The hidden costs of SaaS once your business is established

The right system, built once, properly.

The right system,
built once, properly.