Integrations

Why integrations break in production (and what to design for)

Andrew Roper · 1 May 2026 · 9 min read

Quick answer: integrations between business systems most commonly fail at five points — API contract drift, surprise rate-limit hits, expired auth tokens, unreliable webhooks, and the absence of observability to notice any of the above. Designing for these up front is most of the difference between an integration that runs unattended and one that needs babysitting.

The most fragile thing in most modern businesses isn’t a single piece of software. It’s the layer between them. When something goes wrong in a stack of ten SaaS tools, it’s rarely the tools that broke — it’s the integrations between them.

We’ve spent a lot of time cleaning up integrations that someone else built, and a fair bit building integrations that survive the day Stripe changes their API or Salesforce rotates their tokens. The same five things break first, every time. Designing for them up front is most of the difference between an integration that runs unattended for years and one that needs babysitting every week.

Why “it’s just an API call” isn’t

The phrase that precedes most broken integrations is “it’s just an API call.”

A real integration between two business systems is rarely a single API call. It’s a sequence of:

Authenticating against the source system (with credentials that expire)
Reading data (subject to pagination, rate limits, and consistency)
Transforming it (often with non-trivial business logic)
Authenticating against the destination system (with different credentials, also expiring)
Writing data (idempotently, so retries don’t double-create)
Handling partial failure (because something will always be missing or stale)
Logging enough that you can debug it three months from now
Alerting when it’s broken in a way you actually notice

Every step is a place to fail. The five most common are below.

1. API contract drift

APIs change. Sometimes vendors version the change cleanly (v1 stays, v2 is new). Sometimes they don’t.

What we see in practice:

A field that used to be a string becomes an object with { value, currency } — existing integrations now break on JSON parsing
An optional field becomes required (or vice versa)
A response shape that used to return one record returns an array
Pagination behaviour changes — what used to be unbounded is now capped at 100
Webhook payload structure changes without notice

The defence is to build integrations expecting the shape to change. In practice that means:

Validate every response against a schema you control, not against the shape you assumed
Fail loudly on schema mismatch, before bad data reaches the next system
Log the actual response payload alongside any failure so you can see what changed

Out-of-the-box no-code tools like Zapier and Make tend to silently accept whatever they get back. That’s fine for low-stakes flows. For anything where a wrong field can corrupt a customer record, it’s why we usually move clients to coded integrations once the flow becomes business-critical.

2. Rate limits, surprise edition

Most APIs publish their rate limits. Most integrations are designed assuming those limits won’t be hit.

The patterns where they get hit anyway:

Backfills (importing historical data — suddenly you’re hammering an endpoint that’s normally quiet)
Batch syncs (everyone runs them at the top of the hour, including the API’s other customers)
Retries cascading after a partial outage (the system catches up by sending ten times the normal volume)
A new feature launches and uses the same endpoints (your integration shares the limit with somebody else’s integration in the same vendor account)

What good integrations do about it:

Respect the rate limit headers. Most APIs return X-RateLimit-Remaining and Retry-After. Use them. Slowing down is cheaper than getting blocked.
Use exponential backoff. Linear retries amplify the problem. Exponential backoff with jitter is the boring, correct answer.
Queue, don’t fire-and-hope. A persistent queue between “event happens” and “API call fires” absorbs spikes. Without it, every spike becomes a cascade of failed calls.
Avoid top-of-hour clustering. Stagger scheduled jobs. Top-of-hour is a thundering herd.

GitHub’s rate-limit documentation is a good reference for how mature platforms structure their limits and the headers good integrations watch for — the patterns generalise.

[INSERT: a real example — e.g. “We rebuilt a [client]’s [Salesforce/HubSpot/Klaviyo] integration after a backfill saturated their API quota for 14 hours, blocking every other customer-facing flow that depended on the same endpoint.”]

3. Auth tokens and the rotation problem

Most modern APIs use OAuth tokens. Tokens expire. Refresh tokens also expire, just less often. When either expires unattended, the integration silently stops working — usually at 2am on a long weekend.

What goes wrong specifically:

Refresh tokens that haven’t been used in 90 days are revoked by the platform
A user who authorised the integration leaves the company; their token is killed; the integration that depended on their identity dies
The integration was built against a personal account “for testing” and never moved to a service account
API key rotation policies get enforced retroactively and old keys stop working

What good integrations do:

Run as a service account (or platform-level connection), not as a real human user
Refresh tokens proactively before they expire, not reactively when a call fails
Monitor token health as a first-class signal in observability, separate from API call success rate
Document the renewal process so the next person to inherit the system can rotate credentials safely

4. Webhook unreliability

Webhooks are how most modern integrations get told that something happened. They’re also where a lot of integrations quietly lose data.

What goes wrong with webhooks specifically:

The vendor sends, your endpoint is briefly down, the vendor doesn’t retry
The vendor does retry, you process the same event twice, you double-create a record
Out-of-order delivery: an “updated” event arrives before the “created” event for the same record
Verification failures: the signature header changes, you reject all your own webhooks
Vendor decides to deprecate a webhook event with 30 days’ notice; your integration silently goes blind to that event

The patterns that survive:

Idempotency keys on every write. Receiving the same webhook twice should never double-create.
A persistent queue between webhook receipt and processing — so even if your processor is briefly down, the event is held safely.
Reconciliation jobs that periodically diff the source system against your local copy. Webhooks are the fast path; reconciliation is the safety net.
Verification of webhook signatures in code, with monitoring on rejected webhooks (a sudden spike in rejections is usually the vendor changing the signing key).

Stripe’s documentation on webhook best practices is worth a read regardless of which platform you’re integrating with — the patterns travel.

5. The observability gap

The single most common reason an integration runs broken for weeks before anyone notices is that nobody can see it.

The minimum observable surface for any integration we ship:

Success rate of API calls, broken out by endpoint
Latency percentiles, not just averages
Token health (when does the next one expire?)
Queue depth (is the integration keeping up?)
Records reconciled vs records expected (is the integration producing the right output?)
Alert thresholds on each, with a documented on-call runbook

This sounds like overkill for a “simple” integration. It is, until the day it isn’t. The cost of building this is small compared to the cost of one quarter where revenue numbers were silently wrong because the HubSpot ↔ accounting integration was dropping every fifth event.

What we tend to default to

For business-critical integrations, the design we keep returning to looks like:

A small dedicated worker (Node, Python, or a serverless function) handling the integration logic
A persistent queue (a database-backed job queue is enough for most use cases) between event ingestion and execution
Schema validation on every external response, controlled by us, not assumed
Service-account credentials with proactive refresh
Idempotent writes with explicit idempotency keys
A reconciliation job running on a slower cadence
Observability via Sentry (or equivalent) for errors, plus structured logs to a queryable store
Documentation that someone other than the original builder can read and operate

It’s not glamorous engineering. It’s the engineering that means an integration runs for three years without anyone needing to think about it.

When “simple” really is enough

Not every integration needs all of the above. A weekly export of completed orders into a finance team’s Google Sheet doesn’t need a queue and a reconciliation job. A one-time migration script doesn’t need observability infrastructure.

The honest call is: how much would it cost the business if this integration was silently wrong for a week? If the answer is “a coffee’s worth of time to fix,” build it simple. If the answer is “a meaningful slice of revenue or trust,” build it properly the first time.

If you’ve got an integration that’s been “mostly working” for a while and you’re not sure which category it’s in, start a project and we’ll do an honest review. The audit alone is usually worth more than the build.

Common questions

Why do API integrations keep breaking? Five reasons recur: the upstream API changed shape (contract drift), rate limits got hit unexpectedly, auth tokens expired or were revoked, webhook delivery dropped or duplicated events, or the integration was failing silently because nobody could see it. Most outages are one of these five.

Should we use Zapier or build custom integrations? Zapier (and Make, n8n) is the right answer for low-volume, low-stakes flows that change frequently. Custom code is the right answer once an integration becomes business-critical, performance-sensitive, or carries data whose accuracy matters. We do both kinds of work and the decision usually isn’t close once we list the failure modes.

What is idempotency in API integrations? It means a write that’s sent twice has the same effect as a write that’s sent once — no duplicates. Achieved by tagging every write with a unique idempotency key the destination system uses to deduplicate. Essential for any integration that processes payments, orders, or customer records.

How do you handle OAuth token expiry? Run integrations as service accounts (not real human users), refresh tokens proactively before they expire rather than reactively when calls fail, and monitor token health as a first-class signal independent of API call success. Personal-account auth is the most common avoidable failure mode we see.

Do webhooks deliver reliably? Mostly. Reliably enough to depend on as the fast path; never reliably enough to depend on as the only path. The pattern that survives: webhooks for speed, plus a periodic reconciliation job that diffs the source system against your local copy as a safety net.

How much does an API integration cost to build? For a simple one-way sync between two well-documented APIs, $5,000–$15,000 typically covers the build with proper error handling and observability. For a bidirectional, high-volume, or compliance-relevant integration, expect $20,000–$80,000+. Cost scales with the number of failure modes you have to design for, not with the apparent complexity of the data.

Why integrations break in production (and what to design for)

Why “it’s just an API call” isn’t

1. API contract drift

2. Rate limits, surprise edition

3. Auth tokens and the rotation problem

4. Webhook unreliability

5. The observability gap

What we tend to default to

When “simple” really is enough

Common questions

What AI actually costs to run in production

The hidden costs of SaaS once your business is established

Red flags to watch for when hiring a development agency

The right system,
built once, properly.

Why integrations break in production (and what to design for)

Why “it’s just an API call” isn’t

1. API contract drift

2. Rate limits, surprise edition

3. Auth tokens and the rotation problem

4. Webhook unreliability

5. The observability gap

What we tend to default to

When “simple” really is enough

Common questions

What AI actually costs to run in production

The hidden costs of SaaS once your business is established

Red flags to watch for when hiring a development agency

The right system, built once, properly.

The right system,
built once, properly.