Skip to content

Integrations

Webhook patterns that don't fall over at scale

Andrew Roper · · 7 min read

Quick answer: webhooks fail in five common ways — missed deliveries, duplicate processing, out-of-order events, signature verification gaps, and silent data loss when the receiver is briefly down. The patterns that survive: idempotency keys on every write, a persistent queue between receipt and processing, signature verification with monitoring on rejections, and a periodic reconciliation job as the safety net.

If you’ve built more than a couple of API integrations, you’ve had the conversation about webhooks. The patterns below sit alongside the integration testing layer most teams skip and API rate-limit design as the three things that turn a fragile integration into one that runs unattended. They’re fast, simple in concept, and the right tool for “tell my system when something happens.” They’re also the place where a lot of integration data quietly goes missing.

The patterns below are the ones that we keep deploying because the alternatives keep failing. They aren’t novel. They’re just the ones that survive.

How webhooks actually fail

Five categories cover most webhook outages we’ve seen:

1. Missed deliveries. The vendor sends, your endpoint is briefly down (deploy, restart, network blip), the vendor doesn’t retry — or retries with a policy that gives up after three attempts. The event is gone.

2. Duplicate processing. The vendor does retry, your endpoint receives the same event twice, and your handler doesn’t deduplicate. Now the customer has two of whatever the event was — two charges, two emails, two records.

3. Out-of-order events. The “updated” webhook arrives before the “created” webhook for the same record (yes, this happens). Or events for the same entity arrive in mixed order across an outage. Your handler processes them in receipt order and ends up with state that doesn’t match the source system.

4. Signature verification gaps. The vendor signs webhook payloads with a secret. Your endpoint either doesn’t verify, or verifies with the wrong algorithm, or has a bug in the verification that lets through unsigned requests. An attacker who guesses your endpoint URL can now inject events.

5. Silent data loss. Your endpoint is up, but the downstream system that should have processed the event was down. The event was acknowledged to the vendor, then dropped. Nobody notices for weeks because the failure isn’t visible.

Each has a known fix. None of the fixes is exotic. All of them are skipped surprisingly often.

Pattern 1: idempotency keys on every write

The single most important pattern: every write triggered by a webhook should be idempotent.

In practice, that means tagging every write with a unique idempotency key (often the webhook’s own event ID). The destination system uses the key to deduplicate. If you receive the same webhook twice, you write once.

Most modern APIs support an Idempotency-Key header. Stripe’s implementation is the canonical reference. For systems that don’t natively support idempotency keys, build it yourself: store the event ID alongside each created record and reject inserts where the event ID already exists.

This single pattern removes most of the “duplicate processing” failure mode. It’s also the cheapest pattern to implement — you’re adding a column and a uniqueness check.

Pattern 2: persistent queue between receipt and processing

The next most important pattern: separate webhook receipt from webhook processing.

In a fragile setup, your webhook endpoint:

  1. Receives the request
  2. Validates it
  3. Processes the business logic
  4. Returns 200

If step 3 takes a long time, or fails, or the database is briefly slow, the vendor times out and retries. Now you’re in the duplicate-processing failure mode, and you’re processing the same event multiple times in parallel.

In a robust setup, your webhook endpoint:

  1. Receives the request
  2. Verifies the signature
  3. Stores the event payload in a persistent queue (a database table, a job queue, an SQS / Pub/Sub topic)
  4. Returns 200 immediately

A separate worker pulls events from the queue and processes them. If processing fails, the event stays in the queue and is retried. If your processing infrastructure is briefly down, the queue holds events until it’s back.

This pattern decouples the “tell me about the event” SLA (very fast, very reliable) from the “do something about the event” SLA (whatever your business logic needs).

Pattern 3: signature verification, with monitoring on rejection

Verifying webhook signatures isn’t optional for any production integration. The implementation is platform-specific (most vendors document it precisely) but the pattern is the same: compute a hash of the payload using a shared secret, compare to the signature header, reject if it doesn’t match.

Two things to add to the basic implementation:

  • Monitor rejected webhooks. A sudden spike in rejections is almost always one of: the vendor rotated their signing key, an attacker is probing your endpoint, or there’s a deploy bug that broke verification. All three are things you want to know about within minutes, not days.
  • Use constant-time comparison. A naive equality check is technically vulnerable to timing attacks. Most language runtimes have a crypto.timingSafeEqual (or equivalent) for exactly this purpose.

Pattern 4: reconciliation as the safety net

The pattern most easily skipped, and most often regretted.

Webhooks are the fast path. They’re fast, but not 100% reliable. Reconciliation is the slow path that catches everything the fast path missed.

The pattern: a scheduled job, running daily or hourly depending on the use case, that polls the source system’s API for everything that’s changed since the last run, compares it to your local copy, and processes any differences.

For a customer-record sync, the reconciliation job runs nightly, fetches all customers updated in the last 25 hours, diffs against your database, and processes any records the webhook layer missed. Most reconciliations find nothing — that’s the point. The job exists to catch the small percentage of events that fell through the webhook layer.

This pattern is the difference between “mostly working” and “works correctly even when something went wrong.” It costs a few hours of engineering time and an extra cron job. It saves you from quarter-end emergencies where revenue numbers don’t reconcile.

Pattern 5: dead-letter handling

Some events will fail processing no matter how many times you retry — malformed payloads, references to records that no longer exist, business-logic violations.

After a reasonable number of retries (typically 3–5), failed events should go to a dead-letter queue rather than disappearing. The DLQ is checked regularly (manually or by alerting), and each event is either:

  • Replayable (the upstream issue has been fixed; reprocess it)
  • Discardable (the event was a duplicate or no-op; ignore it)
  • Investigation-worthy (something genuinely unexpected; debug)

A DLQ isn’t glamorous. It’s the difference between an integration that quietly drops 0.1% of events and an integration whose failures are visible and recoverable.

Pattern 6: explicit ordering when it matters

When ordering matters, webhooks alone aren’t enough. The patterns that work:

  • Use the source system’s timestamps, not your receipt time. Process the event whose source timestamp is older, even if it arrived later.
  • Idempotent updates that consider state. When processing an “updated” event, check the source record’s current state and apply the update only if your local copy is older. This makes ordering largely irrelevant.
  • Version stamps. Some platforms include an incrementing version number with each event. Use it to detect and skip stale events.

For most use cases, idempotent updates with timestamp comparison handles ordering well enough. For genuine ordering-required workflows (sequential state machines, say), you’ll need stronger guarantees, often via a stream-processing platform rather than raw webhooks.

What this looks like in code

The skeleton of a robust webhook handler is roughly:

1. Receive POST request
2. Verify signature → reject if invalid (with monitoring)
3. Parse payload, extract event ID
4. INSERT into events queue (with event ID as primary key for dedup)
5. Return 200 immediately

Separate worker:
6. Pull next event from queue
7. Look up source record fresh from API (don't trust webhook payload to be current)
8. Apply business logic
9. Mark event as processed
10. On failure → retry with exponential backoff
11. After N failures → move to dead-letter queue with full context
12. Periodic reconciliation job runs independently

That’s most of what we deploy. The complexity is appropriate to the failure modes. Skipping any of these patterns is what produces “our integration mostly works.”

When “simple” really is enough

Not every webhook integration needs the full pattern set. A weekly Slack notification when a deal closes can be a 30-line endpoint that calls Slack’s API. There’s no reconciliation, no DLQ, no queue — and that’s fine, because the cost of a missed Slack notification is approximately zero.

The honest test: how much would the business lose if this integration silently dropped 1% of events for a quarter? If the answer is “nothing measurable,” build it simple. If the answer is “customers, money, or trust,” build it properly. The middle ground — building it “mostly properly” — is where most outages live.

Common questions

Are webhooks reliable? Reliable enough to depend on as the fast path; not reliable enough to depend on as the only path. The robust pattern is webhooks + a slower reconciliation job that catches anything the webhooks missed.

How do I handle duplicate webhooks? Use idempotency keys on every write triggered by a webhook. Store the event ID with the record you create or update. Reject duplicates at the database level. Most modern APIs support an Idempotency-Key header natively.

What happens if my webhook endpoint is down? Most platforms retry for some period (Stripe retries for up to 3 days; others vary widely). Some don’t retry at all. The robust pattern doesn’t depend on retries: a separate reconciliation job runs against the source system’s API on a slower cadence and catches anything that was missed.

How do I verify webhook signatures? Compute the expected signature from the payload using the vendor’s shared secret, compare to the signature header using constant-time comparison, reject the request if it doesn’t match. Each platform documents the exact algorithm. Monitor signature rejection rates — spikes indicate either rotation or attempted abuse.

What’s the difference between webhooks and polling? Webhooks are push-based: the source system tells you when something happens. Polling is pull-based: you ask the source system periodically what’s changed. Webhooks are faster and cheaper at scale. Polling is more reliable but higher latency and higher cost. Robust integrations typically use both: webhooks as the fast path, polling as the reconciliation safety net.

If you’ve got webhooks running in production and you’re not sure they’re built for the failure modes above, start a project and we’ll do an audit. The repair work is usually cheap compared to discovering the issue at quarter-end.

Let’s build something

The right system,
built once, properly.

If your business is ready to scale beyond what off-the-shelf tools can support — we should talk.