Launch special — let's split the check with SPLITCHECK for 50% off
10 min read

Webhook Monitoring: How to Monitor Webhooks in Production

Monitor webhooks in production: the four signals to track, provider vs consumer-side monitoring, alerting on silence, and forensics with captured history.

MonitoringObservabilityWebhooksProduction
O

Ozer

Developer & Founder of HookSense

Monitoring webhooks in production means tracking four signals — delivery freshness, failure rate by status code, handler latency against the provider's timeout budget, and payload anomalies — from two vantage points: the provider's delivery logs and your own consumer-side telemetry. Neither sees the whole picture alone, which is why mature teams also keep an inspection endpoint in the loop: a retained, searchable history of every request that arrived, so when something breaks at 2am you can answer the question that matters — what exactly came in, and what did we do with it?

Why Webhooks Fail Silently in Production

Webhooks have a structural observability problem: sender and receiver belong to different organizations, and neither sees the whole pipeline. The provider knows what status code came back — but a 200 from a handler that swallowed an exception looks identical to a 200 from one that worked. The consumer knows what its handler did, but only for requests that arrived. If the provider silently disabled your endpoint after repeated failures, or a firewall rule started dropping traffic, your logs show nothing at all. No errors. Just silence.

There are no built-in delivery guarantees beyond the provider's retry policy, and those policies are finite. Stripe retries with exponential backoff for roughly 3 days, then gives up; GitHub does not retry automatically, offering manual redelivery from the webhook settings UI instead. Once the retry window closes, the event is gone. (See webhook retry strategies for a provider-by-provider breakdown and retry backoff for the mechanics.)

The result is a failure mode unique to webhooks: the system degrades invisibly until the data drifts. A renewal event gets dropped, and three weeks later a customer complains they were charged but their account shows expired. Nothing crashed. No alert fired. The first symptom was a support ticket.

The Four Signals to Monitor

1. Delivery Rate and Freshness

The most valuable webhook metric is also the simplest: time since the last event arrived. Every webhook stream has a natural rhythm, and when it stops, no error-based metric will tell you. Track events per minute, segmented by event type, and alert when the gap since the last event exceeds the stream's normal variance. A payments integration that usually sees an event every few minutes and has been quiet for an hour is an incident — even though every dashboard shows zero errors.

2. Failure Rate by Status Code

Raw error counts are noisy; status code distribution is diagnostic:

  • 401 / 403: Signature verification is failing — a rotated secret, a body-parsing change, a bad deploy.
  • 404: A route moved or a reverse-proxy rule changed.
  • 422 / 400: The payload schema changed, or your validation got stricter.
  • 500: Your handler is throwing — the event arrived fine and your code broke.
  • 502 / 503 / 504: Infrastructure — your app is down, overloaded, or behind a broken load balancer.

A sudden shift in the distribution matters more than the absolute rate — going from scattered 500s to 100% 401s within a minute of a deploy is a signature you will see in the walkthrough below.

3. Handler Latency vs the Provider's Timeout Budget

Providers enforce response budgets tighter than most teams assume. Stripe requires a 2xx within 20 seconds or it marks the delivery failed and schedules a retry. A handler that averages 2 seconds but spikes to 25 under load silently converts healthy deliveries into failures — then processes the retries too. Monitor p95 and p99 latency, not the average, and alert at half the provider's budget. The durable fix is architectural: acknowledge immediately, queue the work, process in the background.

4. Payload Anomalies

Providers evolve their schemas. Monitor for unknown event types reaching your catch-all branch, missing fields your handler depends on, payload sizes that jump an order of magnitude, and unexpected content types. A spike in unknown event types often means the provider shipped a new API version your integration is quietly ignoring.

Three Monitoring Patterns (and Why You Need More Than One)

Provider-Side Dashboards

Stripe's webhook dashboard shows every delivery attempt with status code, response body, and timing, plus manual resend; GitHub's settings page lists recent deliveries with a redeliver button. These are authoritative for "did the provider send it, and what did my server answer?" Their limits: retention is short and outside your control, search is weak, there is no cross-provider view, and crucially, nothing in a provider dashboard pages you.

Consumer-Side Logging and Metrics

Structured logs and metrics from your own handler answer the other half: "what did we do with the event?" Log the event ID, type, signature verification result, outcome, and duration — never the body if it contains sensitive data — and feed counters into your alerting stack. The blind spot is symmetric to the provider's: consumer-side telemetry only covers requests that arrived and code paths that emit logs. A crash before the logging line, a misrouted request, or a provider that stopped sending all produce the same output — nothing.

An Inspection Endpoint in the Loop

The third pattern closes the gap: a capture layer in the delivery path that records every request independently of whether your handler succeeded. Most providers let you register multiple endpoint URLs for the same events — one for your production handler, one for the inspector.

This is the layer HookSense provides. Every request is captured with full headers, body, and timing, shown in a real-time UI, and retained for 14, 30, or 90 days by plan (Catch is free; Hook is $19/month with 30-day retention; Sense is $49/month with 90-day). Bodies are encrypted at rest with AES-256-GCM. Built-in signature verification for Stripe, GitHub, Shopify, and custom HMAC schemes tells you whether a failing request was malformed or your verification code is the problem. Any captured request can be replayed — with edits — against any URL, and the same setup doubles as a staging debugging environment.

To be clear about the division of labor: HookSense is the visibility and forensics layer — capture, history, verification, replay. Your alerting lives in your metrics stack; the inspector is what you open once the alert fires.

Dead Letters and Idempotent Re-Processing

Monitoring tells you something failed; a dead-letter strategy makes the failure recoverable. When your handler cannot process an event, persist the raw event to a dead-letter store instead of dropping it, and alert on dead-letter depth. Re-processing only works if your handlers are idempotent, because provider retries, your replays, and your dead-letter drain can all deliver the same event more than once. Key every side effect on the provider's event ID:

async function handleEvent(event) {
  const seen = await db.processedEvents.findById(event.id);
  if (seen) return { status: "duplicate", id: event.id };

  await processBusinessLogic(event);
  await db.processedEvents.insert({ id: event.id, at: new Date() });
  return { status: "processed", id: event.id };
}

With idempotency in place, "replay everything from the incident window" turns from a risky operation into a routine one. The full pattern is covered in webhook idempotency: why and how.

Retention Windows as Incident Forensics

During an incident, the two questions are: what arrived during the outage? and can we re-process it after the fix? Provider dashboards answer the first partially, for one provider, within their retention limits; your logs answer it only for requests your code logged. A captured request history answers it completely: filter to the incident window and you have the exact set of events — headers, bodies, timestamps — that your handler failed to process.

Retention length determines which incidents you can recover from. A 14-day window covers a bad deploy noticed within days; slow-burn failures — a breakage found during monthly reconciliation, a complaint about something three weeks old — need 30 or 90 days. The question to ask when choosing retention: how long does it typically take you to notice a silent data drift? That number, not your incident response time, is your minimum window.

Alert Design: Alert on Silence, Not Just Errors

Most teams alert on error rates and stop there. The failure modes that hurt longest are the quiet ones:

  • Alert on absence of events. A freshness alert ("no payment events for 45 minutes") catches disabled endpoints, DNS breakage, and provider-side misconfiguration that error alerts structurally cannot see.
  • Alert on distribution shifts, not thresholds alone. 100% 401s is a different incident than 5% 500s.
  • Alert on latency at half the provider's budget. If Stripe's limit is 20 seconds, page at p99 above 10.
  • Alert on dead-letter depth and age. A growing dead-letter queue is a slow incident in progress.
  • Tune to the stream's rhythm. Per-event-type baselines beat one global threshold.

Incident Walkthrough: The 2am Signature Break

At 1:54am, an automated deploy ships a refactor that moves JSON body parsing in front of the webhook route. Signature verification now runs against the re-serialized body instead of the raw bytes, and every Stripe request starts returning 401. What each layer sees:

  • Stripe's dashboard records every 401 and begins its retry schedule — exponential backoff, up to roughly 3 days. The information sits there, accurately, in a UI nobody is watching at night.
  • Consumer-side logs capture the 401s, but alerting is keyed on 500s — "401 means someone external sent a bad request" was the assumption. No page fires.
  • A freshness alert on processed payment events, had one existed, would have fired around 2:40am when the stream flatlined — the strongest catch, because it measures outcomes, not errors.
  • The inspection endpoint keeps capturing regardless: the requests are arriving fine; verification is what broke. The history fills with valid Stripe events the production handler rejected.

Detection happens at 7:30am when the on-call engineer sees the 401 wall on the morning dashboard. Running signature verification against the captured requests in HookSense shows the signatures are valid — ruling out a rotated secret and pointing at the consumer's verification code. The deploy diff surfaces the body-parsing change in minutes.

Recovery is where retention pays for itself. The fix ships at 8:10am. Stripe's retries will eventually redeliver most failed events — but "eventually" means hours on the backoff schedule, and anything past its retry budget is gone. Instead, the team filters the captured history to the 1:54–8:10 window and replays every event against the fixed endpoint; idempotent handlers deduplicate whatever Stripe retries anyway. Data loss: zero. Time to consistency: minutes, not the tail of a 3-day retry schedule.

The postmortem actions write themselves: a freshness alert on processed events, 401-rate shifts treated as pages, and the inspection endpoint kept permanently in the loop.

Putting It Together

Webhook monitoring is a layered posture, not one tool. The provider's dashboard is the authoritative delivery record; your logs and metrics are the authoritative processing record and where alerts live. The inspection layer is the connective tissue: an independent, retained, replayable record of everything that actually arrived, turning incidents from data-loss events into replay exercises.

Start with the cheapest wins: a freshness alert per critical event type, status-code distribution on one dashboard, idempotent handlers keyed on event IDs, and a capture endpoint registered alongside production. All four fit in an afternoon — and the next 2am deploy becomes a 20-minute fix instead of a three-week data drift.

Related posts

Try HookSense Free

Inspect, debug, and replay webhooks in real-time. No credit card required.

Get Started Free