Webhook Monitoring: How to Monitor Webhooks in Production
Monitor webhooks in production: the four signals to track, provider vs consumer-side monitoring, alerting on silence, and forensics with captured history.
Ozer
Developer & Founder of HookSense
Monitoring webhooks in production means tracking four signals — delivery freshness, failure rate by status code, handler latency against the provider's timeout budget, and payload anomalies — from two vantage points: the provider's delivery logs and your own consumer-side telemetry. Neither sees the whole picture alone, which is why mature teams also keep an inspection endpoint in the loop: a retained, searchable history of every request that arrived, so when something breaks at 2am you can answer the question that matters — what exactly came in, and what did we do with it?
Why Webhooks Fail Silently in Production
Webhooks have a structural observability problem: sender and receiver belong to different organizations, and neither sees the whole pipeline. The provider knows what status code came back — but a 200 from a handler that swallowed an exception looks identical to a 200 from one that worked. The consumer knows what its handler did, but only for requests that arrived. If the provider silently disabled your endpoint after repeated failures, or a firewall rule started dropping traffic, your logs show nothing at all. No errors. Just silence.
There are no built-in delivery guarantees beyond the provider's retry policy, and those policies are finite. Stripe retries with exponential backoff for roughly 3 days, then gives up; GitHub does not retry automatically, offering manual redelivery from the webhook settings UI instead. Once the retry window closes, the event is gone. (See webhook retry strategies for a provider-by-provider breakdown and retry backoff for the mechanics.)
The result is a failure mode unique to webhooks: the system degrades invisibly until the data drifts. A renewal event gets dropped, and three weeks later a customer complains they were charged but their account shows expired. Nothing crashed. No alert fired. The first symptom was a support ticket.
The Four Signals to Monitor
1. Delivery Rate and Freshness
The most valuable webhook metric is also the simplest: time since the last event arrived. Every webhook stream has a natural rhythm, and when it stops, no error-based metric will tell you. Track events per minute, segmented by event type, and alert when the gap since the last event exceeds the stream's normal variance. A payments integration that usually sees an event every few minutes and has been quiet for an hour is an incident — even though every dashboard shows zero errors.
2. Failure Rate by Status Code
Raw error counts are noisy; status code distribution is diagnostic:
- 401 / 403: Signature verification is failing — a rotated secret, a body-parsing change, a bad deploy.
- 404: A route moved or a reverse-proxy rule changed.
- 422 / 400: The payload schema changed, or your validation got stricter.
- 500: Your handler is throwing — the event arrived fine and your code broke.
- 502 / 503 / 504: Infrastructure — your app is down, overloaded, or behind a broken load balancer.
A sudden shift in the distribution matters more than the absolute rate — going from scattered 500s to 100% 401s within a minute of a deploy is a signature you will see in the walkthrough below.
3. Handler Latency vs the Provider's Timeout Budget
Providers enforce response budgets tighter than most teams assume. Stripe requires a 2xx within 20 seconds or it marks the delivery failed and schedules a retry. A handler that averages 2 seconds but spikes to 25 under load silently converts healthy deliveries into failures — then processes the retries too. Monitor p95 and p99 latency, not the average, and alert at half the provider's budget. The durable fix is architectural: acknowledge immediately, queue the work, process in the background.
4. Payload Anomalies
Providers evolve their schemas. Monitor for unknown event types reaching your catch-all branch, missing fields your handler depends on, payload sizes that jump an order of magnitude, and unexpected content types. A spike in unknown event types often means the provider shipped a new API version your integration is quietly ignoring.
Three Monitoring Patterns (and Why You Need More Than One)
Provider-Side Dashboards
Stripe's webhook dashboard shows every delivery attempt with status code, response body, and timing, plus manual resend; GitHub's settings page lists recent deliveries with a redeliver button. These are authoritative for "did the provider send it, and what did my server answer?" Their limits: retention is short and outside your control, search is weak, there is no cross-provider view, and crucially, nothing in a provider dashboard pages you.
Consumer-Side Logging and Metrics
Structured logs and metrics from your own handler answer the other half: "what did we do with the event?" Log the event ID, type, signature verification result, outcome, and duration — never the body if it contains sensitive data — and feed counters into your alerting stack. The blind spot is symmetric to the provider's: consumer-side telemetry only covers requests that arrived and code paths that emit logs. A crash before the logging line, a misrouted request, or a provider that stopped sending all produce the same output — nothing.
An Inspection Endpoint in the Loop
The third pattern closes the gap: a capture layer in the delivery path that records every request independently of whether your handler succeeded. Most providers let you register multiple endpoint URLs for the same events — one for your production handler, one for the inspector.
This is the layer HookSense provides. Every request is captured with full headers, body, and timing, shown in a real-time UI, and retained for 14, 30, or 90 days by plan (Catch is free; Hook is $19/month with 30-day retention; Sense is $49/month with 90-day). Bodies are encrypted at rest with AES-256-GCM. Built-in signature verification for Stripe, GitHub, Shopify, and custom HMAC schemes tells you whether a failing request was malformed or your verification code is the problem. Any captured request can be replayed — with edits — against any URL, and the same setup doubles as a staging debugging environment.
To be clear about the division of labor: HookSense is the visibility and forensics layer — capture, history, verification, replay. Your alerting lives in your metrics stack; the inspector is what you open once the alert fires.
Dead Letters and Idempotent Re-Processing
Monitoring tells you something failed; a dead-letter strategy makes the failure recoverable. When your handler cannot process an event, persist the raw event to a dead-letter store instead of dropping it, and alert on dead-letter depth. Re-processing only works if your handlers are idempotent, because provider retries, your replays, and your dead-letter drain can all deliver the same event more than once. Key every side effect on the provider's event ID:
async function handleEvent(event) {
const seen = await db.processedEvents.findById(event.id);
if (seen) return { status: "duplicate", id: event.id };
await processBusinessLogic(event);
await db.processedEvents.insert({ id: event.id, at: new Date() });
return { status: "processed", id: event.id };
}
With idempotency in place, "replay everything from the incident window" turns from a risky operation into a routine one. The full pattern is covered in webhook idempotency: why and how.
Retention Windows as Incident Forensics
During an incident, the two questions are: what arrived during the outage? and can we re-process it after the fix? Provider dashboards answer the first partially, for one provider, within their retention limits; your logs answer it only for requests your code logged. A captured request history answers it completely: filter to the incident window and you have the exact set of events — headers, bodies, timestamps — that your handler failed to process.
Retention length determines which incidents you can recover from. A 14-day window covers a bad deploy noticed within days; slow-burn failures — a breakage found during monthly reconciliation, a complaint about something three weeks old — need 30 or 90 days. The question to ask when choosing retention: how long does it typically take you to notice a silent data drift? That number, not your incident response time, is your minimum window.
Alert Design: Alert on Silence, Not Just Errors
Most teams alert on error rates and stop there. The failure modes that hurt longest are the quiet ones:
- Alert on absence of events. A freshness alert ("no payment events for 45 minutes") catches disabled endpoints, DNS breakage, and provider-side misconfiguration that error alerts structurally cannot see.
- Alert on distribution shifts, not thresholds alone. 100% 401s is a different incident than 5% 500s.
- Alert on latency at half the provider's budget. If Stripe's limit is 20 seconds, page at p99 above 10.
- Alert on dead-letter depth and age. A growing dead-letter queue is a slow incident in progress.
- Tune to the stream's rhythm. Per-event-type baselines beat one global threshold.
Incident Walkthrough: The 2am Signature Break
At 1:54am, an automated deploy ships a refactor that moves JSON body parsing in front of the webhook route. Signature verification now runs against the re-serialized body instead of the raw bytes, and every Stripe request starts returning 401. What each layer sees:
- Stripe's dashboard records every 401 and begins its retry schedule — exponential backoff, up to roughly 3 days. The information sits there, accurately, in a UI nobody is watching at night.
- Consumer-side logs capture the 401s, but alerting is keyed on 500s — "401 means someone external sent a bad request" was the assumption. No page fires.
- A freshness alert on processed payment events, had one existed, would have fired around 2:40am when the stream flatlined — the strongest catch, because it measures outcomes, not errors.
- The inspection endpoint keeps capturing regardless: the requests are arriving fine; verification is what broke. The history fills with valid Stripe events the production handler rejected.
Detection happens at 7:30am when the on-call engineer sees the 401 wall on the morning dashboard. Running signature verification against the captured requests in HookSense shows the signatures are valid — ruling out a rotated secret and pointing at the consumer's verification code. The deploy diff surfaces the body-parsing change in minutes.
Recovery is where retention pays for itself. The fix ships at 8:10am. Stripe's retries will eventually redeliver most failed events — but "eventually" means hours on the backoff schedule, and anything past its retry budget is gone. Instead, the team filters the captured history to the 1:54–8:10 window and replays every event against the fixed endpoint; idempotent handlers deduplicate whatever Stripe retries anyway. Data loss: zero. Time to consistency: minutes, not the tail of a 3-day retry schedule.
The postmortem actions write themselves: a freshness alert on processed events, 401-rate shifts treated as pages, and the inspection endpoint kept permanently in the loop.
Putting It Together
Webhook monitoring is a layered posture, not one tool. The provider's dashboard is the authoritative delivery record; your logs and metrics are the authoritative processing record and where alerts live. The inspection layer is the connective tissue: an independent, retained, replayable record of everything that actually arrived, turning incidents from data-loss events into replay exercises.
Start with the cheapest wins: a freshness alert per critical event type, status-code distribution on one dashboard, idempotent handlers keyed on event IDs, and a capture endpoint registered alongside production. All four fit in an afternoon — and the next 2am deploy becomes a 20-minute fix instead of a three-week data drift.
Related posts
Try HookSense Free
Inspect, debug, and replay webhooks in real-time. No credit card required.
Get Started Free