How do I monitor webhooks in production?

Track four signals: delivery freshness (time since the last event arrived), failure rate broken down by HTTP status code, handler latency against the provider's timeout budget, and payload anomalies. Combine the provider's delivery dashboard with consumer-side structured logging, and keep an inspection endpoint like HookSense in the loop so you have a full request history to search and replay during incidents.

What metrics should I track for webhooks?

At minimum: events received per minute per event type, time since last event, response status code distribution, p95 handler latency, and a count of signature verification failures. Latency matters because providers enforce response budgets — Stripe marks a delivery failed if you do not return 2xx within 20 seconds.

How do I know if my webhook endpoint is down?

Errors alone will not tell you — a DNS misconfiguration, expired TLS certificate, or disabled provider endpoint produces zero traffic and zero error logs. Alert on absence of events: if a normally steady stream goes quiet for longer than its usual gap, page someone. Freshness alerts catch outages that error-rate alerts miss entirely.

Do webhook providers retry failed deliveries?

Most do, but only for a limited window. Stripe retries with exponential backoff for roughly 3 days and requires a 2xx response within 20 seconds. GitHub lets you redeliver individual events manually from the webhook settings UI. Once the retry window closes, the event is gone unless you captured it yourself — which is why a retained request history matters.

What is the difference between webhook monitoring and webhook debugging?

Debugging is reactive and local: inspect one payload, fix one handler, replay one request. Monitoring is continuous and production-wide: watch delivery rates, failure codes, latency, and freshness over time so you find out about breakage before your data drifts. The two converge during incidents — good monitoring tells you something broke, and captured request history lets you debug and recover.

Last updated: June 10, 202610 min read

Webhook Monitoring: How to Monitor Webhooks in Production

Monitor webhooks in production: the four signals to track, provider vs consumer-side monitoring, alerting on silence, and forensics with captured history.

MonitoringObservabilityWebhooksProduction

Ozer

Developer & Founder of HookSense

Monitoring webhooks in production means tracking four signals — delivery freshness, failure rate by status code, handler latency against the provider's timeout budget, and payload anomalies — from two vantage points: the provider's delivery logs and your own consumer-side telemetry. Neither sees the whole picture alone, which is why mature teams also keep an inspection endpoint in the loop: a retained, searchable history of every request that arrived, so when something breaks at 2am you can answer the question that matters — what exactly came in, and what did we do with it?

Why Webhooks Fail Silently in Production

Webhooks have a structural observability problem: sender and receiver belong to different organizations, and neither sees the whole pipeline. The provider knows what status code came back — but a 200 from a handler that swallowed an exception looks identical to a 200 from one that worked. The consumer knows what its handler did, but only for requests that arrived. If the provider silently disabled your endpoint after repeated failures, or a firewall rule started dropping traffic, your logs show nothing at all. No errors. Just silence.

There are no built-in delivery guarantees beyond the provider's retry policy, and those policies are finite. Stripe retries with exponential backoff for roughly 3 days, then gives up; GitHub does not retry automatically, offering manual redelivery from the webhook settings UI instead. Once the retry window closes, the event is gone. (See webhook retry strategies for a provider-by-provider breakdown and retry backoff for the mechanics.)

The result is a failure mode unique to webhooks: the system degrades invisibly until the data drifts. A renewal event gets dropped, and three weeks later a customer complains they were charged but their account shows expired. Nothing crashed. No alert fired. The first symptom was a support ticket.

The Four Signals to Monitor

1. Delivery Rate and Freshness

The most valuable webhook metric is also the simplest: time since the last event arrived. Every webhook stream has a natural rhythm, and when it stops, no error-based metric will tell you. Track events per minute, segmented by event type, and alert when the gap since the last event exceeds the stream's normal variance. A payments integration that usually sees an event every few minutes and has been quiet for an hour is an incident — even though every dashboard shows zero errors.

2. Failure Rate by Status Code

Raw error counts are noisy; status code distribution is diagnostic:

401 / 403: Signature verification is failing — a rotated secret, a body-parsing change, a bad deploy.
404: A route moved or a reverse-proxy rule changed.
422 / 400: The payload schema changed, or your validation got stricter.
500: Your handler is throwing — the event arrived fine and your code broke.
502 / 503 / 504: Infrastructure — your app is down, overloaded, or behind a broken load balancer.

A sudden shift in the distribution matters more than the absolute rate — going from scattered 500s to 100% 401s within a minute of a deploy is a signature you will see in the walkthrough below.

3. Handler Latency vs the Provider's Timeout Budget

Providers enforce response budgets tighter than most teams assume. Stripe requires a 2xx within 20 seconds or it marks the delivery failed and schedules a retry. A handler that averages 2 seconds but spikes to 25 under load silently converts healthy deliveries into failures — then processes the retries too. Monitor p95 and p99 latency, not the average, and alert at half the provider's budget. The durable fix is architectural: acknowledge immediately, queue the work, process in the background.

4. Payload Anomalies

Providers evolve their schemas. Monitor for unknown event types reaching your catch-all branch, missing fields your handler depends on, payload sizes that jump an order of magnitude, and unexpected content types. A spike in unknown event types often means the provider shipped a new API version your integration is quietly ignoring.

Three Monitoring Patterns (and Why You Need More Than One)

Provider-Side Dashboards

Stripe's webhook dashboard shows every delivery attempt with status code, response body, and timing, plus manual resend; GitHub's settings page lists recent deliveries with a redeliver button. These are authoritative for "did the provider send it, and what did my server answer?" Their limits: retention is short and outside your control, search is weak, there is no cross-provider view, and crucially, nothing in a provider dashboard pages you.

Consumer-Side Logging and Metrics

Structured logs and metrics from your own handler answer the other half: "what did we do with the event?" Log the event ID, type, signature verification result, outcome, and duration — never the body if it contains sensitive data — and feed counters into your alerting stack. The blind spot is symmetric to the provider's: consumer-side telemetry only covers requests that arrived and code paths that emit logs. A crash before the logging line, a misrouted request, or a provider that stopped sending all produce the same output — nothing.

An Inspection Endpoint in the Loop

The third pattern closes the gap: a capture layer in the delivery path that records every request independently of whether your handler succeeded. Most providers let you register multiple endpoint URLs for the same events — one for your production handler, one for the inspector.

This is the layer HookSense provides. Every request is captured with full headers, body, and timing, signature-verified, decrypted, and retained for 14, 30, or 90 days by plan (Catch; Hook is $29/month with 30-day retention; Sense is $99/month with 90-day; paid plans are early-access). Bodies are encrypted at rest with AES-256-GCM. Built-in signature verification for Stripe, GitHub, Shopify, and custom HMAC schemes tells you whether a failing request was malformed or your verification code is the problem. Any captured request can be replayed — with edits — against any URL, and the same setup doubles as a staging debugging environment.

To be clear about the division of labor: HookSense is the visibility and forensics layer — capture, history, verification, replay. Your alerting lives in your metrics stack; the inspector is what you open once the alert fires.

Dead Letters and Idempotent Re-Processing

Monitoring tells you something failed; a dead-letter strategy makes the failure recoverable. When your handler cannot process an event, persist the raw event to a dead-letter store instead of dropping it, and alert on dead-letter depth. Re-processing only works if your handlers are idempotent, because provider retries, your replays, and your dead-letter drain can all deliver the same event more than once. Key every side effect on the provider's event ID:

async function handleEvent(event) {
  const seen = await db.processedEvents.findById(event.id);
  if (seen) return { status: "duplicate", id: event.id };

  await processBusinessLogic(event);
  await db.processedEvents.insert({ id: event.id, at: new Date() });
  return { status: "processed", id: event.id };
}

With idempotency in place, "replay everything from the incident window" turns from a risky operation into a routine one. The full pattern is covered in webhook idempotency: why and how.

Retention Windows as Incident Forensics

During an incident, the two questions are: what arrived during the outage? and can we re-process it after the fix? Provider dashboards answer the first partially, for one provider, within their retention limits; your logs answer it only for requests your code logged. A captured request history answers it completely: filter to the incident window and you have the exact set of events — headers, bodies, timestamps — that your handler failed to process.

Retention length determines which incidents you can recover from. A 14-day window covers a bad deploy noticed within days; slow-burn failures — a breakage found during monthly reconciliation, a complaint about something three weeks old — need 30 or 90 days. The question to ask when choosing retention: how long does it typically take you to notice a silent data drift? That number, not your incident response time, is your minimum window.

Alert Design: Alert on Silence, Not Just Errors

Most teams alert on error rates and stop there. The failure modes that hurt longest are the quiet ones:

Alert on absence of events. A freshness alert ("no payment events for 45 minutes") catches disabled endpoints, DNS breakage, and provider-side misconfiguration that error alerts structurally cannot see.
Alert on distribution shifts, not thresholds alone. 100% 401s is a different incident than 5% 500s.
Alert on latency at half the provider's budget. If Stripe's limit is 20 seconds, page at p99 above 10.
Alert on dead-letter depth and age. A growing dead-letter queue is a slow incident in progress.
Tune to the stream's rhythm. Per-event-type baselines beat one global threshold.

Incident Walkthrough: The 2am Signature Break

At 1:54am, an automated deploy ships a refactor that moves JSON body parsing in front of the webhook route. Signature verification now runs against the re-serialized body instead of the raw bytes, and every Stripe request starts returning 401. What each layer sees:

Stripe's dashboard records every 401 and begins its retry schedule — exponential backoff, up to roughly 3 days. The information sits there, accurately, in a UI nobody is watching at night.
Consumer-side logs capture the 401s, but alerting is keyed on 500s — "401 means someone external sent a bad request" was the assumption. No page fires.
A freshness alert on processed payment events, had one existed, would have fired around 2:40am when the stream flatlined — the strongest catch, because it measures outcomes, not errors.
The inspection endpoint keeps capturing regardless: the requests are arriving fine; verification is what broke. The history fills with valid Stripe events the production handler rejected.

Detection happens at 7:30am when the on-call engineer sees the 401 wall on the morning dashboard. Running signature verification against the captured requests in HookSense shows the signatures are valid — ruling out a rotated secret and pointing at the consumer's verification code. The deploy diff surfaces the body-parsing change in minutes.

Recovery is where retention pays for itself. The fix ships at 8:10am. Stripe's retries will eventually redeliver most failed events — but "eventually" means hours on the backoff schedule, and anything past its retry budget is gone. Instead, the team filters the captured history to the 1:54–8:10 window and replays every event against the fixed endpoint; idempotent handlers deduplicate whatever Stripe retries anyway. Data loss: zero. Time to consistency: minutes, not the tail of a 3-day retry schedule.

The postmortem actions write themselves: a freshness alert on processed events, 401-rate shifts treated as pages, and the inspection endpoint kept permanently in the loop.

Putting It Together

Webhook monitoring is a layered posture, not one tool. The provider's dashboard is the authoritative delivery record; your logs and metrics are the authoritative processing record and where alerts live. The inspection layer is the connective tissue: an independent, retained, replayable record of everything that actually arrived, turning incidents from data-loss events into replay exercises.

Start with the cheapest wins: a freshness alert per critical event type, status-code distribution on one dashboard, idempotent handlers keyed on event IDs, and a capture endpoint registered alongside production. All four fit in an afternoon — and the next 2am deploy becomes a 20-minute fix instead of a three-week data drift.

Share:X / Twitter LinkedIn

Jun 19, 2026·Security

Try HookSense Free

Inspect, debug, and replay webhooks in real-time. No credit card required.

Get Started Free