Skip to main content

Webhook Retries

When a webhook delivery fails, Von Payments retries it on a fixed schedule. This page documents the schedule, the response codes that influence it, the per-endpoint circuit breaker that pauses delivery to a chronically-failing URL, and where to see delivery state in the dashboard.

This page documents the retry behavior for subscription-level webhooks (/dashboard/developers/webhooks). The session-level surface (Webhooks) follows the same retry schedule; see its overview for the surface-specific signature and event shape.

Retry schedule — 8 attempts over ~79 hours

AttemptDelay before this attemptCumulative time (approx.)
1immediate0s
230s30s
32 min~2.5 min
410 min~12 min
51 hour~1h 12 min
66 hours~7h 12 min
724 hours~31h
848 hours~79h
(none)dead

After attempt 8 fails, the delivery is marked dead. The original event row stays in the dashboard's delivery history with status: dead and a final response code; no further attempts are made.

Full-jitter delays

The "delay before this attempt" column is the base delay. The actual delay between two attempts is uniform in [0, base] — full jitter. This spreads retries so a wave of failed deliveries (e.g., an endpoint coming back from a 30-minute outage) doesn't all retry at the same instant and flatten the endpoint again.

What this means for you:

  • A given delivery's attempt-2 might come anywhere from 0 to 30 seconds after attempt 1.
  • A given delivery's attempt-5 might come anywhere from 0 to 60 minutes after attempt 4.
  • The cumulative-time column above is an upper bound on a "typical" trajectory; some deliveries finish their 8 attempts noticeably faster.

How response codes drive retry

OutcomeWhat it meansWhat happens next
200299DeliveredTerminal. No further attempts.
410 GoneEndpoint declared the URL permanently deadSubscription is disabled; row marked dead. No further attempts to this subscription for any event.
429 Too Many RequestsEndpoint rate-limited the requestRow marked dead immediately — see the 429 note below. Return 503 instead if you rate-limit inbound webhook traffic.
Other 4xx (400, 401, 403, 404, 422, …)Endpoint rejected the requestRow marked dead immediately. No retry — a retry can't fix a misconfigured signature check or a route that doesn't exist.
5xx (500, 502, 503, 504)Endpoint had a transient errorRetry with backoff per the schedule above.
Network error / timeout (no response)Connection refused, DNS failed, or your handler took longer than 10 secondsRetry with backoff per the schedule above.

The 10-second timeout is hard. If your handler does heavy work synchronously, queue it and return 200 immediately — see Best Practices.

Why 4xx is non-retryable

The retry schedule exists to absorb transient failures (network blips, brief endpoint restarts, short load spikes). It can't paper over wrong code:

  • A 401 means your signature verification rejected the request. The next attempt carries the same signature; it'll fail the same way.
  • A 404 means the URL doesn't route. The next attempt hits the same URL.
  • A 422 means your handler parsed the body and explicitly rejected it. Retrying won't change the body.

Marking these dead immediately surfaces the misconfiguration to you faster (you see one failed row in the dashboard, not eight).

429 Too Many Requests is the exception worth flagging. Standard rate-limiting middleware returns 429 when an endpoint is over budget — that is a transient state, but the retry engine treats every 4xx outside 410 the same way and marks the row dead. If your handler may rate-limit inbound webhook traffic, return 503 from the rate-limiter instead so the retry engine backs off. The alternative is silent event loss whenever your rate-limit window overlaps a webhook delivery.

If you're returning any other 4xx on transient state (e.g., "I haven't seen this customer yet, come back later"), return 503 instead — that's the contract the retry engine speaks.

Per-endpoint circuit breaker

Layered on top of the retry schedule is a per-endpoint circuit breaker. It pauses outbound delivery to any one endpoint that's repeatedly returning 5xx so a single broken consumer doesn't queue up against itself — and so one merchant's failing endpoint doesn't slow delivery to every other merchant.

State machine

        closed ──5 failures within a 60s window──▶ open
▲ │
│ │ cooldown elapsed
│ success ▼
└──────────────────────────── half-open ──5xx──▶ open (longer cooldown)
  • Closed — normal delivery. Every attempt goes through.
  • Open — delivery to this endpoint is paused. No HTTP attempts are made; rows wait. After a cooldown, the breaker moves to half-open.
  • Half-open — one probe attempt goes through. If it succeeds, the breaker closes. If it fails, the breaker re-opens with a longer cooldown.

Cooldown progression

The cooldown doubles on each re-open until it caps at 5 minutes:

Re-open countCooldown
1 (first open)30s
260s
3120s (2 min)
4240s (4 min)
5+300s (5 min)

A streak of successful deliveries resets the counter; the next open starts over at 30s.

What this looks like in the delivery log

A row delivered through a closed breaker shows the normal attempt sequence. A row whose attempt was suppressed while the breaker was open appears in the dashboard with a circuit_open annotation on the attempt — you can tell apart "endpoint didn't respond" from "delivery was suppressed because the endpoint was misbehaving moments ago."

The circuit breaker is per-subscription, not global. A breaker opening on Subscription A has no effect on Subscription B, even on the same merchant.

Dead-letter queue

When a delivery hits any of these terminal states, the row moves to the dead-letter queue (DLQ):

  • All 8 retry attempts exhausted
  • A 4xx (non-410) marked it dead immediately
  • A 410 Gone disabled the subscription
  • The subscription was deleted mid-flight

DLQ rows are retained for 30 days. They're visible in the dashboard at /dashboard/developers/webhooks/{id}/dlq with the final response code, the error message excerpt (response body or network error reason), and the full request payload.

Your endpoint's response body is stored

The error message excerpt is the response body your endpoint returned on the failing attempt. It's retained for the 30-day DLQ window and visible to anyone with merchant-dashboard access for your account. Keep error responses short and avoid including PII (customer emails, full request bodies, internal stack traces) — there's no need for them in a webhook reject path, and they'd outlive the failure itself in the DLQ.

You can manually retry a DLQ row from the dashboard — useful if you've fixed the handler and want to re-process events that died during the outage. After the 30-day retention window, the DLQ row is purged and the event can no longer be redelivered.

Where to see delivery state

SurfaceWhat you see
/dashboard/developers/webhooks/{id} → DeliveriesAll recent deliveries with status (delivered / retrying / dead), attempt count, last response code, next-retry timestamp if still retrying.
/dashboard/developers/webhooks/{id}/dlqDead-lettered deliveries with the final response code, error message, full payload, and a manual "Retry" button.
Per-event viewClick any event to see its full attempt history (each attempt's timestamp, response code, response body excerpt, and the circuit_open annotation if applicable).

Sending a test event

Use the CLI to fire a fully-signed test event at your handler — see Test your handler. Test events go through the same delivery code path that production events do, so the retry schedule, circuit breaker, and DLQ all work identically against your localhost handler.

Designing your handler around the retry contract

A few invariants worth coding against:

  • Idempotency. The same id (event ID) may arrive more than once if a previous attempt succeeded but you returned a 5xx by mistake, then the next attempt also delivered. Idempotency-guard on id and return 200 for known-processed IDs.
  • Order is not guaranteed. Out of two events emitted close together, the second can be delivered before the first if the first hits 5xx on attempt 1 and retries. Build your handler to be order-tolerant: read state from the API by ID rather than from the event payload alone where ordering matters.
  • Fast 200, async work. Return 200 as soon as you've persisted the event ID for idempotency. Do the heavy work (charge order, send email, update analytics) in a background job. A handler that takes >10s to respond hits the timeout and gets retried — at which point you're racing your own background work.
  • 503 over 4xx for transient. If you genuinely can't process the event right now but expect to be able to soon (e.g., a backing service is briefly down), return 503 — the retry engine will back off and try again. A 4xx marks the event dead immediately and you'll only see it again via a manual DLQ retry.