Webhook Retries
When a webhook delivery fails, Von Payments retries it on a fixed schedule. This page documents the schedule, the response codes that influence it, the per-endpoint circuit breaker that pauses delivery to a chronically-failing URL, and where to see delivery state in the dashboard.
This page documents the retry behavior for subscription-level webhooks (/dashboard/developers/webhooks). The session-level surface (Webhooks) follows the same retry schedule; see its overview for the surface-specific signature and event shape.
Retry schedule — 8 attempts over ~79 hours
| Attempt | Delay before this attempt | Cumulative time (approx.) |
|---|---|---|
| 1 | immediate | 0s |
| 2 | 30s | 30s |
| 3 | 2 min | ~2.5 min |
| 4 | 10 min | ~12 min |
| 5 | 1 hour | ~1h 12 min |
| 6 | 6 hours | ~7h 12 min |
| 7 | 24 hours | ~31h |
| 8 | 48 hours | ~79h |
| (none) | dead | — |
After attempt 8 fails, the delivery is marked dead. The original event row stays in the dashboard's delivery history with status: dead and a final response code; no further attempts are made.
Full-jitter delays
The "delay before this attempt" column is the base delay. The actual delay between two attempts is uniform in [0, base] — full jitter. This spreads retries so a wave of failed deliveries (e.g., an endpoint coming back from a 30-minute outage) doesn't all retry at the same instant and flatten the endpoint again.
What this means for you:
- A given delivery's attempt-2 might come anywhere from 0 to 30 seconds after attempt 1.
- A given delivery's attempt-5 might come anywhere from 0 to 60 minutes after attempt 4.
- The cumulative-time column above is an upper bound on a "typical" trajectory; some deliveries finish their 8 attempts noticeably faster.
How response codes drive retry
| Outcome | What it means | What happens next |
|---|---|---|
200–299 | Delivered | Terminal. No further attempts. |
410 Gone | Endpoint declared the URL permanently dead | Subscription is disabled; row marked dead. No further attempts to this subscription for any event. |
429 Too Many Requests | Endpoint rate-limited the request | Row marked dead immediately — see the 429 note below. Return 503 instead if you rate-limit inbound webhook traffic. |
Other 4xx (400, 401, 403, 404, 422, …) | Endpoint rejected the request | Row marked dead immediately. No retry — a retry can't fix a misconfigured signature check or a route that doesn't exist. |
5xx (500, 502, 503, 504) | Endpoint had a transient error | Retry with backoff per the schedule above. |
| Network error / timeout (no response) | Connection refused, DNS failed, or your handler took longer than 10 seconds | Retry with backoff per the schedule above. |
The 10-second timeout is hard. If your handler does heavy work synchronously, queue it and return 200 immediately — see Best Practices.
Why 4xx is non-retryable
The retry schedule exists to absorb transient failures (network blips, brief endpoint restarts, short load spikes). It can't paper over wrong code:
- A
401means your signature verification rejected the request. The next attempt carries the same signature; it'll fail the same way. - A
404means the URL doesn't route. The next attempt hits the same URL. - A
422means your handler parsed the body and explicitly rejected it. Retrying won't change the body.
Marking these dead immediately surfaces the misconfiguration to you faster (you see one failed row in the dashboard, not eight).
429 Too Many Requests is the exception worth flagging. Standard rate-limiting middleware returns 429 when an endpoint is over budget — that is a transient state, but the retry engine treats every 4xx outside 410 the same way and marks the row dead. If your handler may rate-limit inbound webhook traffic, return 503 from the rate-limiter instead so the retry engine backs off. The alternative is silent event loss whenever your rate-limit window overlaps a webhook delivery.
If you're returning any other 4xx on transient state (e.g., "I haven't seen this customer yet, come back later"), return 503 instead — that's the contract the retry engine speaks.
Per-endpoint circuit breaker
Layered on top of the retry schedule is a per-endpoint circuit breaker. It pauses outbound delivery to any one endpoint that's repeatedly returning 5xx so a single broken consumer doesn't queue up against itself — and so one merchant's failing endpoint doesn't slow delivery to every other merchant.
State machine
closed ──5 failures within a 60s window──▶ open
▲ │
│ │ cooldown elapsed
│ success ▼
└──────────────────────────── half-open ──5xx──▶ open (longer cooldown)
- Closed — normal delivery. Every attempt goes through.
- Open — delivery to this endpoint is paused. No HTTP attempts are made; rows wait. After a cooldown, the breaker moves to half-open.
- Half-open — one probe attempt goes through. If it succeeds, the breaker closes. If it fails, the breaker re-opens with a longer cooldown.
Cooldown progression
The cooldown doubles on each re-open until it caps at 5 minutes:
| Re-open count | Cooldown |
|---|---|
| 1 (first open) | 30s |
| 2 | 60s |
| 3 | 120s (2 min) |
| 4 | 240s (4 min) |
| 5+ | 300s (5 min) |
A streak of successful deliveries resets the counter; the next open starts over at 30s.
What this looks like in the delivery log
A row delivered through a closed breaker shows the normal attempt sequence. A row whose attempt was suppressed while the breaker was open appears in the dashboard with a circuit_open annotation on the attempt — you can tell apart "endpoint didn't respond" from "delivery was suppressed because the endpoint was misbehaving moments ago."
The circuit breaker is per-subscription, not global. A breaker opening on Subscription A has no effect on Subscription B, even on the same merchant.
Dead-letter queue
When a delivery hits any of these terminal states, the row moves to the dead-letter queue (DLQ):
- All 8 retry attempts exhausted
- A
4xx(non-410) marked it dead immediately - A
410 Gonedisabled the subscription - The subscription was deleted mid-flight
DLQ rows are retained for 30 days. They're visible in the dashboard at /dashboard/developers/webhooks/{id}/dlq with the final response code, the error message excerpt (response body or network error reason), and the full request payload.
The error message excerpt is the response body your endpoint returned on the failing attempt. It's retained for the 30-day DLQ window and visible to anyone with merchant-dashboard access for your account. Keep error responses short and avoid including PII (customer emails, full request bodies, internal stack traces) — there's no need for them in a webhook reject path, and they'd outlive the failure itself in the DLQ.
You can manually retry a DLQ row from the dashboard — useful if you've fixed the handler and want to re-process events that died during the outage. After the 30-day retention window, the DLQ row is purged and the event can no longer be redelivered.
Where to see delivery state
| Surface | What you see |
|---|---|
/dashboard/developers/webhooks/{id} → Deliveries | All recent deliveries with status (delivered / retrying / dead), attempt count, last response code, next-retry timestamp if still retrying. |
/dashboard/developers/webhooks/{id}/dlq | Dead-lettered deliveries with the final response code, error message, full payload, and a manual "Retry" button. |
| Per-event view | Click any event to see its full attempt history (each attempt's timestamp, response code, response body excerpt, and the circuit_open annotation if applicable). |
Sending a test event
Use the CLI to fire a fully-signed test event at your handler — see Test your handler. Test events go through the same delivery code path that production events do, so the retry schedule, circuit breaker, and DLQ all work identically against your localhost handler.
Designing your handler around the retry contract
A few invariants worth coding against:
- Idempotency. The same
id(event ID) may arrive more than once if a previous attempt succeeded but you returned a 5xx by mistake, then the next attempt also delivered. Idempotency-guard onidand return200for known-processed IDs. - Order is not guaranteed. Out of two events emitted close together, the second can be delivered before the first if the first hits 5xx on attempt 1 and retries. Build your handler to be order-tolerant: read state from the API by ID rather than from the event payload alone where ordering matters.
- Fast
200, async work. Return200as soon as you've persisted the event ID for idempotency. Do the heavy work (charge order, send email, update analytics) in a background job. A handler that takes >10s to respond hits the timeout and gets retried — at which point you're racing your own background work. 503over4xxfor transient. If you genuinely can't process the event right now but expect to be able to soon (e.g., a backing service is briefly down), return503— the retry engine will back off and try again. A4xxmarks the event dead immediately and you'll only see it again via a manual DLQ retry.
Related
- Webhooks (session-level) — how the session.* webhook surface works
- Webhook Event Reference — subscription-level event catalog
- Webhook Signature Verification — verifying both signature formats
- Webhook Signing Secrets — create, view-once, rotate