Delivery Recovery
This guide is for teams responsible for keeping a live integration healthy.
It explains how to reason about delivery state, dead letters, and replay without turning recovery into guesswork.
What this page is for
Use this page when you need to answer questions like:
- Is this an authentication problem or a delivery problem?
- Are deliveries stuck, retrying, or dead-lettered?
- Should I replay one item or recover a larger batch?
- Is delivery draining normally?
Start with diagnosis, not replay
The first step is always to inspect the app’s operational state.
That inspection should tell you whether the primary problem is:
- invalid app auth
- missing delegated authority
- delivery backlog
- dead-letter accumulation
When to replay a single delivery
Replay one delivery when:
- the underlying bug is fixed
- the failure was isolated
- you know exactly which delivery should be retried
This is the safest recovery path because it limits duplicate downstream work.
When to replay dead letters in batch
Replay a batch only when the failure was systemic, for example:
- the receiver was down
- signature validation was broken
- a parser bug affected many deliveries
Batch replay is a recovery tool, not a first diagnostic step.
When to inspect delivery health
Delivery health matters when you see:
- growing retry counts
- a rising dead-letter count
- queued deliveries aging instead of draining
- inconsistent downstream state after valid writes
Those are delivery signals, not application-state signals.
Practical recovery order
- Inspect the operational snapshot.
- Decide whether the issue is auth, grants, or delivery.
- Fix the underlying cause.
- Replay one delivery if the failure was isolated.
- Replay dead letters in batch only when the failure was systemic.