Retry Storms: How Well-Meaning Retries Turn a Blip Into an Outage

A downstream service hiccupped for two seconds. Every caller retried three times. Those retries tripled the load on a service that was already struggling, and the blip became a 20-minute outage. Backoff without jitter made it worse, not better.

June 21, 20269 min readSystem DesignResilience

An internal auth service had a two-second GC pause. That should have been a non-event: a few requests slow, then back to normal. Instead it turned into a 20-minute partial outage that paged four teams. The auth service never recovered on its own. We had to shed load manually to let it breathe. The two seconds of slowness wasn't the outage. The retries were.

How a blip becomes a storm

The auth service had maybe 200 callers. Each caller, sensibly, retried failed requests up to 3 times. So during the two-second pause, requests started failing or timing out, and every one of those failures turned into up to 3 more requests. The instant the service came back, it was hit with not the normal load but roughly 4x the normal load: the fresh requests plus the backlog of everyone's retries, all arriving at once.

That 4x spike pushed it back over the edge. More requests failed. Those failures generated more retries. The retries kept the service pinned above capacity, so it kept failing, so callers kept retrying. The system had found a stable equilibrium, and the equilibrium was "down". This is a retry storm, and the thing that makes it vicious is that it is self-sustaining: the retries cause the failures that cause the retries.

Backoff alone doesn't save you

"Just add exponential backoff" is the standard advice, and it is necessary but not sufficient. Backoff spaces out one client's retries: wait 100ms, then 200ms, then 400ms. Good. But if 200 clients all failed at the same instant (which they did, they all saw the same GC pause), then they all back off by the same amounts and all retry at the same future instants. You haven't spread the load, you've just moved the spike to the right and kept its shape. The herd retries in lockstep.

// backoff WITHOUT jitter: everyone retries at t=100, 300, 700...
// the spike stays a spike, just delayed
function backoff(attempt) {
  return Math.min(BASE * 2 ** attempt, CAP);  // 100, 200, 400, ...
}

// backoff WITH full jitter: spread the herd across the whole window
function backoffJittered(attempt) {
  const window = Math.min(BASE * 2 ** attempt, CAP);
  return Math.random() * window;   // uniform in [0, window)
}

Full jitter is the fix that actually matters. Each client picks a random delay inside its backoff window, so 200 clients that failed together retry smeared across the whole window instead of all at one point. AWS published the numbers on this: full jitter dramatically cuts both the contention and the total work compared to plain exponential backoff. The randomness is the load-spreader; the exponential part just grows the window.

Cap the total retries with a budget

Jitter spreads the spike but doesn't bound it. If the downstream is genuinely down, retrying at all is just wasted load that makes recovery harder. The cleaner control is a retry budget: allow retries only as a small fraction of your real request rate, say 10%. Track successes and retries in a rolling window, and once retries exceed 10% of requests, stop retrying and fail fast until the ratio recovers.

// retry budget: retries may be at most 10% of real traffic
if (retryTokens.tryAcquire()) {     // token bucket refilled by successes
  return await callWithBackoffJitter(req);
}
return failFast();                   // budget exhausted, don't pile on

This is what turns a retry storm into a bounded event. When the downstream is healthy, retries are rare and the budget is never touched. When it is failing en masse, the budget caps the extra load at +10% instead of +300%, which is often the difference between a service that recovers and one that stays pinned down. It is the same idea as a circuit breaker, just expressed as a rate instead of an on/off state.

The retries that multiply through layers

The subtle killer is retry amplification across a call chain. If A calls B calls C, and each layer retries 3 times, then one user request can become 3 x 3 = 9 requests at the bottom. Add a fourth layer and it's 27. So the rule is: retry at one layer only, usually the one closest to the failure, and let the layers above it either pass the failure through or use a circuit breaker. Retrying at every layer is how a small storm becomes an exponential one.

Rules of thumb

If a downstream blip causes a longer outage than the blip itself, suspect a retry storm. The retries, not the original fault, are keeping it down.
Exponential backoff without jitter just delays the spike and keeps its shape. Add full jitter so a synchronized herd retries spread across the window, not in lockstep.
Bound total retry load with a retry budget (retries capped at ~10% of real requests). Backoff spaces retries; a budget limits how many exist at all.
Retry at exactly one layer of a call chain. Retrying at every layer multiplies: three layers of 3x retries is 27x load at the bottom.
Only retry idempotent operations, and only on errors worth retrying (timeouts, 503s), never on a 400 that will fail identically every time.
Pair retries with a circuit breaker so that when a dependency is clearly down, you stop sending load and give it room to recover.

Retry Storms: How Well-Meaning Retries Turn a Blip Into an Outage

How a blip becomes a storm

Backoff alone doesn't save you

Cap the total retries with a budget

The retries that multiply through layers

Rules of thumb

2 replies// weighed in

More from this topic

Idempotency Keys: Making APIs Safe to Retry

The Outbox Pattern: Atomic DB and Queue Writes

The Saga Pattern: Distributed Transactions Without Two-Phase Commit