ML
Redis

Cache Stampede: How One Expired Key Took Down the Database

A single popular cache key expired at peak traffic and three thousand requests all missed at once, all ran the same expensive query, and the database fell over. The fix isn't a longer TTL. It's making sure exactly one request rebuilds the value.

June 21, 20269 min readRedisCaching

We had a homepage feed cached in Redis with a 60-second TTL. Cheap to serve, expensive to build: a fan-out query that took about 900ms and hammered four tables. For months it was invisible. Then one evening at peak the database CPU spiked to 100%, p99 went from 30ms to nine seconds, and the on-call graph showed a perfectly periodic sawtooth, one spike per minute. The cache wasn't broken. It was doing exactly what we told it to. The problem was what happens at the instant a hot key expires.

The thundering herd

When that one popular key expires, every request that wanted it now misses at the same moment. At peak we were serving roughly 3000 req/s for that feed. The key's TTL lapses, and in the next few hundred milliseconds three thousand requests all check Redis, all get a miss, and all decide to rebuild the value themselves. Now three thousand copies of a 900ms query hit the database simultaneously. That is a cache stampede, also called the thundering herd or a dogpile.

The cruel part is that the cache was working. A longer TTL doesn't fix it; it just makes the stampede rarer and bigger. A shorter TTL makes it more frequent. The miss itself is the trigger, and the miss is unavoidable as long as keys expire.

// the naive read-through that stampedes
async function getFeed() {
  const hit = await redis.get("feed:home");
  if (hit) return JSON.parse(hit);
  // 3000 requests arrive here at once on expiry
  const data = await buildFeedFromDB();        // 900ms, 4 tables
  await redis.set("feed:home", JSON.stringify(data), "EX", 60);
  return data;
}

Fix 1: let exactly one request rebuild

The core idea is mutual exclusion on the miss. On a miss, a request tries to grab a short-lived lock. Whoever wins rebuilds and repopulates; everyone else either waits briefly and re-reads, or serves slightly stale data. Redis gives you the lock primitive directly with SET key value NX PX ttl, which sets only if the key is absent (NX) and auto-expires (PX) so a crashed holder can't deadlock the key forever.

async function getFeed() {
  const hit = await redis.get("feed:home");
  if (hit) return JSON.parse(hit);

  // only one winner: NX = set-if-absent, PX = lock expiry
  const got = await redis.set("feed:lock:home", "1", "NX", "PX", 5000);
  if (got) {
    try {
      const data = await buildFeedFromDB();
      await redis.set("feed:home", JSON.stringify(data), "EX", 60);
      return data;
    } finally {
      await redis.del("feed:lock:home");
    }
  }

  // lost the race: wait a beat and read the value the winner wrote
  await sleep(50);
  const second = await redis.get("feed:home");
  if (second) return JSON.parse(second);
  return buildFeedFromDB(); // last resort if winner is slow
}

This alone took the database from 3000 concurrent rebuilds down to one. But it has a weakness: the losers either stall waiting or fall through to a slow path, and the value is fully gone for the ~900ms the winner needs. You still get a latency cliff once a minute, just without the database fire.

Fix 2: early recompute so the value never actually expires

The better pattern keeps the value present and refreshes it before it dies. Store an explicit logical expiry alongside the data and let one request refresh probabilistically as that expiry approaches, while everyone keeps serving the still-valid cached value. This is the probabilistic early expiration from the "XFetch" paper, and it is shorter than it sounds:

// store value + when it was built + how long it took to build
const { value, builtAt, deltaMs } = await readCacheWithMeta("feed:home");
const ttlMs = 60_000;
const now = Date.now();

// as we near expiry, an increasing share of requests volunteer to refresh.
// beta>1 makes refreshes start earlier; deltaMs weights expensive rebuilds.
const shouldRefresh =
  now - builtAt + deltaMs * BETA * Math.log(Math.random()) >= ttlMs;

if (shouldRefresh) {
  // typically just ONE request crosses the threshold first
  triggerBackgroundRebuild();   // refresh without blocking this response
}
return value;                   // everyone still serves the valid value

Because the trigger is randomized and weighted by how expensive the rebuild is, in practice a single request refreshes early and silently while every other request keeps getting an instant hit. No miss, no cliff, no herd. We layered this on top of the lock so the one volunteer that refreshes is also the only one allowed to touch the database. After shipping it the per-minute sawtooth on the database flattened completely.

The failure you forgot: the miss that lasts forever

One more trap. If buildFeedFromDB() throws or returns empty for a key that doesn't exist, and you only cache successes, then every request for that key misses forever and pounds the database with the same failing query. Cache the negative result too, with a short TTL, so a missing user or a transient failure costs you one query a minute instead of all of them. A stampede on an error is still a stampede.

Rules of thumb

  • A periodic spike on your database that lines up with a cache TTL is a stampede, not a slow query. The fix is concurrency control on the miss, not a bigger TTL.
  • Use SET key val NX PX ttl so exactly one request rebuilds a hot key. The PX expiry is mandatory, or a crashed holder deadlocks the key.
  • Locking removes the database fire but leaves a latency cliff. Early probabilistic recompute removes the cliff too by refreshing before the value expires.
  • Weight the early-refresh trigger by rebuild cost: expensive values should start refreshing sooner, cheap ones can wait.
  • Cache negative results with a short TTL. An endlessly-missing key stampedes just as hard as a popular one.
  • Never give a whole class of keys the same exact TTL; jitter it so they don't all expire on the same tick and herd together.
SharePostLinkedIn

Reader Discussion

8 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Elena Ricci· Platform Eng · Booking infraFrom experience

    XFetch quietly killed our daily cache stampede. 6h TTL on a product catalog, three-instance API, used to brown-out for 90 seconds every refresh. Shipped XFetch on a Friday afternoon and forgot it existed. That's the highest praise I can give a fix.

    Jun 23, 2026·2 days later
  2. Luca Bianchi· Tech LeadPushback

    fwiw — hash tags bị overuse là footgun thật. Cluster bọn mình từng có 1 slot ăn 41% traffic vì ai đó nghĩ {tenant} làm key prefix là idea hay. Cluster slowlog từ 12ms lên 800ms trong 1 đêm. Cluster rebalance không cứu được vì cùng 1 slot.

    Jun 27, 2026·6 days later·edited
  3. Huyền Lê· Software EngineerAgrees

    viết postmortem tiêu đề 'WAIT did not wait' xong 1 tuần sau gặp đoạn này trong post. cười ra nước mắt. cái phần WAIT không phải consensus primitive cần tô đỏ trong docs official.

    Jun 24, 2026·3 days later
  4. Amir Shah· InfraAsks

    Q: pre-warm hot keys — internal cron inside the app vs external scheduler (k8s cronjob etc)? We've shipped both. Internal is simpler but you fight clock skew across replicas; external is reliable but adds a moving piece.

    Jun 25, 2026·4 days later
    • ML
      Minh LeAuthor

      External, every time. The number of "why is the warmup not running" tickets I've seen with internal crons is not funny anymore. Make it boring infra.

      Jun 26, 2026
    • Carla Pérez· Backend

      we do external + a redis lock so only one instance actually runs the warmup. simple and observable.

      Jun 27, 2026
  5. Mark Vandermeer· Infra EngineerPushback

    RDB + AOF on the same instance is not a 'belt and suspenders' move btw — fsync-on-rewrite collisions can make latency vibrate. Pick one and tune it.

    Jun 29, 2026·1 week later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Jun 24, 2026·3 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email