ML
Redis

Redis at 3 AM: The Hot Key That Took Down Black Friday

How a single product page took down our Redis cluster on the busiest night of the year — and the one-line fix that came too late.

November 21, 202512 min readRedisPostmortemScaling

It was 23:47 on a Friday. Black Friday, technically — sales had started at midnight. We had spent six weeks loading-testing this. We had a war room booked. We had three fridges of Red Bull. We were ready.

Redis was not ready.

The product page that ended it all

One product. A pair of Nike Air Jordans, marked down 70%. The marketing team had teased it on Instagram for a week. At 00:00:00 ICT, the site got hit with what our load balancer logs would later describe as "a wall of human beings."

The Air Jordan SKU lived in our product cache as product:sku:NIKE-AJ1-2025. A single Redis key. Roughly 4 KB, a serialized JSON blob with prices, images, stock counts, related items. We had cached it with a 60-second TTL.

Within 90 seconds, that one key was serving 180,000 GETs per second.

What broke (and what didn't)

Surprisingly, the product page kept working. Redis is fast. 180k QPS on a single key on a hot node is rough but survivable.

What broke was everything else. The shard that owned that key was at 100% CPU. Latency on neighboring keys — sessions, cart contents, checkout tokens — went from sub-millisecond to 80ms. The session middleware in our Node API was synchronously waiting on Redis on every request. So the API thread pool filled up. So the load balancer started rejecting connections. So the homepage went down. So the entire site went down — because of a single product page nobody could buy from anyway, because the cart service couldn't talk to Redis either.

The classic noisy neighbour failure mode, except the noisy neighbour was the front page.

The decisions (00:03 → 00:18)

The CTO called me at 00:03. "Site's down, what do you need." I had a Grafana dashboard open and one immediate signal — the heatmap showing one shard pegged at 100%. I knew the cause within 90 seconds. The harder question was the fix.

Option A: Pull the page. Take the Air Jordan offline, the rest of the site recovers. Costs ~$40k in immediate sales. Marketing will hate me.

Option B: Push a hot-key replication patch. Cache the key locally in each API instance with a 1-second TTL. We had the code lying around from a previous incident. Not deployed. Would take 8-12 minutes to roll out.

Option C: Add Redis read replicas, route the hot key. 30+ minutes. Not happening tonight.

I picked B. We had a deployable hotfix in 4 minutes. CI took another 6. By 00:18, every API instance was holding its own 1-second copy of the Air Jordan blob. Redis QPS on that shard dropped from 180k to about 1,200 (the 1-per-second cache miss across our fleet plus stragglers).

The site came back at 00:19. We had been down for nineteen minutes, on the busiest night of the year.

The retro: what we got wrong

Three things, in increasing order of "I should have known better."

  1. We had no per-key QPS alerting. Hot key detection in Redis is a --hotkeys flag. We hadn't run it in production. We did now.
  2. Our API was synchronously blocked on cache lookups. The whole point of a cache is that missing a cache call should be a soft failure. Ours was a hard failure because we never set a Redis client timeout. Default was "forever." We now run with a 50ms timeout and a stampede-safe fallback to the DB.
  3. We had no concept of an LRU local cache. A Caffeine in front of Redis would have made this incident a non-event. The cost is RAM. The benefit is sleeping through Black Friday.

The one-line fix

The actual long-term fix in our codebase was twelve characters:

@Cacheable("product")
public Product get(String sku) { … }

We added a @Cacheable in front of the Redis call, backed by Caffeine with a 30-second local TTL and a 5,000-entry max. Twelve characters. Six weeks of load testing for an entire team didn't catch what twelve characters of caching prevented.

We've since shipped that pattern as a default in our internal SDK. Every read-heavy endpoint gets a small in-process LRU cache for free. The architecture team called it "cache layering." I called it "the line that lets me sleep on Black Friday."

The deeper lesson

Hot keys are an emergent property, not a known unknown. You can't load-test for the SKU your marketing team will Instagram in three weeks. The defense isn't "identify all hot keys ahead of time." The defense is to make a single hot key not be capable of taking down everything else. That means timeouts, that means local caches, that means circuit breakers, and most of all that means treating Redis like a cache and not a database.

The day you rely on a single Redis key serving 200k QPS, you have already lost. You just don't know which Friday it'll be.

SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Elena Ricci· Platform Eng · Booking infraFrom experience

    XFetch quietly killed our daily cache stampede. 6h TTL on a product catalog, three-instance API, used to brown-out for 90 seconds every refresh. Shipped XFetch on a Friday afternoon and forgot it existed. That's the highest praise I can give a fix.

    Nov 23, 2025·2 days later
  2. Mark Vandermeer· Infra EngineerPushback

    RDB + AOF on the same instance is not a 'belt and suspenders' move btw — fsync-on-rewrite collisions can make latency vibrate. Pick one and tune it.

    Nov 29, 2025·1 week later
  3. Yuki Tanaka· Senior EngineerAgrees

    pipelining is so cheap and so under-used. converted a hot ticker loop from 30k cmd/sec to 30k cmd/sec but in 800 round-trips/sec instead of 30k. p99 dropped 4x. should be the first optimisation people reach for.

    Nov 24, 2025·3 days later
  4. Anya Sokolova· BackendAsks

    any thoughts on Lua vs MULTI/EXEC for the deduct-and-check pattern? been using Lua for 2 years and the script-cache cliff bites us when we redeploy. but MULTI feels chattier

    Nov 27, 2025·6 days later
  5. Sơn Nguyễn🇻🇳 Hà Nội· Senior BackendStory

    +1 cho UNLINK. FLUSHDB SYNC làm prod đứng 39s, alert pager kêu vang nhà — sau đó mình đổi qua UNLINK + SCAN chunked, không bao giờ thấy spike lại. Mọi dev junior team mình bắt buộc đọc cái incident này.

    Nov 26, 2025·5 days later
  6. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Nov 27, 2025·6 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email