Redis at 3 AM: The Hot Key That Took Down Black Friday
How a single product page took down our Redis cluster on the busiest night of the year — and the one-line fix that came too late.
It was 23:47 on a Friday. Black Friday, technically — sales had started at midnight. We had spent six weeks loading-testing this. We had a war room booked. We had three fridges of Red Bull. We were ready.
Redis was not ready.
The product page that ended it all
One product. A pair of Nike Air Jordans, marked down 70%. The marketing team had teased it on Instagram for a week. At 00:00:00 ICT, the site got hit with what our load balancer logs would later describe as "a wall of human beings."
The Air Jordan SKU lived in our product cache as product:sku:NIKE-AJ1-2025. A single Redis key. Roughly 4 KB, a serialized JSON blob with prices, images, stock counts, related items. We had cached it with a 60-second TTL.
Within 90 seconds, that one key was serving 180,000 GETs per second.
What broke (and what didn't)
Surprisingly, the product page kept working. Redis is fast. 180k QPS on a single key on a hot node is rough but survivable.
What broke was everything else. The shard that owned that key was at 100% CPU. Latency on neighboring keys — sessions, cart contents, checkout tokens — went from sub-millisecond to 80ms. The session middleware in our Node API was synchronously waiting on Redis on every request. So the API thread pool filled up. So the load balancer started rejecting connections. So the homepage went down. So the entire site went down — because of a single product page nobody could buy from anyway, because the cart service couldn't talk to Redis either.
The classic noisy neighbour failure mode, except the noisy neighbour was the front page.
The decisions (00:03 → 00:18)
The CTO called me at 00:03. "Site's down, what do you need." I had a Grafana dashboard open and one immediate signal — the heatmap showing one shard pegged at 100%. I knew the cause within 90 seconds. The harder question was the fix.
Option A: Pull the page. Take the Air Jordan offline, the rest of the site recovers. Costs ~$40k in immediate sales. Marketing will hate me.
Option B: Push a hot-key replication patch. Cache the key locally in each API instance with a 1-second TTL. We had the code lying around from a previous incident. Not deployed. Would take 8-12 minutes to roll out.
Option C: Add Redis read replicas, route the hot key. 30+ minutes. Not happening tonight.
I picked B. We had a deployable hotfix in 4 minutes. CI took another 6. By 00:18, every API instance was holding its own 1-second copy of the Air Jordan blob. Redis QPS on that shard dropped from 180k to about 1,200 (the 1-per-second cache miss across our fleet plus stragglers).
The site came back at 00:19. We had been down for nineteen minutes, on the busiest night of the year.
The retro: what we got wrong
Three things, in increasing order of "I should have known better."
- We had no per-key QPS alerting. Hot key detection in Redis is a
--hotkeysflag. We hadn't run it in production. We did now. - Our API was synchronously blocked on cache lookups. The whole point of a cache is that missing a cache call should be a soft failure. Ours was a hard failure because we never set a Redis client timeout. Default was "forever." We now run with a 50ms timeout and a stampede-safe fallback to the DB.
- We had no concept of an LRU local cache. A Caffeine in front of Redis would have made this incident a non-event. The cost is RAM. The benefit is sleeping through Black Friday.
The one-line fix
The actual long-term fix in our codebase was twelve characters:
@Cacheable("product")
public Product get(String sku) { … }
We added a @Cacheable in front of the Redis call, backed by Caffeine with a 30-second local TTL and a 5,000-entry max. Twelve characters. Six weeks of load testing for an entire team didn't catch what twelve characters of caching prevented.
We've since shipped that pattern as a default in our internal SDK. Every read-heavy endpoint gets a small in-process LRU cache for free. The architecture team called it "cache layering." I called it "the line that lets me sleep on Black Friday."
The deeper lesson
Hot keys are an emergent property, not a known unknown. You can't load-test for the SKU your marketing team will Instagram in three weeks. The defense isn't "identify all hot keys ahead of time." The defense is to make a single hot key not be capable of taking down everything else. That means timeouts, that means local caches, that means circuit breakers, and most of all that means treating Redis like a cache and not a database.
The day you rely on a single Redis key serving 200k QPS, you have already lost. You just don't know which Friday it'll be.