The Night Kafka Ate Itself: A 6-Hour Outage Postmortem
A real outage, hour by hour. Three brokers, one bad config, six hours of pain, and the seven-line fix that should have been there from day one.
This is a real incident. Names are changed; the timeline isn't.
02:14 ICT. The pager goes off. Not the polite "warning" pager — the one with the siren ringtone you set after the third 3am wake-up of the year, the one that tells you something is genuinely on fire.
I'm staring at the ceiling. The Slack message reads: orders-consumer lag > 1.2M, growing 40k/s. A million-two doesn't sound apocalyptic. The 40k per second part does.
1. The setup (or: how we got here)
We ran a 3-broker Kafka cluster on bare metal. ~140 topics. The big one — orders — had 24 partitions, RF=3, and was the heart of basically every workflow. min.insync.replicas=2, acks=all, the canonical durability story.
The producer was a Spring Boot app pushing 6-8k messages/second steady. Consumers: a fleet of 12 pods running a transformation pipeline. We'd been running this stack for three years without incident. Three years. That's the prologue of every postmortem.
2. The blast radius (02:14 → 02:31)
Logging in, three signals were lit up like Christmas:
- Broker 2 was throwing
NotEnoughReplicasExceptionon every partition it led. - The Java client metric
last-poll-seconds-agowas climbing past 20 across most consumers. - Disk on broker 2 was at 96% full.
That last one was the smoking gun. Broker 2 had run out of disk. Closed segments couldn't be deleted because the cleanup thread couldn't take a write lock. So the broker started rejecting appends. min.insync.replicas=2 meant the leader couldn't ack writes from any partition that had broker 2 in its ISR. Effectively half the cluster went read-only.
# [broker-2] log dir: /var/kafka/data is 96% full
# [broker-1] failed to expand ISR for orders-12: NotEnoughReplicasException
# [producer] retries exhausted, dropped 1,847 records to dead letter
3. The first wrong move (02:31 → 03:05)
I did what every panicked SRE does: I tried to free disk by shortening retention.
kafka-configs.sh --alter --entity-type topics --entity-name orders \
--add-config retention.ms=21600000 # 6 hours
This did not help. Why? Because log.cleaner.dedupe.buffer.size on this cluster was tuned for compaction, but orders was a retention-by-time topic, and the deletion thread couldn't roll a new segment to delete the old one. Retention is a segment-level operation. Until a segment closes, it cannot be deleted. We had a 1 GiB open segment and no way to close it without producing more data — which we couldn't, because the broker was rejecting writes.
I had built a deadlock with my own hands. Beautiful.
4. The escalation (03:05 → 04:20)
By now the on-call manager was awake. Three of us on a call. The clock was a brick on my chest. Every minute meant another 360k messages backed up on the producer side and another postmortem paragraph for me to write.
Options on the table:
- Add disk to broker 2. Bare metal. Datacenter ticket. ETA: hours.
- Reduce ISR to 2 by removing broker 2. Risky, could trigger unclean leader election.
- Forcibly delete old segment files on disk. Don't do this. We did not do this. (We almost did this.)
- Drain log dir of compacted topics we didn't care about, freeing emergency space.
We went with option 4. __consumer_offsets alone was 18 GiB on broker 2 because we had a setting where retention was effectively infinite. Two minutes of kafka-delete-records.sh against a few internal topics, and we had 9 GiB of breathing room.
5. The actual fix (04:20 → 06:14)
With 9 GiB free, broker 2 started cleaning up. ISR healed. Producers stopped retrying. Consumers chewed through the backlog. We sat there watching the lag chart fall. It's a strangely meditative experience — watching a million-message queue drain at 30k/sec, knowing every notch downward is one fewer angry message in your inbox tomorrow.
By 06:14 the lag was zero. Birds were starting to chirp outside. I made the strongest coffee of my life.
6. The seven-line fix that should have been there
The actual root cause was so dumb it hurts. Disk monitoring on the cluster was set up — but the alert threshold was 95%. By the time we got paged, we had twenty-eight minutes of buffer before an outage was guaranteed. Not enough.
- alert: KafkaBrokerDiskHigh
expr: node_filesystem_avail_bytes{mountpoint="/var/kafka/data"}
/ node_filesystem_size_bytes < 0.20
for: 5m
severity: warning
- alert: KafkaBrokerDiskCritical
expr: ... < 0.10
for: 1m
severity: page
Two thresholds. 20% warns the team. 10% pages someone. We'd have had two-plus hours of warning, in business hours, with engineers awake. Instead I got twenty-eight minutes at 2am.
7. What I changed afterward
- Disk thresholds split into warn / page, with the warn going to a Slack channel that humans actually read.
- retention.bytes set on every topic, not just
retention.ms. Retention by size is a hard ceiling; retention by time is a hope. - A runbook for "broker out of disk" with the exact
kafka-delete-records.shcommands, target topics, and freeable bytes per topic — pre-computed. - An emergency 50 GiB sparse file on each broker that the runbook can
rmto instantly buy space. Ugly. Effective.
What I learned, the unvarnished version
Kafka does not protect you from running out of disk. It cannot. Once the cleanup thread can't take a lock, the system enters a doom loop where the only way out is more disk or fewer files. Both of those decisions are operational, not architectural. No amount of min.insync.replicas magic helps when the underlying block device is at 100%.
The other thing: I had read about this exact failure mode in a Confluent blog post probably two years before this incident. I had filed it under "won't happen to us." The single highest-leverage thing I do now, when I read about a postmortem, is ask my team: could this happen to us, and what's the runbook?
Most of the time the answer is "yes" and "nothing." Both are fixable.