The Night Kafka Ate Itself: A 6-Hour Outage Postmortem

A real outage, hour by hour. Three brokers, one bad config, six hours of pain, and the seven-line fix that should have been there from day one.

November 04, 202514 min readKafkaPostmortemReliability

This is a real incident. Names are changed; the timeline isn't.

02:14 ICT. The pager goes off. Not the polite "warning" pager — the one with the siren ringtone you set after the third 3am wake-up of the year, the one that tells you something is genuinely on fire.

I'm staring at the ceiling. The Slack message reads: orders-consumer lag > 1.2M, growing 40k/s. A million-two doesn't sound apocalyptic. The 40k per second part does.

1. The setup (or: how we got here)

We ran a 3-broker Kafka cluster on bare metal. ~140 topics. The big one — orders — had 24 partitions, RF=3, and was the heart of basically every workflow. min.insync.replicas=2, acks=all, the canonical durability story.

The producer was a Spring Boot app pushing 6-8k messages/second steady. Consumers: a fleet of 12 pods running a transformation pipeline. We'd been running this stack for three years without incident. Three years. That's the prologue of every postmortem.

2. The blast radius (02:14 → 02:31)

Logging in, three signals were lit up like Christmas:

Broker 2 was throwing NotEnoughReplicasException on every partition it led.
The Java client metric last-poll-seconds-ago was climbing past 20 across most consumers.
Disk on broker 2 was at 96% full.

That last one was the smoking gun. Broker 2 had run out of disk. Closed segments couldn't be deleted because the cleanup thread couldn't take a write lock. So the broker started rejecting appends. min.insync.replicas=2 meant the leader couldn't ack writes from any partition that had broker 2 in its ISR. Effectively half the cluster went read-only.

# [broker-2] log dir: /var/kafka/data is 96% full
# [broker-1] failed to expand ISR for orders-12: NotEnoughReplicasException
# [producer] retries exhausted, dropped 1,847 records to dead letter

3. The first wrong move (02:31 → 03:05)

I did what every panicked SRE does: I tried to free disk by shortening retention.

kafka-configs.sh --alter --entity-type topics --entity-name orders \
  --add-config retention.ms=21600000  # 6 hours

This did not help. Why? Because log.cleaner.dedupe.buffer.size on this cluster was tuned for compaction, but orders was a retention-by-time topic, and the deletion thread couldn't roll a new segment to delete the old one. Retention is a segment-level operation. Until a segment closes, it cannot be deleted. We had a 1 GiB open segment and no way to close it without producing more data — which we couldn't, because the broker was rejecting writes.

I had built a deadlock with my own hands. Beautiful.

4. The escalation (03:05 → 04:20)

By now the on-call manager was awake. Three of us on a call. The clock was a brick on my chest. Every minute meant another 360k messages backed up on the producer side and another postmortem paragraph for me to write.

Options on the table:

Add disk to broker 2. Bare metal. Datacenter ticket. ETA: hours.
Reduce ISR to 2 by removing broker 2. Risky, could trigger unclean leader election.
Forcibly delete old segment files on disk. Don't do this. We did not do this. (We almost did this.)
Drain log dir of compacted topics we didn't care about, freeing emergency space.

We went with option 4. __consumer_offsets alone was 18 GiB on broker 2 because we had a setting where retention was effectively infinite. Two minutes of kafka-delete-records.sh against a few internal topics, and we had 9 GiB of breathing room.

5. The actual fix (04:20 → 06:14)

With 9 GiB free, broker 2 started cleaning up. ISR healed. Producers stopped retrying. Consumers chewed through the backlog. We sat there watching the lag chart fall. It's a strangely meditative experience — watching a million-message queue drain at 30k/sec, knowing every notch downward is one fewer angry message in your inbox tomorrow.

By 06:14 the lag was zero. Birds were starting to chirp outside. I made the strongest coffee of my life.

6. The seven-line fix that should have been there

The actual root cause was so dumb it hurts. Disk monitoring on the cluster was set up — but the alert threshold was 95%. By the time we got paged, we had twenty-eight minutes of buffer before an outage was guaranteed. Not enough.

- alert: KafkaBrokerDiskHigh
  expr: node_filesystem_avail_bytes{mountpoint="/var/kafka/data"}
        / node_filesystem_size_bytes < 0.20
  for: 5m
  severity: warning

- alert: KafkaBrokerDiskCritical
  expr: ... < 0.10
  for: 1m
  severity: page

Two thresholds. 20% warns the team. 10% pages someone. We'd have had two-plus hours of warning, in business hours, with engineers awake. Instead I got twenty-eight minutes at 2am.

7. What I changed afterward

Disk thresholds split into warn / page, with the warn going to a Slack channel that humans actually read.
retention.bytes set on every topic, not just retention.ms. Retention by size is a hard ceiling; retention by time is a hope.
A runbook for "broker out of disk" with the exact kafka-delete-records.sh commands, target topics, and freeable bytes per topic — pre-computed.
An emergency 50 GiB sparse file on each broker that the runbook can rm to instantly buy space. Ugly. Effective.

What I learned, the unvarnished version

Kafka does not protect you from running out of disk. It cannot. Once the cleanup thread can't take a lock, the system enters a doom loop where the only way out is more disk or fewer files. Both of those decisions are operational, not architectural. No amount of min.insync.replicas magic helps when the underlying block device is at 100%.

The other thing: I had read about this exact failure mode in a Confluent blog post probably two years before this incident. I had filed it under "won't happen to us." The single highest-leverage thing I do now, when I read about a postmortem, is ask my team: could this happen to us, and what's the runbook?

Most of the time the answer is "yes" and "nothing." Both are fixable.

The Night Kafka Ate Itself: A 6-Hour Outage Postmortem

1. The setup (or: how we got here)

2. The blast radius (02:14 → 02:31)

3. The first wrong move (02:31 → 03:05)

4. The escalation (03:05 → 04:20)

5. The actual fix (04:20 → 06:14)

6. The seven-line fix that should have been there

7. What I changed afterward

What I learned, the unvarnished version

8 replies// weighed in

More from this topic

Kafka Internals: Log Segments, Offsets & The Commit Protocol

Exactly-Once Semantics in Kafka: Idempotence & Transactions

Consumer Group Rebalancing: Eager, Cooperative, and Static