ML
Apache Kafka

The Night Kafka Ate Itself: A 6-Hour Outage Postmortem

A real outage, hour by hour. Three brokers, one bad config, six hours of pain, and the seven-line fix that should have been there from day one.

November 04, 202514 min readKafkaPostmortemReliability

This is a real incident. Names are changed; the timeline isn't.

02:14 ICT. The pager goes off. Not the polite "warning" pager — the one with the siren ringtone you set after the third 3am wake-up of the year, the one that tells you something is genuinely on fire.

I'm staring at the ceiling. The Slack message reads: orders-consumer lag > 1.2M, growing 40k/s. A million-two doesn't sound apocalyptic. The 40k per second part does.

1. The setup (or: how we got here)

We ran a 3-broker Kafka cluster on bare metal. ~140 topics. The big one — orders — had 24 partitions, RF=3, and was the heart of basically every workflow. min.insync.replicas=2, acks=all, the canonical durability story.

The producer was a Spring Boot app pushing 6-8k messages/second steady. Consumers: a fleet of 12 pods running a transformation pipeline. We'd been running this stack for three years without incident. Three years. That's the prologue of every postmortem.

2. The blast radius (02:14 → 02:31)

Logging in, three signals were lit up like Christmas:

  • Broker 2 was throwing NotEnoughReplicasException on every partition it led.
  • The Java client metric last-poll-seconds-ago was climbing past 20 across most consumers.
  • Disk on broker 2 was at 96% full.

That last one was the smoking gun. Broker 2 had run out of disk. Closed segments couldn't be deleted because the cleanup thread couldn't take a write lock. So the broker started rejecting appends. min.insync.replicas=2 meant the leader couldn't ack writes from any partition that had broker 2 in its ISR. Effectively half the cluster went read-only.

# [broker-2] log dir: /var/kafka/data is 96% full
# [broker-1] failed to expand ISR for orders-12: NotEnoughReplicasException
# [producer] retries exhausted, dropped 1,847 records to dead letter

3. The first wrong move (02:31 → 03:05)

I did what every panicked SRE does: I tried to free disk by shortening retention.

kafka-configs.sh --alter --entity-type topics --entity-name orders \
  --add-config retention.ms=21600000  # 6 hours

This did not help. Why? Because log.cleaner.dedupe.buffer.size on this cluster was tuned for compaction, but orders was a retention-by-time topic, and the deletion thread couldn't roll a new segment to delete the old one. Retention is a segment-level operation. Until a segment closes, it cannot be deleted. We had a 1 GiB open segment and no way to close it without producing more data — which we couldn't, because the broker was rejecting writes.

I had built a deadlock with my own hands. Beautiful.

4. The escalation (03:05 → 04:20)

By now the on-call manager was awake. Three of us on a call. The clock was a brick on my chest. Every minute meant another 360k messages backed up on the producer side and another postmortem paragraph for me to write.

Options on the table:

  1. Add disk to broker 2. Bare metal. Datacenter ticket. ETA: hours.
  2. Reduce ISR to 2 by removing broker 2. Risky, could trigger unclean leader election.
  3. Forcibly delete old segment files on disk. Don't do this. We did not do this. (We almost did this.)
  4. Drain log dir of compacted topics we didn't care about, freeing emergency space.

We went with option 4. __consumer_offsets alone was 18 GiB on broker 2 because we had a setting where retention was effectively infinite. Two minutes of kafka-delete-records.sh against a few internal topics, and we had 9 GiB of breathing room.

5. The actual fix (04:20 → 06:14)

With 9 GiB free, broker 2 started cleaning up. ISR healed. Producers stopped retrying. Consumers chewed through the backlog. We sat there watching the lag chart fall. It's a strangely meditative experience — watching a million-message queue drain at 30k/sec, knowing every notch downward is one fewer angry message in your inbox tomorrow.

By 06:14 the lag was zero. Birds were starting to chirp outside. I made the strongest coffee of my life.

6. The seven-line fix that should have been there

The actual root cause was so dumb it hurts. Disk monitoring on the cluster was set up — but the alert threshold was 95%. By the time we got paged, we had twenty-eight minutes of buffer before an outage was guaranteed. Not enough.

- alert: KafkaBrokerDiskHigh
  expr: node_filesystem_avail_bytes{mountpoint="/var/kafka/data"}
        / node_filesystem_size_bytes < 0.20
  for: 5m
  severity: warning

- alert: KafkaBrokerDiskCritical
  expr: ... < 0.10
  for: 1m
  severity: page

Two thresholds. 20% warns the team. 10% pages someone. We'd have had two-plus hours of warning, in business hours, with engineers awake. Instead I got twenty-eight minutes at 2am.

7. What I changed afterward

  • Disk thresholds split into warn / page, with the warn going to a Slack channel that humans actually read.
  • retention.bytes set on every topic, not just retention.ms. Retention by size is a hard ceiling; retention by time is a hope.
  • A runbook for "broker out of disk" with the exact kafka-delete-records.sh commands, target topics, and freeable bytes per topic — pre-computed.
  • An emergency 50 GiB sparse file on each broker that the runbook can rm to instantly buy space. Ugly. Effective.

What I learned, the unvarnished version

Kafka does not protect you from running out of disk. It cannot. Once the cleanup thread can't take a lock, the system enters a doom loop where the only way out is more disk or fewer files. Both of those decisions are operational, not architectural. No amount of min.insync.replicas magic helps when the underlying block device is at 100%.

The other thing: I had read about this exact failure mode in a Confluent blog post probably two years before this incident. I had filed it under "won't happen to us." The single highest-leverage thing I do now, when I read about a postmortem, is ask my team: could this happen to us, and what's the runbook?

Most of the time the answer is "yes" and "nothing." Both are fixable.

SharePostLinkedIn

Reader Discussion

8 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Tuấn Phạm🇻🇳 HCMC· Staff Engineer · Tiki Data PlatformStory

    min.insync.replicas=2 với acks=all là combo chuẩn. Bọn em từng để mặc định min.insync.replicas=1, một broker rolling restart là mất 4 message — xui là đúng cái event payment confirm. Ngồi viết postmortem từ 2h sáng đến 7h.

    Nov 05, 2025·1 day later
    • ML
      Minh LeAuthor

      Đúng cái khoảnh khắc realize default = 1 thì đã muộn. Cảm ơn bro share — mình sẽ thêm warning box vô post.

      Nov 05, 2025
  2. Derek Okonkwo· Principal Engineer · FintechAgrees

    The Kafka→anything-else caveat is the single most important paragraph in this post. I cannot count how many "exactly once" pipelines I've reviewed that write to Postgres without an outbox and the team still calls it EOS. It's at-least-once with extra steps and a worse mental model.

    Nov 05, 2025·1 day later
  3. Sven Bergström· Senior SWEStory

    fresh transactional.id on each pod restart is THE silent killer. We had it for ~14 months. Realised when a zombie producer wrote 80k duplicates after a node went catatonic and came back.

    Nov 09, 2025·5 days later
  4. Quốc Anh· Backend Lead · FinhayAgrees

    Đoạn segment files giải thích quá ngắn gọn. Bọn em hay quên log.segment.bytes vs log.retention.bytes là 2 thằng khác nhau, bị retention không kick in là vì segment chưa rolled — đúng cái rule of thumb cuối bài.

    Nov 06, 2025·2 days later
  5. Amelia Brooks· Distributed SystemsPushback

    tiny nit but acks=0 is not literally fire-and-forget at the protocol layer — the producer still writes to its socket buffer. The 'forget' is the broker side. Pedantic but bites people in metrics dashboards.

    Nov 13, 2025·1 week later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Nov 07, 2025·3 days later
  7. Omar Khalil· Senior SWEKind words

    this is the third article from this blog I've sent to my team this month. you're cooking. don't switch to crypto.

    Nov 09, 2025·5 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email