The Night Kafka Ate Itself: A 6-Hour Outage Postmortem
A real outage, hour by hour. Three brokers, one bad config, six hours of pain, and the seven-line fix that should have been there from day one.
Production-grade notes on Apache Kafka — the log format on disk, the durability contract, the partition strategy that decides whether your cluster scales or melts, and the operational gotchas that only show up under real load.
6 articles · updated regularly
A real outage, hour by hour. Three brokers, one bad config, six hours of pain, and the seven-line fix that should have been there from day one.
Three legitimate ways to process a Kafka topic — which one fits your shape of problem.
Ordering guarantees in Kafka are per-partition. Getting the key right is the whole game.
Why your consumer hangs for 30s every deploy — and how cooperative rebalance + static membership fixes it.
How producer IDs, sequence numbers, and the transaction coordinator combine to give you exactly-once — and when they don't.
A tour of how Kafka persists records — segment files, index files, high watermark, and the acks=all contract.