Kafka Internals: Log Segments, Offsets & The Commit Protocol
A tour of how Kafka persists records — segment files, index files, high watermark, and the acks=all contract.
Kafka is often described as a distributed commit log, but the phrase hides a lot of detail. Under the hood, every partition is a directory of append-only segment files paired with two index files. Understanding that layout is the fastest route to reasoning about throughput, retention, and failure modes.
1. Segments on disk
Each partition log is split into segments of a fixed size (default log.segment.bytes=1 GiB). A segment becomes closed once it hits the size limit or the time roll log.roll.ms expires. Closed segments are immutable, which is what lets Kafka push sequential I/O so hard.
/var/kafka/data/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000004823104.log
├── 00000000000004823104.index
└── 00000000000004823104.timeindex
The filename is the base offset. The .index maps offset → physical byte position; the .timeindex maps timestamp → offset. Both are sparse (one entry every log.index.interval.bytes, default 4 KiB) — a consumer seeking a specific offset does a binary search in the index, then a short linear scan of the log.
2. LEO, HW, and the ISR
Every replica tracks a Log End Offset (LEO). The leader tracks the minimum LEO across the in-sync replicas (ISR) and publishes that as the High Watermark (HW). Consumers can only read up to HW — this is how Kafka guarantees that data returned to a consumer will survive a leader failure.
Leader LEO=120 HW=118
Follower A LEO=120
Follower B LEO=118 ← gates the HW
3. The acks contract
acks=0— fire and forget. The producer doesn't even wait for a socket write to complete.acks=1— leader appended to its local log. You lose data if the leader dies before replication.acks=all— the leader waits for all ISR followers to append. Combined withmin.insync.replicas=2this is the correct setting for durable pipelines.
4. Retention is a compaction, not a delete
When retention expires, Kafka deletes whole closed segments. That's why setting log.retention.ms to a small value but log.segment.ms to a huge one will appear to "not work" — the segment simply hasn't rolled yet.
Rules of thumb
- Keep segments small on low-traffic topics so retention kicks in on time.
- On throughput-critical topics, keep segments large to minimise file-handle churn.
- Always pair
acks=allwithmin.insync.replicas. One without the other is a footgun.