ML
Apache Kafka

Kafka Internals: Log Segments, Offsets & The Commit Protocol

A tour of how Kafka persists records — segment files, index files, high watermark, and the acks=all contract.

August 11, 20259 min readKafkaArchitecture

Kafka is often described as a distributed commit log, but the phrase hides a lot of detail. Under the hood, every partition is a directory of append-only segment files paired with two index files. Understanding that layout is the fastest route to reasoning about throughput, retention, and failure modes.

1. Segments on disk

Each partition log is split into segments of a fixed size (default log.segment.bytes=1 GiB). A segment becomes closed once it hits the size limit or the time roll log.roll.ms expires. Closed segments are immutable, which is what lets Kafka push sequential I/O so hard.

/var/kafka/data/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000004823104.log
├── 00000000000004823104.index
└── 00000000000004823104.timeindex

The filename is the base offset. The .index maps offset → physical byte position; the .timeindex maps timestamp → offset. Both are sparse (one entry every log.index.interval.bytes, default 4 KiB) — a consumer seeking a specific offset does a binary search in the index, then a short linear scan of the log.

2. LEO, HW, and the ISR

Every replica tracks a Log End Offset (LEO). The leader tracks the minimum LEO across the in-sync replicas (ISR) and publishes that as the High Watermark (HW). Consumers can only read up to HW — this is how Kafka guarantees that data returned to a consumer will survive a leader failure.

Leader  LEO=120  HW=118
Follower A LEO=120
Follower B LEO=118  ← gates the HW

3. The acks contract

  • acks=0 — fire and forget. The producer doesn't even wait for a socket write to complete.
  • acks=1 — leader appended to its local log. You lose data if the leader dies before replication.
  • acks=all — the leader waits for all ISR followers to append. Combined with min.insync.replicas=2 this is the correct setting for durable pipelines.

4. Retention is a compaction, not a delete

When retention expires, Kafka deletes whole closed segments. That's why setting log.retention.ms to a small value but log.segment.ms to a huge one will appear to "not work" — the segment simply hasn't rolled yet.

Rules of thumb

  • Keep segments small on low-traffic topics so retention kicks in on time.
  • On throughput-critical topics, keep segments large to minimise file-handle churn.
  • Always pair acks=all with min.insync.replicas. One without the other is a footgun.
SharePostLinkedIn

Reader Discussion

9 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Tuấn Phạm🇻🇳 HCMC· Staff Engineer · Tiki Data PlatformStory

    min.insync.replicas=2 với acks=all là combo chuẩn. Bọn em từng để mặc định min.insync.replicas=1, một broker rolling restart là mất 4 message — xui là đúng cái event payment confirm. Ngồi viết postmortem từ 2h sáng đến 7h.

    Aug 12, 2025·1 day later
    • ML
      Minh LeAuthor

      Đúng cái khoảnh khắc realize default = 1 thì đã muộn. Cảm ơn bro share — mình sẽ thêm warning box vô post.

      Aug 12, 2025
  2. Jakub Nowak· Backend EngineerPushback

    small pushback — cooperative sticky is great in theory but mixing it with a 2.4 broker we still had on legacy gave us a 6h partial outage where some consumers thought they owned 0 partitions and others owned everything. compatibility matrix is not a footnote, it's load bearing

    Aug 18, 2025·1 week later·edited
  3. Mai Tran· Full StackAsks

    real q: composite key = customerId + (ts % 8) sounds clean but what's the play when one customer goes whale-mode and one of the 8 buckets gets super hot retroactively? do you re-key in a side topic or just suffer until the next quarter?

    Aug 14, 2025·3 days later
    • Derek Okonkwo· Principal Engineer

      We do a side compaction topic keyed by (customer, partition). Cheap to scan, cheap to re-route. The sin is trying to live-migrate keys.

      Aug 15, 2025
  4. Derek Okonkwo· Principal Engineer · FintechAgrees

    The Kafka→anything-else caveat is the single most important paragraph in this post. I cannot count how many "exactly once" pipelines I've reviewed that write to Postgres without an outbox and the team still calls it EOS. It's at-least-once with extra steps and a worse mental model.

    Aug 12, 2025·1 day later
  5. Sven Bergström· Senior SWEStory

    fresh transactional.id on each pod restart is THE silent killer. We had it for ~14 months. Realised when a zombie producer wrote 80k duplicates after a node went catatonic and came back.

    Aug 16, 2025·5 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Aug 14, 2025·3 days later
  7. Omar Khalil· Senior SWEKind words

    this is the third article from this blog I've sent to my team this month. you're cooking. don't switch to crypto.

    Aug 16, 2025·5 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email