ML
Apache Kafka

Consumer Group Rebalancing: Eager, Cooperative, and Static

Why your consumer hangs for 30s every deploy — and how cooperative rebalance + static membership fixes it.

September 18, 20258 min readKafkaOperations

A rebalance is the process by which partitions are re-assigned among consumers in a group. If you've seen processing latency spike whenever a pod is recycled, you've seen the default eager protocol in action.

1. Eager rebalance (the old default)

When a new member joins, everyone revokes all their partitions, the group coordinator recomputes assignments, and everyone picks up new work. During that window — which can be tens of seconds — the group processes zero records. It's called "stop-the-world" for a reason.

2. Cooperative rebalance

Enabled by setting partition.assignment.strategy to CooperativeStickyAssignor. Only partitions that need to move are revoked; everyone else keeps processing.

props.put("partition.assignment.strategy",
  "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

The trade-off: the rebalance happens in two rounds (revoke-only, then assign). Total wall time is similar, but partitions that don't move are never paused.

3. Static membership

Give each consumer a stable group.instance.id. A short restart (within session.timeout.ms) no longer triggers a rebalance at all — the returning member reclaims its partitions.

props.put("group.instance.id", "order-worker-" + podName);
props.put("session.timeout.ms", "60000");

This is the single highest-leverage change for deployments on Kubernetes. Pair it with terminationGracePeriodSeconds ≥ session timeout so rolling updates never overlap with rebalance storms.

Diagnosing a slow rebalance

  • kafka-consumer-groups.sh --describe shows the current members and their generation.
  • Watch the coordinator log for Preparing to rebalance group X in state PreparingRebalance.
  • If you see frequent rebalances without obvious cause, the consumer is likely taking longer than max.poll.interval.ms inside poll().

Checklist

  1. Switch to cooperative sticky.
  2. Set group.instance.id for every consumer.
  3. Keep each poll() iteration well under max.poll.interval.ms, or hand work off to a worker pool.
SharePostLinkedIn

Reader Discussion

8 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Tuấn Phạm🇻🇳 HCMC· Staff Engineer · Tiki Data PlatformStory

    min.insync.replicas=2 với acks=all là combo chuẩn. Bọn em từng để mặc định min.insync.replicas=1, một broker rolling restart là mất 4 message — xui là đúng cái event payment confirm. Ngồi viết postmortem từ 2h sáng đến 7h.

    Sep 19, 2025·1 day later
    • ML
      Minh LeAuthor

      Đúng cái khoảnh khắc realize default = 1 thì đã muộn. Cảm ơn bro share — mình sẽ thêm warning box vô post.

      Sep 19, 2025
  2. Derek Okonkwo· Principal Engineer · FintechAgrees

    The Kafka→anything-else caveat is the single most important paragraph in this post. I cannot count how many "exactly once" pipelines I've reviewed that write to Postgres without an outbox and the team still calls it EOS. It's at-least-once with extra steps and a worse mental model.

    Sep 19, 2025·1 day later
  3. Sven Bergström· Senior SWEStory

    fresh transactional.id on each pod restart is THE silent killer. We had it for ~14 months. Realised when a zombie producer wrote 80k duplicates after a node went catatonic and came back.

    Sep 23, 2025·5 days later
  4. Quốc Anh· Backend Lead · FinhayAgrees

    Đoạn segment files giải thích quá ngắn gọn. Bọn em hay quên log.segment.bytes vs log.retention.bytes là 2 thằng khác nhau, bị retention không kick in là vì segment chưa rolled — đúng cái rule of thumb cuối bài.

    Sep 20, 2025·2 days later
  5. Amelia Brooks· Distributed SystemsPushback

    tiny nit but acks=0 is not literally fire-and-forget at the protocol layer — the producer still writes to its socket buffer. The 'forget' is the broker side. Pedantic but bites people in metrics dashboards.

    Sep 27, 2025·1 week later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Sep 20, 2025·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Sep 22, 2025·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email