Consumer Group Rebalancing: Eager, Cooperative, and Static
Why your consumer hangs for 30s every deploy — and how cooperative rebalance + static membership fixes it.
A rebalance is the process by which partitions are re-assigned among consumers in a group. If you've seen processing latency spike whenever a pod is recycled, you've seen the default eager protocol in action.
1. Eager rebalance (the old default)
When a new member joins, everyone revokes all their partitions, the group coordinator recomputes assignments, and everyone picks up new work. During that window — which can be tens of seconds — the group processes zero records. It's called "stop-the-world" for a reason.
2. Cooperative rebalance
Enabled by setting partition.assignment.strategy to CooperativeStickyAssignor. Only partitions that need to move are revoked; everyone else keeps processing.
props.put("partition.assignment.strategy",
"org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
The trade-off: the rebalance happens in two rounds (revoke-only, then assign). Total wall time is similar, but partitions that don't move are never paused.
3. Static membership
Give each consumer a stable group.instance.id. A short restart (within session.timeout.ms) no longer triggers a rebalance at all — the returning member reclaims its partitions.
props.put("group.instance.id", "order-worker-" + podName);
props.put("session.timeout.ms", "60000");
This is the single highest-leverage change for deployments on Kubernetes. Pair it with terminationGracePeriodSeconds ≥ session timeout so rolling updates never overlap with rebalance storms.
Diagnosing a slow rebalance
kafka-consumer-groups.sh --describeshows the current members and their generation.- Watch the coordinator log for
Preparing to rebalance group X in state PreparingRebalance. - If you see frequent rebalances without obvious cause, the consumer is likely taking longer than
max.poll.interval.msinsidepoll().
Checklist
- Switch to cooperative sticky.
- Set
group.instance.idfor every consumer. - Keep each
poll()iteration well undermax.poll.interval.ms, or hand work off to a worker pool.