How Kafka Elects Leaders and Distributes Partitions: Inside the Mechanics
Kafka’s internal mechanisms for leader election, partition assignment, and file storage are explained, covering how the Controller uses ZooKeeper, the ISR-based leader selection process, partition distribution strategies, segment file structures, and the evolution of offset management from Zookeeper to the __consumer_offsets topic.
1. Leader Election
Leader is a per‑partition role; when a broker fails, all leaders residing on that broker are re‑elected, and a new leader is chosen.
Kafka first uses ZooKeeper to elect a Controller, which handles partition assignment and leader election.
The Controller registers a watch on the ZooKeeper path
/brokers/ids; when a broker goes down, the Controller receives a notification.
Partition allocation:
Sort all brokers (assume n brokers) and the partitions to be assigned.
Assign the i‑th partition to broker
(i mod n)– this broker becomes the leader.
Assign the j‑th replica of the i‑th partition to broker
((i + j) mod n).
Leader selection: The Controller reads the ISR (in‑sync replica) list for a partition from ZooKeeper (
/brokers/topics/[topic]/partitions/[partition]/state) and picks one as the leader. The ISR list changes based on
replica.lag.time.max.ms; older versions also removed replicas based on lagged message count.
After a leader is chosen, ZooKeeper is updated and the Controller sends a
LeaderAndISRRequestto the affected brokers, notifying them of the new leader. Subsequent requests for that partition are handled by the leader, while followers pull messages for synchronization.
2. Partition Distribution Strategy
If the producer specifies a partition, the message is written to that partition.
If no partition is specified but a key is set, the key is hashed to determine the target partition.
If neither partition nor key is provided, the producer round‑robin distributes messages across all partitions.
3. Partition Files
In the operating system, each partition is a directory containing multiple segment files. Each segment consists of three files:
.log(stores the actual messages),
.index, and
.timeindex(the latter appears in newer versions).
The files are named after the smallest offset in the segment, e.g.,
368796.indexcovers offsets 368796‑1105813.
Kafka leverages ordered offsets, segmented files, indexes, sparse indexes, binary search, and sequential search to achieve efficient data lookup.
4. Offset Management Evolution
Before version 0.10, consumer offsets were stored in ZooKeeper and reported periodically, which could cause duplicate consumption and performance issues.
Since version 0.10, offsets are stored directly in the internal __consumer_offsets topic within the Kafka cluster.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.