Operations 5 min read

How Kafka Elects Leaders and Distributes Partitions: Inside the Mechanics

Kafka’s internal mechanisms for leader election, partition assignment, and file storage are explained, covering how the Controller uses ZooKeeper, the ISR-based leader selection process, partition distribution strategies, segment file structures, and the evolution of offset management from Zookeeper to the __consumer_offsets topic.

Raymond Ops

May 7, 2025

How Kafka Elects Leaders and Distributes Partitions: Inside the Mechanics

1. Leader Election

Leader is a per‑partition role; when a broker fails, all leaders residing on that broker are re‑elected, and a new leader is chosen.

Kafka first uses ZooKeeper to elect a Controller, which handles partition assignment and leader election.

The Controller registers a watch on the ZooKeeper path /brokers/ids; when a broker goes down, the Controller receives a notification.

Partition allocation:

Sort all brokers (assume n brokers) and the partitions to be assigned.

Assign the i‑th partition to broker (i mod n) – this broker becomes the leader.

Assign the j‑th replica of the i‑th partition to broker ((i + j) mod n).

Leader selection: The Controller reads the ISR (in‑sync replica) list for a partition from ZooKeeper ( /brokers/topics/[topic]/partitions/[partition]/state) and picks one as the leader. The ISR list changes based on replica.lag.time.max.ms; older versions also removed replicas based on lagged message count.

After a leader is chosen, ZooKeeper is updated and the Controller sends a LeaderAndISRRequest to the affected brokers, notifying them of the new leader. Subsequent requests for that partition are handled by the leader, while followers pull messages for synchronization.

2. Partition Distribution Strategy

If the producer specifies a partition, the message is written to that partition.

If no partition is specified but a key is set, the key is hashed to determine the target partition.

If neither partition nor key is provided, the producer round‑robin distributes messages across all partitions.

3. Partition Files

In the operating system, each partition is a directory containing multiple segment files. Each segment consists of three files: .log (stores the actual messages), .index, and .timeindex (the latter appears in newer versions).

The files are named after the smallest offset in the segment, e.g., 368796.index covers offsets 368796‑1105813.

Kafka leverages ordered offsets, segmented files, indexes, sparse indexes, binary search, and sequential search to achieve efficient data lookup.

4. Offset Management Evolution

Before version 0.10, consumer offsets were stored in ZooKeeper and reported periodically, which could cause duplicate consumption and performance issues.

Since version 0.10, offsets are stored directly in the internal __consumer_offsets topic within the Kafka cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Zookeeper Kafka leader election Partition Assignment Offsets

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.