Kafka Outage and High Availability Mechanisms
This article examines a Kafka outage scenario in a fintech company, explains Kafka’s multi-replica redundancy design, leader‑follower architecture, ISR mechanism, and how misconfiguration of the __consumer_offset topic can cause cluster-wide consumer failures, and provides solutions to ensure true high availability.
The article starts with a real‑world incident where one of three Kafka broker nodes crashed, causing all consumers to stop receiving messages despite the remaining two nodes being up.
It then describes Kafka’s multi‑replica redundancy design, covering the physical model (brokers) and logical model (topics, partitions, offsets). Each Broker is a Kafka server, a Topic groups messages, a Partition is a sub‑division of a topic, and an Offset identifies a message’s position within a partition.
From version 0.8 onward, Kafka introduced replication: every partition has one Leader replica and one or more Follower replicas. Clients only interact with the leader, while followers keep in‑sync via the ISR ( In‑Sync Replica ) mechanism, which maintains a list of followers that are sufficiently up‑to‑date.
The article also explains the producer acknowledgment setting request.required.acks (0, 1, all). Setting it to 0 provides no durability, 1 guarantees only the leader’s receipt, and all (or -1) requires all ISR members to acknowledge, offering the strongest durability guarantees.
A critical issue identified is the built‑in __consumer_offset topic, which by default has a replication factor of 1 and 50 partitions. If all its partitions reside on a single broker, the broker’s failure makes the entire consumer group stop, creating a single point of failure.
To resolve this, the article suggests two steps: (1) delete the problematic __consumer_offset topic by removing its log files, and (2) configure offsets.topic.replication.factor to 3 so that the offset topic also benefits from multi‑replica redundancy, matching the number of brokers.
Finally, it emphasizes that with proper replication (e.g., three replicas for three brokers) and correct acks settings, Kafka can achieve true high availability, while also noting the need to avoid configurations where more than half of the brokers are down, as Kafka will then halt operations.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.