Why a Single Kafka Broker Failure Can Halt the Entire Cluster
This article explains Kafka's high‑availability architecture, covering multi‑replica redundancy, ISR synchronization, producer ACK settings, and the critical role of the __consumer_offset topic, and shows how to configure replication factors to prevent a single‑node outage from stopping consumption.
1. Kafka Outage Triggers High‑Availability Issues
The problem starts with a Kafka outage in a fintech company that uses Kafka instead of RabbitMQ. Although the cluster runs stably most of the time, occasional consumer failures occur when one of the three broker nodes goes down, causing the entire consumer group to stop receiving messages.
2. Kafka's Multi‑Replica Redundancy Design
High availability in distributed systems such as ZooKeeper, Redis, Kafka, and HDFS is typically achieved through redundancy. Key Kafka concepts include:
Broker (Node) : a Kafka server, i.e., a physical node.
Topic : a logical category for messages; producers send to a topic name, consumers read from it.
Partition : each topic is split into one or more partitions; each partition belongs to a single broker.
Offset : the position of a message within a partition, used by consumers to track progress.
Before version 0.8, Kafka had no replication; a broker failure meant loss of all partitions on that broker. Since 0.8, each partition has a leader and one or more followers. Producers and consumers interact only with the leader; followers replicate data from the leader.
When a broker crashes, its partitions' leaders are re‑elected from the ISR (in‑sync replica) list. If the ISR is empty, a new leader is chosen from any surviving replica, which may risk data loss.
How Many Replicas Are Sufficient?
Three replicas are generally enough to guarantee high availability; more replicas increase resource consumption and may degrade performance.
What If Followers Are Not Fully Synchronized with the Leader?
Kafka uses the ISR mechanism. The leader maintains an ISR list of followers that are sufficiently up‑to‑date. Followers that fall behind are removed from the ISR, ensuring only synchronized replicas are considered for leader election.
Leader Election After a Broker Failure
Kafka’s controller selects a new leader from the ISR list. If the previous leader has already stepped down, the controller prevents split‑brain scenarios.
3. ACK Settings Determine Reliability
The producer configuration
request.required.acks(often written as
acks) controls how many replicas must acknowledge a write before it is considered successful:
0 : The producer does not wait for any acknowledgment; messages may be lost.
1 : Only the leader’s acknowledgment is required; if the leader fails before followers replicate, data can be lost. This is the default setting.
all (or
-1): All in‑sync replicas must acknowledge the write, providing the strongest durability guarantee. However, if the ISR contains only the leader,
allbehaves like
1.
4. Solving the Consumer‑Offset Problem
In the test environment, the cluster has three brokers, a topic with replication factor 3, six partitions, and
acks=1. When one broker fails, the cluster re‑elects leaders, but the internal
__consumer_offsettopic has a default replication factor of 1, making it a single point of failure. If the broker holding its partitions dies, all consumers stop.
To fix this:
Delete the existing
__consumer_offsettopic (it cannot be removed via command, so logs were cleared).
Set
offsets.topic.replication.factor=3to recreate the topic with three replicas.
After replicating
__consumer_offset, consumer groups continue working even when a broker goes down.
Remaining questions include why the
__consumer_offsetpartitions were initially placed on a single broker instead of being distributed.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.