Kafka Crash and High‑Availability Issues: Replica Design, ISR, and Consumer Offset Problems
The article explains why a single Kafka broker failure can render the whole cluster unavailable, detailing Kafka's multi‑replica architecture, ISR mechanism, leader election, producer acknowledgment settings, and the special handling required for the __consumer_offset topic.
Kafka Crash Leads to High‑Availability Issues
The story starts with a Kafka node outage in a fintech company that uses Kafka (instead of RabbitMQ) for log processing, where a broker failure caused all consumers to stop receiving messages despite the remaining two brokers being up.
Kafka's Multi‑Replica Redundancy Design
Like many distributed systems (e.g., zookeeper , redis , HDFS ), Kafka achieves high availability through replica redundancy. Key concepts include:
Physical Model
Logical Model
Broker (node): a single Kafka server.
Topic : a logical channel identified by a Topic Name .
Partition : each topic is split into one or more partitions; a partition belongs to exactly one broker.
Offset : the position of a message within a partition, used by consumers to read.
Before Kafka 0.8 there was no replication; a broker crash meant loss of all partitions on that broker. Since 0.8, each partition has a Leader replica and one or more Follower replicas. Producers and consumers interact only with the leader; followers replicate data from the leader.
When a broker (and thus its leader) fails, Kafka selects a new leader from the ISR ( In‑Sync Replica ) list. If the ISR list is empty, a leader is chosen from any surviving replica, which may risk data loss.
Acknowledgment (acks) Parameter Determines Reliability
The producer setting request.required.acks controls when a send is considered successful:
0 : fire‑and‑forget; messages may be lost.
1 : only the leader must acknowledge (default); if the leader crashes before followers sync, messages can be lost.
all (or -1): all ISR followers must acknowledge; this provides the strongest durability but requires at least two in‑sync replicas.
Resolving the Real Problem
In the author's test environment there are 3 brokers, a topic with replication factor 3, 6 partitions, and acks=1 . When one broker goes down, the cluster elects new leaders, but the internal __consumer_offset topic has a default replication factor of 1, making it a single‑point‑of‑failure. Consumers stop because their offset information becomes unavailable.
Two solutions are proposed:
Delete the problematic __consumer_offset topic (cannot be removed directly; the author removed its log files).
Set offsets.topic.replication.factor to 3 so that __consumer_offset also has three replicas, eliminating the single‑point failure.
Understanding why the __consumer_offset partitions were placed on a single broker requires deeper investigation of the cluster’s partition assignment strategy.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.