Big Data 10 min read

Kafka Crash and High‑Availability Issues: Replica Design, ISR, and Consumer Offset Problems

The article explains why a single Kafka broker failure can render the whole cluster unavailable, detailing Kafka's multi‑replica architecture, ISR mechanism, leader election, producer acknowledgment settings, and the special handling required for the __consumer_offset topic.

Architecture Digest
Architecture Digest
Architecture Digest
Kafka Crash and High‑Availability Issues: Replica Design, ISR, and Consumer Offset Problems

Kafka Crash Leads to High‑Availability Issues

The story starts with a Kafka node outage in a fintech company that uses Kafka (instead of RabbitMQ) for log processing, where a broker failure caused all consumers to stop receiving messages despite the remaining two brokers being up.

Kafka's Multi‑Replica Redundancy Design

Like many distributed systems (e.g., zookeeper , redis , HDFS ), Kafka achieves high availability through replica redundancy. Key concepts include:

Physical Model

Logical Model

Broker (node): a single Kafka server.

Topic : a logical channel identified by a Topic Name .

Partition : each topic is split into one or more partitions; a partition belongs to exactly one broker.

Offset : the position of a message within a partition, used by consumers to read.

Before Kafka 0.8 there was no replication; a broker crash meant loss of all partitions on that broker. Since 0.8, each partition has a Leader replica and one or more Follower replicas. Producers and consumers interact only with the leader; followers replicate data from the leader.

When a broker (and thus its leader) fails, Kafka selects a new leader from the ISR ( In‑Sync Replica ) list. If the ISR list is empty, a leader is chosen from any surviving replica, which may risk data loss.

Acknowledgment (acks) Parameter Determines Reliability

The producer setting request.required.acks controls when a send is considered successful:

0 : fire‑and‑forget; messages may be lost.

1 : only the leader must acknowledge (default); if the leader crashes before followers sync, messages can be lost.

all (or -1): all ISR followers must acknowledge; this provides the strongest durability but requires at least two in‑sync replicas.

Resolving the Real Problem

In the author's test environment there are 3 brokers, a topic with replication factor 3, 6 partitions, and acks=1 . When one broker goes down, the cluster elects new leaders, but the internal __consumer_offset topic has a default replication factor of 1, making it a single‑point‑of‑failure. Consumers stop because their offset information becomes unavailable.

Two solutions are proposed:

Delete the problematic __consumer_offset topic (cannot be removed directly; the author removed its log files).

Set offsets.topic.replication.factor to 3 so that __consumer_offset also has three replicas, eliminating the single‑point failure.

Understanding why the __consumer_offset partitions were placed on a single broker requires deeper investigation of the cluster’s partition assignment strategy.

distributed systemsHigh AvailabilitykafkaReplicationISRConsumer Offsets
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.