Big Data 8 min read

Understanding Kafka High Availability and Resolving Consumer Offset Issues

This article explains Kafka's high‑availability architecture, including multi‑replica design, ISR synchronization, leader election, acks configuration, and how misconfigured __consumer_offset replication can cause consumer outages, offering practical steps to ensure reliable message delivery.

Architect
Architect
Architect
Understanding Kafka High Availability and Resolving Consumer Offset Issues

The article begins with a real‑world incident where a Kafka broker failure caused all consumers to stop receiving messages, prompting an investigation into Kafka's high‑availability mechanisms.

It introduces core Kafka concepts—Broker (node), Topic, Partition, and Offset—illustrating how messages are stored and consumed across the cluster.

Kafka achieves HA through multi‑replica redundancy: each partition has a leader and one or more followers; if a broker crashes, a follower from the ISR (In‑Sync Replica) list is elected as the new leader.

While more replicas improve resilience, a replication factor of three is generally sufficient; increasing it further consumes more network and disk resources.

The ISR mechanism ensures that only followers that are sufficiently synchronized with the leader remain in the ISR list, preventing data loss from lagging replicas.

Leader election follows rules similar to Zookeeper's Zab or Raft, selecting the first ISR replica as the new leader and using a controller to avoid split‑brain scenarios.

The article also covers the producer request.required.acks setting, describing its three possible values (0, 1, all) and how they trade off durability against throughput.

A critical issue is identified: the internal __consumer_offset topic often has a replication factor of 1, creating a single point of failure that can halt all consumer groups when its broker goes down.

To resolve this, the article recommends deleting the faulty __consumer_offset logs (since the topic cannot be removed via commands) and configuring offsets.topic.replication.factor=3 to replicate the offset topic across brokers, thereby restoring consumer availability after a broker failure.

distributed systemshigh availabilityStreamingKafkaReplicationconsumer offset
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.