Big Data 8 min read

Understanding Kafka High Availability and Resolving Consumer Offset Issues

This article explains Kafka's high‑availability architecture, including multi‑replica design, ISR synchronization, leader election, acks configuration, and how misconfigured __consumer_offset replication can cause consumer outages, offering practical steps to ensure reliable message delivery.

Architect

Jul 7, 2021

Understanding Kafka High Availability and Resolving Consumer Offset Issues

The article begins with a real‑world incident where a Kafka broker failure caused all consumers to stop receiving messages, prompting an investigation into Kafka's high‑availability mechanisms.

It introduces core Kafka concepts—Broker (node), Topic, Partition, and Offset—illustrating how messages are stored and consumed across the cluster.

Kafka achieves HA through multi‑replica redundancy: each partition has a leader and one or more followers; if a broker crashes, a follower from the ISR (In‑Sync Replica) list is elected as the new leader.

While more replicas improve resilience, a replication factor of three is generally sufficient; increasing it further consumes more network and disk resources.

The ISR mechanism ensures that only followers that are sufficiently synchronized with the leader remain in the ISR list, preventing data loss from lagging replicas.

Leader election follows rules similar to Zookeeper's Zab or Raft, selecting the first ISR replica as the new leader and using a controller to avoid split‑brain scenarios.

The article also covers the producer request.required.acks setting, describing its three possible values (0, 1, all) and how they trade off durability against throughput.

A critical issue is identified: the internal __consumer_offset topic often has a replication factor of 1, creating a single point of failure that can halt all consumer groups when its broker goes down.

To resolve this, the article recommends deleting the faulty __consumer_offset logs (since the topic cannot be removed via commands) and configuring offsets.topic.replication.factor=3 to replicate the offset topic across brokers, thereby restoring consumer availability after a broker failure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming replication distributed-systems high-availability Consumer Offset

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.