Backend Development 9 min read

Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

This article details the analysis and resolution of a Kafka cluster fault at vivo, where multiple topics experienced complete traffic drop due to a single broker disk failure.

Deployment Architecture: The Kafka cluster handles trillions of messages daily, split into multiple clusters by business dimension. Each cluster contains logical "resource groups" where nodes within a group share resources while groups are isolated from each other to prevent cascading failures.

Fault Symptoms: When a disk failure occurred on a Kafka broker node, nearly all topics in that resource group experienced traffic drop to zero. This was unexpected since Kafka partitions are distributed across multiple brokers, so one broker failure should not affect all partitions.

Root Cause Analysis: The investigation revealed that the issue was not the disk failure itself, but rather a cascading effect in the Kafka producer client. The default partitioner routes messages with specified keys using hash-based modulo operations across ALL partitions (including the failed one), rather than routing only to healthy partitions.

Technical Deep Dive: The Kafka producer uses client-side buffering to batch messages before sending. When a broker becomes unavailable, the default partitioner behavior causes messages to wait for timeout on the failed broker, exhausting the shared client buffer pool. This prevents other healthy partitions from acquiring buffer resources, causing a complete traffic collapse.

Key Findings from Source Code Analysis:

If a partition is explicitly specified, messages go directly to that partition

If a key is specified, the partition is determined by hash(key) % partition_count - this routes to ALL partitions including failed ones

If no key is specified, a round-robin approach using available partitions is used

Recommendations:

Avoid specifying keys in messages unless necessary, as it can trigger cascading failures across all partitions

If keys are required, implement a custom partitioner that excludes failed brokers from routing

distributed systemsPerformance OptimizationBackend DevelopmentKafkaMessage Queuetroubleshootingfault-analysis
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.