Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism
A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.
This article details the analysis and resolution of a Kafka cluster fault at vivo, where multiple topics experienced complete traffic drop due to a single broker disk failure.
Deployment Architecture: The Kafka cluster handles trillions of messages daily, split into multiple clusters by business dimension. Each cluster contains logical "resource groups" where nodes within a group share resources while groups are isolated from each other to prevent cascading failures.
Fault Symptoms: When a disk failure occurred on a Kafka broker node, nearly all topics in that resource group experienced traffic drop to zero. This was unexpected since Kafka partitions are distributed across multiple brokers, so one broker failure should not affect all partitions.
Root Cause Analysis: The investigation revealed that the issue was not the disk failure itself, but rather a cascading effect in the Kafka producer client. The default partitioner routes messages with specified keys using hash-based modulo operations across ALL partitions (including the failed one), rather than routing only to healthy partitions.
Technical Deep Dive: The Kafka producer uses client-side buffering to batch messages before sending. When a broker becomes unavailable, the default partitioner behavior causes messages to wait for timeout on the failed broker, exhausting the shared client buffer pool. This prevents other healthy partitions from acquiring buffer resources, causing a complete traffic collapse.
Key Findings from Source Code Analysis:
If a partition is explicitly specified, messages go directly to that partition
If a key is specified, the partition is determined by hash(key) % partition_count - this routes to ALL partitions including failed ones
If no key is specified, a round-robin approach using available partitions is used
Recommendations:
Avoid specifying keys in messages unless necessary, as it can trigger cascading failures across all partitions
If keys are required, implement a custom partitioner that excludes failed brokers from routing
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.