Big Data 11 min read

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

This article explains the design of a multi‑tenant Kafka cluster, the business onboarding process, detailed fault symptoms and monitoring metrics, analyzes the root cause of a topic‑wide traffic drop, and examines the default partitioner’s rules to propose mitigation recommendations.

Architecture Digest
Architecture Digest
Architecture Digest
Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

1. Kafka Cluster Deployment Architecture – The Kafka service is split into multiple clusters by business domain to handle tens of trillions of messages daily. Each cluster contains logical “resource groups” that isolate broker nodes while sharing resources within the group, preventing cascade failures.

2. Business Access Process – Projects register on the Kafka platform, optionally create an independent resource group for critical data, bind the project to a resource group, create topics via the platform API ensuring partitions reside on brokers within the bound group, and obtain read/write permissions.

3. Fault Situation – During a night‑time incident, all topics in the affected resource group experienced near‑zero traffic. Disk metrics (READ, WRITE, IO.UTIL, AVG.WAIT, READ.REQ, WRITE.REQ) triggered alerts, and the issue persisted for an extended period.

4. Monitoring Metrics

Network idle rate dropped to zero on the faulty node, matching production traffic patterns.

Grafana showed topic production traffic falling to zero.

Kafka platform monitoring confirmed the same for multiple topics.

Disk IO.UTIL on the SDF disk reached 100 % and AVG.WAIT rose to minute‑level delays.

Controller logs reported Input/Output errors; Linux system logs showed Buffer I/O errors.

5. Fault Speculation and Analysis – The immediate hypothesis points to a failed SDF disk on a broker, but the complete traffic drop suggests a broader “avalanche” effect. Since Kafka’s default behavior distributes partitions across brokers, a single broker failure should not silence the entire topic, indicating additional resource‑isolation factors.

6. Default Partitioner Rules

If a partition is explicitly specified, the message is sent directly to that partition.

If a key is provided, the key’s hash modulo the number of partitions determines the target partition (corresponds to the “second” speculation).

If neither partition nor key is set, an incrementing counter modulo the available partitions is used (corresponds to the “first” speculation).

7. Summary

When a key is used with Kafka’s default partitioner, the producer’s client buffer can be exhausted, causing a topic‑wide avalanche.

The investigated system indeed used keys with the default partitioner.

The hypothesis was validated.

8. Recommendations

Avoid specifying a key unless absolutely necessary.

If a key is required, replace the default partitioner with a custom one to prevent buffer exhaustion.

9. Extended Questions

Why does the default partitioner treat keyed messages differently from non‑keyed ones?

Can the producer buffer granularity be changed from instance‑level to partition‑level?

Further articles will explore these questions in depth.

monitoringbig dataKafkaClusterfault-analysispartitioner
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.