Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters
To tame a petabyte‑scale Kafka deployment of over 1,000 brokers, the team built a Raft‑based federation controller (Guardian) that adds per‑partition I/O throttling, disk‑aware automatic balancing, multi‑tenant isolation, cross‑IDC migration, request‑queue splitting, tiered storage, auditing, and fully automated rolling upgrades, enabling stable, self‑healing operations.
1. Background
Kafka is a critical data‑middleware in the company, used for reporting, buffering and distributing massive amounts of data. The deployment consists of more than 1,000 Kafka machines forming over 20 clusters, with heterogeneous hardware (HDD, SSD, NVMe) and daily traffic at the petabyte level. As the cluster size grew, a series of stability and operational challenges emerged.
2. Challenges and Pain Points
Unpredictable client read/write patterns lead to I/O saturation and resource contention.
Multi‑tenant workloads interfere with each other, expanding the “blast radius” of failures.
Open‑source rate‑limiting is coarse‑grained and cannot react to real‑time disk metrics.
Cluster scaling and machine up/down processes are cumbersome and slow.
Partition assignment ignores disk‑level load, causing imbalance across machines and disks.
Lack of automatic rebalancing and migration‑rate control harms real‑time performance.
Single IDC cannot meet the company’s growing demand; cross‑IDC coordination is required.
Only one request‑handling thread pool exists; slow requests can block the whole pool.
3. Thinking and Solutions
3.1 Guardian – Kafka Federation (Cluster Controller)
Guardian is a self‑developed federation controller that uses Raft for high availability and consistency. It collects metrics from Kafka brokers, performs analysis, and executes governance plans, including:
Metadata management for federation clusters.
Remote storage metadata handling.
UUID (topicId, segmentId) allocation.
Cluster‑wide scheduling based on collected metrics.
Multi‑tenant label isolation.
Failure alerting and self‑healing.
3.2 Cluster‑Level Governance
3.2.1 Partition‑Level Throttling Protection
Kafka is I/O‑intensive; uncontrolled reads/writes can saturate disks. The open‑source throttling is too coarse. The new logic adds per‑partition throttling based on real‑time disk I/O and latency metrics. An estimation algorithm calculates the time a data segment can stay in PageCache (T) and decides whether a read will hit disk. If a disk exceeds thresholds, throttling is applied to keep I/O and latency within acceptable ranges.
Key concepts:
Six I/O behaviors: user read/write, replica sync read/write, and intra‑disk migration read/write.
Abnormal‑behavior queue sorted by current traffic volume.
Illustrative flow diagram is shown in the original slides.
3.2.2 Automatic Partition Balancing
To avoid hotspot formation, a balancing plan is generated based on disk‑level load, topic distribution, and per‑partition traffic. The plan selects target disks with the lowest historical median load and moves partitions accordingly. Incremental task submission prevents long‑tail blocking, and dynamic speed adjustment ensures migration does not impact cluster stability.
3.2.3 Multi‑Tenant Resource Isolation
Each business domain can request exclusive resources. Topics can be created with dedicated resource pools, supporting dynamic scaling, automatic throttling, and automatic balancing. Resource shrinkage returns machines to the shared pool after migration.
3.2.4 Multi‑IDC Management
Cross‑IDC migration is automated: when a business moves to a new IDC, the system generates a migration plan, applies throttling, and ensures offset continuity without user impact. The system also provides IDC‑aware read routing.
3.2.5 Request Queue Splitting
The thread model is refined to separate slow requests from normal ones. A ChannelMarker monitors per‑channel latency and tags slow channels. Slow requests are dispatched to dedicated thread pools (e.g., SlowRequest ), achieving isolation and preventing blockage of fast paths.
Relevant source files (wrapped in tags):
core/src/main/scala/kafka/network/SocketServer.scala
clients/src/main/java/org/apache/kafka/common/network/Selector.java
clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java
core/src/main/scala/kafka/server/KafkaRequestHandler.scala
3.2.6 Tiered Storage
The goal is to decouple partition data from a single disk and store parts on remote HDFS storage. A Raft‑based meta service (HA) replaces the Kafka meta‑topic, using GRPC for communication and RocksDB for persistent meta data. New offset fetching strategies (local‑start‑fetch‑offset) reduce unnecessary remote reads. Batch meta writes and leader fencing improve consistency. Remote segments are cached locally to lower HDFS load and latency.
3.2.7 Kafka Auditing
An audit layer records detailed produce/consume requests into ClickHouse, enabling real‑time troubleshooting, cost management, and permission cleanup. The audit data also supports security enhancements.
3.3 Operations‑Level Governance
3.3.1 Smooth Cluster Rolling Release
A service automates batch machine up/down operations. When a machine is taken offline, its leader partitions are first migrated; when it returns, partitions are restored. This reduces a 15‑person‑day rollout to a one‑hour automated process.
4. Future Outlook
Support minute‑level migration scheduling.
Implement minute‑level self‑healing for hardware failures.
Achieve fully automated dynamic scaling based on real‑time cluster analysis.
The presentation concludes with an invitation for feedback and a reminder to “like” the content.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.