Big Data 21 min read

Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters

To tame a petabyte‑scale Kafka deployment of over 1,000 brokers, the team built a Raft‑based federation controller (Guardian) that adds per‑partition I/O throttling, disk‑aware automatic balancing, multi‑tenant isolation, cross‑IDC migration, request‑queue splitting, tiered storage, auditing, and fully automated rolling upgrades, enabling stable, self‑healing operations.

Bilibili Tech

Nov 3, 2023

Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters

1. Background

Kafka is a critical data‑middleware in the company, used for reporting, buffering and distributing massive amounts of data. The deployment consists of more than 1,000 Kafka machines forming over 20 clusters, with heterogeneous hardware (HDD, SSD, NVMe) and daily traffic at the petabyte level. As the cluster size grew, a series of stability and operational challenges emerged.

2. Challenges and Pain Points

Unpredictable client read/write patterns lead to I/O saturation and resource contention.

Multi‑tenant workloads interfere with each other, expanding the “blast radius” of failures.

Open‑source rate‑limiting is coarse‑grained and cannot react to real‑time disk metrics.

Cluster scaling and machine up/down processes are cumbersome and slow.

Partition assignment ignores disk‑level load, causing imbalance across machines and disks.

Lack of automatic rebalancing and migration‑rate control harms real‑time performance.

Single IDC cannot meet the company’s growing demand; cross‑IDC coordination is required.

Only one request‑handling thread pool exists; slow requests can block the whole pool.

3. Thinking and Solutions

3.1 Guardian – Kafka Federation (Cluster Controller)

Guardian is a self‑developed federation controller that uses Raft for high availability and consistency. It collects metrics from Kafka brokers, performs analysis, and executes governance plans, including:

Metadata management for federation clusters.

Remote storage metadata handling.

UUID (topicId, segmentId) allocation.

Cluster‑wide scheduling based on collected metrics.

Multi‑tenant label isolation.

Failure alerting and self‑healing.

3.2 Cluster‑Level Governance

3.2.1 Partition‑Level Throttling Protection

Kafka is I/O‑intensive; uncontrolled reads/writes can saturate disks. The open‑source throttling is too coarse. The new logic adds per‑partition throttling based on real‑time disk I/O and latency metrics. An estimation algorithm calculates the time a data segment can stay in PageCache (T) and decides whether a read will hit disk. If a disk exceeds thresholds, throttling is applied to keep I/O and latency within acceptable ranges.

Key concepts:

Six I/O behaviors: user read/write, replica sync read/write, and intra‑disk migration read/write.

Abnormal‑behavior queue sorted by current traffic volume.

Illustrative flow diagram is shown in the original slides.

3.2.2 Automatic Partition Balancing

To avoid hotspot formation, a balancing plan is generated based on disk‑level load, topic distribution, and per‑partition traffic. The plan selects target disks with the lowest historical median load and moves partitions accordingly. Incremental task submission prevents long‑tail blocking, and dynamic speed adjustment ensures migration does not impact cluster stability.

3.2.3 Multi‑Tenant Resource Isolation

Each business domain can request exclusive resources. Topics can be created with dedicated resource pools, supporting dynamic scaling, automatic throttling, and automatic balancing. Resource shrinkage returns machines to the shared pool after migration.

3.2.4 Multi‑IDC Management

Cross‑IDC migration is automated: when a business moves to a new IDC, the system generates a migration plan, applies throttling, and ensures offset continuity without user impact. The system also provides IDC‑aware read routing.

3.2.5 Request Queue Splitting

The thread model is refined to separate slow requests from normal ones. A ChannelMarker monitors per‑channel latency and tags slow channels. Slow requests are dispatched to dedicated thread pools (e.g., SlowRequest), achieving isolation and preventing blockage of fast paths.

Relevant source files (wrapped in

tags):</p><ul><li><code>core/src/main/scala/kafka/network/SocketServer.scala

clients/src/main/java/org/apache/kafka/common/network/Selector.java clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java core/src/main/scala/kafka/server/KafkaRequestHandler.scala 3.2.6 Tiered Storage The goal is to decouple partition data from a single disk and store parts on remote HDFS storage. A Raft‑based meta service (HA) replaces the Kafka meta‑topic, using GRPC for communication and RocksDB for persistent meta data. New offset fetching strategies (local‑start‑fetch‑offset) reduce unnecessary remote reads. Batch meta writes and leader fencing improve consistency. Remote segments are cached locally to lower HDFS load and latency. 3.2.7 Kafka Auditing An audit layer records detailed produce/consume requests into ClickHouse, enabling real‑time troubleshooting, cost management, and permission cleanup. The audit data also supports security enhancements. 3.3 Operations‑Level Governance 3.3.1 Smooth Cluster Rolling Release A service automates batch machine up/down operations. When a machine is taken offline, its leader partitions are first migrated; when it returns, partitions are restored. This reduces a 15‑person‑day rollout to a one‑hour automated process. 4. Future Outlook

Support minute‑level migration scheduling.

Implement minute‑level self‑healing for hardware failures.

Achieve fully automated dynamic scaling based on real‑time cluster analysis.

The presentation concludes with an invitation for feedback and a reminder to “like” the content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Performance Optimization Big Data Kafka Cluster Governance

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.