Operations 21 min read

Inside Bilibili’s Kafka: Challenges, Guardian Federation, and Future Automation

The article details how Bilibili operates over 1,000 Kafka nodes across 20+ clusters, outlines the scalability and stability challenges they faced, and explains the design and implementation of their self‑built Guardian federation controller, partition‑level throttling, automatic balancing, multi‑tenant isolation, tiered storage, audit, and automated ops workflows.

Smart Era Software Development
Smart Era Software Development
Smart Era Software Development
Inside Bilibili’s Kafka: Challenges, Guardian Federation, and Future Automation

Background

Bilibili runs Kafka as a core data middleware for batch and online workloads, deploying over 1,000 machines across more than 20 clusters with HDD, SSD and NVMe disks. Daily traffic reaches petabytes, and rapid cluster growth introduced stability and governance problems.

Challenges and Pain Points

Unpredictable client read/write patterns cause I/O saturation, degrading user experience.

Shared clusters lead to interference between core and ordinary business workloads.

Open‑source Kafka throttling is coarse‑grained (client‑ID level) and cannot react to real‑time disk conditions.

Machine up/down procedures are cumbersome and inefficient.

Partition assignment ignores disk I/O load and inter‑topic traffic, causing load imbalance.

Lack of automatic balancing and migration‑rate control hampers real‑time reads.

Single IDC cannot support expanding services, requiring multi‑IDC coordination.

Only one request‑handling thread pool means slow requests can block others.

Solution Overview

Guardian – Kafka Federation Cluster Controller

Guardian is a self‑developed federation controller that uses Raft for high availability and consistency. It collects metrics from Kafka brokers, performs analysis, and executes governance plans. Key functions:

Metadata management for federation clusters.

Remote storage metadata handling.

UUID (topicId, segmentId) allocation.

Cluster information collection and scheduling.

Multi‑tenant management with label isolation.

Fault alerting and self‑healing.

JMX‑based metric collection requires one request per MBean and can consume up to 20 % CPU. Guardian’s built‑in Kafka Reporter pulls all metrics in a single gRPC/HTTP RPC.

Cluster‑Level Governance

Partition‑Level Throttling Protection

Kafka’s I/O intensity makes unpredictable reads a risk. Open‑source throttling limits by client ID, which is too coarse. Guardian adds partition‑granular throttling based on real‑time disk I/O and latency metrics. An estimation algorithm computes the time T a data segment can stay in PageCache (available memory ÷ disk speed). Using MessageInRate * T, it estimates how many messages a partition can cache. If LEO - MessageInRate * T exceeds the fetch offset, the read is considered a disk read and is throttled.

An “abnormal behavior queue” ranks I/O anomalies by current traffic volume.

Six I/O behaviors are identified: user read/write, replica sync read/write, and migration read/write.

Activating partition‑level throttling enables automatic disk‑migration tasks and improves cluster stability.

Automatic Partition Balancing

Guardian analyzes per‑disk load and per‑topic traffic, then generates a migration plan that places new replicas on disks with the lowest historical load median. Migration tasks are submitted incrementally to avoid long‑tail blocking, and speeds are adjusted dynamically based on cluster load. Features include:

Configurable concurrent migrations.

Pre‑allocation of partitions for new topics based on expected traffic.

Leader balancing to avoid leader‑host hotspots.

Automatic cancellation of migrations when nodes fail or when topic expansion is needed.

Multi‑Tenant Resource Isolation

Guardian supports exclusive resources per tenant. Topics can be created with specified exclusive resources, which can be dynamically expanded or shrunk. The system automatically applies throttling and balancing to isolated resources.

Multi‑IDC Management

When a single IDC cannot meet demand, Guardian coordinates cross‑IDC topic migration with one‑click operations, preserving offset continuity and providing near‑IDC read routing.

Request Queue Splitting

Kafka instances run many partitions on a single broker. Slow‑disk or fail‑slow scenarios can affect all requests. Guardian splits request queues by type (Produce, Fetch, Default) and introduces a dedicated slow‑request thread pool. ChannelMarker monitors per‑channel latency, tags slow channels, and routes them to the slow pool, achieving thread isolation and preventing slow requests from blocking fast ones.

Tiered Storage

Standard Kafka tiered storage (S3‑based) lacks HDFS support and incurs metadata overhead. Guardian implements a Raft‑based meta service with gRPC communication, storing meta in RocksDB. It adds a custom offset fetch mode that prefers local data when possible and falls back to remote HDFS storage, reducing latency and write‑amplification. Additional improvements:

Raft‑based meta server with HA, using RocksDB key‑value store.

Offset fetch logic that returns local‑start‑fetch‑offset (max of local start and calculated threshold) to limit remote reads.

Batch meta writes with Leader Fence for strong consistency.

Segment download to local cache reduces HDFS load.

Reduced write latency because SSD capacity is no longer the bottleneck.

Kafka Audit

Guardian adds an audit layer that records production and consumption requests into ClickHouse, enabling detailed troubleshooting, cost management, and permission cleanup.

Operations‑Level Governance

Smooth Cluster Rolling

Guardian automates batch machine up/down operations. When a node is taken offline, its leader partitions are migrated first; when the node returns, its partitions are restored. This reduced a manual rolling upgrade from 15 person‑days to 1 person‑hour.

Future Outlook

Achieve minute‑level migration scheduling (currently hour‑to‑day).

Implement minute‑level self‑healing for hardware failures.

Fully automate dynamic scaling based on real‑time cluster analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

automationKafkaTiered StorageCluster GovernanceAuditMulti‑Tenant IsolationPartition Balancing
Smart Era Software Development
Written by

Smart Era Software Development

Committed to openness and connectivity, we build frontline engineering capabilities in software, requirements, and platform engineering. By integrating digitalization, cloud computing, blockchain, new media and other hot tech topics, we create an efficient, cutting‑edge tech exchange platform and a diversified engineering ecosystem. Provides frontline news, summit updates, and practical sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.