Databases 22 min read

Design and Architecture of Bilibili's High‑Performance Distributed KV Store

Bilibili’s high‑performance distributed KV store combines hash and range partitioning, Raft‑based multi‑replica consistency, and a Metaserver‑managed topology of pools, zones, nodes, tables, shards and replicas, offering features such as partition splitting, binlog streaming, multi‑active replication, bulk loading, KV‑storage separation, and automated load, leader and health balancing for reliable, scalable data services.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Architecture of Bilibili's High‑Performance Distributed KV Store

Background: Bilibili’s business scenarios involve many different data models, ranging from complex relational data (accounts, video metadata) to simple key‑value (kv) models. Early solutions used MySQL for persistence and Redis for caching, which introduced consistency and development complexity issues. The need emerged for a persistent, high‑performance kv store positioned between Redis and MySQL.

The system is designed to be highly reliable, available, performant, and scalable. Reliability is achieved through multi‑replica disaster recovery using the Raft consensus protocol.

Partitioning: Two partitioning strategies are supported – hash‑based and range‑based. Hash partitioning prevents hotspots but does not preserve global order; range partitioning preserves order but may cause write hotspots. Bilibili defaults to hash partitioning for most use‑cases, while range partitioning is available when global ordering is required.

Overall Architecture

The core consists of three components:

Metaserver – manages cluster metadata, node health monitoring, failover, and load balancing.

Node – stores kv data, holds a replica of each shard, and ensures consistency via Raft. Nodes can serve reads from followers when consistency requirements allow.

Client – provides two access methods: a proxy mode and a native SDK (implemented in C++). The SDK obtains shard metadata from Metaserver and routes requests to the appropriate Node, with retry and back‑off mechanisms for high availability.

Cluster Topology

The topology includes Pool, Zone, Node, Table, Shard, and Replica concepts:

Pool – a resource pool containing multiple zones, used for business isolation.

Zone – a fault‑isolated network segment (e.g., a data center or switch).

Node – a physical host that stores data and runs a Raft group.

Table – a logical table similar to a MySQL table.

Shard – a logical partition of a table, distributed across nodes.

Replica – a copy of a shard; replicas are placed in different zones for fault tolerance. Each replica contains an engine (RocksDB or SparrowDB).

Core Features

1. Partition Splitting : Both range and hash partitions support splitting. For hash splitting, the system doubles the number of shards, updates Metaserver metadata, and marks the old shard as splitting until the new shard catches up.

2. Binlog Support : Raft logs are reused as a binlog, enabling real‑time event streaming and cold backup to object storage for long‑term replay.

3. Multi‑Active Replication : Learner modules replicate writes across data‑center clusters. Read‑only multi‑active allows one cluster to serve reads while another writes; write‑active multi‑active enables bidirectional writes with unit‑level write isolation.

4. Bulk Load : Offline tools generate SST files that are uploaded to object storage; Nodes ingest these files directly, reducing write amplification and speeding up data ingestion.

5. KV Storage Separation (SparrowDB) : Data values are stored in append‑only data files while indexes reside in RocksDB. Small values are inlined to avoid extra I/O, balancing read latency and write amplification.

6. Load Balancing : Balances disk space, CPU, memory, and network I/O across nodes. Replica placement prefers zones with the fewest replicas and respects node load thresholds.

7. Leader Balancing : Calculates the expected number of leader replicas per node as expected_leader = node_replica_count / shard_replica_count and migrates leaders to achieve a balanced distribution.

8. Health Monitoring & Failure Recovery : Metaserver sends heartbeats to Nodes; failed Nodes are marked and their replicas are migrated. Heartbeat forwarding mitigates false positives caused by network partitions. Disk health is monitored via dmesg logs.

9. Failure Repair : Upon node or disk failure, Metaserver creates new replicas on healthy nodes, which join the existing Raft group via snapshot replication, restoring redundancy.

Practical Experience

RocksDB Tuning :

Expired data eviction using periodic_compaction_seconds to trigger compaction.

Slow scan mitigation with CompactOnDeletionCollector and deletion_trigger .

Compaction rate limiting via NewGenericRateLimiter with auto_tuned and RateLimiter::Mode settings.

Disabling WAL for RocksDB writes because Raft log already guarantees durability.

Raft Optimizations :

Reducing replica count during severe failures to maintain availability.

Log aggregation for high‑throughput writes (batching every 5 ms or 4 KB).

Asynchronous fsync disabling ( vm.dirty_ratio tuning) and asynchronous Raft log flushing to avoid I/O spikes.

Future Directions

Transparent multi‑tier storage with automatic hot‑cold separation.

Integration of SPDK and PMEM for accelerated I/O.

Distributed StorageKV StoreBulk LoadFailure RecoverypartitioningRaft consensus
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.