Databases 22 min read

Design and Architecture of Bilibili's High‑Performance KV Storage System

This article presents the background, overall architecture, partitioning strategies, raft‑based replication, binlog support, multi‑active deployment, bulk‑load mechanisms, storage‑engine optimizations, load‑balancing policies, and failure‑detection & recovery techniques of a high‑reliability, high‑throughput key‑value store used at Bilibili.

Architect
Architect
Architect
Design and Architecture of Bilibili's High‑Performance KV Storage System

Background – Bilibili’s services generate diverse data models, some requiring complex relational handling while others fit simple KV patterns; high‑throughput scenarios previously used MySQL + Redis, which introduced cache‑consistency and development‑complexity issues.

Design Goal – Build a self‑developed KV system that is highly reliable, available, performant, and scalable, using multi‑replica disaster recovery via the Raft consensus protocol.

Overall Architecture – The system consists of three core components: Metaserver (manages cluster metadata, health checks, failover, and load‑balancing), Node (stores KV data, each holding a replica; Raft ensures consistency and elects a leader for reads/writes), and Client (access point offering proxy‑based or native SDK access, obtains shard placement from Metaserver and implements retry/back‑off).

Cluster Topology – The topology defines Pool , Zone , Node , Table , Shard , and Replica . Pools group zones for resource isolation; zones are network‑connected fault domains; nodes are physical hosts; tables map to business tables; shards split tables; replicas are Raft groups with engines (RocksDB or SparrowDB).

Core Features

Partition Splitting – Both range and hash partitioning are supported. Hash partitioning avoids hotspots but loses global ordering; range partitioning preserves order but may cause write hotspots. Splitting logic (illustrated for hash) involves Metaserver calculating target shard count, instructing Nodes to create new replicas, updating shard state from splitting to normal , and handling client routing during the transition.

Binlog Support – Raft logs are reused as KV binlogs, enabling real‑time event streaming and cold‑storage backup on object storage for long‑term replay.

Multi‑Active Deployment – Learner modules replicate writes across data‑center clusters, allowing read‑only access to the nearest cluster and optional write‑active mode with unit‑based writes to avoid conflicts.

Bulk Load – Offline‑generated SST files are uploaded to object storage and directly ingested by KV nodes, reducing write amplification and offloading compaction work.

KV Storage Separation – Inspired by Bitcask, SparrowDB stores values in append‑only data files while keeping indexes in RocksDB; small values are inlined to balance read latency and write amplification.

Load Balancing – Balances disk, CPU, memory, and network load; replica placement prefers zones with fewer replicas and lower node load; primary‑replica distribution is balanced using expected‑master calculations: expected_master = node_replica_count / shard_replica_count .

Failure Detection & Recovery – Metaserver sends heartbeats to Nodes; failed Nodes are marked and their replicas migrated. Heartbeat forwarding mitigates false positives in network partitions. Disk failures trigger replica migration and snapshot‑based recovery, with snapshots copied at the file level to minimize downtime.

RocksDB Tuning – Implements TTL‑based compaction filters, CompactOnDeletionCollector with deletion_trigger to mitigate delete‑induced write amplification, periodic compaction via periodic_compaction_seconds , and rate‑limiting using NewGenericRateLimiter with auto_tuned and RateLimiter::Mode settings.

Raft Optimizations – Reduces replica count during severe failures, aggregates log submissions (e.g., every 5 ms or 4 KB) to improve throughput, and supports asynchronous flushing of Raft logs to avoid fsync bottlenecks.

Future Directions – Multi‑tier storage with automatic hot‑cold separation, integration of SPDK and PMEM for I/O acceleration, and further improvements to load‑balancing and failure‑recovery mechanisms.

References: Bitcask, Lethe, and internal Bilibili engineering documents.

distributed systemsLoad BalancingKV storageRaftBulk Loadpartitioning
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.