Evolution, Design, and Implementation of Bilibili's Distributed KV Storage System
This article details how Bilibili's KV storage system evolved from early solutions like Redis and MySQL to a highly scalable, high‑availability distributed architecture, describing its overall design, node components, data splitting, multi‑active disaster recovery, typical use cases, and operational challenges.
The talk, presented by senior Bilibili engineer Lin Tanghui, introduces the rapid growth of Bilibili's business and the need for a storage system that can handle exponential traffic peaks while providing high availability through multi‑active replication.
Storage Evolution – Early KV solutions (Redis/Memcache, Redis+MySQL, HBase) could not scale: MySQL could not store >10 TB, Redis Cluster suffered from Gossip‑based communication overhead, and HBase incurred high tail latency and memory costs. Requirements were defined: 100× horizontal scalability, low latency, high QPS, fault‑tolerance, low cost, and zero data loss.
Design Implementation
1. Overall Architecture – Consists of three parts: Client (SDK access), Metaserver (metadata of tables, shards, and node placement), and Node (stores replicas). Nodes use Raft for replica synchronization, ensuring high availability.
2. Cluster Topology – Resources are organized into Pools (online/offline), Zones (fault isolation), Nodes (contain multiple disks and replicas), and Shards (range or hash split).
3. Metaserver – Manages resources, metadata distribution, health checks, load monitoring, load balancing, split management, Raft leader election, and persists metadata with RocksDB.
4. Node – Contains background threads (Binlog management, garbage collection, health checks, compaction), RPC interface with quota‑based throttling, and an abstract engine layer to handle different workloads (e.g., large‑value optimization).
5. Data Splitting & Balancing – Automatic shard splitting (range/hash) when a 24 GB shard exceeds capacity, with transparent redirection during split and millisecond‑level rebalancing using RocksDB checkpoints.
6. Multi‑Active Disaster Recovery – Binlog‑based cross‑region replication (e.g., between cloud‑cube and Jiading data centers) enables automatic failover when an entire region fails.
Scenarios & Problems
Typical use cases include user profiling for recommendation, dynamic feeds, object storage, and danmaku. Optimizations such as Bulkload reduce full‑load time from hours to minutes, and custom fixed‑length list engines handle high‑volume history data.
Challenges encountered:
Compaction delays for expired keys – solved with periodic compaction checks.
SCAN slowdown due to delete markers – mitigated by delete‑threshold‑triggered compaction and delayed deletion.
Large‑value write amplification – addressed by KV‑storage separation.
Raft replica loss – automatic downgrade to single‑replica mode and scripted recovery.
Log flushing overhead – batch flushing after reaching size or count thresholds improves throughput 2–3×.
Summary & Reflections
Future directions include tighter KV‑cache integration, Sentinel mode for replica cost reduction, improved slow‑node detection, automated disk‑balance mechanisms, and performance tuning via SPDK/DPDK to further boost KV throughput.
Thank you for listening.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.