ByteDance’s Enhancements to RocksDB: LazyBuffer, Adaptive Map, KV Separation, Multi‑Index, Extreme Compression, and New Hardware Support
This article describes ByteDance’s extensive improvements to the RocksDB storage engine—including LazyBuffer, Adaptive Map‑based lazy compaction, KV separation, adaptive multi‑index support, extreme compression techniques, and hardware acceleration—to reduce amplification, improve performance, and lower costs for large‑scale database workloads.
Background
RocksDB is a widely used LSM‑tree storage engine and is the foundation of many ByteDance database products. However, it suffers from read/write amplification, limited compression, indexing inefficiency, and other performance and cost issues.
Shortcomings of RocksDB
Severe read‑write amplification
Insufficient peak‑traffic handling
Limited compression ratio
Low index efficiency
…
Our Improvements
LazyBuffer
Replaces PinnableSlice with a lazy buffer that postpones I/O until the value is actually needed, reducing unnecessary reads and copy overhead, especially for scans and random seeks.
Lazy Compaction (Adaptive Map)
Introduces an Adaptive Map (aMap) that builds a virtual SST overlay, tracks overlap and coverage, and lets the GC thread prioritize low‑overlap layers, achieving lower write amplification while keeping read amplification controllable.
Key functions include inheriting Level Compaction, constructing aMap before compaction, splitting into overlapping segments (R1‑R5), and GC selection based on overlap quality.
KV Separation
Adopts a design similar to WiscKey where values are stored in a separate log and SST entries contain only pointers, reducing compaction and seek costs. ByteDance’s implementation leverages Adaptive Map to delay GC and handle traffic spikes, though range queries suffer.
Multi‑Index Support
Extends the default RocksDB index block with adaptive index selection, supporting compressed trie for strings, descending integer bitmap indexes, and the possibility of embedding B+‑tree indexes directly in data blocks.
Extreme Compression
Implements a global compression pipeline that builds a dictionary, applies sliding‑window compression, optional entropy compression (ANS, high‑order Huffman), and stores per‑record offsets using pfordelta‑style techniques.
New Hardware Support
Offloads compression and CRC checks to QAT accelerators and explores persistent memory and FPGA usage to shift CPU bottlenecks and improve I/O efficiency.
Benchmark
Using RocksDB’s db_bench tool on a 256 GB dataset (48‑core CPU, 128 GB RAM, Intel NVMe), ByteDance compared RocksDB, TitanDB, and TerarkDB, focusing on KV‑separation performance for large values (4 KB and 32 KB). Results show significant improvements in write amplification and throughput.
Future Work
Plans include distributed compaction, SPDK integration, AI‑driven I/O tuning, and deeper adoption of persistent memory and FPGA.
SST File Layout (Reference)
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block] (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block] (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block] (see section: "range deletion" Meta Block)
[meta block 5: stats block] (see section: "properties" Meta Block)
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[Footer] (fixed size; starts at file_size - sizeof(Footer))Images illustrating the Adaptive Map structure, KV‑separation concept, compression flow, and benchmark results are included in the original article.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.