Databases 14 min read

ByteDance’s Enhancements to RocksDB: LazyBuffer, Adaptive Map, KV Separation, Multi‑Index, Extreme Compression, and New Hardware Support

This article describes ByteDance’s extensive improvements to the RocksDB storage engine—including LazyBuffer, Adaptive Map‑based lazy compaction, KV separation, adaptive multi‑index support, extreme compression techniques, and hardware acceleration—to reduce amplification, improve performance, and lower costs for large‑scale database workloads.

DataFunTalk
DataFunTalk
DataFunTalk
ByteDance’s Enhancements to RocksDB: LazyBuffer, Adaptive Map, KV Separation, Multi‑Index, Extreme Compression, and New Hardware Support

Background

RocksDB is a widely used LSM‑tree storage engine and is the foundation of many ByteDance database products. However, it suffers from read/write amplification, limited compression, indexing inefficiency, and other performance and cost issues.

Shortcomings of RocksDB

Severe read‑write amplification

Insufficient peak‑traffic handling

Limited compression ratio

Low index efficiency

Our Improvements

LazyBuffer

Replaces PinnableSlice with a lazy buffer that postpones I/O until the value is actually needed, reducing unnecessary reads and copy overhead, especially for scans and random seeks.

Lazy Compaction (Adaptive Map)

Introduces an Adaptive Map (aMap) that builds a virtual SST overlay, tracks overlap and coverage, and lets the GC thread prioritize low‑overlap layers, achieving lower write amplification while keeping read amplification controllable.

Key functions include inheriting Level Compaction, constructing aMap before compaction, splitting into overlapping segments (R1‑R5), and GC selection based on overlap quality.

KV Separation

Adopts a design similar to WiscKey where values are stored in a separate log and SST entries contain only pointers, reducing compaction and seek costs. ByteDance’s implementation leverages Adaptive Map to delay GC and handle traffic spikes, though range queries suffer.

Multi‑Index Support

Extends the default RocksDB index block with adaptive index selection, supporting compressed trie for strings, descending integer bitmap indexes, and the possibility of embedding B+‑tree indexes directly in data blocks.

Extreme Compression

Implements a global compression pipeline that builds a dictionary, applies sliding‑window compression, optional entropy compression (ANS, high‑order Huffman), and stores per‑record offsets using pfordelta‑style techniques.

New Hardware Support

Offloads compression and CRC checks to QAT accelerators and explores persistent memory and FPGA usage to shift CPU bottlenecks and improve I/O efficiency.

Benchmark

Using RocksDB’s db_bench tool on a 256 GB dataset (48‑core CPU, 128 GB RAM, Intel NVMe), ByteDance compared RocksDB, TitanDB, and TerarkDB, focusing on KV‑separation performance for large values (4 KB and 32 KB). Results show significant improvements in write amplification and throughput.

Future Work

Plans include distributed compaction, SPDK integration, AI‑driven I/O tuning, and deeper adoption of persistent memory and FPGA.

SST File Layout (Reference)

[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]          (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]      (see section: "range deletion" Meta Block)
[meta block 5: stats block]          (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                (fixed size; starts at file_size - sizeof(Footer))

Images illustrating the Adaptive Map structure, KV‑separation concept, compression flow, and benchmark results are included in the original article.

IndexingCompactionstorage enginecompressionRocksDBhardware accelerationKV Separation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.