Big Data 12 min read

Optimizing Flink State Performance with RocksDB KV Separation and BlobDB

In large‑scale Flink double‑stream joins, terabyte‑sized RocksDB state caused severe compaction latency and CPU spikes, but enabling RocksDB BlobDB KV‑separation (and an inner‑compaction patch) dramatically shrank SST files, reduced read/write latencies to sub‑millisecond levels, and cut CPU spikes by about half.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Optimizing Flink State Performance with RocksDB KV Separation and BlobDB

Flink SQL often encounters large-scale double‑stream join scenarios. When both left and right streams have high traffic, even a join waiting time of one hour can cause the Flink keyed state (RocksDBStateBackend) to grow to terabyte scale.

Metrics collected on state size, latency, and value length reveal that as state reaches TB size, read/write request latencies increase dramatically. The root cause is RocksDB compaction, which becomes a bottleneck during peak traffic.

Two types of double‑stream joins are identified:

Regular Join without time interval constraints, where keys and values are relatively small.

Interval Join (including latency‑join implementations) where keys are timestamps and values are large lists of RowData, leading to very long serialized values.

The join operator maintains two MapState<long, list> structures, one for each stream. Each timestamp key maps to a list of RowData records containing all projected fields. In high‑traffic jobs, the number of projected fields and their lengths are large, causing long serialized values.

RocksDB write flow: a KV pair is first written to the WAL (disabled in Flink), then to the MemTable. When the MemTable reaches its size limit (64 MB) or a checkpoint triggers a flush, data is persisted to L0 SST files. Compaction is triggered when L0 file count exceeds a threshold, merging files into higher levels (L1, L2, …) according to configured size limits.

In TB‑scale double‑stream join jobs, the SST file distribution often becomes [2,4,41,98,0,0,0] across levels L0‑L7, with a large number of files in L3. This is caused by massive KV volume and long values, which increase compaction time and CPU spikes.

Read flow: a client first checks the MemTable, then L0 (scanning all files because L0 is not globally ordered), and finally one file per higher level until the key is found. Because join keys are timestamps and values are large lists, many Get requests result in ReadNull operations that traverse from L0 to L3, inflating read latency.

Both Get and iterator operations (seek/next) trigger reads that may reach the deepest SST levels, leading to 99th‑percentile latencies of tens of milliseconds—unacceptable for real‑time stream processing.

Optimization strategy: enable RocksDB BlobDB (KV separation) so that values larger than a threshold are stored in separate blob files, while SST files keep only indexes. This dramatically reduces SST size and level depth.

After upgrading to RocksDB 7.8.3 (which supports BlobDB) and enabling KV separation, SST file sizes shrank, and most keys stayed in L0/L1. ReadNull latency dropped to sub‑millisecond levels, and seek/next latencies fell to around 1 ms.

CPU spikes were also reduced by ~50% in large‑state jobs. Remaining spikes stem from BlobDB garbage collection, which cannot be disabled but is less frequent than previous compaction‑induced spikes.

For smaller states that still exhibit periodic spikes, an InnerCompaction patch was applied to pre‑compact L0 files before they merge with L1, further mitigating I/O amplification.

Future work includes upgrading the high‑version Flink RocksDB backend, decoupling checkpoint snapshots from compaction, and making KV separation enabled by default with adaptive value‑size thresholds.

performance optimizationFlinkStreamingRocksDBKV SeparationState Backend
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.