Cloud Native 18 min read

Remote StateBackend for Flink: Design, Optimizations, and Cloud‑Native Migration

To enable Bilibili’s cloud‑native migration, the team built a RemoteStateBackend that moves Flink’s keyed state to the Taishan KV store, using deterministic KeyGroup placement, per‑shard snapshots, asynchronous write batching, off‑heap caching with Bloom‑filter filtering, and a fixed‑size memory model, which together reduce checkpoint overhead, improve disk utilization, and accelerate rescaling for more than one hundred production jobs.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Remote StateBackend for Flink: Design, Optimizations, and Cloud‑Native Migration

In the context of industry‑wide cost‑reduction and efficiency improvement, Bilibili is migrating its real‑time and online services to a cloud‑native architecture with unified resource pools and scheduling. Different workloads have varying resource requirements, and the existing online resource pool lacks storage and I/O capabilities, while Flink, as a stateful compute engine, needs strong storage support.

The main pain points are low disk utilization on Flink machines (most tasks have small or no keyed state) and slow rescaling of large‑state jobs due to costly state redistribution.

To address these issues, a RemoteStateBackend is proposed, moving Flink state to a remote KV store (Taishan) that provides snapshot capabilities and aligns with cloud‑native goals. Taishan is a high‑reliability, high‑performance storage system built on RocksDB and SparrowDB with Raft consensus.

State Switching Guarantee : KeyGroupId is computed as MathUtils.murmurHash(key.hashCode) % maxParallelism , ensuring deterministic placement. During rescale, only KeyGroup metadata moves, not the entire state.

Backend Topology : The KeyedStateBackend manages various internal state types (InternalValueState, InternalMapState, etc.) and priority‑queue state. Checkpointing creates snapshots per shard, with snapshot IDs matching Flink checkpoint IDs.

TaishanStateBackend Architecture : The design maps each KeyGroupId to a Taishan shard, keeps shard count equal to Flink’s maxParallelism, creates one Taishan table per operator, prefixes keys with operator IDs, and performs per‑shard snapshot/create‑restore during checkpoints and restores.

Optimizations :

1) Write Optimization : Asynchronous batching of put requests via a BlockingQueue reduces network calls; flushing occurs when batch size or latency thresholds are reached or during checkpoint.

2) Read Optimization : A cache layer (initially Caffeine, later off‑heap OHC) stores recent state to cut RPC traffic. Read‑null requests are filtered using an off‑heap Bloom‑filter‑like structure (OffHeapBloomFilter) with TTL handling.

3) Memory Model Optimization : Off‑heap OHC cache replaces on‑heap Caffeine, sharing cache across sub‑tasks within a slot, eliminating excessive GC overhead. Hash table size is fixed, rehash disabled, and eviction logic respects state expiration.

These changes make checkpoints lighter (metadata‑only upload), achieve true compute‑storage separation, and dramatically speed up task rescaling.

Current Status & Future Work : Over 100 production Flink jobs have been switched to TaishanStateBackend, achieving storage‑compute separation and faster rescale. Remaining challenges include high‑QPS sparse‑key scenarios, large key/value cases causing write stalls, and the need for tiered state storage that combines off‑heap cache, local SSD, and remote KV store.

FlinkPerformanceOptimizationCloudNativeStreamingRemoteStorageStateBackendTaishan
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.