Design and Evolution of WTable’s Scaling Process Using RocksDB
This article explains how the WTable distributed key‑value store leverages RocksDB’s LSM‑tree architecture and slot‑based data distribution to redesign its scaling workflow, separating full and incremental data migration to reduce compaction overhead and achieve high‑speed, low‑impact cluster expansion.
RocksDB, an open‑source high‑performance key‑value storage engine from Facebook, is widely used in industry and forms the storage layer of the internally developed distributed KV system WTable.
WTable achieves scalability by partitioning data into multiple slots, each being the smallest unit of migration; a slot is determined by hashing the key ( SlotId = Hash(Key)/N ) and prefixing the slot ID to the physical key, which ensures that keys belonging to the same slot are stored contiguously and that slots remain ordered within RocksDB’s SST files.
The initial scaling approach added new nodes, created a migration plan, and migrated each slot in three phases—full data transfer using a RocksDB iterator snapshot, incremental transfer of binlog‑recorded writes, and finally switching routing information in etcd so the new node serves the slot.
This method caused problems in production: when migration speed was high, the mixture of ordered full‑data writes and unordered incremental writes triggered heavy compaction across levels, consuming I/O and leading to write stalls, which degraded service performance.
To address these issues, the process was re‑engineered: all full data from every slot is transferred together without interleaved incremental writes, reducing compaction to simple file moves; incremental data is sent afterward at a slower pace; and routing changes are applied only after all slots have completed migration, resulting in three clear stages—full data migration, incremental data migration, and unified routing switch.
After these improvements, WTable can expand at 50–100 MB/s without impacting online services. Alternative approaches such as transferring SST files via IngestExternalFile were evaluated but rejected because they would break the existing binlog‑based replication mechanism and require extensive additional work.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.