Evolution and Performance Optimization of Tencent Cloud Block Storage (CBS)
Tencent Cloud Block Storage (CBS) has evolved through three generations—apllo, atlas, and HiSTOR—adopting a client‑direct, distributed architecture, SPDK, RDMA and user‑space TCP to cut latency to sub‑microseconds while delivering exabyte‑scale throughput, high IOPS, and reliable multi‑copy replication for cloud VM workloads.
This article summarizes a talk by Tencent Cloud storage expert Wang Yinhhu on the architecture evolution and high‑performance practices of Tencent Cloud Block Storage (CBS). It introduces CBS as a cloud‑native distributed block storage service that provides persistent storage for cloud virtual machines.
Definition of CBS : CBS (Cloud Block Storage) is built on distributed storage technology, offering the same functionality as a local hard disk but with higher throughput, reliability (three‑copy replication), and availability through automatic failover.
Architecture Evolution : CBS has gone through three generations – CBS apllo, CBS atlas, and the latest HiSTOR. The first generation reused an existing storage platform not designed for block storage. In 2014, CBS apllo replaced it to handle PB‑scale workloads. In 2016, CBS atlas was designed to reduce cost and improve performance, and it has been in production since 2017, supporting exabyte‑scale data. The current HiSTOR architecture focuses on extreme latency reduction.
System Components : The storage system consists of three parts – the client layer, the storage cluster, and the control cluster. The client accesses storage directly over the network, eliminating traditional middle‑layer bottlenecks.
Performance Dimensions : Three key metrics are discussed – throughput (bandwidth), IOPS, and latency. High‑throughput workloads (big data, AI training) require >260 MiB/s bandwidth, while latency is critical for database workloads.
Latency Optimization :
Distributed layer: use of consistent‑hash routing with a “lazy route sync” mechanism that pushes routing updates to storage nodes first, then to clients on demand.
Client side: replace iSCSI with SPDK (Intel’s Storage Performance Development Kit) to achieve zero‑copy, single‑threaded data paths.
Storage engine: CEDA event‑driven framework and a data pool that avoids data copies, enabling sub‑microsecond processing.
Network layer: two products – “极速型云盘” (RDMA‑based) and “增强型云盘” (user‑space TCP stack ZTCP). RDMA provides hardware‑offloaded data transfer, while ZTCP moves TCP processing to user space with zero‑copy.
The talk also covers the trade‑offs between RDMA and TCP, the use of different I/O paths for small (send/recv) and large (read/write) I/O, and the handling of congestion control in massive clusters.
Q&A Highlights :
Cost and quality are the main bottlenecks for scaling to PB‑EB levels.
Consistent‑hash migration can be minimized by splitting only the new partition.
High‑concurrency small I/O does increase client CPU usage, but resources are isolated from the user’s VM.
Cloud disks can eventually replace physical disks once latency is sufficiently low.
Data replication uses a proprietary algorithm, not Raft.
Disaster recovery relies on multi‑copy metadata, dispersed placement, and snapshot capabilities.
Overall, the presentation provides a comprehensive view of how Tencent Cloud has engineered CBS to achieve hundred‑microsecond latency, high throughput, and reliable storage for large‑scale cloud workloads.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.