Cloud Computing 25 min read

Technical Overview of Tencent Cloud CBS Data Scheduling System

The Tencent Cloud CBS data scheduling system has evolved from a simple snapshot service into a highly concurrent, low‑latency platform that uses COW/ROW mechanisms, multi‑version snapshots, rapid rollback, hot‑data caching, horizontal scaling, fault‑tolerant task switching, cross‑region replication, and seamless disk migration to ensure reliable, fast storage for backups, image creation, and cloud‑disk migration, with future AI‑driven scheduling and ultra‑low‑latency features.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Technical Overview of Tencent Cloud CBS Data Scheduling System

This document summarizes the technical sharing of Yang Guangchao, a Tencent Cloud storage expert, on the CBS (Cloud Block Storage) data scheduling system.

1. Evolution of CBS Data Scheduling System – Initially a simple snapshot service in 2015, CBS evolved to support data protection, cloud‑server image production, and online cloud‑disk migration as business scale grew and latency requirements became stricter.

2. Typical Business Scenarios and Challenges

Data protection – daily backups, manual and periodic snapshots.

Image production for cloud servers – creating images from snapshots and batch deploying servers.

Cloud‑disk migration – moving disks between storage warehouses without affecting user workloads.

Each scenario faces challenges such as latency sensitivity, high concurrency, and fault tolerance.

3. Key Technologies

COW and ROW – Two snapshot mechanisms. COW (Copy‑On‑Write) copies original blocks before writing, leading to write amplification; ROW (Redirect‑On‑Write) updates pointers, reducing write amplification and favoring write‑intensive workloads.

Multi‑Version ROW – Assigns version numbers to snapshots, enabling incremental backup and precise data reconstruction.

Snapshot Overview – Combines full and incremental backup; snapshot creation is designed to be completed in seconds.

Rollback Process – Uses bitmap metadata to identify which blocks need restoration, merging data from the target snapshot and its dependencies.

Image Production via Rollback – Leverages snapshot rollback to create images quickly without full download, achieving second‑level server boot.

Hot Data Access Strategy – Caches frequently accessed image blocks in the transmission layer to reduce latency for batch server provisioning.

Horizontal Scaling – Splits large regional deployments into smaller zones, uses static (heartbeat) and dynamic (load‑aware) balancing for both control and transmission layers, and replicates data to avoid hotspots.

Task Smooth Switching – Detects node failures via heartbeat, failure rate, and monitoring; switches tasks to healthy nodes to maintain I/O continuity.

Cross‑Region Image Replication – Uses separate control planes in each region to transfer image metadata, ensuring data safety by separating transfer and verification.

Seamless Disk Migration – Introduces an I/O access layer that isolates user I/O from background migration I/O, supports three block states (unmigrated, migrating, migrated) and writes to both source and destination during migration to guarantee data integrity.

Data Reliability – MD5 checksums, cross‑region backup, version‑based write protection, and careful reclamation of old snapshot data ensure data correctness.

Future Directions

Support for block, file, and database storage scenarios.

Ultra‑low‑latency online migration for high‑performance disks.

Second‑level RPO for finer‑grained data protection.

AI‑driven intelligent scheduling for resource risk detection and balanced storage pool utilization.

Q&A Highlights

Empty blocks are not migrated during a full snapshot.

Deployment has shifted from large‑region to small‑region models.

Migration speed is dynamically controlled based on available bandwidth.

Cross‑region image transfer has no special network requirements beyond internal bandwidth limits.

The control plane has active‑standby nodes; the scheduling and transmission clusters are stateless.

migrationHigh Availabilitysnapshotcloud storageblock storagedata scheduling
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.