Architecture and Design of Tencent Cloud CynosDB and CynosFS
Tencent Cloud’s CynosDB uses a shared‑storage, primary‑multiple‑replica architecture built on the user‑space distributed file system CynosFS, separating compute from elastic block storage, sinking write‑ahead logs for asynchronous replay, and employing MVCC with Raft‑replicated segments to achieve high‑availability, scalable, and cost‑effective MySQL/PostgreSQL‑compatible performance.
CynosDB is Tencent Cloud's next‑generation distributed database, 100% compatible with MySQL and PostgreSQL, supporting elastic storage expansion, a primary‑multiple‑replica shared‑data architecture, and performance that surpasses native MySQL and PostgreSQL. CynosDB adopts a share‑storage architecture, whose elastic scaling and cost‑effectiveness are built on CynosFS – a user‑space distributed file system developed by Tencent Cloud.
Challenges and Responses
CynosDB follows a public‑cloud‑native design. The core challenge is achieving efficient, stable elastic scheduling for storage, which is the foundation of high availability in public‑cloud products. While compute and network already have mature elastic scheduling, storage lacks such mechanisms because data scheduling costs are high. To pool resources for a database product, two requirements must be met:
Separation of storage and compute: Compute resources (CPU and memory) can be pooled using containers or VMs.
Distributed storage: Data is split into standardized blocks and managed by a distributed scheduler, enabling elastic capacity and I/O provisioning.
The traditional VM + cloud‑disk architecture suffers from heavy network I/O and non‑shared storage between primary and replicas, leading to high latency and cost. CynosDB addresses these issues with two key designs:
Log sinking: The Write‑Ahead Log (WAL) is logically the complete record of data changes. CynosDB sinks the WAL to the storage layer, so the database instance only writes the WAL, eliminating page writes and reducing network I/O. The WAL also serves as the Raft log for multi‑replica synchronization.
Shared storage for primary and replicas: Primary and replica instances share a single storage dataset, further reducing network traffic and storage consumption.
The optimized architecture is illustrated in the following diagram:
Key components of the architecture include:
DB Engine: The database engine supporting one primary and multiple replicas.
Distributed File System (CynosFS): A user‑space distributed file system that translates file read/write requests into block operations.
LOG/BLOCK API: Storage Service provides separate write‑ahead‑log and block read/write interfaces.
DB Cluster Manager: Handles HA management for the primary‑multiple‑replica DB cluster.
Storage Service: Manages log processing, asynchronous block replay, multi‑version reads, and backs up WAL to cold storage.
Segment (Seg): The smallest unit (≈10 GB) managed by Storage Service, replicated across nodes via Raft (forming a Segment Group).
Pool: A logical collection of Segment Groups presented as a continuous block device to the Distributed File System.
Storage Cluster Manager: Oversees HA scheduling for Storage Service and Segment Groups, maintaining the Pool‑to‑SegmentGroup mapping.
Cold Backup Service: Performs incremental backup of WAL logs, enabling generation of full and differential backups.
All modules except the DB Engine and DB Cluster Manager constitute the user‑space distributed file system named CynosFS .
Log Sinking and Asynchronous Replay
CynosDB adopts the “log‑as‑database” concept from the AWS Aurora paper. WAL is sunk to the Storage Service, which asynchronously applies the log to the corresponding data blocks, reducing write I/O and providing MVCC read capability. The write flow consists of:
1. Receive update log, write to disk, and trigger step 2.
2. Initiate Raft log replication.
3. Raft majority commits, confirming write success.
4. Asynchronously attach the log to the update chain of the target data block.
5. Merge the update chain into the data block.
6. Back up the log to the cold‑backup system.
7. Reclaim obsolete logs.
This process creates a per‑block update chain that supports multiple versions, enabling MVCC. Storage Service also offloads CRC calculations to its CPUs.
MVCC Implementation
CynosFS’s MVCC underpins CynosDB’s primary‑multiple‑replica architecture. Key concepts include:
Mini‑transaction (MTR): The smallest unit guaranteeing ACID for page modifications (e.g., a B+‑tree insert that modifies up to three pages).
Write‑Ahead Log (WAL): Physical log of binary modifications to data blocks.
Log Sequence Number (LSN): Monotonically increasing identifier for each log record.
Consistency Point LSN (CPL): The LSN of the last log record of a completed MTR, representing a consistent data state.
Segment Group Complete LSN (SGCL): The highest LSN persisted in a Segment Group (Raft CommitIndex).
Pool Complete LSN (PCL): The highest LSN persisted across all Segment Groups in a Pool.
Pool Consistency Point LSN (PCPL): The greatest CPL ≤ PCL, defining the MVCC read point.
The write path aggregates page modifications into Pool‑level logs with increasing LSNs, distributes them to Segment Groups, and advances PCL/PCPL as SGCLs are committed. Figures 4‑5 (omitted) illustrate PCL and PCPL progression.
The read path on replica instances uses the PCPL to determine the latest consistent snapshot (Read Point LSN). Replicas apply logs that fall within their buffer pool; others are discarded. The minimum PCPL among all replicas (Min‑RPL) defines the range of valid CPLs that Storage Service must serve.
Transaction Commit
A transaction consists of multiple MTRs; the LSN of the final log record (a CPL) is the Commit LSN (CLSN). Logs are pushed to Storage Service asynchronously, grouped by Segment Group. When PCPL exceeds a transaction’s CLSN, the transaction is considered committed.
Crash Recovery
CynosDB continuously records the primary’s PCPL as Last‑PCPL (L‑PCPL). Upon a crash, a new primary reads all logs with LSN > L‑PCPL from each Segment Group, reconstructs the log stream, determines the new PCPL, and pushes it to replicas, restoring the system to a consistent state.
Continuous Backup
Full backups are unnecessary; as long as WAL logs are incrementally saved to cold storage before reclamation, they can be used to generate full or differential backups offline. Instance configuration data is also backed up for recovery.
Note: CynosDB is currently in public beta, offering three months of free usage. Reply with “数据库” to receive a free trial.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.