Databases 9 min read

Practices and Exploration of Disaster Recovery in Tencent Cloud‑Native Database TDSQL‑C (formerly CynosDB)

This article examines the architecture differences between cloud‑native TDSQL‑C and traditional MySQL, outlines TDSQL‑C’s elastic, serverless, low‑latency features, compares MySQL disaster‑recovery models, and details the multi‑dimensional disaster‑recovery system and its cross‑AZ/Region challenges and solutions.

Tencent Architect

Dec 30, 2021

Practices and Exploration of Disaster Recovery in Tencent Cloud‑Native Database TDSQL‑C (formerly CynosDB)

TDSQL‑C, a cloud‑native database from Tencent Cloud, provides high performance, low cost, large storage, low latency, rapid scaling, fast backup/restore, and serverless capabilities for enterprise ToB users. The article introduces the practice and exploration of its disaster‑recovery mechanisms.

1. Cloud‑native vs. traditional database architecture – Traditional MySQL relies on binlog replication, causing heavy I/O, uncontrolled master‑slave latency, long recovery times, and poor scalability. In contrast, TDSQL‑C adopts a "log‑as‑database" design, separates compute and storage, uses physical replication, and achieves stateless compute nodes.

Key advantages of TDSQL‑C include:

Extreme elasticity: adding a read‑only node completes in about 20 seconds.

Serverless mode: compute nodes can be paused when idle, with recovery under 2 seconds.

Sub‑20 ms master‑slave latency and global consistency thanks to redo‑based replication.

Second‑level recovery: snapshot‑based backups enable second‑level backup and GB‑scale parallel restore.

2. MySQL disaster‑recovery deployment models – Two common patterns are cross‑AZ (two or three availability zones with multiple replicas) and cross‑Region (disaster‑recovery instances in a separate region, typically read‑only primary). These models involve asynchronous or semi‑synchronous replication and rely on external systems for consistency.

3. TDSQL‑C multi‑dimensional disaster‑recovery system – The system consists of agents co‑located with instances that collect status, heartbeat to a Scheduler, and Zookeeper for leader election. The Scheduler decides failover actions based on agent health, lease information, and probing results.

Failure‑switch steps :

When the primary AZ fails, Zookeeper triggers a leader switch and the Scheduler re‑elects a new primary.

The Scheduler detects missing heartbeats from agents in the failed AZ.

It double‑checks lease information before proceeding.

After lease timeout, the Scheduler initiates the failover.

Challenges and mitigation strategies – To prevent double‑writes, Zookeeper degrades to read‑only and agents stop renewing leases, forcing the database into read‑only mode. To avoid accidental switches, a third‑party lease system and an external probing service provide additional safety layers, ensuring decisions are based on comprehensive health data.

The article concludes that TDSQL‑C’s compute‑layer disaster‑recovery is robust, while storage‑layer resilience relies on the HiStor block‑storage engine, and a global database based on redo replication is planned for the future.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability mysql disaster recovery elastic scaling TDSQL-C cloud-native database

Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.