Mastering High Availability: From Cold Backups to Multi‑Active Disaster Recovery
This article explores the evolution of high‑availability strategies for stateful backend services, comparing cold backups, active/standby, same‑city and cross‑city active‑active setups, and discusses the trade‑offs, design considerations, and real‑world implementations of multi‑active and multi‑active architectures.
Preface
Backend services can be classified as stateless or stateful. High availability is straightforward for stateless applications, which can rely on load balancers like F5, but the following discussion focuses on stateful services.
State is typically persisted on disk or in memory, using databases such as MySQL, Redis, or JVM memory (which has a short lifecycle).
High Availability
1. Some HA Solutions
High‑availability has evolved through several stages:
Cold backup
Active/standby (dual‑machine hot backup)
Same‑city active‑active
Cross‑city active‑active
Cross‑city multi‑active
Before discussing cross‑city multi‑active, it helps to understand earlier solutions.
Cold Backup
Cold backup copies data files while the database is offline, often using simple file copy commands (e.g.,
cpon Linux). Benefits include simplicity, fast backup, quick recovery, and point‑in‑time restoration.
Simple
Fast backup compared to other methods
Fast recovery – copy files back or adjust configuration; even two
mvcommands can restore instantly
Point‑in‑time recovery – useful for incidents like coupon exploits
However, cold backup has drawbacks:
Requires service downtime, which is unacceptable for 24/7 global services
Potential data loss between backup and restoration; manual log replay is labor‑intensive
Full‑copy consumes excessive disk space and time
Impractical for large data volumes (multiple terabytes) and lacks selective backup
Balancing these pros and cons is essential for each business.
Active/Standby (Dual‑Machine Hot Backup)
Hot backup allows continuous service while backing up data, but restoration still requires downtime. This discussion excludes shared‑disk approaches.
Active/Standby Mode
One primary node serves traffic while a secondary node acts as backup. Data is synchronized via software (e.g., MySQL master/slave binlog, SQL Server replication) or hardware (disk mirroring). Software‑level sync is often called application‑level disaster recovery; hardware‑level sync is data‑level disaster recovery.
Dual‑Machine Mutual Backup
Essentially Active/Standby with roles swapped, allowing better resource utilization and read‑write separation when deploying different services on each machine.
Other HA options include various MySQL deployment modes (master‑slave, master‑master, MHA) and Redis setups (master‑slave, Sentinel, Cluster).
Same‑City Active‑Active
This extends previous solutions across an entire data center, protecting against a single IDC failure (e.g., power outage). It resembles dual‑machine hot backup but with greater distance; latency remains low due to dedicated links.
Some applications achieve true active‑active with conflict resolution, though not all workloads can support it.
Industry practice often adopts a “two‑site three‑center” model: two local data centers provide primary service, while a remote center serves as disaster‑recovery only. Traffic is load‑balanced, and failover switches to the remote center when a local site fails, though latency may increase.
In the “two‑site three‑center” diagram, traffic is distributed via load balancers to IDC1 and IDC2; both sync data to IDC3. If any IDC fails, traffic is redirected to the remaining site.
The diagram shows a master‑slave based three‑center architecture, where two local sites act as master‑slave and the remote site as backup.
3. Cross‑City Active‑Active
Same‑city active‑active handles most disaster scenarios, but large‑scale outages (e.g., natural disasters) still cause service interruption. Extending the architecture across cities allows traffic to fail over to another city, albeit with degraded user experience.
Most internet companies adopt cross‑city active‑active.
The simple cross‑city active‑active diagram shows load balancers directing traffic to two city clusters, each with its own local database cluster. Failover occurs only when the local databases become unavailable.
Cross‑city synchronization introduces higher latency, reducing throughput and increasing conflict risk. Solutions include distributed locks, eventual consistency, sharding, and intermediate states with retries.
For strict consistency requirements, Ele.me uses a “Global Zone” design: writes are directed to a single master data center, while reads can be served locally, ensuring strong consistency.
For applications demanding strong consistency, we provide a Global Zone solution that centralizes writes to a master data center while allowing reads from any slave, based on our Database Access Layer (DAL), making the process transparent to business logic. —《Ele.me Cross‑Region Multi‑Active Technical Implementation (Part 1) Overview》
Cross‑city active‑active is essentially a temporary step toward cross‑city multi‑active, which offers better scalability but introduces more complexity.
Cross‑City Multi‑Active
The diagram illustrates a mesh topology where each node connects to four others, providing resilience against any single node failure. However, the increased distance for write operations leads to higher latency and more conflicts.
Optimizing the mesh into a star topology reduces synchronization overhead:
In this star layout, each city can fail without affecting data integrity; traffic is rerouted to the nearest city. The central node bears higher reliability requirements (fast recovery, complete backups).
Alibaba’s envisioned multi‑active architecture places writes in a single city while reads are distributed, similar to the “Global Zone” concept.
Large e‑commerce platforms like Taobao adopt a unit‑based split: transactional units synchronize bidirectionally with a central unit, while non‑transactional data syncs unidirectionally, allowing elastic scaling for business units and robust stability for the central unit.
Implementing such architectures requires extensive code refactoring, distributed transaction handling, cache invalidation, and sophisticated testing and operations pipelines.
In summary, cross‑city multi‑active demands strong foundational capabilities such as data transfer, verification, and a simplified client‑side write/sync layer.
Source: https://blog.dogchao.cn/?p=299
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.