How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform
This article details Dada's journey of designing and implementing a dual‑cloud active‑active architecture, covering high‑availability vs. disaster‑recovery concepts, Phase 1 and Phase 2 solutions, challenges faced, multi‑data‑center Consul deployment, bidirectional database replication, precise load‑balancing, capacity elasticity, and future plans.
1. Why Build Dual‑Active
High Availability (HA) handles single‑component failures within a cluster, while Disaster Recovery (DR) addresses large‑scale failures requiring traffic shift between data centers. HA operates at the LAN level, DR at the WAN level, and HA ensures service continuity within a cloud, whereas DR ensures continuity across multiple clouds, aiming to minimize data loss (RPO) and recovery time (RTO).
DR can be implemented at network, application, host, and storage layers. Industry examples include banks' "two‑site three‑center" setups, Ant Financial's "three‑site five‑center" multi‑active architecture, and Ele.me's dual‑data‑center design.
Major incidents such as the Alipay 527 outage, Ctrip 528 database crash, AWS Beijing cable cut, Tencent Cloud disk failure, Google Cloud GKE outage, and Huawei Hong Kong data‑center collapse highlight the critical need for robust DR capabilities.
Dada experienced severe network failures in 2017‑2018, prompting the adoption of a dual‑cloud architecture for same‑city active‑active resilience.
2. Dual‑Cloud Phase 1
Phase 1 focused on deploying services across two clouds, traffic splitting, and validating interface functionality and performance.
Phase 1 Solution
Cross‑Cloud Dedicated Lines: Four high‑availability links (4 Gbps, 3‑4 ms latency) provided by two providers.
Service Registry: Consul was used for service registration, link isolation, data‑source discovery, and HA failover. Consul’s Raft algorithm ensures strong consistency, though leader election can cause temporary unavailability.
Configuration Center: Shared the existing Config center for service parameters, cache, and database connection strings.
Service Deployment: Core services were deployed only on the J‑cloud, while Config, cache, queue, and database remained on the U‑cloud.
Traffic Distribution: Load balancer built on OpenResty + Consul with custom traffic‑control logic for production, gray, and stress‑test traffic.
Granular traffic control allowed domain‑level routing based on CityId tags, directing a configurable percentage of external traffic to J‑cloud nodes.
Monitoring & Logging: Shared monitoring, logging, APM, and release systems with the U‑cloud production environment.
Problems Encountered in Phase 1
J‑cloud interface response times were high (≈500 ms) due to cross‑cloud latency and multiple round‑trips, compared to ≈200 ms on U‑cloud.
Gray‑release users experienced noticeable latency differences when requests were routed to different clouds.
Sharing a single Consul cluster across clouds caused occasional gossip‑protocol disruptions, leading to node isolation and bandwidth spikes.
Consul gossip protocol was optimized to reduce sensitivity to network latency (see diagram).
3. Dual‑Cloud Phase 2
Phase 2 addressed Phase 1 latency, Consul instability, and traffic consistency issues by introducing three core improvements: intra‑cloud service interaction, bidirectional database replication, and fine‑grained traffic control, combined with an RCG model classification based on access frequency, latency, and consistency requirements.
Additional enhancements covered unified configuration, service consistency deployment, observability, tool adaptation, capacity planning, and cost control.
Consul Multi‑DataCenter Solution
Each cloud runs its own Consul server cluster. Consul clients join the local cluster via LAN gossip, while server clusters interconnect via WAN gossip, achieving intra‑cloud service isolation and cross‑cloud service discovery.
Bidirectional Database Replication
Primary keys were split by odd/even numbers across clouds (U‑cloud odd, J‑cloud even) with auto‑increment steps of 2. Alibaba‑open‑source Otter was used for stable bidirectional sync, achieving an average latency of 0.9 s (max 2.2 s).
Precise LB Traffic Distribution
A custom OpenResty Lua module extracts CityId from request headers and TransporterID/ShopID from the body, maps them to a unified CityID, and routes traffic to J‑cloud based on domain + URI configuration. This enables per‑city and per‑rider traffic steering.
RCG Model Classification
Services and data were classified according to access frequency, latency tolerance, and consistency needs, forming an RCG (Read‑Critical‑Group) model for targeted replication and routing strategies.
Phase 2 Architecture Diagram
Tool/System Adaptation for Dual‑Active
Configuration Center: Migrated from Config to Apollo, with cluster names aligned to Consul datacenters.
Release System: Enabled one‑click dual‑cloud consistent releases, rollbacks, and restarts.
Business Monitoring: Integrated cloud tags to view metrics per cloud.
APM: Deployed Pinpoint clusters in each cloud for intra‑cloud service interaction monitoring.
NTP Service: Synchronized time across clouds via a primary NTP server.
Image Registry: Harbor instances deployed in both clouds with incremental sync.
Capacity Elasticity and Cost Control
Database clusters are mirrored across clouds; MySQL is moving to MGR to reduce cold‑standby costs. Stateless services use K8s HPA and a custom auto‑scaling system, with Kata containers for faster startup, achieving elastic capacity while controlling expenses.
4. Current Status
The dual‑cloud solution is now stable in production, with the ability to shift traffic for any city to a chosen cloud.
5. Future Plans and Summary
Future work includes adding more sharding keys (city, rider, merchant, order IDs), introducing an API router for easier integration, converting fallback jobs to Pulsar delayed messages, enhancing monitoring of traffic switch ratios, adopting TiDB for account services, and extending sharding from CityID to a broader ShardingID dimension.
The project demonstrates that systematic architecture and operations can solve same‑city active‑active challenges with minimal business impact, providing a foundation for unit‑level and multi‑cloud active‑active strategies.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.