Cloud Computing 13 min read

Dada Dual-Cloud Active-Active Disaster Recovery: Architecture, Practices, and Lessons Learned

This article details Dada's dual‑cloud active‑active disaster‑recovery implementation, explaining high availability versus disaster recovery, describing the first‑phase architecture and challenges, and outlining the second‑phase enhancements such as multi‑data‑center Consul, bidirectional database replication, precise load‑balancing, tool adaptations, capacity elasticity, and future plans.

Dada Group Technology
Dada Group Technology
Dada Group Technology
Dada Dual-Cloud Active-Active Disaster Recovery: Architecture, Practices, and Lessons Learned

In the past six years Dada has continuously upgraded its technical capabilities to ensure high‑availability and disaster‑recovery (DR) for its on‑demand delivery business, distinguishing high availability (HA) as intra‑data‑center failover and DR as inter‑data‑center continuity.

The first phase of the dual‑cloud active‑active project focused on deploying core services in a secondary cloud (J‑cloud), sharing Consul, Config, caches, queues, and databases with the primary cloud (U‑cloud), and implementing traffic routing via OpenResty‑based LB with city‑based percentage controls. Issues encountered included increased latency across the 3‑4 ms cross‑cloud link, inconsistent user experience during gray releases, and occasional Consul cluster instability caused by network jitter.

To address these problems, the second phase introduced three core improvements: (1) a multi‑data‑center Consul deployment where each cloud runs its own Consul servers and clients join locally via LAN Gossip, while WAN Gossip links the clouds; (2) bidirectional MySQL replication using Alibaba’s Otter, with odd/even primary‑key allocation and sub‑second sync latency; (3) fine‑grained traffic control using a custom OpenResty Lua module that extracts CityId/TransporterID/ShopID to direct requests to the appropriate cloud.

Additional system adaptations include migrating configuration management from Config to Apollo, synchronizing image repositories with Harbor, deploying Pinpoint for APM in each cloud, establishing NTP master‑slave time sync, and implementing auto‑scaling based on K8S HPA and Kata containers to achieve capacity elasticity while controlling costs.

Currently the dual‑cloud setup runs stably, supporting city‑level traffic switching, and future work will add more sharding keys, API routing, Pulsar‑based delayed jobs, TiDB for account services, and further horizontal database partitioning to enhance scalability and resilience.

cloud computingload balancingdisaster recoveryConsulDatabase Replicationactive-activedual-cloud
Dada Group Technology
Written by

Dada Group Technology

Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.