Multi-Cloud Active‑Active Architecture: Design, Benefits, and Challenges
The article examines why multi‑cloud active‑active (multi‑active) deployments are essential for high availability, outlines common disaster‑recovery patterns such as primary‑backup and active‑active, details the technical workflow of traffic routing, business and storage layers, and discusses the practical advantages and drawbacks of this approach.
When an internet company reaches a certain scale, system high availability becomes critical, and many adopt a "multi‑active" strategy to mitigate unexpected failures. The author, an Apache Dubbogo committer, shares experiences from implementing a dual‑cloud solution.
Why multi‑active matters – Real‑world incidents like Bilibili’s 2021 server outage and Futu Securities’ IDC network failure illustrate how basic service failures can severely impact availability, making multi‑active a powerful remedy.
Disaster‑recovery patterns
Primary‑Backup
In small companies a primary‑backup setup is common, but the standby cluster is rarely exercised, risking unverified code, configuration, and data during a failover.
Active‑Active (Multi‑Active)
All clusters serve traffic under normal conditions; traffic is split across them, and if one cluster fails, traffic is shifted to the remaining healthy clusters. Variants include same‑city dual‑active, cross‑region dual‑active, and multi‑center designs, each requiring more resources as the level rises.
Technical details of multi‑cloud active‑active
Two cloud providers host duplicate services. Under normal operation both clouds serve users; if one cloud experiences an issue, all traffic is switched to the other.
The workflow includes:
Clients access services through an entry layer.
The entry layer distributes traffic to business layers according to routing rules.
The business layer processes logic and writes data to storage.
Traffic distribution / switching
Capacity of clusters in both clouds is evaluated, and traffic is typically split evenly. When a cloud fails, the entry layer redirects all traffic to the healthy cloud, highlighting the importance of a reliable entry component.
Business layer dual‑active
Deploy identical code to both clouds, ensuring isolation so that Cloud 1 cannot access Cloud 2. CI/CD pipelines enable rapid rollbacks, but core services should be isolated and validated in a non‑core cluster before promotion.
Storage layer
The design typically uses classic primary‑replica setups for MySQL and Redis, with one cloud hosting the primary and some replicas, and the other cloud hosting the remaining replicas, synchronized via master‑slave mechanisms over a dedicated line.
Pros
Simple architecture leveraging built‑in data‑sync mechanisms of Redis/MySQL, allowing each cloud to serve reads/writes locally.
Cons
The approach heavily depends on the stability of the primary cloud and the dedicated inter‑cloud link; line saturation or failure can cripple the system, and write operations may fail during a cloud outage, requiring manual compensation.
To truly achieve active‑active, multi‑master replication for both Redis and MySQL is needed, but implementing this reliably is extremely challenging.
Conclusion
Many companies end up with “pseudo‑active‑active” systems where the storage layer remains a single point of failure. For non‑BAT‑level firms, it is advisable to first ensure multi‑center backup of core data (transactions, users) to enable rapid recovery when a cloud encounters issues.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.