How We Built a Resilient Multi‑Cloud Network: Lessons from Three Evolution Phases
This article details the step‑by‑step evolution of a multi‑cloud network at Zuoyebang, covering three construction phases, quality‑improvement measures such as fault prevention and rapid recovery, and ongoing operational governance that together deliver a flexible, high‑availability cloud infrastructure.
Multi‑Cloud Network Journey
Zuoyebang has been built on public clouds since its inception, without a private data center, and over 95% of its services are cloud‑deployed. Through multiple architectural redesigns, the team developed a dual‑ring star topology with CPE control to achieve multi‑cloud network interconnection.
Network Construction
Phase 1 – Dual‑Cloud Network Centered on the Work Area
In late 2019, after several cloud provider outages, Zuoyebang moved from a single‑cloud to a dual‑cloud architecture by reusing the work‑area’s dedicated lines and IT expertise, making the work‑area network the core of the dual‑cloud link.
The long, uncontrolled dual‑cloud links made the network fragile and reduced reliability because the work‑area network, originally designed for office use, became a production‑critical dependency.
Strong coupling between the work‑area and dual‑cloud networks blurred team responsibilities, causing IP conflicts, routing chaos, and change‑impact issues.
Phase 2 – Hybrid‑Line Dual‑Cloud Network
To eliminate reliance on the work‑area network, direct redundant lines were built between the two clouds, using the work‑area only as a fallback.
Static routing on the redundant lines prevented true hot‑standby; failover required manual route changes, leading to long outage durations.
Bandwidth saturation on the lines lacked visibility and throttling, making troubleshooting time‑consuming.
Adding new cloud providers was costly because the existing architecture lacked extensibility.
Phase 3 – Dual‑Ring Star Topology for Multi‑Cloud
The team evaluated four topology options—linear, ring, mesh, and star—normalizing cost, quality, performance, and efficiency. The star topology was chosen for its scalability, and two line‑provider POPs in Beijing were deployed with ECMP for link redundancy.
Additional improvements included:
Deploying CPE devices (rented from line providers) to gain traffic perception and control (routing, ACL, QoS, BGP, BFD).
Switching all inter‑cloud links from static routes to BGP with BFD, achieving sub‑second automatic failover.
Quality Improvement
The multi‑cloud network became a critical service, so stability was paramount. Fault handling was divided into five stages: prevention, detection,定位,止损, and recovery.
Fault Prevention
Architectural high‑availability through redundant design.
Capacity control: keep dual‑link utilization below 50% and prioritize core traffic with QoS.
Fault Detection
Line monitoring: status and bandwidth of line gateways.
CPE monitoring via SNMP/syslog for CPU, memory, fan, optics, BGP, BFD.
Link monitoring using a full‑mesh ping system across clouds and providers.
Fault定位
Traffic analysis via NetFlow stored in ELK, providing Top‑N, protocol, and cross‑cloud dashboards.
Quality dashboards aggregating ping‑mesh data.
Automatic diagnostics: packet loss triggers a trace that is posted to DingTalk for rapid root‑cause identification.
Fault止损
When rollback fails, traffic diversion is used. A pre‑defined runbook platform automates SOP execution, enabling one‑click, sub‑5‑minute mitigation.
Fault Recovery
Post‑mortems capture lessons, close improvement loops, and prevent recurrence.
Continuous Operation
Service Governance
Work‑area control: restrict published routes and add firewalls for audit.
Route convergence: aggregate >400 routes to <100 by summarizing on CPE devices.
Traffic Governance
Measure cross‑cloud traffic, tag it, and assign ownership for weekly operations.
Enforce policies on CPEs to block non‑essential cross‑cloud requests, achieving near‑zero cross‑cloud traffic for non‑core services.
Summary
After three construction phases, the network moved from reactive firefighting to proactive operation, supporting a mature multi‑cloud architecture.
Experience Summary
Goal: build a minimal, reliable, and scalable multi‑cloud network.
Principles: unify architecture, protocols, links, and rules.
Future Outlook
Edge scenarios such as RTC streaming, nationwide work‑area cloud, third‑party private clouds, and edge‑cloud integration demand a distributed cloud architecture that extends the central network to the edge.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.