Cloud Computing 16 min read

How We Built a Resilient Multi‑Cloud Network: Lessons from Three Evolution Phases

This article details the step‑by‑step evolution of a multi‑cloud network at Zuoyebang, covering three construction phases, quality‑improvement measures such as fault prevention and rapid recovery, and ongoing operational governance that together deliver a flexible, high‑availability cloud infrastructure.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
How We Built a Resilient Multi‑Cloud Network: Lessons from Three Evolution Phases

Multi‑Cloud Network Journey

Zuoyebang has been built on public clouds since its inception, without a private data center, and over 95% of its services are cloud‑deployed. Through multiple architectural redesigns, the team developed a dual‑ring star topology with CPE control to achieve multi‑cloud network interconnection.

Network Construction

Phase 1 – Dual‑Cloud Network Centered on the Work Area

In late 2019, after several cloud provider outages, Zuoyebang moved from a single‑cloud to a dual‑cloud architecture by reusing the work‑area’s dedicated lines and IT expertise, making the work‑area network the core of the dual‑cloud link.

The long, uncontrolled dual‑cloud links made the network fragile and reduced reliability because the work‑area network, originally designed for office use, became a production‑critical dependency.

Strong coupling between the work‑area and dual‑cloud networks blurred team responsibilities, causing IP conflicts, routing chaos, and change‑impact issues.

Phase 2 – Hybrid‑Line Dual‑Cloud Network

To eliminate reliance on the work‑area network, direct redundant lines were built between the two clouds, using the work‑area only as a fallback.

Static routing on the redundant lines prevented true hot‑standby; failover required manual route changes, leading to long outage durations.

Bandwidth saturation on the lines lacked visibility and throttling, making troubleshooting time‑consuming.

Adding new cloud providers was costly because the existing architecture lacked extensibility.

Phase 3 – Dual‑Ring Star Topology for Multi‑Cloud

The team evaluated four topology options—linear, ring, mesh, and star—normalizing cost, quality, performance, and efficiency. The star topology was chosen for its scalability, and two line‑provider POPs in Beijing were deployed with ECMP for link redundancy.

Additional improvements included:

Deploying CPE devices (rented from line providers) to gain traffic perception and control (routing, ACL, QoS, BGP, BFD).

Switching all inter‑cloud links from static routes to BGP with BFD, achieving sub‑second automatic failover.

Quality Improvement

The multi‑cloud network became a critical service, so stability was paramount. Fault handling was divided into five stages: prevention, detection,定位,止损, and recovery.

Fault Prevention

Architectural high‑availability through redundant design.

Capacity control: keep dual‑link utilization below 50% and prioritize core traffic with QoS.

Fault Detection

Line monitoring: status and bandwidth of line gateways.

CPE monitoring via SNMP/syslog for CPU, memory, fan, optics, BGP, BFD.

Link monitoring using a full‑mesh ping system across clouds and providers.

Fault定位

Traffic analysis via NetFlow stored in ELK, providing Top‑N, protocol, and cross‑cloud dashboards.

Quality dashboards aggregating ping‑mesh data.

Automatic diagnostics: packet loss triggers a trace that is posted to DingTalk for rapid root‑cause identification.

Fault止损

When rollback fails, traffic diversion is used. A pre‑defined runbook platform automates SOP execution, enabling one‑click, sub‑5‑minute mitigation.

Fault Recovery

Post‑mortems capture lessons, close improvement loops, and prevent recurrence.

Continuous Operation

Service Governance

Work‑area control: restrict published routes and add firewalls for audit.

Route convergence: aggregate >400 routes to <100 by summarizing on CPE devices.

Traffic Governance

Measure cross‑cloud traffic, tag it, and assign ownership for weekly operations.

Enforce policies on CPEs to block non‑essential cross‑cloud requests, achieving near‑zero cross‑cloud traffic for non‑core services.

Summary

After three construction phases, the network moved from reactive firefighting to proactive operation, supporting a mature multi‑cloud architecture.

Experience Summary

Goal: build a minimal, reliable, and scalable multi‑cloud network.

Principles: unify architecture, protocols, links, and rules.

Future Outlook

Edge scenarios such as RTC streaming, nationwide work‑area cloud, third‑party private clouds, and edge‑cloud integration demand a distributed cloud architecture that extends the central network to the edge.

monitoringcloud computingoperationsMulti-CloudNetworkBGPCPE
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.