Cloud Native 11 min read

Building High‑Availability Architecture with Service Mesh (ASM) Across Availability Zones and Regions

This article explains how to design a highly available business system on Alibaba Cloud by leveraging multi‑availability‑zone deployments, ASM circuit‑breaking and rate‑limiting, and multi‑region multi‑cluster service‑mesh strategies to ensure resilience against both AZ‑level and region‑level failures.

Alibaba Cloud Infrastructure

Dec 9, 2024

Building High‑Availability Architecture with Service Mesh (ASM) Across Availability Zones and Regions

During business iteration, stability becomes the most critical foundation for digital systems, and achieving high availability is the top priority when designing architectures on cloud platforms that are inherently multi‑region and multi‑AZ.

A complete high‑availability architecture must address two aspects: (1) deploying resources across multiple geographic locations to avoid single points of failure and ensuring the scheduler can operate across those locations; (2) providing robust service protection (e.g., circuit breaking and rate limiting) to shield critical services from burst or malicious traffic. In the cloud‑native era, the standard solution is Kubernetes combined with a service mesh, where Kubernetes handles resource scheduling and the mesh manages advanced security, traffic, and observability.

Building Availability‑Zone‑Level High Availability

- Deploy Across Multiple AZs -

In a single Kubernetes cluster, Alibaba Cloud ACK distributes all managed components, worker nodes, and elastic container instances across multiple AZs with multi‑replica, balanced placement. If an AZ fails (e.g., power outage or network loss), the healthy AZ continues to serve traffic.

- Use ASM Circuit Breaking and Rate Limiting to Boost Global Availability -

Beyond physical isolation, software‑level protection is essential. Circuit breaking and rate limiting significantly improve overall service availability by limiting the blast radius of local issues and preventing cascading failures.

Rate limiting protects backend services from overload, automatically degrading when traffic exceeds thresholds. ASM supports local and global rate limiting as well as custom rules (e.g., per‑user QPS).

Circuit breaking temporarily cuts off calls to upstream services that are failing or overloaded, preventing the issue from affecting the whole system. Unlike traditional libraries that require code changes, a service mesh provides circuit breaking transparently.

ASM’s circuit‑breaking and rate‑limiting capabilities are richer than the open‑source Istio version, offering more granular conditions to ensure optimal global performance.

- AZ Traffic Retention -

When workloads are spread across AZs, Kubernetes Service load balancing distributes traffic evenly among Pods in different AZs.

Because cross‑AZ calls increase latency, it is ideal to keep traffic within the same AZ under normal conditions. The service mesh’s locality‑aware routing keeps calls in the same AZ unless a failure forces a failover.

- ASM High Availability -

The ASM control plane is deployed across multiple AZs by default, ensuring that if one AZ fails, the data plane continues to operate. The data plane also caches configurations, allowing it to serve traffic briefly even when all AZs are down.

Summary (AZ Level)

Deploying a single cluster across multiple AZs simplifies disaster recovery and reduces operational cost, but it cannot tolerate region‑level failures (e.g., a whole data center outage). To achieve higher availability, multi‑region deployment is required.

Building Region‑Level High Availability

- Multi‑Region Multi‑Cluster Disaster Recovery -

Multi‑region deployments require multiple clusters. Traffic is split between two entry points; under normal conditions DNS can route users to the nearest healthy cluster. Alibaba Cloud Smart DNS or Global Traffic Manager (GTM) can perform health‑checked DNS routing, while ASM gateways provide advanced circuit‑breaking and rate‑limiting for each cluster.

However, global traffic switches face challenges:

Global switch means a sudden surge of traffic to a single cluster, stressing scaling and caching.

Complex failure scenarios (e.g., one application fails in one cluster while another fails in the other) cannot be fully resolved by simple DNS switching.

Service mesh offers a non‑global failover solution. ASM’s multi‑instance service discovery shares service information across clusters, enabling sub‑second, seamless failover for AZ‑, node‑, or service‑level faults, and even maintaining availability when both clusters have partial failures.

- Building a More Complete Multi‑Cluster Disaster Recovery with ASM -

In each region, create an ASM instance (or use the nearest region for non‑Alibaba clusters) and add the remote cluster in “service‑discovery‑only” mode. Both sides can discover each other’s services, providing the same automatic failback and fault‑tolerance as a single‑cluster setup.

For cross‑cluster communication, use Alibaba Cloud CEN to bridge physical networks. When physical connectivity is impossible (e.g., hybrid cloud or non‑Alibaba clusters), ASM’s cross‑cluster gateway can establish a secure mTLS tunnel over the public internet.

Final Summary

To tolerate region‑level failures, multi‑cluster deployment is essential. Simple entry‑point switching is insufficient for many scenarios; using a service mesh to interconnect clusters dramatically improves disaster‑recovery capabilities across regions. Alibaba Cloud Service Mesh can manage any Kubernetes cluster, granting robust high‑availability for complex, multi‑region architectures.

/ END /

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes ASM

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.