Cloud Native 24 min read

Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

Zone‑level failures can render workloads in an entire availability zone (AZ) unavailable, causing service disruption or data errors. Common causes include power outages, infrastructure faults, resource exhaustion, and human error. To mitigate these risks, both the cloud infrastructure and the application itself must be designed for resilience.

1. Multi‑AZ High Availability with Alibaba Cloud Managed Components

Alibaba Cloud Container Service for Kubernetes (ACK) and Service Mesh (ASM) deploy all managed components across multiple replicas and AZs. The control plane, worker nodes, and elastic container instances are spread evenly, ensuring that a single AZ failure does not affect the overall cluster.

2. Deploying Applications Across AZs

When creating an ACK cluster, select node pools that span multiple AZs and use balanced scaling policies. Use topology spread constraints or node selectors (e.g., topology.kubernetes.io/zone ) to distribute workloads evenly. Refer to the ACK high‑availability architecture guide for details.

3. Observability and Fault Detection

ASM’s sidecar proxies expose metrics such as request counts, latency, and error codes. By adding the locality dimension (e.g., xds.node.locality.zone ) to Prometheus metrics, you can monitor each AZ separately. Sample PromQL queries are provided to view request rates and latency per AZ.

4. Alerting Based on Metrics

Custom Prometheus alerts can be defined for latency or non‑200 response codes, grouped by service and AZ. Example alerts trigger when mockb latency exceeds 3 ms or when mocka returns any status other than 200.

5. Traffic Isolation During an AZ Failure

When an AZ becomes unhealthy, isolate its nodes by adding a taint to make them unschedulable, and use NLB/ALB DNS removal to stop inbound traffic. ASM’s AZ‑traffic‑transfer feature can also redirect east‑west traffic away from the affected zone. After isolation, verify that all traffic originates from healthy AZs using Prometheus queries.

6. Recovery Procedure

To restore service, remove node taints, delete the custom service‑discovery range, and re‑enable DNS for the NLB. This returns traffic to the repaired AZ and resumes normal operation.

The article includes a complete YAML manifest that creates three services (mocka, mockb, mockc) with two replicas each, distributed across two AZs, along with corresponding Istio Gateway and VirtualService definitions.

kubectl apply -f- <

By following these steps, you can build a robust multi‑AZ deployment on Alibaba Cloud, detect AZ‑level incidents quickly, isolate the affected zone, and recover services with minimal impact.

KubernetesPrometheusdisaster recoveryService MeshAlibaba Cloudmulti-AZ
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.