Backend Development 18 min read

Ele.me’s Secret to Seamless Multi-Region Active-Active Architecture

This article details how Ele.me engineered a cross‑region active‑active system that scales elastically, tolerates whole‑data‑center failures, and maintains real‑time food‑delivery performance through geographic sharding, intelligent routing, and robust data‑replication middleware.

Efficient Ops
Efficient Ops
Efficient Ops
Ele.me’s Secret to Seamless Multi-Region Active-Active Architecture

Background: Why Build Cross‑Region Active‑Active?

Ele.me’s rapid growth exhausted the capacity of a single data center, prompting the need to distribute services across multiple sites and to survive whole‑data‑center outages without disrupting order flow.

Goals

Enable services to expand across several data centers.

Maintain availability during an entire data‑center failure.

Challenges of Network Latency

Beijing‑to‑Shanghai round‑trip latency is about 30 ms, roughly 60× slower than intra‑datacenter latency (0.5 ms). The article lists typical latency figures for caches, locks, memory, disks, and network hops, illustrating the performance impact of cross‑region calls.

Design Principles

Business Cohesion : An order’s entire lifecycle (user, merchant, rider) must stay within a single “eZone” to guarantee low latency.

Availability First : In a failover, keep the system usable even if data briefly diverges; each eZone holds a full data copy.

Data Correctness : Lock orders with inconsistent states during a switch to prevent corruption.

Business Awareness : Services must recognize their eZone and process only local data, using state machines to detect and correct inconsistencies.

Service Sharding

The sharding key is geographic location. Users, merchants, and riders that are close are placed in the same eZone, ensuring that an order’s processing stays within one data center. Custom geographic fences divide the country into shards, which are grouped into eZones.

Traffic Routing

An API Router deployed in public‑cloud regions receives a routing tag from the front‑end app, maps it to a Shard ID, then to an eZone, and forwards the request accordingly. A layered routing scheme also supports high‑level keys beyond geography.

Data Replication

All eZones keep full data copies. A Data Replication Center (DRC) synchronizes MySQL bidirectionally within 1 s, resolves primary‑key conflicts via timestamps, and broadcasts changes for cache invalidation. Separate tools replicate ZooKeeper, message queues, and Redis.

Strong Consistency (Global Zone)

For workloads requiring strict consistency, writes are directed to a master data center while reads can be served from any eZone’s replica, achieved through a dedicated data‑access layer.

Failover Protection

Avoid switching if the network is down; each eZone can serve independently.

Lock orders during a switch until replication catches up.

Reject writes that target a different eZone.

DRC reports illegal writes for investigation.

Cache Refresh Across eZones

Data‑change events broadcast by DRC trigger cache invalidation in all eZones, keeping caches consistent.

Overall Architecture

The system consists of five core middleware components:

APIRouter : HTTP reverse proxy and load balancer that routes API traffic based on sharding keys.

Global Zone Service (GZS) : Central routing table and coordination service, pushing updates to SDK caches.

SOA Proxy : Internal gateway for inter‑eZone SOA calls, applying the same routing logic.

Data Replication Center : Handles real‑time replication for MySQL, ZooKeeper, MQ, and Redis.

Data Access Layer : Enforces routing rules and protects against incorrect writes, supporting Global Zone semantics.

Future Plans

Ele.me will expand from two to three‑four data centers and add a cloud‑based eZone to leverage public‑cloud elasticity for global high availability and scalability.

distributed systemsHigh Availabilityservice routingdata replicationactive-activegeographic sharding
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.