Ele.me’s Secret to Seamless Multi-Region Active-Active Architecture
This article details how Ele.me engineered a cross‑region active‑active system that scales elastically, tolerates whole‑data‑center failures, and maintains real‑time food‑delivery performance through geographic sharding, intelligent routing, and robust data‑replication middleware.
Background: Why Build Cross‑Region Active‑Active?
Ele.me’s rapid growth exhausted the capacity of a single data center, prompting the need to distribute services across multiple sites and to survive whole‑data‑center outages without disrupting order flow.
Goals
Enable services to expand across several data centers.
Maintain availability during an entire data‑center failure.
Challenges of Network Latency
Beijing‑to‑Shanghai round‑trip latency is about 30 ms, roughly 60× slower than intra‑datacenter latency (0.5 ms). The article lists typical latency figures for caches, locks, memory, disks, and network hops, illustrating the performance impact of cross‑region calls.
Design Principles
Business Cohesion : An order’s entire lifecycle (user, merchant, rider) must stay within a single “eZone” to guarantee low latency.
Availability First : In a failover, keep the system usable even if data briefly diverges; each eZone holds a full data copy.
Data Correctness : Lock orders with inconsistent states during a switch to prevent corruption.
Business Awareness : Services must recognize their eZone and process only local data, using state machines to detect and correct inconsistencies.
Service Sharding
The sharding key is geographic location. Users, merchants, and riders that are close are placed in the same eZone, ensuring that an order’s processing stays within one data center. Custom geographic fences divide the country into shards, which are grouped into eZones.
Traffic Routing
An API Router deployed in public‑cloud regions receives a routing tag from the front‑end app, maps it to a Shard ID, then to an eZone, and forwards the request accordingly. A layered routing scheme also supports high‑level keys beyond geography.
Data Replication
All eZones keep full data copies. A Data Replication Center (DRC) synchronizes MySQL bidirectionally within 1 s, resolves primary‑key conflicts via timestamps, and broadcasts changes for cache invalidation. Separate tools replicate ZooKeeper, message queues, and Redis.
Strong Consistency (Global Zone)
For workloads requiring strict consistency, writes are directed to a master data center while reads can be served from any eZone’s replica, achieved through a dedicated data‑access layer.
Failover Protection
Avoid switching if the network is down; each eZone can serve independently.
Lock orders during a switch until replication catches up.
Reject writes that target a different eZone.
DRC reports illegal writes for investigation.
Cache Refresh Across eZones
Data‑change events broadcast by DRC trigger cache invalidation in all eZones, keeping caches consistent.
Overall Architecture
The system consists of five core middleware components:
APIRouter : HTTP reverse proxy and load balancer that routes API traffic based on sharding keys.
Global Zone Service (GZS) : Central routing table and coordination service, pushing updates to SDK caches.
SOA Proxy : Internal gateway for inter‑eZone SOA calls, applying the same routing logic.
Data Replication Center : Handles real‑time replication for MySQL, ZooKeeper, MQ, and Redis.
Data Access Layer : Enforces routing rules and protects against incorrect writes, supporting Global Zone semantics.
Future Plans
Ele.me will expand from two to three‑four data centers and add a cloud‑based eZone to leverage public‑cloud elasticity for global high availability and scalability.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.