Databases 22 min read

How Ele.me Achieved Cross‑Region Active‑Active MySQL: Architecture, Challenges & Lessons

This article details Ele.me's practical experience building a cross‑region active‑active database system, covering latency challenges, architectural design, extensive database refactoring, DBA operational hurdles, consistency verification tools, and future scalability plans.

Efficient Ops
Efficient Ops
Efficient Ops
How Ele.me Achieved Cross‑Region Active‑Active MySQL: Architecture, Challenges & Lessons

1. Challenges in Active‑Active

Ele.me needed to implement cross‑region (Beijing‑Shanghai) active‑active databases, confronting network latency of about 30 ms, which can amplify to hundreds of milliseconds for frequent calls, making many applications intolerant to such delays.

Key difficulties include:

Distinguishing between same‑city and cross‑city active‑active; same‑city latency is negligible, but cross‑city latency requires careful design.

Ensuring data safety with multiple write points, avoiding conflicts, circular replication, and data loops.

Maintaining consistency despite multiple write sources.

To mitigate latency impact, Ele.me groups user traffic so that a single user’s requests are routed to the same data center and classifies services as either active‑active capable or globally shared (e.g., user data).

Traffic routing relies on geographic fences (POI) and a virtual ShardingKey that maps logical shards to physical locations, with APIRouter directing traffic accordingly.

Data‑conflict prevention involves adding a DRC timestamp column to all tables to resolve conflicts by selecting the newest record.

2. Active‑Active Architecture

The architecture consists of entry‑traffic routing, flow control, and cross‑data‑center synchronization components. A crucial component is DRC , which includes three services: Replicator (collects changes), Applier (writes changes to the remote data center), and Manager (controls the process).

Two main DB deployment models are used:

ShardingZone : both reads and writes are served locally; failover only switches traffic without changing underlying data placement.

GlobalZone : writes are centralized in one data center while reads are served locally, suitable for low‑write, high‑read workloads that tolerate higher latency.

3. Database Refactoring

The migration required full data transfer of several hundred terabytes across clusters, adding DRC timestamps, converting primary keys from

INT

to

BIGINT

, and adjusting foreign keys, which involved massive DDL operations.

Business‑type segregation forced the split of over 50 databases into separate instances, and network‑segment adjustments were needed to broaden IP ranges for accounts.

HA configurations were duplicated across data centers, increasing failure‑handling capacity but also raising operational load.

4. DBA Challenges

DBAs faced consistency verification, HA management, configuration drift, capacity planning, and massive DDL workloads. To address consistency, Ele.me built the DCP platform, which performs full and incremental data checks, supports black‑/white‑list rules, and can compare table structures and multi‑dimensional data.

DCP also provides automated repair tools and scripts, handling millions of records daily across hundreds of clusters.

For HA, the EMHA system automatically detects node changes, updates MHA configurations, notifies DRC of master switches, and synchronizes Proxy settings, reducing manual intervention.

5. DDL Automation and Tools

Traditional PT‑based DDL caused high TPS spikes and latency. Ele.me developed mm‑ost , a fork of gh‑ost, enabling cross‑data‑center DDL with latency kept under 3‑5 seconds, supporting pause, throttling, and peak‑aware scheduling.

The release platform orchestrates mm‑ost, enforcing safety checks (DDL space, latency limits, lock handling) and can auto‑execute low‑risk changes, achieving an 8:2 ratio of automated to manual DDL deployments.

6. Benefits and Outlook

Active‑active eliminated single‑data‑center capacity bottlenecks, allowed dynamic traffic shifting during incidents, and improved overall availability. Over 20 traffic cut‑overs (including drills) have demonstrated resilience.

Future work includes adding a third data center to spread cost, implementing data sharding across regions, automating dynamic scaling, and pursuing strong consistency guarantees for critical data.

High Availabilitydata consistencydatabasesactive-activeDBADDLmulti-region
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.