Backend Development 20 min read

High‑Availability Architecture for a Billion‑User Membership System: ES Dual‑Center Cluster, Traffic Isolation, Redis Caching, and MySQL Migration

This article details how a large‑scale membership system achieves high performance and fault tolerance through a dual‑center Elasticsearch cluster, traffic‑isolated three‑cluster design, deep ES optimizations, Redis caching with distributed locks, dual‑center MySQL partitioning, and seamless migration from SQL Server, while also outlining future fine‑grained flow‑control and degradation strategies.

Architecture Digest

Feb 27, 2023

High‑Availability Architecture for a Billion‑User Membership System: ES Dual‑Center Cluster, Traffic Isolation, Redis Caching, and MySQL Migration

The membership system is a core service for all business lines; any outage blocks order placement across the company, so it must deliver ultra‑high performance, high availability, and stable service under peak traffic exceeding 20,000 TPS.

1. Elasticsearch high‑availability solution : a dual‑center primary‑backup cluster is deployed across two data centers (A and B). The primary cluster handles reads/writes; data is asynchronously replicated to the backup via MQ. In case of primary failure, configuration switches the service to the backup cluster within seconds, and after recovery the data is synchronized back.

2. Traffic isolation three‑cluster architecture : to protect the main ES cluster from marketing‑spike traffic, a separate ES cluster is dedicated to high‑TPS marketing requests. Requests are classified into high‑priority order‑flow queries and lower‑priority marketing queries, ensuring that marketing bursts do not affect the core ordering process.

3. Deep ES optimizations include balancing shard distribution across nodes, limiting thread‑pool size to cpu_core * 3 / 2 + 1, keeping shard memory under 50 GB, removing duplicate text fields, using filter instead of query, moving result sorting to the application JVM, and adding routing keys to target specific shards. These changes dramatically reduced CPU usage and improved query latency.

4. Redis cache scheme : because ES is near‑real‑time (≈1 s delay), stale data could be written back to Redis. The solution adds a 2‑second distributed lock when updating ES, deletes the related Redis entry, and prevents concurrent reads from overwriting the cache with outdated data. After implementation, cache hit rate exceeds 90 % and overall system latency drops sharply.

5. Redis dual‑center multi‑cluster : each data center runs a full Redis cluster; writes are performed to both clusters, and reads are served locally, guaranteeing availability even if one data center fails.

6. MySQL high‑availability : a dual‑center partitioned MySQL cluster with over 1,000 shards (≈1 M rows per shard) is deployed; master resides in data center A, slaves in B, with sub‑millisecond replication. Reads are routed to the local data center, writes to the master, achieving >2 万 TPS and ~10 ms average latency.

7. Migration from SQL Server to MySQL : a three‑phase plan—full data sync, real‑time dual‑write, and incremental sync—ensures zero‑downtime migration. During gray‑scale rollout, traffic is gradually shifted from SQL Server to MySQL (1 % → 100 %) while an asynchronous verification thread compares query results and logs inconsistencies.

8. MySQL & ES primary‑backup scheme : if the DAL component or MySQL fails, reads/writes can be switched to ES, and after recovery the data is synchronized back, providing an additional safety net.

9. Abnormal member relationship governance : complex logic identifies and fixes rare cases where user accounts become incorrectly linked, preventing cross‑account data leakage and order manipulation.

10. Future work: finer‑grained flow control and degradation : implement hotspot throttling, per‑account limits, global traffic caps, response‑time based circuit breaking, and anomaly‑ratio based degradation to ensure the system remains resilient under extreme load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

System architecture traffic isolation

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.