High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies
This article details how a membership system serving billions of users achieves high performance and high availability through a dual‑center Elasticsearch cluster, traffic‑isolated ES clusters, Redis cache with distributed locks, MySQL dual‑center partitioning, and fine‑grained flow‑control and degradation mechanisms, all while ensuring zero‑downtime migrations and consistent data.
Background
The membership system is a core service for all business lines; any outage blocks order placement across the company. After the merger of Tongcheng and eLong, the system must handle cross‑platform member queries (APP, WeChat mini‑programs) and sustain peak traffic exceeding 20,000 TPS.
1. Elasticsearch High‑Availability Solution
Two data‑centers (A and B) host a primary‑backup ES cluster. The primary cluster runs in data‑center A, the backup in B. Writes go to the primary; data are replicated to the backup via MQ. In case of primary failure, traffic is switched to the backup with minimal downtime, then synchronized back.
To isolate high‑TPS marketing traffic, a third ES cluster is dedicated to flash‑sale requests, keeping the primary ES cluster free for order‑critical queries.
Further ES optimizations include balancing shard distribution, limiting thread‑pool size (cpu * 1.5 + 1), keeping shard size < 50 GB, using filter instead of query , removing unnecessary text fields, and adding routing keys to reduce cross‑shard requests.
2. Redis Cache Strategy
Initially the system avoided caching due to real‑time consistency concerns, but a sudden traffic spike during a ticket blind‑box event prompted the introduction of a Redis cache with a 90%+ hit rate. A distributed lock (2 s) is applied when updating ES to prevent stale data from being written back to Redis.
Redis is deployed in a dual‑center multi‑cluster mode: both data‑centers host a Redis cluster, writes are performed to both, and reads are served locally to reduce latency. This ensures service continuity even if one data‑center fails.
3. High‑Availability Primary Database Migration
The original SqlServer instance reached physical limits with > 10 billion rows. A dual‑center MySQL partitioned cluster (1000+ shards, 1 master + 3 slaves) replaces it, with master in data‑center A and slaves in B. Real‑time dual‑write, retry logic, and asynchronous incremental sync ensure data consistency during migration.
Gradual traffic gray‑release (A/B testing) validates read consistency between SqlServer and MySQL before full cut‑over.
In case of DAL component failure, reads/writes can be switched to Elasticsearch as a fallback, then synchronized back to MySQL once recovered.
4. Abnormal Member Relationship Governance
Complex logic identifies and fixes abnormal member bindings that could cause cross‑account data leakage, preventing severe customer complaints.
5. Future Fine‑Grained Flow‑Control and Degradation
Three levels of flow‑control are proposed: hotspot throttling for abusive accounts, per‑caller limits to guard against buggy integration, and global limits to protect the system from traffic spikes beyond its 30 k TPS capacity.
Degradation strategies include response‑time‑based circuit breaking and error‑rate‑based fallback, with plans to tighten caller‑account management for more precise control.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.