Operations 20 min read

High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies

This article details how a membership system serving billions of users achieves high performance and high availability through a dual‑center Elasticsearch cluster, traffic‑isolated ES clusters, Redis cache with distributed locks, MySQL dual‑center partitioning, and fine‑grained flow‑control and degradation mechanisms, all while ensuring zero‑downtime migrations and consistent data.

IT Architects Alliance

Apr 27, 2022

High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies

Background

The membership system is a core service for all business lines; any outage blocks order placement across the company. After the merger of Tongcheng and eLong, the system must handle cross‑platform member queries (APP, WeChat mini‑programs) and sustain peak traffic exceeding 20,000 TPS.

1. Elasticsearch High‑Availability Solution

Two data‑centers (A and B) host a primary‑backup ES cluster. The primary cluster runs in data‑center A, the backup in B. Writes go to the primary; data are replicated to the backup via MQ. In case of primary failure, traffic is switched to the backup with minimal downtime, then synchronized back.

To isolate high‑TPS marketing traffic, a third ES cluster is dedicated to flash‑sale requests, keeping the primary ES cluster free for order‑critical queries.

Further ES optimizations include balancing shard distribution, limiting thread‑pool size (cpu * 1.5 + 1), keeping shard size < 50 GB, using filter instead of query, removing unnecessary text fields, and adding routing keys to reduce cross‑shard requests.

2. Redis Cache Strategy

Initially the system avoided caching due to real‑time consistency concerns, but a sudden traffic spike during a ticket blind‑box event prompted the introduction of a Redis cache with a 90%+ hit rate. A distributed lock (2 s) is applied when updating ES to prevent stale data from being written back to Redis.

Redis is deployed in a dual‑center multi‑cluster mode: both data‑centers host a Redis cluster, writes are performed to both, and reads are served locally to reduce latency. This ensures service continuity even if one data‑center fails.

3. High‑Availability Primary Database Migration

The original SqlServer instance reached physical limits with > 10 billion rows. A dual‑center MySQL partitioned cluster (1000+ shards, 1 master + 3 slaves) replaces it, with master in data‑center A and slaves in B. Real‑time dual‑write, retry logic, and asynchronous incremental sync ensure data consistency during migration.

Gradual traffic gray‑release (A/B testing) validates read consistency between SqlServer and MySQL before full cut‑over.

In case of DAL component failure, reads/writes can be switched to Elasticsearch as a fallback, then synchronized back to MySQL once recovered.

4. Abnormal Member Relationship Governance

Complex logic identifies and fixes abnormal member bindings that could cause cross‑account data leakage, preventing severe customer complaints.

5. Future Fine‑Grained Flow‑Control and Degradation

Three levels of flow‑control are proposed: hotspot throttling for abusive accounts, per‑caller limits to guard against buggy integration, and global limits to protect the system from traffic spikes beyond its 30 k TPS capacity.

Degradation strategies include response‑time‑based circuit breaking and error‑rate‑based fallback, with plans to tighten caller‑account management for more precise control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed-systems Flow Control Scaling high-availability

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.