Backend Development 22 min read

High‑Availability Architecture and Performance Optimization for a Large‑Scale Membership System

This article describes how a unified membership system serving billions of users across multiple platforms achieves high performance and high availability through dual‑center Elasticsearch clusters, traffic isolation, deep ES optimizations, Redis caching, dual‑center MySQL partitioning, seamless migration, and fine‑grained flow‑control and degradation strategies.

Top Architect

Jun 17, 2024

High‑Availability Architecture and Performance Optimization for a Large‑Scale Membership System

Background – The membership system is a core service for order processing across all business lines; any failure blocks user orders, so it must provide high performance, high availability, and stable service for billions of members after the merger of two companies.

ES High‑Availability Solution

The system uses a dual‑center primary‑backup Elasticsearch cluster: the primary cluster runs in data center A, the backup in data center B. Reads/writes go to the primary, and data is synchronized to the backup via MQ. In case of primary failure, traffic is switched to the backup with minimal downtime.

To further protect against large traffic spikes (e.g., holiday promotions), a third isolated ES cluster handles high‑TPS marketing requests, preventing them from affecting the primary order‑flow cluster.

ES Deep Optimization

Balanced shard distribution to avoid hot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU thrashing.

Shard memory kept below 50 GB.

Removed duplicate text fields, keeping only keyword to save storage.

Used filter instead of query for non‑scoring lookups.

Performed sorting in the application layer to reduce ES load.

Added routing keys to limit queries to relevant shards.

These optimizations reduced CPU usage dramatically and improved query latency.

Member Redis Cache Scheme

Initially the system avoided caching because ES latency was low and data consistency was critical. After observing occasional traffic bursts, a write‑through Redis cache with a 90%+ hit rate was introduced, along with a distributed lock to avoid stale data caused by ES’s near‑real‑time write delay.

High‑Availability Redis Architecture

A dual‑center multi‑cluster Redis deployment writes to both data centers synchronously; reads are served locally to minimize latency. If one data center fails, the other continues to provide full member services.

MySQL Dual‑Center Partition Cluster

Member registration data migrated from a single SQL Server instance to a 1000‑shard MySQL cluster (1 primary + 3 replicas) spread across two data centers, achieving >20 k TPS with ~10 ms latency.

Seamless Migration Strategy

Data migration employed full data sync, incremental sync, and real‑time dual‑write. During a trial period, writes were directed to SQL Server while asynchronously writing to MySQL; failures triggered retries and manual investigation. After stable dual‑write, traffic was gradually gray‑scaled from SQL Server to MySQL using A/B testing, with consistency checks on each request.

MySQL & ES Primary‑Backup Scheme

To guard against DAL component failures, writes are also duplicated to Elasticsearch. If MySQL or the DAL fails, the system can switch reads/writes to ES and later resynchronize.

Abnormal Member Relationship Governance

Complex bugs that caused cross‑account binding were identified and fixed at the code‑logic layer to prevent data leakage and unauthorized order manipulation.

Fine‑Grained Flow‑Control and Degradation

Three levels of flow control are applied: hotspot throttling for abusive accounts, per‑caller limits to prevent buggy code from generating massive traffic, and global TPS caps to protect the system. Degradation strategies include response‑time‑based circuit breaking, error‑rate thresholds, and gradual traffic gray‑scaling with automated verification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization System Architecture Elasticsearch Redis traffic isolation

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.