Operations 21 min read

High‑Availability Architecture Design for the Integrated Membership System of Tongcheng and eLong

This article details the design and implementation of a high‑performance, highly available membership system for the merged Tongcheng‑eLong platform, covering Elasticsearch dual‑center clusters, traffic‑isolated three‑cluster architecture, deep ES optimizations, Redis caching and dual‑center clusters, MySQL dual‑center partitioning, migration strategies, and future fine‑grained flow‑control and degradation measures.

Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
High‑Availability Architecture Design for the Integrated Membership System of Tongcheng and eLong

The membership system is a core service tightly coupled with the order flow of all business lines; any failure blocks user orders across the entire company, so it must provide high performance and high availability.

After the merger of Tongcheng and eLong, multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs, etc.) require a unified member relationship, leading to massive request volumes (over 20,000 TPS during peak holidays). The article explains how this challenge is addressed.

1. Elasticsearch High‑Availability Solution

Given the billions of members and diverse query dimensions (phone, WeChat unionid, eLong card), Elasticsearch (ES) is used to store unified member relationships. A dual‑center primary‑backup ES cluster is deployed: the primary cluster in Data Center A and the backup cluster in Data Center B. Data is synchronized via MQ, and read/write traffic can be switched to the backup cluster instantly if the primary fails, with later synchronization back to the primary.

2. ES Traffic‑Isolation Three‑Cluster Architecture

To protect the main order flow from marketing‑driven traffic spikes, requests are classified into high‑priority (order‑related) and lower‑priority (marketing) groups. A dedicated ES cluster handles high‑TPS marketing “flash‑sale” traffic, isolating it from the primary ES cluster.

3. Deep ES Cluster Optimizations

Balanced shard distribution to avoid hotspot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU overload.

Shard memory limited to 50 GB per shard.

Removed unnecessary text fields, keeping only keyword for member queries.

Used filter instead of query to avoid relevance scoring.

Performed result sorting in the member service JVM.

Added routing keys to direct queries to specific shards.

These optimizations dramatically reduced CPU usage and improved query latency, as shown in the following charts.

4. Member Redis Cache Solution

Initially the system avoided caching due to real‑time consistency concerns, but a sudden “blind‑box” ticket promotion forced the adoption of a Redis cache. To solve the 1‑second ES write‑delay inconsistency, a 2‑second distributed lock is acquired when updating ES, the related Redis entry is deleted, and concurrent reads respect the lock to avoid stale writes.

After implementation, cache hit rates exceeded 90 %, greatly relieving ES pressure.

5. Redis Dual‑Center Multi‑Cluster Architecture

Two Redis clusters are deployed in Data Centers A and B. Writes are performed to both clusters (dual‑write) and reads are served locally, ensuring high availability even if one data center fails.

6. MySQL Dual‑Center Partition Cluster

Member data (over 10 billion records) is sharded into more than 1,000 partitions, each holding roughly one million rows. The cluster uses a 1‑master‑3‑slave topology with the master in Data Center A and slaves in Data Center B, synchronized over a dedicated link with sub‑millisecond latency.

Stress tests showed >20 k TPS with average latency under 10 ms.

7. Smooth Migration from SQL Server to MySQL

The migration follows a “full sync → incremental sync → real‑time gray‑scale switch” strategy, using dual‑write to both databases, retry mechanisms, and extensive verification before fully cutting over traffic.

8. MySQL and ES Primary‑Backup Cluster Scheme

To guard against DAL component failures, member data is also written to an ES backup cluster, allowing a quick switch to ES if MySQL becomes unavailable.

9. Abnormal Member Relationship Governance

Complex logic identifies and rectifies cases where a user’s APP account is incorrectly bound to another’s WeChat account, preventing cross‑account data leakage and order manipulation.

10. Outlook: More Fine‑Grained Flow‑Control and Degradation Strategies

Future work includes hotspot throttling, per‑account flow rules, global traffic limiting, response‑time‑based degradation, and exception‑rate‑based circuit breaking, along with comprehensive account governance.

In conclusion, the team will continue to refine and evolve the system to ensure reliability, performance, and scalability as business demands grow.

distributed systemssystem architectureElasticsearchHigh AvailabilityRedisMySQLtraffic isolation
Tongcheng Travel Technology Center
Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.