Backend Development 16 min read

Design and Architecture of Ctrip Service Registration Center

The article explains Ctrip's service registration center architecture, including its two‑layer Data and Session design, multi‑sharding, fault‑tolerance mechanisms, Redis‑based cluster discovery, design trade‑offs such as proxy versus Smart SDK, hashing strategy, and operational considerations for burst traffic and future scaling.

Ctrip Technology

Jun 20, 2024

Design and Architecture of Ctrip Service Registration Center

Author : Siegfried, a Ctrip software technology expert responsible for the development of the Ctrip registration center.

Introduction : Most Ctrip services have been migrated to micro‑services, requiring each instance to register with a central registry and discover other services. Recent business growth has increased the number of instances dramatically, putting pressure on the registry's performance and stability.

The registry must handle rapid instance registration, heartbeat monitoring, and subscription notifications while ensuring that failed or offline instances are quickly removed from service discovery.

Overall Architecture : The registry uses a two‑layer structure – Data and Session. The Data layer stores service metadata and instance health, while the Session layer communicates with SDKs, aggregates heartbeats, and forwards queries.

Registration – Periodic Heartbeat : Service instances send a heartbeat every 5 seconds to the Session layer, which forwards it to the appropriate Data shard. Continuous heartbeats keep the instance marked as healthy; missing heartbeats cause the entry to expire.

Discovery – Event Push / Fallback Polling : The first heartbeat of a new instance generates a NEW event; expiration generates a DELETE event. Events are pushed to SDKs, and SDKs also perform periodic full queries to compensate for possible lost push notifications.

Multi‑Sharding Scheme : Data is split into multiple shards to avoid vertical bottlenecks. Sessions hash the service ID to route heartbeats, subscriptions, and queries to the correct shard, aggregating responses for the client.

Single‑Point Failure – Data : Heartbeat requests are replicated to multiple Data nodes within a shard, so any single Data failure is covered by the remaining replicas without manual failover.

Data writes are eventually consistent; each Session sticks to a particular Data node within a shard to reduce the impact of slight timing differences.

Single‑Point Failure – Session : Any Session can access all Data shards, so if a Session node fails, SDKs simply switch to another Session.

Cluster Self‑Discovery : The registry uses Redis for bootstrapping. New instances retrieve the list of existing nodes from Redis, send internal heartbeats, and become part of the cluster. Runtime operation does not depend on Redis, so Redis outages only affect scaling, not core functionality.

Design Trade‑offs :

Proxy vs. Smart SDK : Adding a Session (proxy) layer isolates heavy connection handling, allows independent scaling, and keeps SDKs lightweight.

Hashing Strategy : Ordinary (fixed) hashing is used instead of consistent hashing because service counts are relatively stable; each shard has multiple replicas, simplifying the design.

Redis Dependency : Runtime does not rely on Redis; the registry can continue operating during Redis failures, though automatic scaling is limited.

Implementation Language : Data layer is written in Java for better control and extensibility, rather than implementing the whole registry directly on Redis.

Operational Scenarios :

Burst Traffic : During holidays or promotions, traffic spikes cause registration/discovery overload; the system aggregates and deduplicates requests to reduce load.

Traffic Imbalance : Random Data selection can cause older Data nodes to receive more traffic; a global controller is introduced to balance Session stickiness across shards.

Global Risk : Hash‑based routing can affect all business lines if a shard fails; grouping Data by business unit isolates failures and enables gray‑release testing.

Future Plans : Optimize single‑machine performance, simplify mechanisms, reduce node count, and add elastic scaling to automatically expand or shrink the registry based on load.

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.