Backend Development 21 min read

Design and Evolution of Ctrip's Hermes Message Queue System

This article presents a detailed overview of Ctrip's Hermes message queue system, covering its architectural evolution from a simple Mongo‑based design to a broker‑centric, multi‑storage solution with meta‑server coordination, and discusses practical techniques for building high‑performance, scalable messaging infrastructure.

Architect
Architect
Architect
Design and Evolution of Ctrip's Hermes Message Queue System

Gu Qing introduces Hermes, the new message system being deployed at Ctrip, focusing on its architecture and implementation details.

The advantages of message queues are highlighted, including decoupling of services, asynchronous processing, handling traffic spikes, and supporting fan‑out scenarios.

The basic MQ model is explained with queue and topic patterns, consumer groups, and message delivery semantics.

Version 1.0 of Ctrip's MQ stored messages directly in MongoDB without a broker, leading to high client upgrade costs, heavy DB coordination, limited features, and poor elasticity.

Version 2.0 added a master‑slave broker layer that coordinates via MongoDB heartbeats, allowing clients to communicate only with brokers, simplifying upgrades and reducing client complexity.

The current architecture (Fig 4) consists of producers, brokers, and storage back‑ends (MySQL for critical data and Kafka for high‑throughput logs), with a meta‑server handling cluster coordination and lease management.

Two message storage types are used: Kafka provides high throughput but lacks features like replay and priority, while MySQL offers richer queue capabilities for important business data.

Efficient MQ construction starts with single‑machine optimizations: fast insert‑only tables, batch writes, minimal indexing, and low‑latency delivery via push or long‑polling pull mechanisms.

Scaling to a cluster involves adding brokers, partitioning topics, and ensuring ordering within partitions; load balancing is achieved by assigning partitions to specific brokers.

Cluster management relies on a lease‑based approach where the meta‑server grants time‑limited leases to brokers and consumers, using ZooKeeper for broker coordination and HTTP/TCP protocols for meta‑server communication, enabling simple HA and dynamic rebalancing.

The summary emphasizes the need for fast writes, large throughput channels, low delivery latency, partition stickiness, long‑pulling techniques, and lease‑based coordination to build a robust, scalable message queue system.

distributed systemsArchitectureKafkaMessage QueueCluster ManagementHermesCtrip
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.