Design and Implementation of Qunar's High‑Concurrency Instant Messaging System
The article details Qunar's in‑house instant messaging platform, covering its purpose, protocol choices (XMPP with protocol‑buffer optimization), architecture based on ejabberd, message flow, reliability mechanisms, extensions such as bots and HTTP APIs, as well as extensive system‑level tuning for high‑concurrency TCP connections.
1. What is IM
Instant Messaging (IM) is a network‑based technology that enables real‑time text communication between users, typically consisting of a client, a server, and the messages exchanged between them.
2. Common IM Implementation Schemes
XMPP Protocol
XMPP is an open XML‑based protocol designed for near‑real‑time messaging, presence, and request‑response services. Its main components are Presence, Message, and IQ. It offers rich extensions but generates relatively large packets, consuming more bandwidth and battery.
MQTT Protocol
MQTT is a lightweight publish/subscribe protocol built on TCP/IP, ideal for low‑bandwidth, low‑power devices. It is simple to implement but requires custom development for chat‑specific features such as friend lists.
3. Qunar's Implementation
Protocol Selection
After evaluating common IM protocols, Qunar chose XMPP as the core protocol because it allows rapid implementation of basic chat functions and leverages existing extensions. To mitigate XMPP’s large XML payloads, the message transport was replaced with Protocol Buffers.
Open‑Source Project Choice
Qunar adopted ejabberd , an Erlang/OTP‑based XMPP server, for its cross‑platform, fault‑tolerant, clustered architecture and its GPLv2 license.
Architecture Design
The system uses two types of connections: a long‑lived TCP (or WebSocket for web) for stateful, multi‑device synchronization, and an HTTP connection for stateless requests.
Load balancing: TCP connections are balanced with LVS/HA, HTTP with Nginx.
Data handling: messages are stored in a database or pushed to a message queue; high‑frequency data is cached in Redis.
Management: internal APIs expose IM functionality to other services, and monitoring tools aid maintenance.
ejabberd Process Flow
Client sends a message over a long connection to ejabberd_c2s .
ejabberd_c2s forwards the stanza to ejabberd_router for routing.
ejabberd_router processes common logic and passes the message to ejabberd_local if it is destined for a local user.
ejabberd_local determines the target user and forwards the message to the appropriate ejabberd_sm process.
ejabberd_sm looks up all online devices of the recipient and sends the message to each corresponding ejabberd_c2s process.
The ejabberd_c2s process delivers the message to the client via its TCP connection.
This completes the end‑to‑end delivery of a message from user A to user B.
Feature Extensions
Protocol Buffer Transport : Replaces XML with Protocol Buffers on the wire to reduce payload size and power consumption; the server converts back to XML for internal processing.
Message Reliability : Implements ACKs for online devices, HTTP pull for offline devices, and unique message IDs for idempotency.
Message Confirmation Hook : Adds a hook in ejabberd_c2s to generate server‑side ACKs with timestamps.
Multi‑Device Synchronization : Ensures messages sent from one device are propagated to all other logged‑in devices.
Message Queue Publishing : Publishes all IM messages and timestamps to Kafka for downstream analytics and storage.
HTTP Messaging API : Provides an HTTP endpoint that external systems can call to simulate user messages.
IM Authentication Token : Issues a token after successful long‑connection authentication; the token is used for subsequent HTTP calls and for cross‑system authentication.
Incremental Pull : Clients request only data updated after their last known timestamp, reducing bandwidth.
Bot and Customer Service Extensions
By subscribing to the message queue, a bot service can process specific messages and reply via the IM API, enabling self‑service and intelligent responses. The customer service system queues incoming user messages, performs routing, and converts them to messages for specific agents.
4. Data Metrics
Concurrent online users: ~200,000
TCP connection establishment QPS: ~30,000
Incoming message QPS: ~30,000
Outgoing message QPS: ~30,000
5. System Parameter Optimization
Linux OS Parameters
Increase the maximum number of file descriptors, per‑process limits, and adjust sysctl settings such as fs.file-max , net.core.somaxconn , net.ipv4.tcp_max_syn_backlog , and TCP socket buffers to support millions of concurrent connections.
TCP Stack Tuning
Configure backlog sizes, port ranges, socket buffer sizes, connection tracking limits, and TIME‑WAIT handling to improve throughput and reduce latency.
6. Summary
By implementing core IM capabilities and a series of extensions, Qunar achieved a stable, high‑performance messaging platform that provides reliable long‑lived TCP connections, unified authentication, and efficient message publishing/subscription for downstream services.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.