Design and Architecture of a High‑Availability Instant Messaging System
This article explains the overall architecture, data structures, login and messaging flows, common real‑time, reliability and consistency challenges, and high‑availability and high‑concurrency techniques used in a production instant‑messaging service.
Overview
The article uses the ZuanZuan IM architecture as a case study to introduce IM components, their relationships, data flow during login and message sending, common problems, and practical tips for achieving high availability and high concurrency.
1. Architecture
1.1 Application Layer
Upstream business services that use the IM service, including iOS/Android apps, mini‑programs/PC/web pages, push services, and other business systems.
1.2 Access Layer
TCP entry – long‑connection maintenance, session management, protocol parsing.
HTTP entry – long‑polling for similar purposes.
MQ – receives system messages such as e‑commerce promotions and smooths traffic spikes.
RPC server – queries user chat data and sends real‑time system messages.
1.3 Logic Layer
logic – core service handling login information, online/offline messages, and push management.
ext‑logic – extended service for sub‑account push, login statistics, system message management, etc.
1.4 Data Layer
MySQL – stores contacts, messages, and system messages.
Redis – stores login information and related transient data.
2. Data Flow
2.1 Scenario
Illustrates a conversation between user A (uid = 1) and user B (uid = 2) with accompanying diagrams.
2.2 Data Structures (simplified)
2.2.1 Login Information
key: uid
value: {entryIp:"127.0.0.1", entryPort:5000, loginTime:23443233}2.2.2 Contacts
A table with columns: id, uid_a, uid_b, recent_msg_content, recent_read_time, is_del. The recent_msg_content stores the latest message for display in the contact list, and recent_read_time is used to determine read status.
2.2.3 Messages
A table with columns: id, big_uid, small_uid, msg_content, create_time, client_msg_id, direction. client_msg_id provides client‑side idempotency, and direction indicates which user sent the message.
2.3 Main Processes
2.3.1 Login
Connection: app connects to entry via VIP.
Forwarding: entry forwards login info to logic, which obtains the uid and manages the connection.
Persistence: logic records login info in Redis.
Sample Redis entry:
Redis中数据如下:
key:1
value:{entryIp:"127.0.0.1", entryPort:5000, loginTime:23443233}2.3.2 Send Message
Send: user A sends text "hello world" through the long‑connection to entry.
Forward: entry forwards the text to logic.
Persist: logic updates the contact table (recent_msg_content) and inserts a new row into the message table.
Push: logic retrieves user B’s login entry from Redis; if offline, triggers push, WeChat, or SMS.
Delivery: user B receives the message.
Ack: client sends ack to entry.
Complete: logic receives ack and cancels the retransmission timer; if no ack, logic retries.
Contact table example:
id
uid_a
uid_b
recent_msg_content
recent_read_time
is_del
1
1
2
hello world
1
0
2
2
1
hello world
1
0
Message table example:
id
big_uid
small_uid
msg_content
create_time
client_msg_id
direction
1
1
2
hello world
1
1
0
2.3.3 Data‑Related Questions
Why no sharding? TiDB handles scaling without explicit sharding.
Why two rows in the contact table per message? Storing a row per user enables efficient indexing on uid_a.
How to query a conversation from either side? A composite index on (big_uid, small_uid) allows a single row to serve both directions.
3. Common IM Issues
3.1 Real‑time
Real‑time delivery is achieved by using long‑connections and an epoll‑like model: entry maintains connections (epoll_create), manages sessions (epoll_ctl), and logic waits for messages and pushes them (epoll_wait).
3.2 Reliability
Reliability is ensured through retransmission on failure and explicit ack confirmation. If any step fails, the client or server retries the operation.
3.3 Consistency
Duplicate messages can arise from retransmission. Using a unique client_msg_id (like an ID card) allows the client to deduplicate messages.
4. High Availability & High Concurrency
Scaling: Docker‑based rapid scaling of the logic service based on metrics (CPU, QPS, JVM, SQL).
Circuit‑breaker: When non‑critical downstream services fail (e.g., sub‑account service), the IM service degrades gracefully instead of becoming unavailable.
Rate limiting: In traffic spikes, fast‑fail strategies limit incoming requests to protect the database and keep the service partially available.
The IM service therefore achieves high availability, flexible degradation, and partial availability depending on the situation.
5. Conclusion
The article provides a concise overview of IM system architecture, core business logic, typical challenges, and practical solutions for service governance, while noting many additional challenges such as unread counts, group chat, multi‑device login, and massive data storage choices.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.