Backend Development 13 min read

Design and Architecture of a High‑Availability Instant Messaging System

This article explains the overall architecture, data structures, login and messaging flows, common real‑time, reliability and consistency challenges, and high‑availability and high‑concurrency techniques used in a production instant‑messaging service.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Architecture of a High‑Availability Instant Messaging System

Overview

The article uses the ZuanZuan IM architecture as a case study to introduce IM components, their relationships, data flow during login and message sending, common problems, and practical tips for achieving high availability and high concurrency.

1. Architecture

1.1 Application Layer

Upstream business services that use the IM service, including iOS/Android apps, mini‑programs/PC/web pages, push services, and other business systems.

1.2 Access Layer

TCP entry – long‑connection maintenance, session management, protocol parsing.

HTTP entry – long‑polling for similar purposes.

MQ – receives system messages such as e‑commerce promotions and smooths traffic spikes.

RPC server – queries user chat data and sends real‑time system messages.

1.3 Logic Layer

logic – core service handling login information, online/offline messages, and push management.

ext‑logic – extended service for sub‑account push, login statistics, system message management, etc.

1.4 Data Layer

MySQL – stores contacts, messages, and system messages.

Redis – stores login information and related transient data.

2. Data Flow

2.1 Scenario

Illustrates a conversation between user A (uid = 1) and user B (uid = 2) with accompanying diagrams.

2.2 Data Structures (simplified)

2.2.1 Login Information

key: uid
value: {entryIp:"127.0.0.1", entryPort:5000, loginTime:23443233}

2.2.2 Contacts

A table with columns: id, uid_a, uid_b, recent_msg_content, recent_read_time, is_del. The recent_msg_content stores the latest message for display in the contact list, and recent_read_time is used to determine read status.

2.2.3 Messages

A table with columns: id, big_uid, small_uid, msg_content, create_time, client_msg_id, direction. client_msg_id provides client‑side idempotency, and direction indicates which user sent the message.

2.3 Main Processes

2.3.1 Login

Connection: app connects to entry via VIP.

Forwarding: entry forwards login info to logic, which obtains the uid and manages the connection.

Persistence: logic records login info in Redis.

Sample Redis entry:

Redis中数据如下:
key:1
value:{entryIp:"127.0.0.1", entryPort:5000, loginTime:23443233}

2.3.2 Send Message

Send: user A sends text "hello world" through the long‑connection to entry.

Forward: entry forwards the text to logic.

Persist: logic updates the contact table (recent_msg_content) and inserts a new row into the message table.

Push: logic retrieves user B’s login entry from Redis; if offline, triggers push, WeChat, or SMS.

Delivery: user B receives the message.

Ack: client sends ack to entry.

Complete: logic receives ack and cancels the retransmission timer; if no ack, logic retries.

Contact table example:

id

uid_a

uid_b

recent_msg_content

recent_read_time

is_del

1

1

2

hello world

1

0

2

2

1

hello world

1

0

Message table example:

id

big_uid

small_uid

msg_content

create_time

client_msg_id

direction

1

1

2

hello world

1

1

0

2.3.3 Data‑Related Questions

Why no sharding? TiDB handles scaling without explicit sharding.

Why two rows in the contact table per message? Storing a row per user enables efficient indexing on uid_a.

How to query a conversation from either side? A composite index on (big_uid, small_uid) allows a single row to serve both directions.

3. Common IM Issues

3.1 Real‑time

Real‑time delivery is achieved by using long‑connections and an epoll‑like model: entry maintains connections (epoll_create), manages sessions (epoll_ctl), and logic waits for messages and pushes them (epoll_wait).

3.2 Reliability

Reliability is ensured through retransmission on failure and explicit ack confirmation. If any step fails, the client or server retries the operation.

3.3 Consistency

Duplicate messages can arise from retransmission. Using a unique client_msg_id (like an ID card) allows the client to deduplicate messages.

4. High Availability & High Concurrency

Scaling: Docker‑based rapid scaling of the logic service based on metrics (CPU, QPS, JVM, SQL).

Circuit‑breaker: When non‑critical downstream services fail (e.g., sub‑account service), the IM service degrades gracefully instead of becoming unavailable.

Rate limiting: In traffic spikes, fast‑fail strategies limit incoming requests to protect the database and keep the service partially available.

The IM service therefore achieves high availability, flexible degradation, and partial availability depending on the situation.

5. Conclusion

The article provides a concise overview of IM system architecture, core business logic, typical challenges, and practical solutions for service governance, while noting many additional challenges such as unread counts, group chat, multi‑device login, and massive data storage choices.

backendarchitecturescalabilityHigh AvailabilityIMData Flow
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.