Backend Development 21 min read

How to Ensure Reliability, Ordering, and Security in Billion‑User IM Systems

This article explores the key challenges of building a large‑scale instant‑messaging service—including message reliability, ordering, read‑sync, data security, avalanche effects, and weak‑network handling—and presents practical architectural and algorithmic solutions for each problem.

Architecture & Thinking
Architecture & Thinking
Architecture & Thinking
How to Ensure Reliability, Ordering, and Security in Billion‑User IM Systems

1. Introduction

This article builds on the outlines of Deng Yunzhe’s "Large‑Scale Concurrent IM Service Architecture Design" and "IM Weak‑Network Scenario Optimization" (see references at the end). It focuses on several crucial topics of a billion‑user IM architecture such as message reliability, ordering, data security, and weak‑network issues.

2. Series Articles

The content is split into two parts; this is the second part, which dives deeper into the detailed and important hot issues of the IM architecture.

3. Message Reliability Issues

Reliability is a core metric for any IM system; users must trust that their messages will not be lost. From a product perspective, a lack of reliability leads to rapid user churn. The reliability solution consists of two logical parts:

Uplink message reliability

Downlink message reliability

Uplink reliability : The client assigns a local ID to the message and waits for a server acknowledgment (PIMSendAck). If the ACK is not received within a timeout, the SDK retries.

Downlink reliability : When the server pushes a message to multiple recipients, it must cache the push request. The message is written to each recipient’s offline‑message list; after the client acknowledges receipt, the entry is removed. This ensures both real‑time and offline message reliability.

Further reading:

"IM Message Delivery Guarantee (Part 1): Reliable Real‑Time Delivery"

"IM Message Delivery Guarantee (Part 2): Reliable Offline Delivery"

4. Message Ordering Issues

Distributed IM systems face ordering challenges because client and server clocks may diverge, leading to out‑of‑order delivery. The proposed solution includes:

Server‑time alignment (handled by operations)

Client‑side time calibration against the server

Including both local and server timestamps in each message and applying an interpolation sort: messages from the same sender are ordered by local time, while messages from different senders are ordered by server time.

Additional resources on message‑ID ordering algorithms are also suggested.

5. Message Read‑Sync Issues

Read‑receipt functionality becomes complex when a user is logged in on multiple devices. The synchronization logic relies on two mechanisms:

Maintain a timestamp per session indicating the last read message.

When a session is active, broadcast a PIMSyncRead message to other devices.

6. Data Security Issues

6.1 Basic

IM security involves both communication security (socket long‑connections and HTTP short‑connections) and content security. Balancing security, performance, traffic, and user experience is challenging.

6.2 Communication Security

Typical IM services consist of:

Socket long‑connection services (TCP/UDP)

HTTP short‑connection services (REST APIs)

Recommended reading includes articles on TLS 1.3‑based MMTLS, combination encryption algorithms, and HTTPS fundamentals.

6.3 Content Security

Cryptography provides encryption, authentication, and identification. End‑to‑end encryption (E2EE) is essential for protecting message content, as exemplified by Telegram.

Further reading:

"Mobile End‑to‑End Encryption (E2EE) Technical Details"

"Real‑Time Audio/Video Chat E2EE Working Principles"

7. Avalanche Effect Issue

In a distributed IM architecture, a failure in one data center can cause a cascade of overloads in other centers. Mitigation strategies include server‑side rate limiting and client‑side reconnection back‑off or load‑balancer‑assisted server selection.

8. Weak‑Network Issues

8.1 Causes of Weak Networks

Mobile IM frequently encounters weak‑network scenarios (elevators, trains, cars, subways) due to signal fluctuation, interference, uneven base‑station distribution, and high mobility.

8.2 IM Handling of Weak Networks

The core handling consists of:

Automatic message retransmission

Offline message reception

Resend ordering

Offline command processing

8.3 Automatic Message Retransmission

Clients should maintain a state machine for each message (initial, sending, failed, timeout) and automatically retry a few times before notifying the user of failure.

8.4 Offline Message Reception

Detecting offline status can be done via long‑connection heartbeat loss, repeated request failures, or device network‑status APIs. Once connectivity is restored, the client pulls missed messages from the server’s offline queue.

8.5 Resend Message Ordering

When a message is retried after a network glitch, the final receive order should follow the interpolation sort described in section 4 (local time for the same sender, server time for different senders).

8.6 Offline Command Processing

Operations performed while offline (e.g., deleting a contact) must be queued and synchronized with the server once the network recovers.

8.7 Summary

Weak‑network handling for IM is relatively straightforward: automatic retries combined with proper message state tracking solve most problems. More complex scenarios, such as video conferencing under high packet loss, require additional techniques.

9. Article Summary

The two‑part series on large‑scale IM architecture covers overall design, service splitting, and deep dives into reliability, ordering, read‑sync, security, avalanche effects, and weak‑network optimization. Beginners are encouraged to read the curated “from zero to IM” guide for a systematic learning path.

10. References

Large‑Scale Concurrent IM Service Architecture Design

IM Weak‑Network Scenario Optimization

Zero‑Basis IM Development Intro (3): What Is IM Reliability?

IM Message Delivery Guarantee (Part 1): Reliable Real‑Time Delivery

IM Message Delivery Guarantee (Part 2): Reliable Offline Delivery

Instant Messaging Security (Part 2): Combined Encryption Algorithms in IM

WeChat Next‑Gen Communication Security: MMTLS Based on TLS 1.3

Zero‑Basis Mobile IM Development Guide

IM architecture overview
IM architecture overview
backend architectureMessage Reliabilitydata securityweak networkInstant Messagingmessage ordering
Architecture & Thinking
Written by

Architecture & Thinking

🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.