Backend Development 14 min read

Optimizing Online IM Channel Stability in Xianyu

Xianyu improved its online IM channel stability by adding heartbeat‑based connection monitoring, rapid reconnection, ACK‑driven retry queues stored in Alibaba Cloud Table Store, and adaptive delay strategies, which together cut ACC compensation arrival time by 75% (from 60 to 15 minutes) and dramatically reduced user complaints.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Optimizing Online IM Channel Stability in Xianyu

Background: IM messages are a critical communication tool for Xianyu users. The core goals are to prevent message loss and ensure timely delivery. More than half of daily IM traffic uses the online push channel, making its stability directly impact user experience.

Problems: The online channel faces two main issues – long‑connection interruptions and messages that fail to reach the client after being pushed.

Long‑connection interruption: Xianyu uses ACCS, a full‑duplex, low‑latency TCP long‑connection service. Network anomalies (disconnections, switches, weak signals, NAT timeouts) can break the connection, preventing timely message delivery. Detecting and reconnecting quickly is essential.

Undelivered messages: Even with reconnection, messages may be lost due to channel breakage during transmission, delayed online status detection, or client processing failures (e.g., storage errors). ACCS downlink success is about 97%; the remaining 3% are not lost but may be delayed.

Compensation metric: A metric called “ACC compensation arrival time” measures the interval from a successful ACCS downlink to the client pulling the message. Historically this was ~60 minutes; after optimization it dropped to ~15 minutes.

Long‑connection reconnection strategies: Causes include device network loss, network switching, weak networks, and NAT timeout. Heartbeat detection is employed—clients send lightweight heartbeat packets and consider the channel healthy upon receiving a server ACK.

Heartbeat strategies: Various strategies are discussed, such as short‑burst detection, fixed‑interval heartbeats (adjustable by app state), adaptive heartbeats that react to network conditions, and redundant heartbeats triggered when the app moves from background to foreground.

Message ACK and retransmission: After processing a received ACCS message, the client sends an ACK. The server places the message in a retry queue and removes it once an ACK is received.

Retry‑queue storage design: Alibaba Cloud Table Store’s Timeline model is used. Each device has a timeline identified by userId_deviceId with a sequence ID representing the message offset. Successful downlinks insert a record; ACKs update the status. A short TTL prevents data bloat.

Delayed retry design: When a message is sent, it is stored with a “pending” state and a delayed retry task is scheduled. The task checks the message status; if not ACKed, it retries (if the device is online) and reschedules another delayed task, up to a maximum number of attempts.

Delay strategies: • Fixed delay (e.g., 10 s) with a limited retry count. • Fixed delay plus incremental step increase after several attempts. • Adaptive delay that adjusts based on observed network recovery, using per‑device N values and a maximum cap.

Compatibility: Older app versions do not send ACKs. Before adding a message to the retry queue, the server checks the device’s app version (reported after the ACCS connection is established) and only includes compatible versions.

Effectiveness: After deploying the reconnection and retransmission scheme, the ACC compensation arrival time decreased by 75% (from 60 min to 15 min), and weekly user complaints about message delay fell below two.

Future outlook: Continued improvements to Xianyu’s IM experience, including new features (message recall, drafts, location sharing, conversation grouping, search) and performance optimizations (CPU, memory, network, battery).

[1] Modern IM System Architecture – Architecture: https://developer.aliyun.com/article/698301

[2] Modern IM System Architecture – Model: https://developer.aliyun.com/article/701593

[3] High‑Concurrency IM System Optimization Practice: https://developer.aliyun.com/article/66461

BackendIMcloudLong ConnectionMessage RetransmissionReliability
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.