Design and Architecture of a Scalable Live‑Streaming Message Service
The article outlines the challenges of real‑time messaging in live‑streaming education, presents a multi‑stage backend architecture—including AccessServer, MessageServer, and specialized services—along with caching, clustering, and future enhancements such as connection migration and QUIC to achieve high reliability, low latency, and massive concurrency.
In the era of internet services, instant messaging has become essential for products like WeChat, DingTalk, and QQ, and is also critical in live‑streaming classrooms where interactive features such as quizzes, doodles, and likes demand high reliability and immediacy.
The main challenges identified include frequent user churn in live rooms, high QPS for message forwarding (e.g., 500 × 500 = 2.5 w), real‑time latency constraints, user‑experience limits on screen messages, historical message storage for replay, and maintaining message order across users and rooms.
To address these, the system defines priority levels for messages, adopts read‑expansion storage with Pika for historical data, and uses consistent hashing with Kafka‑like queues to preserve order while minimizing latency.
The architecture evolves through three versions:
Architecture 1.0 : Consists of AccessServer (handling TCP connections, async I/O, and user‑room mapping) and MessageServer (interacting with Redis and Pika, processing login, room entry/exit, and message routing). Consistent hashing directs a room’s traffic to a specific MessageServer.
Architecture 2.0 : Splits MessageServer into three services—MessageServer (room logic), BinMsgServer (doodle handling), and PeerMsgServer (one‑to‑one chat). Caching strategies are refined, and cache synchronization is optimized to reduce unnecessary RPC calls.
Architecture 3.0 : Introduces TcpProxyServer as a layer‑7 proxy supporting multiple business sessions (chat, IM, push, etc.) over a single TCP connection, enabling dynamic routing policies and reducing client resource consumption.
Additional components include a DispatchServer for multi‑cluster IP/port allocation, secondary caches in both AccessServer and MessageServer to alleviate Redis pressure, and a cluster management approach that isolates workloads per business line.
Future plans focus on connection migration to seamlessly recover from AccessServer overload or restart, and the adoption of QUIC (UDP‑based) to improve latency in weak‑network environments.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.