Design and Optimization of a High‑Throughput Long‑Connection Service for Live Streaming
The article details a Golang‑based high‑throughput long‑connection service for live‑streaming, describing its five‑layer architecture, multi‑protocol support, load‑balancing, message‑queue decoupling, aggregation with brotli compression, multi‑region deployment, priority channels, and future enhancements for observability and intelligent endpoint selection.
In the digital entertainment era, bullet‑screen (danmu) has become an essential interactive element on live‑streaming platforms. Real‑time interaction such as sending danmu or gifts requires a persistent network channel, i.e., a long‑connection, to push information instantly to the client.
A long‑connection is a network data channel that stays alive for the whole application lifecycle, supporting full‑duplex data transfer. Unlike short‑connection request/response models, it enables the server to push data to users proactively.
This article introduces a Golang‑based long‑connection service, covering its framework design and the optimizations made for stability and high throughput.
Framework Design
The service is shared by multiple business lines, so the design must accommodate diverse requirements while keeping the service boundaries clear to avoid coupling with business logic.
The long‑connection service consists of three main aspects:
Connection establishment, maintenance, and management.
Downstream data push.
Upstream data forwarding (currently only heartbeat).
Overall Architecture
The architecture is divided into five layers:
Control Layer : Pre‑connection checks, authentication, token generation, and routing control.
Access Layer : Core long‑connection handling – certificate unloading, protocol adaptation, connection‑ID/room‑ID mapping, and upstream/downstream message processing.
Logic Layer : Business‑level functions such as online user reporting and connection‑attribute recording.
Message Distribution Layer : Message packaging, compression, aggregation, and dispatch to edge nodes.
Service Layer : Business service entry point for downstream push, permission control, message validation, and rate‑limiting.
Core Processes
The long‑connection follows three core processes:
Establishing the connection: The client obtains a valid token and access point configuration via the control layer.
Maintaining the connection: The client sends periodic heartbeats to keep the connection alive.
Downstream push: Business servers trigger a push, the service layer determines the target connection, the distribution layer forwards the message to the appropriate access node, which finally delivers it to the client.
Feature List
Based on Bilibili’s live‑streaming scenarios, the service provides the following generic push capabilities:
User‑level messages (e.g., invitation for PK).
Device‑level messages (e.g., log‑upload commands for unauthenticated devices).
Room‑level messages (e.g., danmu broadcast to all users in a room).
Region‑level messages (e.g., promotional activity to all rooms in a specific region).
Global messages (e.g., platform‑wide notifications).
High‑Throughput Optimizations
With millions of concurrent users during peak events (e.g., the S‑Series finals), the system faces message rates of over 100 million per second. The following measures were taken to sustain performance.
1. Network Protocol
Three protocols are supported:
TCP – reliable, suitable for high‑reliability scenarios.
UDP – unreliable but low‑latency, used where occasional loss is acceptable.
WebSocket – bidirectional communication for web clients with moderate overhead.
The access layer separates protocol handling from connection management, allowing new protocols to be added without affecting core business logic.
2. Load Balancing
A control layer provides HTTP short‑connection endpoints that, based on client location and edge‑node health, dynamically select the optimal access node. Horizontal scaling of the access layer and dynamic node addition/removal ensure stable CPU and memory usage even when online users approach ten million.
3. Message Queue
Introducing a message queue and a dedicated distribution layer decouples business push from edge‑node delivery, improving concurrency and preventing bottlenecks in the service layer.
4. Message Aggregation
During hot events, a single room may generate millions of identical messages. By aggregating messages per room and sending them in batches, the QPS of the distribution layer to the access layer drops by about 60 %.
5. Compression Algorithms
After aggregation, message payloads become larger, so compression is applied. Two widely used algorithms were evaluated: zlib and brotli.
Test results (average compressed size) are shown below:
Scenario
Original Size
zlib Size
zlib Ratio
brotli Size
brotli Ratio
brotli vs zlib Savings
2 messages
1126
468
42%
390
35%
17%
10 messages
4706
1728
37%
1423
30%
18%
20 messages
9505
2674
28%
2172
23%
19%
40 messages
19387
3161
16%
2488
13%
20%
Brotli consistently outperformed zlib, so it was adopted. Compression is performed at the distribution layer to avoid repeated work on edge nodes, improving throughput and reducing bandwidth costs.
Service Guarantees
Because many business flows depend on reliable push, the following safeguards were implemented.
1. Multi‑Active Deployment
Identical service instances are deployed across East, South, and North China, as well as Singapore for overseas users. Automatic failover ensures continuity when a region experiences a fault.
2. High/Low Priority Channels
Messages are classified by importance. Critical messages (e.g., PK invitations) use a high‑priority channel, while less critical ones (e.g., regular danmu) use a low‑priority channel, providing physical isolation and preferential delivery.
3. "High‑Reach" (高达) Function
To guarantee end‑to‑end delivery, each message carries a unique msgID . The client performs idempotent deduplication and ACKs receipt. The server retries undelivered messages within a configurable window. The final delivery rate is calculated as 1‑(1‑r)^(n+1). For example, with r = 97 % and n = 2, the delivery rate reaches 99.9973 %.
Other Optimizations
Enter/Exit Room Messages : To avoid loss of room‑join/leave notifications, a state‑machine driven by heartbeats is used, providing idempotent handling and compensation mechanisms.
Future Plans
The service is stable after several iterations, and future work will focus on:
Data‑driven observability: full‑link network quality metrics and high‑value message tracing.
Intelligence: automatic endpoint selection and connection establishment based on environment.
Performance: sharing goroutines in the access layer’s connection module to reduce goroutine count and increase per‑node capacity.
Feature expansion: adding offline message support and other capabilities.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.