Design and Implementation of Baidu's Unified Long‑Connection Service
Baidu’s Go‑based unified long‑connection service delivers secure, high‑concurrency, low‑latency connections for multiple Baidu apps through a four‑layer architecture (SDK, control, access, routing), employing goroutine pooling, two‑layer connection models and binary routing to support tens of millions of concurrent users and million‑level QPS, while simplifying integration and reducing maintenance costs.
In the mobile‑Internet era, user expectations for real‑time and interactive services have driven the need for high‑performance long‑connection capabilities. This article presents Baidu’s internal unified long‑connection service, implemented in Go, and discusses its functional design, performance optimizations, and operational experience.
Abstract The unified long‑connection service provides a secure, high‑concurrency, low‑latency, and easy‑to‑integrate solution for multiple Baidu apps (live streaming, messaging, push, cloud control, etc.). It eliminates duplicated development, reduces maintenance cost, and ensures professional, stable long‑connection capabilities across business lines.
Key Goals
Support the major Baidu APP scenarios with a unified, secure long‑connection capability.
Guarantee high concurrency, high stability, and low latency.
Enable multi‑business reuse of a single connection to reduce resource consumption.
Provide a simple, clear integration process for downstream services.
Functional Overview
Connection establishment, maintenance, and management.
Upstream request forwarding.
Downstream data push (unicast, batch‑unicast, broadcast).
Challenges
The service must meet low‑latency, high‑concurrency, and high‑stability requirements while supporting many business scenarios. Maintaining separate long‑connection implementations for each business would cause duplicated effort and hinder rapid feature iteration.
Architecture
The system consists of four layers:
Unified Long‑Connection SDK (client side) – obtains token and endpoint from the control layer, establishes and maintains the connection, forwards business SDK requests, and receives server‑pushed data.
Control Layer – validates device legitimacy, issues tokens, selects appropriate access points and protocols, and performs traffic control.
Access Layer – core long‑connection service handling connection admission, maintenance, request forwarding, and downstream push. It manages connection‑ID ↔︎ connection‑info mapping, group‑ID ↔︎ connection‑info mapping, and separates read/write goroutine pools.
Routing Layer – maintains device‑ID ↔︎ connection‑info mapping to enable targeted push.
Core Process
Connection establishment – SDK obtains token & endpoint from the control layer, then connects to the access layer.
Connection maintenance – periodic heartbeat from the SDK keeps the connection alive.
Upstream request – business SDK sends a request, the SDK packages it, and the access layer forwards it to the appropriate business server.
Downstream push – business server sends a push request, the routing layer resolves the target connection, the access layer writes the data, and the SDK delivers it to the business SDK.
Performance Optimizations
Support for millions of concurrent connections and tens of thousands of QPS for connection establishment, upstream, and downstream traffic.
Introduction of a request‑forwarding group and a downstream‑task group to avoid a single goroutine becoming a bottleneck and to reduce the total number of goroutines per instance.
Two‑layer connection model (connection layer + session layer) isolates business logic from the underlying transport (TCP, TLS, WebSocket, QUIC), allowing seamless protocol upgrades.
State‑machine based connection lifecycle management ensures reliable reconnection and clear state transitions.
Multi‑Business Support
A private binary protocol is used, consisting of a header, common fields (device ID, app ID, business ID, metadata), and business‑specific payload. By parsing the business ID, the service can route data to the correct backend without interpreting business logic.
Deployment
Access points are deployed in East, North, and South China, plus a Hong Kong node for overseas traffic.
Clusters are sized per business importance; critical services have dedicated clusters, while secondary services share resources.
Each instance caps active connections at 100k‑200k to limit goroutine count and GC pressure.
Business Integration
Assess required capabilities (unicast, batch‑unicast, group‑cast, upstream support).
Estimate user scale to plan resources.
Integrate the client SDK.
Adapt server‑side interfaces according to the selected capabilities.
Request resources and launch the service.
Summary & Future Plans
The unified long‑connection service now handles tens of millions of concurrent connections and supports million‑level upstream QPS and downstream UPS. It has proven stable during large‑scale events. Future work focuses on finer‑grained network quality metrics, intelligent adaptive connection parameters, and expanding to new business scenarios.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.