Design and Performance Considerations of 360's Long‑Connection Push System Using Go
This article details the architecture, key performance metrics, and optimization strategies of 360’s high‑availability long‑connection push system built with Go, covering connection capacity, memory usage, throughput, deployment components, client SDK considerations, and operational challenges such as GC pauses and load balancing.
This article is based on a presentation by Zhou Yang, a technical manager and architect at 360 Mobile Assistant, who is responsible for the 360 long‑connection message (push) system.
360's message system is a long‑connection push platform serving multiple internal products, supporting thousands of apps, providing upstream data, and offering user‑state callbacks. The system consists of nine functional clusters deployed across several IDC sites, handling billions of online users.
Key performance indicators discussed:
1. Connection capacity per instance – In stable conditions a single instance can maintain up to 3 million concurrent TCP connections, but real‑world network variability limits practical usage.
2. Memory consumption – Go’s goroutine model adds overhead; full‑duplex designs use two goroutines per connection, while half‑duplex can reduce memory usage.
3. Message throughput – Depends on QoS, push‑pull models, logging, and buffer strategies; a 24‑core, 64 GB server can sustain 2‑5 万 QPS with 25 GB memory and GC pauses of 200‑800 ms.
The system typically caps a single instance at 800 k users and runs at most two instances per machine to avoid DDoS‑like spikes.
System architecture overview
All services are written in Go. Major components include:
Dispatcher service : Returns a set of IPs for the client to connect to the appropriate long‑connection server.
Room service : Holds client connections, registers them, and enforces security policies.
Register service : Global session store indexing user information.
Coordinator service : Forwards upstream data, handles callbacks, and coordinates asynchronous operations such as kicking users.
Saver service : Access layer for Redis and MySQL, caches broadcast data, and implements dead‑message handling strategies.
Center service : Provides internal APIs for unicast, broadcast, status queries, and operational management.
Deployd/Agent service : Manages process deployment and collects component health via Zookeeper/keeper.
The push model is primarily server‑initiated (push‑only), with optional pull for QoS‑critical scenarios. SDK routing strategies, heartbeat tuning, and client‑side reconnection logic are crucial for reliability in weak network environments.
Go development challenges and solutions
Key issues encountered include excessive goroutine creation, un‑reused I/O buffers, inefficient RPC frameworks, and long GC pauses (up to 3‑6 seconds). Solutions involved limiting goroutine spawns, using task pools, adopting connection‑pooled RPC with pipelining, and employing memory/object pools where beneficial.
Operational practices include multi‑instance deployment, selective use of SO_REUSEPORT, and careful evaluation of pooling versus lock overhead.
Operations and testing
Architecture evolves through instance splitting and business‑type resource segregation. Visualized pressure tests on idle servers assess long‑connection stability, while Go’s built‑in profiling aids performance tuning.
Frequently asked questions cover protocol timeout settings, message persistence (Redis + MySQL), storm mitigation, Go toolchain debugging, TCP‑based protocol stack, upstream data routing, SDK multi‑app reuse, profiling activation, consumer grouping, choice of Go over Erlang, load‑balancing via Zookeeper vs. Raft, security (TLS + custom RSA/DES), and plans to open‑source the keeper component.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.