Backend Development 15 min read

Design and Implementation of a WebSocket Long‑Connection Gateway for iQIYI Services

iQIYI built a Netty‑based WebSocket long‑connection gateway that centralizes session management, uses RocketMQ for cluster‑wide broadcast, provides a simple HTTP push API, scales horizontally to millions of connections, integrates Prometheus/Grafana monitoring, and dramatically cuts development effort for real‑time features such as comments and authentication.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Design and Implementation of a WebSocket Long‑Connection Gateway for iQIYI Services

HTTP is a stateless request/response protocol based on TCP, where only the client can initiate a request and the server responds. While this pull model satisfies most scenarios, real‑time push use cases such as message notifications require the server to actively push data to the client.

Server‑side push technologies have evolved from short‑polling and long‑polling, which suffer from latency and resource waste, to the HTML5 WebSocket standard, which is now the mainstream solution.

Integrating WebSocket is straightforward, but a generic WebSocket push gateway has not been maturely solved. Existing cloud providers focus on mobile push (iOS/Android) and lack WebSocket support. This article shares the design and implementation experience of a WebSocket long‑connection gateway built on Netty.

iQIYI Service Scenarios

iQIYI’s content creation platform uses WebSocket for real‑time features such as:

User comments – push comments instantly to browsers.

Real‑name authentication – after scanning a QR code, the authentication result is asynchronously notified to the browser.

Liveness detection – similar to authentication, results are pushed after completion.

Problems identified in existing implementations:

Inconsistent WebSocket stacks (Netty vs servlet containers) increase development and maintenance difficulty.

WebSocket logic is scattered across multiple projects and tightly coupled with business systems, leading to duplicated effort.

WebSocket is stateful; a client connects to a single node, making session sharing across a cluster necessary.

Therefore, a clustered WebSocket solution must address session sharing, horizontal scalability, and monitoring integration.

Design Goals of the Long‑Connection Gateway

Centralized long‑connection management and push capability using a unified technology stack.

Decoupling of business logic from communication details to avoid repeated development.

Simple integration via an HTTP push API, usable by any programming language.

Distributed architecture for horizontal scaling and high availability.

Multi‑device message synchronization.

Multi‑dimensional monitoring and alerting integrated with existing micro‑service observability tools.

Technical Selection

Netty was chosen for its high performance, event‑driven, asynchronous, non‑blocking nature.

For cluster‑wide session sharing, two schemes were evaluated:

方案

优点

缺点

注册中心

会话映射关系清晰,集群规模较大时更合适

实现复杂,强依赖注册中心,有额外运维成本

事件广播

实现简单更加轻量

节点较多时,所有节点均被广播,资源浪费

For the broadcast mechanism, three options were compared:

方案

优点

缺点

基于RocketMQ

吞吐量高、高可用、保证可靠

实时性不如Redis

基于Redis

实时性高、实现简单

不保证可靠

基于ZooKeeper

实现简单

写入性能较差,不适合频繁写入场景

Considering throughput, reliability, and implementation cost, RocketMQ was selected as the broadcast backbone.

System Architecture

The overall architecture is illustrated in the diagram below:

Workflow

Client establishes a WebSocket handshake with any gateway node; the node adds the connection to an in‑memory pool and monitors heartbeats.

When business systems need to push data, they call the gateway’s HTTP API.

The gateway writes the push request into RocketMQ.

All gateway nodes consume the message in broadcast mode.

Each node checks whether the target client is present in its local connection pool; if so, it pushes the data, otherwise it discards the message.

The cluster provides load balancing and horizontal scaling; if a node fails, clients reconnect to other nodes, ensuring high availability.

Session Management

Sessions are kept in memory on each node. The SessionManager component maintains a hash map of UID → UserSession . A UserSession may contain multiple ChannelSession objects (one per browser tab). When the number of channels per user exceeds a threshold, the oldest channel is closed to limit resource usage.

Monitoring & Alerting

The gateway integrates Micrometer to expose custom metrics (connection count, user count) for Prometheus scraping. Grafana dashboards display connection numbers, JVM, CPU, memory, and other key indicators. Alert rules can be configured in Grafana to trigger internal alarm platforms when anomalies occur.

Performance Testing

Two 4‑core/16 GB VMs were used as server and client. The gateway opened 20 ports, each handling 20 clients; each client created 5 000 connections, achieving a total of 1 million concurrent connections. Memory usage and connection counts are shown in the figure below:

Sending a single message to all 1 million connections took about 10 seconds (single‑threaded sender). The latency chart is shown below:

With 10 connections per user, 600 concurrent requests, and a 120 s test duration, the push API achieved a TPS of over 1 600, as illustrated below:

These metrics satisfy iQIYI’s current business requirements and provide headroom for future growth.

Business Case: Image Filter Notification

When a creator uploads a video cover, they can choose a filter effect. After the asynchronous processing finishes, the filtered image is pushed to the browser via the WebSocket gateway:

Integrating the gateway reduces development time from 1‑2 days to a few minutes, improves code maintainability, and lowers operational costs.

Conclusion

WebSocket is the mainstream technology for server‑push. A well‑designed long‑connection gateway abstracts communication details, decouples business logic, provides simple HTTP push interfaces, supports distributed deployment, and offers built‑in monitoring and alerting. iQIYI has already applied this gateway in multiple scenarios (image filter notifications, MCN electronic signatures, etc.) and plans to explore features such as message retransmission, binary data support, and multi‑tenant isolation.

Backend developmentPerformance TestingNettyWebSocketrocketmqgatewayLong Connection
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.