Design and Architecture of QLive Large‑Scale Live Streaming Service
The QLive service powers iQIYI’s massive live‑streaming events—such as the Spring Festival Gala—by combining vertical and horizontal scaling, a three‑layer architecture with dual data‑center isolation, multi‑level caching, circuit‑breaker/degradation controls, and a Flume‑Kafka‑Hive monitoring pipeline to sustain over 400 k QPS and 99.9999 % availability.
During the Spring Festival, iQIYI’s live‑streaming platform experiences traffic spikes that are several times higher than normal, comparable to the 12306 ticket‑purchasing rush. To support such massive concurrent users, the QLive service provides a set of APIs for large‑scale live events such as the Spring Festival Gala, concerts, and variety shows.
Two main scaling approaches are discussed: vertical scaling (enhancing a single machine’s concurrency) and horizontal scaling (adding more servers). The team combines both by deploying a multi‑level cache and a dual‑data‑center architecture, achieving over 400,000 QPS in production tests.
The overall system is divided into three layers:
1. Access Layer (Business Middleware) : Handles load balancing, degradation, and service encapsulation. It is deployed in multiple external data centers to provide the shortest path for users of different ISPs and to avoid single‑point failures.
2. Application Layer (Dual‑Data‑Center Architecture) : Isolates data centers, services, and hotspots to improve cluster availability and horizontal scalability. It also addresses inter‑data‑center latency (≈1 ms within the same city, up to 30 ms across cities) and data‑sync challenges for MySQL/Redis.
3. Basic Service Layer includes monitoring dashboards, CMDB, release management, a Quartz‑based distributed task scheduler, trace logging, and a data center that collects user‑side ping‑backs and generates real‑time reports.
To ensure reliability, the system implements circuit‑breaker and degradation mechanisms. The circuit‑breaker has three states (Closed, Open, Half‑Open) with thresholds based on failure counts and MTTR. Degradation strategies include automatic and manual switches that downgrade non‑critical features (e.g., video play counts) based on load, timeout rates, and failure metrics.
A multi‑level cache architecture is employed: a local cache at the application tier, a distributed Redis cache, and a Redis Pub/Sub mechanism to keep caches consistent across machines. This design keeps the average response time of core APIs under 10 ms.
The monitoring data pipeline uses Flume agents to collect Nginx logs, aggregates them via Kafka, stores them in Hive, and finally loads aggregated metrics into a MySQL database for dashboard visualization.
Future work focuses on further reducing request latency, simplifying business integration, and modularizing live‑stream components while maintaining a 99.9999 % availability rate.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.