Evolution of iQIYI Real-Time Big Data Collection System
iQIYI’s big‑data collection system has progressed from simple HTTP log uploads to a Flume‑Kafka pipeline and finally to a custom Venus‑Agent architecture with centralized configuration, persistent offsets, dual‑Kafka streams and Flink processing, now handling tens of millions of queries per second and over three hundred billion records daily to power its AI‑driven services.
Author: Weichen, head of iQIYI Cloud Platform big data collection business. Joined iQIYI Cloud Platform in 2014 and has experience in big‑data production, collection, transmission and analysis.
Big data has become a strategic resource comparable to land in the agricultural era or energy in the industrial era. Massive user and machine logs are generated at iQIYI, with peak production exceeding ten million log lines per second and more than 3 × 10¹¹ lines per day.
First‑generation data collection (2013‑2014) : Simple HTTP‑based code‑point logging on client devices, logs collected by Nginx, written to HDFS for offline analysis. This architecture suffered from high latency (10‑15 minutes from user action to HDFS) and could only support low‑real‑time use cases such as daily reports.
Second‑generation data collection (2015‑2016) : Introduced Apache Flume and Kafka to achieve second‑level latency and to collect both user behavior logs and backend program logs. Real‑time data enabled more responsive recommendation and security use cases. However, as data volume grew, several problems emerged:
Different business lines required separate collection scripts and Flume configurations, leading to high maintenance cost.
Any change in real‑time collection required large‑scale Flume restarts.
Flume’s functionality could not satisfy advanced needs such as random sampling, data transformation, or format conversion.
Third‑generation data collection (2016‑2017) : Developed a custom client called Venus‑Agent to replace Flume. Key innovations:
File selection based on inode numbers instead of tail‑style scripts, allowing dynamic updates of the file list and seamless log collection in elastic Docker containers.
Persistent offset tracking for each file, enabling the client to resume from the last position after network glitches or Kafka failures, thus guaranteeing no data loss.
All collection configurations are stored centrally; agents pull the latest config via heartbeat, eliminating manual configuration management.
These changes dramatically reduced operational overhead, improved robustness, and allowed the system to scale to millions of QPS.
Real‑time data processing layer : A dual‑Kafka architecture is used – the first layer receives raw logs from agents, the second layer delivers processed streams to downstream services. Apache Flink performs real‑time data cleaning, filtering, and enrichment. Flink job logic is stored in a backend management system and fetched by tasks through heartbeats, making logic updates instantaneous and centrally manageable.
The combined system now supports a peak of tens of millions of QPS and daily throughput of over three hundred billion records. It provides both raw and cleaned data, supports diverse data sources (including binlog and container logs), and forms a solid foundation for iQIYI’s AI and analytics initiatives.
In summary, iQIYI’s big‑data collection architecture has evolved from a simple batch‑oriented pipeline to a highly automated, real‑time, and scalable platform that underpins the company’s data‑driven products and AI strategies.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.