Real-Time Log Monitoring and Alerting for iQIYI Membership Services
To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.
In June 2019 iQIYI’s membership base exceeded 100 million, leading to rapid growth in service traffic and a corresponding expansion of the machine cluster. The existing monitoring system showed limitations, prompting the membership service team to redesign the real‑time log monitoring architecture.
The new system collects four types of logs—access, exception, Nginx, and front‑end delivery—through a custom Venus‑Agent (based on Filebeat) deployed on virtual machines. Logs are streamed to a Kafka cluster, then processed by Spark Streaming (micro‑batch) and Flink (native stream) for exactly‑once semantics and millisecond‑level latency. Processed metrics are stored in Druid, an OLAP‑oriented real‑time analytical database, which serves dashboards and alerting pipelines.
Monitoring is divided into two layers: a basic layer that tracks host‑level metrics (CPU, memory, threads) via shell scripts, and an upper layer that evaluates business‑level indicators such as request success rate, response time, error codes, and traffic volume. Alerts are generated at minute granularity and can trigger automated mitigation actions (degradation, traffic switching, rate limiting).
Key functional modules include:
Nginx logs: capture network‑level data (status codes, RT, IP, region) for real‑time aggregation and fine‑grained alerts.
Front‑end delivery logs: report client‑side performance (page load, API latency, static resource time) to identify region‑ or ISP‑specific issues.
Business access logs: monitor service status codes and extract error details for rapid fault isolation.
Business exception logs: surface runtime exceptions (e.g., ResourceAccessException, NullPointerException) with contextual information for quick triage.
Network operation data: analyze Nginx traffic peaks to guide capacity planning and improve machine utilization.
The team faced several challenges:
Data standardization: unifying log formats across virtual machines and QAE deployments.
Collection performance bottlenecks: resource‑limited Venus‑Agent instances required rule simplification, log sampling, and careful CGroup tuning.
Resource cost optimization: Druid task scaling, partitioning, and traffic‑splitting strategies reduced core consumption by ~120 cores.
Spark Streaming / Flink latency: tuning Kafka partitions and Druid task counts, and increasing Spark streaming job concurrency to mitigate batch‑processing delays.
Results include an 80%+ improvement in incident investigation efficiency, detection of over 90 member‑related complaints before impact, coverage of more than 400 exception types, and delivery of 4,800+ actionable alerts. Future work will focus on intelligent threshold management, traffic‑prediction models, and automated root‑cause analysis to further enhance reliability.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.