Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform
To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.
System monitoring is a fundamental requirement for ensuring the integrity and reliability of large‑scale microservice projects. iQIYI’s "LEDAO" content platform, which manages video, audio, subtitles, and images, has grown to more than 100 microservices, making the existing monitoring approach insufficient.
The platform identified several urgent monitoring needs: rapid detection and localization of anomalies, observable deployment processes with immediate rollback, comprehensive performance metrics for services, interfaces and databases, stable health‑status monitoring of services, timely detection of host/container resource changes, and clear business‑level monitoring.
Based on these requirements, three open‑source monitoring solutions were evaluated:
1. OpenFalcon – easy to integrate, low intrusion, but focuses mainly on host metrics and lacks deep system monitoring.
2. Prometheus – pull‑based data collection, powerful PromQL query engine, but requires additional components (e.g., Grafana) and has a steep learning curve.
3. CAT (by Meituan‑Dianping) – comprehensive real‑time monitoring and alerting, but higher integration cost and more intrusive.
Considering the need for comprehensive features, stability, mature UI, reporting, and alerting, the team selected CAT as the monitoring backbone for the platform.
Deployment and iteration : CAT was initially deployed on a few VMs for pilot testing, then upgraded and adapted for the LEDAO platform. Separate clusters were set up for overseas and mainland China, with the mainland cluster serving over 100 microservices, handling >10k TPS and processing ~1.5 TB of data daily.
Key enhancements :
1. Access method refactoring – introduced three configuration options (XML file, environment variables, properties file) to decouple client configuration from host machines, reducing operational overhead.
2. Health‑check module – added a component that periodically probes services from multiple data centers; repeated failures trigger alerts, filling a gap where CAT could not detect client‑side crashes.
3. Alerting integration – merged CAT’s native alerts with iQIYI’s internal notification channels (email, instant messaging, SMS), improving alert reachability and response speed.
Practice outcomes :
• New services can be onboarded to CAT within five minutes.
• The proxy‑based client library enables near‑zero‑intrusion instrumentation.
• Full‑stack observability now covers hardware metrics, service health, exceptions, performance, and business‑level KPIs.
• Robust alert configurations provide real‑time notifications, enabling rapid issue resolution.
The implementation demonstrates a complete monitoring pipeline: quick business onboarding → automated instrumentation → alert configuration → alert delivery → incident handling, effectively closing the monitoring gap for the LEDAO platform.
Future work includes support for distributed transactions, richer business dashboards, and deeper integration with service discovery (e.g., Nacos) to further enhance the observability ecosystem.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.