Operations 12 min read

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

iQIYI Technical Product Team

Mar 12, 2021

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

System monitoring is a fundamental requirement for ensuring the integrity and reliability of large‑scale microservice projects. iQIYI’s "LEDAO" content platform, which manages video, audio, subtitles, and images, has grown to more than 100 microservices, making the existing monitoring approach insufficient.

The platform identified several urgent monitoring needs: rapid detection and localization of anomalies, observable deployment processes with immediate rollback, comprehensive performance metrics for services, interfaces and databases, stable health‑status monitoring of services, timely detection of host/container resource changes, and clear business‑level monitoring.

Based on these requirements, three open‑source monitoring solutions were evaluated:

1. OpenFalcon – easy to integrate, low intrusion, but focuses mainly on host metrics and lacks deep system monitoring.

2. Prometheus – pull‑based data collection, powerful PromQL query engine, but requires additional components (e.g., Grafana) and has a steep learning curve.

3. CAT (by Meituan‑Dianping) – comprehensive real‑time monitoring and alerting, but higher integration cost and more intrusive.

Considering the need for comprehensive features, stability, mature UI, reporting, and alerting, the team selected CAT as the monitoring backbone for the platform.

Deployment and iteration : CAT was initially deployed on a few VMs for pilot testing, then upgraded and adapted for the LEDAO platform. Separate clusters were set up for overseas and mainland China, with the mainland cluster serving over 100 microservices, handling >10k TPS and processing ~1.5 TB of data daily.

Key enhancements :

1. Access method refactoring – introduced three configuration options (XML file, environment variables, properties file) to decouple client configuration from host machines, reducing operational overhead.

2. Health‑check module – added a component that periodically probes services from multiple data centers; repeated failures trigger alerts, filling a gap where CAT could not detect client‑side crashes.

3. Alerting integration – merged CAT’s native alerts with iQIYI’s internal notification channels (email, instant messaging, SMS), improving alert reachability and response speed.

Practice outcomes :

• New services can be onboarded to CAT within five minutes.

• The proxy‑based client library enables near‑zero‑intrusion instrumentation.

• Full‑stack observability now covers hardware metrics, service health, exceptions, performance, and business‑level KPIs.

• Robust alert configurations provide real‑time notifications, enabling rapid issue resolution.

The implementation demonstrates a complete monitoring pipeline: quick business onboarding → automated instrumentation → alert configuration → alert delivery → incident handling, effectively closing the monitoring gap for the LEDAO platform.

Future work includes support for distributed transactions, richer business dashboards, and deeper integration with service discovery (e.g., Nacos) to further enhance the observability ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Performance Observability DevOps alerting CAT

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.