How AIOps Transforms Zero‑Downtime Operations at China Aviation Information
This talk explains how China Aviation Information applies practical AIOps techniques—such as automated configuration management, cluster analysis, fault prediction, anomaly detection, and event compression—to achieve near‑zero downtime in a complex, legacy‑heavy ticketing and travel system.
When we first approached AIOps it seemed like a high‑threshold technology, so we even recruited data scientists with deep learning experience. In practice, AIOps success depends heavily on the scenario, and we focus on pragmatic, problem‑oriented solutions for availability assurance and zero‑downtime operations.
We divided the presentation into five parts:
Business characteristics and challenges
Non‑standard architecture
Fault prediction and identification
Rapid fault resolution
Our overall understanding of AIOps
1. Business Characteristics of China Aviation Information
China Aviation Information (CAI) provides the core ticketing and travel services for all domestic airlines. Its transaction volume is comparable to a major state bank, handling about half the scale of the largest banking systems. The airline reservation system is highly complex, supporting multi‑seat bookings, wheelchair requests, and other special cases.
CAI’s legacy systems date back to the 1980s, making the platform one of the world’s earliest paper‑less ticketing systems. Many services have been running for ten to fifteen years, and the architecture consists of numerous small clusters with intricate inter‑dependencies.
Reliability is non‑negotiable: a ten‑minute delay at major airports can trigger public unrest, so CAI enforces a zero‑rating policy for service availability, regardless of cost.
2. AIOps vs Non‑Standard Architecture
2.1 Architecture Overview
We often draw a cluster diagram to expose two key problems: single points of failure and asymmetric clusters that appear as a single cluster but are not. Unreasonable upstream/downstream dependencies also arise, especially when a seemingly minor system becomes a critical failure point.
2.2 Configuration Management
Our first step was to automate configuration delivery on a cloud platform, achieving 100% automated configuration updates. We built a discovery tool that automatically identifies connection pools, database dependencies, and other configuration items. However, the tool cannot discover resources that are not yet part of the managed inventory, such as newly introduced services or C++‑based components.
2.3 Cluster Analysis
We model each server by extracting static and dynamic features (process names, ports, CPU load, etc.). After standardizing and dimensionality‑reducing the data, we apply clustering algorithms to separate servers into four typical behavior groups. The resulting clusters are then matched against a CMDB to automatically label dependencies.
In a pilot with 1,500 servers, we produced about 100 clusters in three days, demonstrating that manual labeling is infeasible at scale.
3. AIOps vs Fault Prediction and Identification
3.1 Fault Prediction
We explored early fault prediction using historical metrics (e.g., disk health curves). The main challenge is the scarcity of negative samples; we synthesize negatives by interpolating between positive and random points. With this approach, we achieved around 85% prediction accuracy on network devices.
3.2 Anomaly Detection
Data volume is limited, so we first classify time series into distinct patterns (e.g., weekday vs. weekend load). By separating these patterns, dynamic baselines become more accurate. We also aggregate similar servers to increase sample size, but avoid mixing heterogeneous roles within a single cluster.
3.3 Event Compression
Our event platform compresses and filters millions of monitoring alerts. Rules combine cluster, business, and severity information to collapse redundant alerts during large‑scale incidents. For frequent low‑impact alerts, custom compression rules reduce noise without sacrificing critical information.
4. AIOps vs Rapid Fault Resolution
We prioritize the top 80% of alerts that cause most outages and automate their handling. By building a fault‑handling system that applies predefined rules, we have reduced manual intervention to about 24% of incidents, following the 80/20 principle.
5. Our Understanding of AIOps
AIOps is a first‑step tool for operations, not a silver bullet. It works best when data is virtualized and shared across teams, allowing both data scientists and domain experts to experiment quickly. Expectations should be realistic: AIOps currently assists humans rather than replacing them, and its value grows as more data becomes available.
In summary, by automating configuration discovery, applying cluster analysis, building fault prediction models, and compressing events, CAI has created a practical AIOps pipeline that significantly improves reliability while keeping costs manageable.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.