Operations 24 min read

How AIOps Transforms Zero‑Downtime Operations at China Aviation Information

This talk explains how China Aviation Information applies practical AIOps techniques—such as automated configuration management, cluster analysis, fault prediction, anomaly detection, and event compression—to achieve near‑zero downtime in a complex, legacy‑heavy ticketing and travel system.

Efficient Ops
Efficient Ops
Efficient Ops
How AIOps Transforms Zero‑Downtime Operations at China Aviation Information

When we first approached AIOps it seemed like a high‑threshold technology, so we even recruited data scientists with deep learning experience. In practice, AIOps success depends heavily on the scenario, and we focus on pragmatic, problem‑oriented solutions for availability assurance and zero‑downtime operations.

We divided the presentation into five parts:

Business characteristics and challenges

Non‑standard architecture

Fault prediction and identification

Rapid fault resolution

Our overall understanding of AIOps

1. Business Characteristics of China Aviation Information

China Aviation Information (CAI) provides the core ticketing and travel services for all domestic airlines. Its transaction volume is comparable to a major state bank, handling about half the scale of the largest banking systems. The airline reservation system is highly complex, supporting multi‑seat bookings, wheelchair requests, and other special cases.

CAI’s legacy systems date back to the 1980s, making the platform one of the world’s earliest paper‑less ticketing systems. Many services have been running for ten to fifteen years, and the architecture consists of numerous small clusters with intricate inter‑dependencies.

Reliability is non‑negotiable: a ten‑minute delay at major airports can trigger public unrest, so CAI enforces a zero‑rating policy for service availability, regardless of cost.

2. AIOps vs Non‑Standard Architecture

2.1 Architecture Overview

We often draw a cluster diagram to expose two key problems: single points of failure and asymmetric clusters that appear as a single cluster but are not. Unreasonable upstream/downstream dependencies also arise, especially when a seemingly minor system becomes a critical failure point.

2.2 Configuration Management

Our first step was to automate configuration delivery on a cloud platform, achieving 100% automated configuration updates. We built a discovery tool that automatically identifies connection pools, database dependencies, and other configuration items. However, the tool cannot discover resources that are not yet part of the managed inventory, such as newly introduced services or C++‑based components.

2.3 Cluster Analysis

We model each server by extracting static and dynamic features (process names, ports, CPU load, etc.). After standardizing and dimensionality‑reducing the data, we apply clustering algorithms to separate servers into four typical behavior groups. The resulting clusters are then matched against a CMDB to automatically label dependencies.

In a pilot with 1,500 servers, we produced about 100 clusters in three days, demonstrating that manual labeling is infeasible at scale.

3. AIOps vs Fault Prediction and Identification

3.1 Fault Prediction

We explored early fault prediction using historical metrics (e.g., disk health curves). The main challenge is the scarcity of negative samples; we synthesize negatives by interpolating between positive and random points. With this approach, we achieved around 85% prediction accuracy on network devices.

3.2 Anomaly Detection

Data volume is limited, so we first classify time series into distinct patterns (e.g., weekday vs. weekend load). By separating these patterns, dynamic baselines become more accurate. We also aggregate similar servers to increase sample size, but avoid mixing heterogeneous roles within a single cluster.

3.3 Event Compression

Our event platform compresses and filters millions of monitoring alerts. Rules combine cluster, business, and severity information to collapse redundant alerts during large‑scale incidents. For frequent low‑impact alerts, custom compression rules reduce noise without sacrificing critical information.

4. AIOps vs Rapid Fault Resolution

We prioritize the top 80% of alerts that cause most outages and automate their handling. By building a fault‑handling system that applies predefined rules, we have reduced manual intervention to about 24% of incidents, following the 80/20 principle.

5. Our Understanding of AIOps

AIOps is a first‑step tool for operations, not a silver bullet. It works best when data is virtualized and shared across teams, allowing both data scientists and domain experts to experiment quickly. Expectations should be realistic: AIOps currently assists humans rather than replacing them, and its value grows as more data becomes available.

In summary, by automating configuration discovery, applying cluster analysis, building fault prediction models, and compressing events, CAI has created a practical AIOps pipeline that significantly improves reliability while keeping costs manageable.

AIoperationsconfiguration managementAIOpsFault Predictioncluster analysis
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.