Operations 22 min read

Applying AIOps for Zero‑Downtime Operations at China Aviation Information

The talk by chief architect Luo Hao explains how China Aviation Information tackles heavy legacy systems, non‑standard architectures, and zero‑downtime requirements by using AIOps techniques such as automated configuration discovery, cluster analysis, fault prediction, anomaly detection, event compression and rapid root‑cause automation.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Applying AIOps for Zero‑Downtime Operations at China Aviation Information

1. Business Characteristics of China Aviation Information

China Aviation Information (CAI) provides the core ticketing and flight‑information services for all domestic airlines in China, handling transaction volumes comparable to a major state bank. Its systems are large, heterogeneous, and contain many legacy components accumulated since the 1980s.

Key constraints include strict zero‑downtime requirements (e.g., passengers must be able to board within minutes of flight departure) and a mix of modern micro‑service workloads and decades‑old monolithic applications.

2. AIOps vs. Non‑Standard Architecture

2.1 Current Architecture Issues

Unstandardized cluster deployments and inconsistent upstream/downstream dependencies.

Heavy legacy baggage that makes systematic refactoring difficult.

Frequent configuration‑driven failures caused by inconsistent updates across nodes.

2.2 Configuration Management

CAI built an automated configuration‑delivery platform that discovers and synchronises configuration items (e.g., connection pools, database dependencies) across clusters. Limitations remain for newly introduced services and for C++‑based components where automatic discovery is hard.

2.3 Cluster Analysis

Data from each server (process names, ports, connections) is collected, cleaned, normalized, and reduced to feature vectors. Simple dimensionality‑reduction and clustering reveal distinct behavior groups (stable, slightly fluctuating, highly volatile, etc.). Manual labeling is infeasible for thousands of servers, so CAI leverages CMDB data to auto‑label clusters and infer dependency relationships.

3. AIOps for Fault Prediction and Anomaly Detection

3.1 Fault Prediction

Using historical metrics (e.g., disk health curves) CAI creates synthetic negative samples to balance training data, then trains models that achieve around 85% accuracy for network‑device failures. Feature engineering includes vendor, model, and topology position.

3.2 Anomaly Detection

Data is first segmented by patterns (workday vs. weekend, holiday vs. non‑holiday) to improve baseline stability. Anomalies are detected within each homogeneous segment, avoiding the dilution caused by mixing dissimilar workloads.

3.3 Event Compression

CAI’s event platform aggregates millions of raw alerts into higher‑level incidents using rule‑based compression and frequent‑itemset mining. Rules are refined by removing redundant subsets and adjusting support thresholds to keep only truly informative patterns.

4. Rapid Fault Resolution

By analysing the top 20% of alerts that cause 80% of downtime, CAI built an automated remediation system that handles the majority of recurring incidents without human intervention, achieving a 76% reduction in manual fault handling.

5. Understanding of AIOps

AIOps is viewed as an auxiliary capability that enhances, rather than replaces, traditional operations. Success depends on data virtualization, cross‑team data sharing, and a pragmatic approach that applies mature statistical and machine‑learning methods without over‑promising.

The presentation concludes that while AIOps is not a silver bullet, its systematic application can significantly improve reliability and efficiency in large‑scale, zero‑downtime environments.

machine learningautomationoperationsconfiguration managementAIOpsFault Prediction
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.