Intelligent Anomaly Detection for Ctrip Operations: LSTM Forecasting, Trend Analysis, Adaptive Thresholds, and Periodic Anomaly Filtering
The article describes Ctrip's AIOps approach to improving alert quality by combining statistical methods and machine‑learning models such as LSTM, trend analysis, adaptive threshold calculation, and dynamic‑time‑warping based periodic anomaly detection, achieving significant gains in precision and fault‑recall rates.
Background Ctrip, a large online travel platform, faces stability challenges due to traffic spikes, code releases, and operational changes. To meet a "1‑5‑10" fault‑handling goal (detect in 1 min, locate in 5 min, resolve in 10 min), a robust, low‑cost, high‑accuracy anomaly detection system is needed for key metrics like order volume.
2.1 More Accurate Prediction Time‑series anomaly detection predicts metric values and flags deviations. Various models (ARIMA, Holt‑Winter, LSTM) were evaluated; LSTM performed best on Ctrip's strongly periodic order data. A sliding‑window of the latest 10 points feeds the LSTM. To avoid drift when metrics slowly decline, a hypothesis test (Mann‑Whitney U) checks for short‑term trend; if a trend is detected, the previous window is retained, improving MAE.
Table 1 – Model Prediction Errors (MAE) shows LSTM‑Adjust achieving the lowest error across three business lines (AA, BB, CC).
2.2 Adaptive Threshold Calculation Manual rule‑based thresholds are overly sensitive and costly. Instead, the system computes thresholds adaptively from the metric’s own volatility. A statistic Z = (actual − prediction)/σ is defined; Z follows a stationary time series. Non‑parametric kernel density estimation (KDE) fits Z’s distribution, and the 99.99th percentile serves as the anomaly cutoff. Separate high‑ and low‑volatility periods (derived from coefficient of variation) receive distinct thresholds, reducing false alarms during low‑traffic periods.
2.3 Business Trend Analysis A single anomaly detector is insufficient for many metrics. Linear regression (Huber‑Regression) models short‑term trends; residual distance measures volatility. Combining this with the LSTM‑based predictor filters out metric jitter, boosting alert precision by ~30%.
2.4 Periodic Anomaly Detection Periodic anomalies—regular but unexpected spikes—are filtered using Dynamic Time Warping (DTW) to align current and historical windows, extracting features (period, amplitude, phase) and classifying via a supervised model. This reduces periodic false alarms by ~80%.
Conclusion The intelligent anomaly detection system consists of offline training (using 14 days of pre‑processed data) and online real‑time detection (baseline prediction, unsupervised methods such as Boxplot, K‑sigma, KDE). Over three years of deployment, alert accuracy and recall have improved markedly, with most faults discovered within one minute.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.