How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization
This article explains how 360’s DoctorStarange system combines time‑series forecasting, neural‑network predictions, alarm correlation, and a machine‑health scoring model to reduce false alerts, automate remediation, and maximize resource utilization across thousands of production servers.
DoctorStarange Background
To ensure the stability and reliability of 360’s private‑cloud platform, the team built a monitoring system called Wonder, but simple threshold alerts proved insufficient as traffic grew, leading to many manual, reactive alarms.
DoctorStarange was created as an intelligent prediction and handling system that forecasts alerts, correlates different alarm items, and optimizes machine resources, thereby dramatically reducing alarm frequency and speeding up root‑cause analysis.
Intelligent Prediction and Processing System
Historical monitoring data shows two patterns: stable trends (e.g., disk usage) and highly volatile trends (e.g., CPU, network traffic). Different models are applied to each.
For stable metrics, the team uses an ARIMA time‑series model. ARIMA parameters (p, d, q) are selected via AIC/BIC criteria, and model accuracy is measured by prediction accuracy and alarm‑reduction rate.
In a month‑long test on over 20,000 machines, the ARIMA‑based predictions achieved near‑100% accuracy and reduced alerts by about 70%.
For volatile metrics, a neural‑network model is employed. The input layer uses 24 hourly features, the hidden layer size is twice the input plus one, and the output layer predicts a single future point. This approach yields roughly 80% prediction accuracy and a 50% alarm‑reduction rate for CPU‑related alerts.
When a prediction indicates an imminent issue, the system can either automatically clean up log files or send a notification email with details for manual handling.
Alarm Correlation Analysis
As the number of monitored items grows, alarm volume explodes. The correlation module first merges alarms with high positive correlation to cut down duplicate alerts, then performs real‑time analysis to uncover causal relationships between metrics.
Correlation is derived from cross‑correlation coefficients and slope‑based volatility scores, weighted to produce positive and negative association lists. A self‑learning component continuously refines the model using historical cases and expert knowledge.
Machine Resource Optimization
The team introduces a “machine health score” ranging from –1 (under‑utilized) to 1 (over‑utilized), calculated from six key indicators: CPU idle, memory usage, inbound/outbound network traffic, and connection count, each evaluated against historical and predicted upper/lower bounds.
Scenarios include dynamic scaling (using health scores to decide when to add or remove machines) and visualizing data‑center topology, where colors indicate health status.
QA
Typical questions cover examples of automatic disk‑space handling, details of the alarm‑merging algorithm, the Python pybrain library used for neural networks, recommended monitoring tools (e.g., open‑falcon), and future directions such as business‑level topology mapping and automated root‑cause discovery.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.