How 360 Detects Network Anomalies with AI‑Powered Time‑Series Algorithms
This article explains how 360’s network operations team uses time‑series analysis, statistical thresholds, EWMA, dynamic limits, and machine‑learning models such as K‑Means and Isolation Forest to automatically detect, locate, and remediate traffic anomalies across massive data‑center exits.
Foreword
Thanks to the Efficient Operations community for the platform. I was a network engineer at 360 and experienced the company’s architectural transformation, shifting my focus to network monitoring, automation, visualization, and AI applications. The presentation is divided into four parts: project background, time‑series algorithms, machine learning, and future outlook.
1. Project Background
The project targets ISP‑level traffic anomalies at DC exits, aiming to automatically discover, locate, and notify the responsible business when abnormal traffic occurs.
360’s services span search, smart hardware, mobile, dashcams, children’s watches, robots, and cloud, supporting 8.65 billion monthly active users across 122 data‑center sites with 3.5 Tbps ISP bandwidth.
Zero tolerance for service interruption requires real‑time insight into DC‑exit traffic, detection of any abnormal patterns, and immediate response.
Challenges include identifying which business caused an anomaly when alerts appear as a black box, and avoiding reliance on fixed‑threshold monitoring that misses subtle deviations.
We stored traffic from hundreds of thousands of ports as time‑series data, extracting dozens of features (server, domain, business owner, region, etc.) to enable downstream anomaly detection.
2. Time‑Series Algorithms
After data collection we verify stationarity and apply differencing when needed.
2.1 3‑Sigma
Assuming normal distribution, data points beyond three standard deviations from the mean are flagged as anomalies.
2.2 EWMA (Exponentially Weighted Moving Average)
EWMA weights recent data more heavily (parameter λ between 0 and 1). We compute a 15‑minute window EWMA over seven days, using the latest EWMA value as the mean for 3‑sigma comparison, which captures recent trends while smoothing noise.
2.3 Dynamic Threshold
Using the second‑smallest and second‑largest values from the past 14 days, multiplied by 0.6 and 1.2 respectively, we create adaptive upper and lower bounds that adjust to traffic patterns.
2.4 Small‑Flow Monitoring Optimization
We apply a logarithmic function to give higher sensitivity to low‑volume flows, allowing a steep curve for small traffic and a gentler slope for larger volumes.
3. Machine Learning
When statistical methods struggle, we turn to machine learning.
3.1 Architecture
An automatically updating model handles evolving traffic patterns; offline training produces a model, while online inference classifies real‑time traffic as normal or abnormal.
3.2 Supervised vs Unsupervised Learning
Supervised learning requires balanced, manually labeled samples; unsupervised learning discovers patterns without labels but needs parameter tuning.
3.3 Feature Extraction
Features include normalized traffic volume, quantile‑based distributions, period‑over‑period ratios, and coefficient of variation; these are fed to the model for classification.
3.4 Model Selection
We evaluated K‑Means clustering (with distance‑based anomaly thresholds) and Isolation Forest (tree‑based anomaly scoring). Isolation Forest proved easier to use and performed better with limited features, so we adopted it.
Each port‑direction pair receives its own model, updated daily, with a 10‑minute sliding window for inference.
Ensembling multiple algorithms (four statistical methods plus the ML model) and voting improves detection accuracy to over 98%.
4. Present and Future
Detection alone is insufficient without business‑level attribution. We built a C‑based tool to map abnormal traffic to IP, protocol, and business owner, sending automated alerts.
Correlation analysis using Pearson coefficients helps identify related traffic curves; a similarity search (“千里眼”) automatically finds the most correlated DC ports for a given anomaly.
Future work focuses on linking diverse monitoring metrics, suppressing redundant alerts, pinpointing root causes, and automating remediation through pre‑defined playbooks.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.