AIOps Implementation Practice at 360: Architecture, Models, and Automation
The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.
On September 22, the 18th session of the 360 Internet Technology Training Camp titled “AIOps Landing Practice Exploration” was held at the 360 Building in Beijing, with a summary shared by senior operations engineer Wang Baoping.
The event featured four talks: an internal presentation on "AIOps in 360 – You Can Quickly Deploy AIOps," which introduced 360’s intelligent operations framework and component replacement suggestions; a second internal talk on "AI Operations Platform Based on StackStorm – Fault Self‑Healing Practice," demonstrating scenario‑driven fault detection, prediction, and automated remediation; and two external talks. The first external speaker from Yixin discussed building a next‑generation intelligent CMDB using knowledge graphs to encode operational expertise. The second external speaker from LogEasy presented "Intelligent Operations and Security Based on Log Big Data," describing how large‑scale log ingestion, NLP keyword analysis, and time‑series anomaly detection enable AI‑driven monitoring.
Following the external talks, the article outlines 360’s own AIOps implementation. Since early 2018, 360 has identified three high‑frequency operational scenarios—resource reclamation, alarm false‑positive/negative reduction, and alarm correlation—and applied AI models such as classification, time‑series forecasting, and anomaly detection to each. The overall workflow is described as "Operations Big Data → AI Center → Alarm Self‑Healing → Operations Dashboard," likened to a human body where data collection is the eyes, the AI center the brain, self‑healing the hands/feet, and the dashboard the face.
The operations dashboard aggregates resource reclamation costs, efficiency gains, core network link metrics, and real‑time push notifications for large‑scale alarms, with suggestions for future interactive features such as gaze tracking and gesture control.
The underlying architecture relies on custom‑built agents that collect hardware, log, process, and external‑network quality data, forwarding them to various storage back‑ends (Elasticsearch, MongoDB, InfluxDB). A lightweight gateway handles data ingestion and alarm dispatch without heavy RPC frameworks, and an Nginx front‑end simplifies logging.
Three core AI models are deployed: time‑series forecasting for resource reclamation, time‑series anomaly detection for external‑network quality, and alarm correlation analysis for I/O alerts. For anomaly detection across 100,000 links, the system clusters links into ~200 groups, applies multiple models (Isolation Forest, EWMA + 3σ, other statistical methods), and uses a voting mechanism to determine anomalies, achieving per‑data processing within one minute.
Model serving is optimized by pre‑loading models into memory, using short‑lived TCP connections, and batching requests. A map‑reduce style pipeline distributes 100,000 data points across a 10‑node high‑performance cluster, achieving sub‑second response times for anomaly flags.
Detected anomalies are fed into a labeling platform where operators confirm or reject results, providing feedback that continuously refines offline model training.
The self‑healing platform, built on a customized StackStorm framework, abstracts common remediation actions (machine reboot, process restart, ticket generation) into atomic actions that can be composed into workflows, enabling non‑technical users to design new self‑healing scenarios via a UI.
In summary, the end‑to‑end pipeline—Operations Big Data → AI Center → Labeling Platform → Alarm Self‑Healing → Operations Dashboard—constitutes 360’s current AIOps framework, which the team aims to generalize and offer as a SaaS solution for other teams and external customers.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.