Applying AIOps to Game Operations: Roadmap, Anomaly Detection, and Fault Localization
This article describes NetEase's AIOps journey for game operations, explaining the Gartner definition of intelligent operations, the implementation roadmap, detailed anomaly‑detection techniques for business, performance, and log data, and a comprehensive fault‑localization workflow that combines resource, code, and historical analysis.
According to Gartner, AIOps (Artificial Intelligence for IT Operations) integrates big data and machine‑learning capabilities to extract and analyze the growing volume, variety, and velocity of IT data, supporting quality assurance, cost management, and efficiency improvement in operational scenarios such as anomaly detection, fault diagnosis, prediction, self‑healing, resource optimization, capacity planning, intelligent change, and decision making.
1. NetEase Game AIOps Implementation Roadmap
Since 2016, NetEase Game has continuously explored AIOps, transitioning from manual to intelligent operations. The team built a smart monitoring platform and has deployed functions such as anomaly detection, prediction, correlation analysis, drill‑down analysis, log analysis, operation robots, fault location, and fault warning, as well as flame‑graph analysis, hardware prediction, and CDN file publishing.
2. Anomaly Detection
Anomaly detection is the foundation of AIOps, using AI algorithms to automatically and accurately identify abnormal patterns in monitoring data, offering advantages over traditional threshold‑based methods such as easier configuration, higher accuracy, broader coverage, and automatic updates.
Business Golden Metrics
These metrics (e.g., online player count) have strong periodicity, low volatility, small scale, and require high precision and recall. A supervised learning framework is used, with sample construction from historical KPI data and online user annotations, preprocessing with LSTM+CNN classification, missing‑value handling, feature engineering (≈500 features), model training (RF, XGB, GBDT, LR ensemble), and visualization via graphic alerts and quick annotation.
Sample Construction: Combine historical KPI samples and user‑labeled data, using unsupervised IForest to generate anomaly scores, followed by stratified sampling and manual labeling.
Preprocessing: Curve classification (LSTM+CNN), missing‑value filling, max‑min normalization, and feature extraction.
Algorithm Model: Ensemble of RF, XGB, GBDT, and LR.
Visualization: Graphic alerts, quick annotation links, and anomaly view.
Performance Metrics
Performance metrics (e.g., CPU usage) have large scale, complex types, and irregular cycles, making supervised models impractical. Unsupervised models are applied, classifying anomalies into spikes, drifts, high‑frequency, and linear‑trend types, each with dedicated detection algorithms such as differencing, SR, STL decomposition, mean‑shift, robust regression, multi‑step differencing, and LR + MK for trend detection. Periodic suppression is added to reduce false positives.
Text Data (Log Analysis)
Massive daily logs contain potential anomalies that are hard to spot. Log intelligent analysis leverages big data and AI to perform real‑time log classification and anomaly detection. Templates are extracted using the Drain algorithm followed by a secondary Spell classification. Anomaly detection compares template counts across time windows, using machine‑learning models to flag sudden changes.
Train models on two‑day historical log distributions to learn normal fluctuations.
Analyze overall log distribution to suppress minor noise.
Automatically select top‑N log categories with the greatest impact.
3. Fault Localization
Fault localization follows a two‑stage process: pre‑mitigation (quickly obtain information for immediate damage control) and post‑mitigation (deep root‑cause analysis to restore normal service). In complex game architectures, alarms are numerous and scattered, making manual correlation difficult.
Resources
Resource‑level analysis includes machines, network channels, and SaaS services.
Machines: Detect anomalies on recent metrics, score them, and rank by criteria such as earliest occurrence and severity to produce top‑N abnormal machines.
Network/Channel: Apply the Adtributor algorithm to drill down by region and carrier, yielding top‑N abnormal dimensions.
SaaS: Directly aggregate existing SaaS alerts.
Code
Code‑related issues are identified through log classification and anomaly detection, presenting top‑N abnormal log templates.
Human Operations
Human‑initiated changes are linked with change‑management systems to correlate pre‑fault change events and issue alerts.
Historical Faults
Historical fault similarity is measured using the Tanimoto coefficient; top‑N similar past faults and their root causes are recommended when the similarity exceeds a threshold.
The overall fault‑localization workflow detects a fault, then analyzes resources, code, human actions, and historical incidents to pinpoint the root cause, e.g., a drop in online player count triggers machine‑level network anomaly detection, leading to identification of a network outage.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.