Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture
The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.
The 58 Intelligent Monitoring System aims to deliver a flexible, easy‑to‑use monitoring product for all business lines of the group, achieving 7×24 real‑time monitoring without blind spots by covering network, server, system, application, and business layers.
Core Functions
Data collection (e.g., server resource usage, service status)
Configurable alarm policies
Accurate, low‑volume alarm delivery via multiple channels
Multi‑dimensional data visualization
The system acts as the guardian of online services, helping operations, development, and testing teams quickly detect and troubleshoot faults, visualize operational data, and provide intelligent insights such as alarm correlation, root‑cause analysis, and optimization suggestions.
Three‑Dimensional Monitoring Architecture
Vertical coverage includes:
Network layer – device status, bandwidth, QoS, etc.
Server layer – downtime, login failures, hardware faults
System layer – CPU, memory, disk, network usage
Application layer – port/process status, QPS
Business layer – PV, UV, order volume, revenue
Horizontal coverage includes:
User side – page performance, DNS hijacking, errors, timeouts
Data‑center network exit – VIP connectivity, page and interface monitoring
Traffic ingress – total traffic and per‑client (APP, mobile, PC) traffic, Nginx‑level metrics
Business cluster – single‑machine and cluster‑level monitoring of availability and response time
Cluster‑Based Monitoring Model
Nodes providing the same service are grouped into a cluster; all monitoring configuration (node list, templates, alarm recipients) is associated with the cluster, enabling easy scaling, node removal, and alarm rule updates without touching other settings.
User Experience
The PC UI consists of three areas: menu, service tree, and business display. Selecting a node in the service tree defines the scope of data shown in the display area.
A lightweight WeChat version provides alarm details, metric views, alarm silencing, and progress remarks for collaborative handling.
Multi‑Dimensional Monitoring Methods
Basic monitoring – server downtime, resource usage, network quality
Service monitoring – port and process status
Custom monitoring – user‑defined metrics
Functional monitoring – page and interface checks
Availability monitoring – cluster and domain level availability, response time
Business‑level intelligent monitoring – prediction and anomaly detection of key business metrics
Implementation details:
Data is collected by agents on each server, stored, and evaluated for anomalies before being visualized.
Page and interface monitoring validates DNS resolution, connectivity, HTTP status, response time, content length, and specific keywords.
Cluster‑level probing detects server‑level issues even when Nginx retries mask them from end users.
Cluster and domain availability are derived from real‑time Nginx log aggregation using a Storm cluster.
Intelligent Monitoring – Machine‑Learning Workflow
The workflow follows four steps: problem definition, data processing, model training, and model deployment. Regression models predict daily traffic trends; classification models detect real‑time anomalies.
The prediction results closely match actual data, while anomaly detection classifies anomalies into normal, severe, and abrupt categories, enabling differentiated alarm channels (voice for abrupt, SMS/WeChat for severe).
Smart Alarm Merging
To avoid alarm flooding, alarms are merged within a 1‑minute window based on user, status, channel, and dimension (cluster, IP, subnet, exception type, host/VM relationship). A custom Gini‑value‑based algorithm iteratively selects merging dimensions and partitions the dataset until a stop condition is met.
Post‑merge, alarm volume decreased by 76.65% while preserving high merge quality, providing concise aggregated information for rapid decision‑making.
Smart Alarm Correlation Analysis
Correlation analysis uses Pearson coefficients to compute relationships among large numbers of metrics, automatically presenting root‑cause analysis and visualized correlation graphs in WeChat alerts.
Traditional vs. Intelligent Monitoring
Traditional monitoring relies on static thresholds and manual analysis, whereas intelligent monitoring adds automation, three‑dimensional coverage, productization for better UX, and AI‑driven features such as prediction, anomaly detection, alarm merging, and self‑healing.
Summary
The system evolved through four stages: automation (auto‑sync from CMDB, template binding), three‑dimensional coverage (vertical and horizontal layers), productization (enhanced UI for internal users), and intelligence (integration of AI techniques to achieve predictive, self‑healing monitoring).
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.