How Wonder Transformed 360’s Monitoring: From Open‑Falcon to Scalable Ops
This article details the evolution of Wonder, 360’s internal monitoring platform built on Open‑Falcon, covering its architecture, key features, large‑scale deployment, data collection, alerting mechanisms, custom plugins, and future development plans for smarter, more flexible operations.
Preface
Wonder is a monitoring system developed by the ADDOPS and HULK‑dev teams based on the open‑source project Open‑Falcon. Launched in April 2016, it now monitors over 40,000 nodes and collects tens of millions of metrics.
Key Features
Powerful and flexible data collection
Efficient alert policy management
User‑friendly alert configuration
Fast historical data queries
High availability
Improvements Over Open‑Falcon
Agent automatic updates
Alive, port, and log monitoring
Alert queue control
Re‑alert after exceeding max alert count
Automatic disabling of alerts via hardware repair interface
Data‑center alert shielding
Persistent storage of LastEvent state
Current Architecture
Alive Component (Sniffer)
The Sniffer is an independently developed component deployed across multiple data centers to monitor network and port availability of machines.
Two sets of Sniffer‑Agent status graphs:
Scale
Transfer_QPS: 200k/s, ~60 M items reported every 5 minutes Collected metrics: >12 M Storage used: 2.4 TB RRD archive retention: 2 years
Data Reporting Example
{ metric: df.bytes.used, endpoint: w01v.add.bjyt.qihoo.net, tags: fstype=ext4,mount=/, value: 1.5, timestamp: `date +%s`, counterType: GAUGE, step: 60 }
Counter: monotonically increasing values (e.g., request count). Gauge: instantaneous values (e.g., CPU usage).
sum(df_bytes_used{fstype="ext4",mount="/"}) by (fstype,mount,hulkid)
RRD Archive Policies
// 1‑minute points for 3 days c.RRA("AVERAGE", 0.5, 1, RRA1PointCnt) // 5‑minute points for 7 days c.RRA("AVERAGE", 0.5, 5, RRA5PointCnt) // 20‑minute points for 15 days c.RRA("AVERAGE", 0.5, 20, RRA20PointCnt) // 3‑hour points for 6 months c.RRA("AVERAGE", 0.5, 180, RRA180PointCnt) // 12‑hour points for 2 years c.RRA("AVERAGE", 0.5, 720, RRA720PointCnt)
Agent Automatic Updates
Wonder manages nearly 40 k hosts; deploying and upgrading agents is a major effort.
Deployment: one‑click installation via Qcmd.
Version upgrade: agents support automatic updates.
CMDB Integration
Business hierarchy (node → main business → sub‑business → role) is inherited; policies, custom monitoring, and log monitoring follow this hierarchy, with fine‑grained permission control consistent with HULK.
Alert Configuration
Alert groups affect all members; individual users can set personal alert methods.
Custom Monitoring (Plugin)
Users can define custom monitoring items, including name, command, unit, etc.
Log Monitoring
Monitors application logs, processes them, and reports; log paths support date‑function matching.
Alive Monitoring
Built‑in basic monitoring; users only need to configure policies to trigger alerts.
Port Monitoring
Nginx Monitoring
Integrated into the agent, providing access statistics, request times, 4xx/5xx error counts, etc.
Alert Statistics
Users can view alert status and history, and manually acknowledge events (similar to Zabbix).
Application Monitoring
Periodically runs scripts on servers and VIPs; users write scripts that return structured results, which the system evaluates to trigger alerts.
Ongoing Work
LVS traffic spike monitoring
Data filtering and caching modules
Integration with Prometheus
Custom data push API
Historical charts with same/period‑over‑period comparison
Future Directions
Provide comprehensive base data and flexible custom data push
Leverage data mining for overall business health assessment
Advance intelligent monitoring: dynamic thresholds, alert correlation, prediction
Expand service coverage and development support
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.