Operations 9 min read

How Wonder Transformed 360’s Monitoring: From Open‑Falcon to Scalable Ops

This article details the evolution of Wonder, 360’s internal monitoring platform built on Open‑Falcon, covering its architecture, key features, large‑scale deployment, data collection, alerting mechanisms, custom plugins, and future development plans for smarter, more flexible operations.

360 Zhihui Cloud Developer

Sep 21, 2017

How Wonder Transformed 360’s Monitoring: From Open‑Falcon to Scalable Ops

Preface

Wonder is a monitoring system developed by the ADDOPS and HULK‑dev teams based on the open‑source project Open‑Falcon. Launched in April 2016, it now monitors over 40,000 nodes and collects tens of millions of metrics.

Key Features

Powerful and flexible data collection

Efficient alert policy management

User‑friendly alert configuration

Fast historical data queries

High availability

Improvements Over Open‑Falcon

Agent automatic updates

Alive, port, and log monitoring

Alert queue control

Re‑alert after exceeding max alert count

Automatic disabling of alerts via hardware repair interface

Data‑center alert shielding

Persistent storage of LastEvent state

Current Architecture

Alive Component (Sniffer)

The Sniffer is an independently developed component deployed across multiple data centers to monitor network and port availability of machines.

Two sets of Sniffer‑Agent status graphs:

Scale

Transfer_QPS: 200k/s, ~60 M items reported every 5 minutes Collected metrics: >12 M Storage used: 2.4 TB RRD archive retention: 2 years

Data Reporting Example

{ metric: df.bytes.used, endpoint: w01v.add.bjyt.qihoo.net, tags: fstype=ext4,mount=/, value: 1.5, timestamp: `date +%s`, counterType: GAUGE, step: 60 }

Counter: monotonically increasing values (e.g., request count). Gauge: instantaneous values (e.g., CPU usage).

sum(df_bytes_used{fstype="ext4",mount="/"}) by (fstype,mount,hulkid)

RRD Archive Policies

// 1‑minute points for 3 days c.RRA("AVERAGE", 0.5, 1, RRA1PointCnt) // 5‑minute points for 7 days c.RRA("AVERAGE", 0.5, 5, RRA5PointCnt) // 20‑minute points for 15 days c.RRA("AVERAGE", 0.5, 20, RRA20PointCnt) // 3‑hour points for 6 months c.RRA("AVERAGE", 0.5, 180, RRA180PointCnt) // 12‑hour points for 2 years c.RRA("AVERAGE", 0.5, 720, RRA720PointCnt)

Agent Automatic Updates

Wonder manages nearly 40 k hosts; deploying and upgrading agents is a major effort.

Deployment: one‑click installation via Qcmd.

Version upgrade: agents support automatic updates.

CMDB Integration

Business hierarchy (node → main business → sub‑business → role) is inherited; policies, custom monitoring, and log monitoring follow this hierarchy, with fine‑grained permission control consistent with HULK.

Alert Configuration

Alert groups affect all members; individual users can set personal alert methods.

Custom Monitoring (Plugin)

Users can define custom monitoring items, including name, command, unit, etc.

Log Monitoring

Monitors application logs, processes them, and reports; log paths support date‑function matching.

Alive Monitoring

Built‑in basic monitoring; users only need to configure policies to trigger alerts.

Port Monitoring

Nginx Monitoring

Integrated into the agent, providing access statistics, request times, 4xx/5xx error counts, etc.

Alert Statistics

Users can view alert status and history, and manually acknowledge events (similar to Zabbix).

Application Monitoring

Periodically runs scripts on servers and VIPs; users write scripts that return structured results, which the system evaluates to trigger alerts.

Ongoing Work

LVS traffic spike monitoring

Data filtering and caching modules

Integration with Prometheus

Custom data push API

Historical charts with same/period‑over‑period comparison

Future Directions

Provide comprehensive base data and flexible custom data push

Leverage data mining for overall business health assessment

Advance intelligent monitoring: dynamic thresholds, alert correlation, prediction

Expand service coverage and development support

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability alerting Infrastructure Open-Falcon

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.