Operations 18 min read

Design and Evolution of Vivo Server Monitoring System

This article systematically presents the business background, basic monitoring workflow, usage guidelines, OpenTSDB fundamentals, code precision issues, vmonitor collector architecture, old and new system designs, core alerting metrics, demo illustrations, and a comparison with mainstream monitoring solutions, offering insights for technology selection.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Evolution of Vivo Server Monitoring System

Business background: modern information explosion creates complex platforms, requiring an effective monitoring system to detect core business issues, CPU spikes, and disk saturation promptly.

Basic monitoring workflow: data collection (JVM metrics, system resources, business logs), transmission (message or HTTP), storage (MySQL, OpenTSDB on HBase, HBase directly), visualization (charts), and flexible alerting via email, SMS, or IM.

Guidelines for using the monitoring system: understand JVM memory structure and GC, define metric states, set reasonable thresholds, handle alerts quickly, and establish a fault‑handling process.

OpenTSDB introduction: a distributed, scalable time‑series database built on HBase, storing Data Points (Metric, Tags, Value, Timestamp) in tables tsdb and tsdb‑uid, with high throughput and extensibility.

Precision issue when storing floating‑point values in OpenTSDB, illustrated by the following code:

String value = "0.51";
float f = Float.parseFloat(value);
int raw = Float.floatToRawIntBits(f);
byte[] float_bytes = Bytes.fromInt(raw);
int raw_back = Bytes.getInt(float_bytes, 0);
double decode = Float.intBitsToFloat(raw_back);
System.out.println("Parsed Float: " + f);
System.out.println("Encode Raw: " + raw);
System.out.println("Encode Bytes: " + UniqueId.uidToString(float_bytes));
System.out.println("Decode Raw: " + raw_back);
System.out.println("Decoded Float: " + decode);

Aggregation function limitation: most OpenTSDB functions use linear interpolation, causing gaps for missing values; vmonitor adds a custom nimavg function together with zimsum to handle empty slots.

vmonitor collector architecture: three collectors (OS, JVM, business) run every minute, aggregate data, and push three packaged reports to RabbitMQ; business metric collectors can also push real‑time data.

Old version architecture: data collection via vmonitor‑agent, RabbitMQ transport, aggregation in OpenTSDB/HBase, Redis for task distribution, MySQL for configuration, Zookeeper for coordination, and alert detection with distributed computation.

New version redesign: HTTP reporting replaces RabbitMQ and CDN, using vmonitor‑gateway for authentication, back‑pressure handling, Redis queue buffering, and final storage to OpenTSDB, improving resilience to network or service failures.

Core alerting metrics and formulas: maximum, minimum, fluctuation (up/down/interval), day‑over‑day, week‑over‑week, and hour‑day comparisons, each with explicit float‑based calculation expressions.

Demo effects: business metric query interface, system/JVM monitoring with auto‑refresh, red‑highlighted machines missing data, detailed view buttons, and configurable business metric definitions (log filter or invasive code reporting).

Comparison with mainstream monitoring tools: Zabbix (relational DB, no tag support, limited SDK), Open‑Falcon (Go/Python, proxy‑gateway, easy custom metrics), Prometheus (TSDB, no external storage, simple architecture), and vmonitor (customized OpenTSDB with nimavg, multi‑dimensional alerts, SDK for easy integration).

Conclusion: the article outlines the design and evolution of Vivo's server monitoring platform, built on a Java stack, and provides a reference for selecting appropriate monitoring technologies.

monitoringArchitecturealertingServerOpenTSDBvmonitor
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.