Big Data 20 min read

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

The paper presents the design and deployment of a hundred‑billion‑scale real‑time monitoring platform that meets stringent data‑collection, analysis, storage, alerting and visualization requirements, compares Oceanus + Elastic Stack against a Zabbix‑Prometheus‑Grafana stack, selects the former, and details performance‑and cost‑optimizations that enable massive, low‑latency monitoring while maintaining high availability.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

Why Build a Monitoring System

In the post‑mobile‑Internet era, a good user experience is the foundation of growth, and a stable experience is the basis of user satisfaction. Large Internet companies, especially those serving C‑end customers, require high system stability and minute‑level response to online issues. For example, a 10‑minute outage of a ride‑hailing service can cause massive complaints and significant economic loss.

Because business systems are large‑scale distributed systems with complex dependencies, any node failure can render the system unavailable. Therefore, monitoring is crucial to detect and intervene in problems promptly, reducing incident impact.

Monitoring System Requirements

The goal of monitoring is to discover and locate system anomalies efficiently, shortening the mean time to repair (MTTR) and minimizing loss. Required capabilities include:

Data collection: comprehensive, accurate, low‑loss acquisition of logs and metrics.

Data aggregation: consolidating related data for processing.

Data analysis & processing: anomaly detection, fault diagnosis, filtering, sampling, transformation.

Data storage: high‑performance storage for massive logs and metrics.

Alerting: rule‑based notifications via phone, email, WeChat, SMS, etc.

Visualization: dashboards for rapid problem localization.

High availability: the monitoring system itself must remain operational.

Overall Architecture

The data flow can be abstracted as: Collection → Aggregation → Processing → Storage → Analysis → Visualization → Alerting . The typical architecture consists of:

Data collection layer (agents, RPC tracing, HTTP push).

Unified aggregation layer (e.g., message queues).

Three parallel processing streams: processing, analysis, storage.

Alerting layer based on defined rules.

Dashboard layer for monitoring and alert panels.

Technical Solutions

Solution 1: Oceanus Stream Computing + Elastic Stack

Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) is widely used for log collection and analysis. However, Beats only collect data and lack processing capabilities, while Logstash’s processing is limited and can cause data loss.

To address these gaps, a message queue (Kafka) is used to buffer Beats output, and Oceanus (built on Apache Flink) performs real‑time cleaning, transformation, and aggregation before feeding results into Elasticsearch for search and Kibana for log analysis. Grafana is employed as the monitoring dashboard.

Solution 2: Zabbix + Prometheus + Grafana

Zabbix provides distributed system monitoring with customizable alerts. Prometheus offers a lightweight time‑series database and alerting engine, while Grafana visualizes metrics from multiple sources. This stack is simple to deploy and suitable for containerized environments, but it struggles with ultra‑large scale data volumes.

Selection Summary

The combination of Oceanus and Elastic Stack delivers a distributed, scalable, real‑time monitoring system capable of handling massive log and metric volumes, albeit with higher deployment complexity and resource consumption. The Zabbix‑Prometheus‑Grafana stack is easier to deploy but faces performance bottlenecks at extreme scale.

System Optimizations

Oceanus Optimizations

SQL performance: native Flink SQL is enhanced for large‑scale jobs.

Data skew mitigation: automatic Local‑Global aggregation and mini‑batch processing reduce hotspot pressure.

UDF reuse: repeated UDF calls are cached to avoid redundant execution.

Dimension‑table join: bucket‑based loading minimizes memory consumption and back‑pressure.

Intelligent job diagnostics: visual alerts for job restarts, snapshot failures, and resource anomalies.

Automatic scaling: dynamic parallelism adjustment based on CPU, memory, and back‑pressure.

Elasticsearch Service Optimizations

Storage model: time‑based tiered merge reduces file fragmentation and improves query pruning, boosting search performance by ~40% and write throughput by 2×.

Cost reduction: hot‑warm‑cold data tiering moves older data to HDD and COS, cutting storage cost by up to 10×.

Memory optimization: off‑heap storage of FST structures and multi‑level caching lower heap usage and GC overhead.

Conclusion

The article outlines the design, architecture, and implementation of a hundred‑billion‑scale real‑time monitoring system. It first defines monitoring requirements, derives an architecture, evaluates two technical stacks, and selects Oceanus + Elastic Stack. Subsequent sections detail performance and cost optimizations for both Oceanus and Tencent Cloud Elasticsearch Service, demonstrating that this combination can meet real‑time monitoring demands while significantly reducing user costs.

monitoringreal-timearchitecturebig dataFlinkElasticsearchOceanus
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.