Operations 10 min read

Scaling Real‑Time Monitoring for Billion‑Call Billing with Prometheus

Jiangsu Mobile’s IT operations team partnered with Newland to build a high‑availability, real‑time performance management platform using Prometheus, achieving billion‑level call‑record monitoring, low‑latency queries, data compression, and advanced forecasting, dramatically improving system health visibility and operational efficiency.

Efficient Ops
Efficient Ops
Efficient Ops
Scaling Real‑Time Monitoring for Billion‑Call Billing with Prometheus

Background

With the rapid growth of traffic services and the arrival of the 5G era, the scale of business support systems has expanded dramatically, generating exponential growth in performance metric data. Daily billing call‑record volume has exceeded hundreds of billions, and the real‑time accuracy of system monitoring has become a bottleneck for operations.

The Jiangsu Mobile IT operations team collaborated with Newland, guided by SRE principles, to address high‑concurrency writes, low‑latency queries, and lightweight storage, researching time‑series databases to build a performance management platform capable of real‑time panoramic monitoring of hundred‑billion‑level call processing.

Time‑Series Database Selection

Popular time‑series databases such as Prometheus, Graphite, InfluxDB, and OpenTSDB were evaluated for their usage scope, strengths, and weaknesses.

Through comparison, Prometheus was found most suitable for the BOSS operations monitoring system. A single Prometheus instance can handle millions of samples per second and supports fast queries. Its compression stores a 16‑byte sample in an average of 1.37 bytes, greatly reducing storage consumption, while real‑time queries keep disk I/O load below 1 %.

Performance Management Platform Architecture

The solution places Prometheus at the core, collecting, cleaning, and storing all real‑time monitoring data related to applications, and visualizing overall system health, component performance, trend prediction, and intelligent analysis.

System architecture diagram:

Deploy a Prometheus cluster in each of the two data centers.

Collect system and application logs, as well as Java metrics via pull; push performance metrics from applications, business components, and services through a pushgateway, then let Prometheus pull them.

Store recent short‑term data in Prometheus for high‑performance real‑time queries, while writing a copy to a remote historical time‑series store.

Visualization and real‑time alerts load data from both Prometheus and the historical store via load balancing.

Adaptation Enhancements

Strengthening High Availability : Native Prometheus is single‑node. We introduced service registration and a lock‑based leader election so that one node becomes the primary executor; if it fails, a standby node acquires the lock and takes over.

Optimizing Data Storage : Short‑term data remains in Prometheus for real‑time alerts, while InfluxDB stores long‑term historical data, ensuring continuity and supporting downstream data mining.

Custom Pushgateway Component : To avoid duplicate data ingestion from pushgateway, we added a post‑pull cleanup step in Prometheus to guarantee unique metric collection.

Extending Visualization : The default Grafana plugins lacked flexibility for multi‑dimensional metric correlation, so we developed a custom visualization tool that presents system, application, and business performance across multiple dimensions.

Timezone Adjustment : Prometheus originally displayed metrics in GMT, causing an 8‑hour offset for Beijing time. We modified the source to use the local system time, fixing the issue.

Metric Collection Scope

Metrics are divided into performance metrics and business metrics, covering system health, application throughput, service call counts, and response times.

Real‑Time Dashboard

Aggregated metric data forms a unified health view of the BOSS system, displaying application performance, business volume, service call counts, and response times. Users can drill down to specific applications or processes, achieving a “single‑pane” operations monitor that greatly improves monitoring efficiency and reduces manpower.

Trend Prediction and Anomaly Detection

The massive time‑series data serves as a valuable asset for modeling and analysis. By applying algorithms for forecasting and anomaly detection, the platform supports several operational scenarios:

Performance Forecasting : Real‑time monitoring of processing speed and historical comparison automatically compute the maximum throughput and predict the time required to finish pending call records.

Business Trend Forecasting : Statistical analyses (daily, weekly, monthly averages, weighted moving averages, etc.) on stored metrics predict future call‑processing trends and resource utilization, guiding capacity planning.

Anomaly Detection : Techniques such as period‑over‑period, year‑over‑year, mean‑shift, standard‑deviation, local fluctuation, and periodic feature analysis promptly identify abnormal business behavior.

Summary and Outlook

The platform currently handles up to 100 k metrics per second, supporting real‑time monitoring of hundred‑billion‑level call processing. Analysis of this massive data enables precise capacity, performance, and fault prediction, allowing proactive measures to prevent issues.

Having been successfully applied to the BOSS system, the platform will be further refined and gradually extended to other business support and pipeline domains.

monitoringOperationsPrometheustime-series databaseperformance management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.