How Octopux Achieves 99.9% Bandwidth Monitoring Accuracy at Scale
Octopux is an open‑source bandwidth monitoring platform designed by Baishan Cloud to deliver 99.9% data integrity, cross‑operator and cross‑country coverage, minute‑level granularity, and horizontal scalability for tens of thousands of devices, addressing the limitations of traditional tools like Cacti.
Introduction
Bandwidth monitoring is essential for carrier settlement and network quality monitoring, and it is a must‑have system for any internet company with self‑built resources.
Baishan Cloud operates thousands of devices across dozens of countries and many carriers. In such a complex environment, ensuring precise bandwidth data collection, flexible integration for different scenarios, and scalability to tens of thousands of devices poses significant technical challenges.
In March, Baishan Cloud open‑sourced Octopux , its self‑developed bandwidth monitoring system, to share its solution with the community.
Conventional Systems Fall Short
Like many companies, we initially used Cacti, but as the network grew to about 800 devices, Cacti showed serious problems:
Insufficient poller concurrency
Monitoring 800 devices required a 5‑minute granularity, while business needs demanded 1‑minute granularity.
Frequent cross‑carrier monitoring failures
Even with the Cacti server in a three‑tier BGP data center, data loss across carriers persisted.
Server I/O bottlenecks
Updating 8,000 RRD files every 5 minutes caused severe disk I/O issues.
Low data extraction efficiency
Bandwidth data stored in binary RRD files could only be extracted via rrd‑tool, making flexible data aggregation impossible.
Octopux Breaks Technical Bottlenecks
We abandoned further Cacti modifications and, after studying many open‑source projects, built a new bandwidth monitoring system.
The core design goals are:
99.9% data completeness
Perfect cross‑carrier and cross‑country monitoring
Horizontal scalability to support tens of thousands of servers
Second‑level granularity
Simple and efficient data query interface
Architecture diagram:
1. swcollector Data Collection Module
swcollector runs as a background process on every server, collecting data and sending it to the swtfr component of the data collection center.
In cross‑network or cross‑country environments, if swcollector cannot reach swtfr directly, it retries via multiple gateways to ensure delivery.
2. Data Writing and Querying
Three sets of
swtfr + influxdb + flow‑apicomponents are deployed globally. Each swtfr replicates incoming data to three InfluxDB instances.
flow‑api handles queries and aggregation: it splits a query into minimal‑granularity events, queries the three InfluxDB nodes in parallel, then aggregates the results.
flow‑api supports common aggregations such as max, min, average, and group‑by.
3. Monitoring Efficiency
The system can monitor up to 150,000 data points per minute, with over 90% of data written within 3 seconds and query latency under 3 seconds, fully meeting business requirements.
When the monitoring scale grows, horizontal expansion of the
swtfr + influxdb + flow‑apistack further boosts performance; InfluxDB itself also scales horizontally for larger storage and higher read/write throughput.
4. Service Capabilities
Bandwidth monitoring is self‑discovering. It supports per‑NIC inbound/outbound monitoring, server internal/external bandwidth separation, and per‑port inbound/outbound monitoring.
Switch bandwidth data is collected by multiple swcollector instances, aggregated by flow‑api, and output with 1‑minute granularity for higher precision.
Integration with CMDB enables automatic hierarchical display, merging by topology, usage, or billing comparison, with query response times in seconds.
Example data visualization:
Open Source Back to the Community
The bandwidth monitoring system, while seemingly simple, becomes complex at large scale. We have open‑sourced Octopux to provide inspiration and reference.
octopux‑swcollector https://github.com/baishancloud/octopux-swcollector octopux‑swtfr https://github.com/baishancloud/octopux-swtfr octopus‑gateway https://github.com/baishancloud/octopus-gateway
Thanks to Xiaomi’s open‑source open‑falcon project for foundational components.
Postscript
With the maturing InfluxDB ecosystem—lightweight, dependency‑free, rich visual components, and built‑in aggregation—we are exploring a third‑generation monitoring system to achieve better aggregation analysis and complex alerting.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.