How ByteDance Built a Scalable Compute Monitoring System for Million‑Node Data Centers
This article details ByteDance's end‑to‑end design and engineering of a low‑overhead, continuous profiling system that collects, processes, and analyzes resource usage across a million‑scale data center, enabling precise performance optimization and significant CPU cost savings.
Background
With ByteDance's rapid business growth, the scale of data‑center servers has expanded quickly to meet increasing compute demand. When the scale reaches a certain size, balancing machine cost, efficiency, and resources becomes critical, prompting targeted performance optimization to reduce compute costs.
Since 2019, the STE team has built a compute monitoring system to establish a comprehensive resource‑usage analysis framework from a data‑center‑wide perspective, supporting hardware‑software co‑design and optimization. After four years of iteration, the system now has a mature product form and a stable user base, providing strong support for performance analysis across business lines.
This article shares the overall design and engineering practice of ByteDance's compute monitoring solution, helping readers understand scientific, accurate performance analysis and profiling of data centers for rapid compute optimization.
Overall Design
The design focuses on four key points:
Data‑center perspective: Unlike single‑machine tools, the system provides a holistic view across thousands of machines, avoiding the "tree‑without‑forest" problem.
Profiling emphasis: Continuous profiling of workloads and resource usage, combined with visualization and analysis, to deepen domain knowledge.
Comprehensive analysis service: Standardized and customized performance analysis that incorporates business dimensions such as micro‑services, clusters, and application scenarios, with proactive push of insights.
Continuous operation: Support for scheduled, continuous profiling that generates ongoing analysis reports and long‑term issue tracking.
Data Collection Phase
Key requirements for the agent:
Lightweight and full‑machine coverage, preferably aggregating in the kernel before exporting to user space.
Support for continuous, scheduled collection with dispersed timing to avoid peaks.
Collect multi‑dimensional data (per‑machine, PSM service, rack, language, libraries, CPU micro‑architecture, etc.) for versatile analysis.
Data Processing Phase
Implement rich operator functions on the processing side to offload work from collectors.
Handle massive data volumes with high concurrency, supporting millions of machines.
Data Analysis Phase
The analysis aims to answer:
What workloads run in the data center, how to characterize them, and whether they fully utilize compute resources?
What workload characteristics can guide targeted code and operations optimization?
Are there common patterns that enable company‑wide system optimizations?
Engineering Practice
Data Collection: Lightweight Instrumentation
To support million‑scale data centers, the team designed a low‑overhead kernel timer that fires every 10 seconds (configurable, typically <0.1 Hz). The kernel module records PID, TID, EIP, CPU ID into a 64‑byte buffer and exports it via /sys/kernel/debug . This approach incurs far less overhead than Linux's perf tool.
User‑space agents periodically gather the kernel data, map PIDs to workload binaries, and filter for hot processes (identified via top ) to limit bandwidth. A blacklist mechanism protects sensitive workloads.
Data Processing & Real‑Time Pipeline
Collected data is streamed into Kafka, then processed in real time with Flink, performing high‑throughput cleaning and correlation. Intermediate results are cached in Redis for further analysis.
Both stream and batch analyses extract high‑value intermediate datasets for downstream insight generation.
Data Analysis: Multi‑Dimensional Profiling
Two levels and three dimensions are used:
Hot function statistics: Lightweight collection provides a global view of CPU‑intensive functions across the data center.
Process and stack information: Collected via top , ps , and an enhanced Linux perf tool (using CPU cycles PMU events or timer interrupts for VMs). Sampling runs hourly for 5 seconds at 10 Hz per CPU, yielding 2 000–4 000 samples on a 100‑vCPU server.
Project Benefits
The monitoring system delivers up‑to‑date metrics for targeted optimization. In one case, optimizing the zlib library in a specific region saved at least 50 k physical CPU cores (out of 100 k vCPU hyper‑threads). Ongoing work continues to reduce compute consumption across services.
The system also uncovers data‑center usage patterns, enabling “data‑center tax” analysis and guiding hardware‑software co‑design.
Conclusion
The compute monitoring system implements a lightweight instrumentation and real‑time data pipeline, producing global hot‑function statistics and detailed analyses that drive continuous performance improvements for both applications and the underlying data‑center infrastructure.
By pinpointing bottlenecks at the business level and optimizing libraries, compilers, and micro‑architectural features, the system has saved nearly a million vCPU cores and provides actionable guidance for hardware selection and system design.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.