Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance
This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.
1. Overview
With the rapid growth of micro‑service architectures, distributed tracing has become a critical component of observability. After years of development, ByteDance’s tracing system now covers most online services, handling tens of thousands of micro‑services and millions of instances. The next challenge is extracting higher‑level insights from massive trace data to support architecture optimization, service governance, and cost reduction.
2. Observability and Tracing
2.1 Basic Concepts
Observability tools collect data such as traces, logs, metrics, profiling, events, and CMDB metadata, enabling operators to diagnose issues quickly by correlating alerts with trace details.
2.2 ByteDance Tracing System
The system evolved from Trace 1.0 (2019) to a unified observation platform Argos (2020) and now supports over 50 k micro‑services, 3 PB of storage, and a throughput of 20 M spans per second.
3. Trace‑Analysis Technical Practice
3.1 Scenarios
Beyond single‑trace debugging, higher‑level questions include stability (which services can be degraded), capacity planning (which services need scaling), and cost‑performance (identifying inefficiencies). These require automated aggregation of massive trace datasets.
3.2 Core Principle
Trace analysis follows a MapReduce‑style aggregation, optionally combined with subscription rules, to produce results for downstream applications.
3.3 Architecture Options
Three computation modes are evaluated:
Streaming computation – near‑real‑time results, high data completeness, but limited to predefined time windows.
Ad‑hoc (sampling) computation – flexible queries with low extra cost, but reduced completeness.
Offline batch computation – high completeness and low operational cost, but with hour‑ or day‑level latency.
Based on requirements such as real‑time needs, data completeness, and ad‑hoc flexibility, ByteDance adopted an integrated solution that supports all three modes using a unified data model and logical operators.
4. Real‑World Applications
4.1 Precise Topology Calculation
By storing per‑node topology graphs in a graph database, ByteDance can retrieve exact upstream/downstream dependencies for any service, with flexible depth and granularity.
4.2 Full‑Link Traffic Estimation
Using streaming aggregation of trace counts and sampling rates, the system estimates traffic flow and proportion across the entire call graph, supporting capacity planning and cost governance.
4.3 Strong/Weak Dependency Analysis
Streaming computation identifies whether a downstream service is a strong or weak dependency based on error propagation, aiding downgrade plans, timeout configuration, and automated root‑cause analysis.
4.4 Performance Anti‑Pattern Detection
The platform automatically discovers patterns such as call amplification, duplicate calls, read‑write amplification, and serial loops, providing worst‑case samples and traffic context for remediation.
4.5 Full‑Link Performance Bottleneck Analysis
Aggregated trace data reveals systemic latency patterns and worst‑case paths, supporting both ad‑hoc and offline analysis modes.
4.6 Error Propagation Chain Analysis
By aggregating error traces, the system uncovers common error sources, propagation paths, and impact scopes, useful for long‑term stability improvements.
5. Summary and Outlook
The article outlines how ByteDance moved from building basic tracing capabilities to a comprehensive trace‑analysis platform that supports real‑time, ad‑hoc, and offline scenarios, delivering actionable insights for architecture governance, capacity planning, fault isolation, and performance optimization.
Future work includes continuous data‑quality improvement, expanding scenario‑specific APIs, increasing automation through AI‑driven analysis, and deeper integration with cloud‑native observability standards such as OpenTelemetry.
ByteDance Terminal Technology
Official account of ByteDance Terminal Technology, sharing technical insights and team updates.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.