Design and Implementation of a Distributed Tracing System at Qunar: Architecture, Technical Selection, and Performance Optimizations
This article describes the background, technology selection, architecture design, data flow, monitoring, logging, and trace collection mechanisms of Qunar's self‑built distributed tracing system, analyzes major performance problems such as Flume interruptions, Kafka bottlenecks, Flink back‑pressure, and presents concrete solutions including sliding‑window throttling, CGroup limits, and JavaAgent instrumentation, ultimately improving trace connectivity and system observability.
Background – As distributed systems grow in scale, Qunar needed a unified observability solution covering monitoring, logging, and tracing. Existing Watcher, Radar, and ELK components lacked a comprehensive distributed tracing capability, prompting the development of a custom APM system based on JavaAgent.
Technical Selection – The observability stack follows the three pillars of cloud‑native monitoring: Prometheus + Grafana for metrics, ELK/Loki for logs, and SkyWalking/Jaeger for tracing. Data ingestion uses Apache Flume and Kafka, processing with Flink, and storage in HBase (with auxiliary MySQL). The UI is built with React.
Architecture Design – Trace collection is achieved via custom middleware instrumentation for critical services and JavaAgent‑based automatic instrumentation for open‑source components. The data pipeline flows from agents → Flume → Kafka → Flink → HBase/MySQL, where aggregated results feed the web UI.
Data Flow Diagram – Shows the end‑to‑end path of trace logs, metrics, and log events through the aforementioned components.
Trace Logging and Reporting – Agents handle trace log generation and upload; Flume is customized to avoid log loss and support per‑line collection. Kafka transports logs to Flink, which aggregates failures, timeouts, and topology information. Metrics are sampled by Watcher agents and linked to trace IDs for joint queries.
UI Presentation – The web UI visualizes call topologies, error rates, slow spans, and related logs by querying HBase and MySQL.
Issues and Solutions
Trace interruption caused by Flume performance limits – solved by expanding memory buffers, converting sinks to asynchronous mode, and applying sliding‑window rate limiting.
Kafka throughput bottleneck – mitigated by increasing partition count, upgrading disks to SSD, and tuning producer/consumer settings.
Flink back‑pressure due to high QPS (≈3 M) – addressed by balancing sub‑tasks, enlarging JVM heap, using in‑memory maps instead of window aggregations, and sharing JVMs across operators.
Trace connectivity loss across threads/processes – resolved with JavaAgent automatic instrumentation (QTracer.wrap) that propagates context through Runnable, Callable, ExecutorService, RxJava, Reactor, etc.
JavaAgent Performance – Benchmarks show that for HTTP requests longer than 50 ms, the agent adds ≤4 % latency and ≤4 % throughput reduction; for cross‑thread scenarios the impact is similar (≈3 %).
Conclusion – The self‑built APM system, after iterative optimization, raised trace connectivity from ~20 % to >80 %, providing a solid foundation for full‑stack observability, chaos engineering, and performance testing.
Code Example
CompletableFuture<Integer> future = CompletableFuture.supplyAsync(new QTraceSupplier<>(()->{
LOG.info("supplyAsync------"+QTraceClientGetter.getClient().getCurrentTraceId());
return 1;
}));
Integer i = future.get();
LOG.info(String.valueOf(i));
CompletableFuture<Void> future1 = CompletableFuture.runAsync(QTracer.wrap(()->{
LOG.info("runAsync------"+QTraceClientGetter.getClient().getCurrentTraceId());
}));
future1.get();
executor.submit(QTracer.wrap(() -> {
LOG.info("in lambda------"+QTraceClientGetter.getClient().getCurrentTraceId());
}));
executor.submit(new Runnable() {
@Override
public void run() {
LOG.info("in lambda------"+"in runnable"+QTraceClientGetter.getClient().getCurrentTraceId());
}
});Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.