Design and Implementation of a Low‑Impact Distributed Tracing System for Service Calls
This article describes the background, design goals, architecture, implementation details, and lessons learned from building a low‑overhead, low‑intrusion distributed tracing system using Kafka, Elasticsearch, and OpenTracing to monitor microservice interactions and support performance analysis and DevOps decision‑making.
Background
As the company's services grew rapidly, the call relationships between services became increasingly complex, making it critical to trace and monitor request flows across multiple microservices, databases, and caches for troubleshooting and process optimization.
Design Goals
Low overhead: tracing should have minimal impact on highly optimized services.
Low intrusion: the tracing component should be transparent and require little developer effort.
Timeliness: data collection, processing, and visualization must be fast.
Decision support: provide useful metrics for DevOps decisions.
Data visualization: enable visual filtering without reading raw logs.
Implemented Functions
Fault location: full request trace displayed.
Performance analysis: per‑segment latency to identify bottlenecks.
Data analysis: complete business log for behavior path aggregation.
Design Approach
The solution follows the distributed tracing model popularized by Google Dapper and implemented in open‑source projects such as Twitter Zipkin and Alibaba EagleEye. By linking all spans of a request, the system provides end‑to‑end visibility.
Typical Distributed Call Process
A request originates from a client, passes through a front‑end service (A), then to intermediate services (B, C), and finally to back‑ends (D, E). Each RPC is instrumented to emit trace events.
cs - CLIENT_SEND, 客户端发起请求
sr - SERVER_RECIEVE, 服务端收到请求
ss - SERVER_SEND, 服务端处理完成,发送结果给客户端
cr - CLIENT_RECIEVE, 客户端收到响应Technical Selection
Considering the company's HTTP‑centric scenario, the design adopts the Zipkin implementation philosophy and follows the OpenTracing standard for multi‑language compatibility.
System Design
Overall Architecture
The tracing system consists of four main components: data instrumentation, data transmission, data storage, and a query UI.
Data Instrumentation
Integrate an SDK into the unified development framework for low‑intrusion data collection.
Use AOP to store trace data in a ThreadLocal variable, keeping the application transparent.
Record TraceId, service name, endpoint, start time, and duration.
Send data asynchronously to a Kafka queue to minimize impact on business logic.
Supported middleware includes HTTP, MySQL, and RabbitMQ.
Data Transmission
A Kafka layer between the SDK and backend services decouples components and buffers data, preventing loss during traffic spikes at the cost of some latency.
Data Storage
Spans and annotations are stored in Elasticsearch, retaining the most recent month of data to balance storage cost and query performance.
Query Interface
A web UI visualizes the distributed call graph, offering trace trees, dependency analysis, and project‑level aggregation.
Challenges Encountered
Web Page Load Timeout
Loading all spans at once caused timeouts for projects with millions of spans; the UI was rewritten to lazy‑load the latest ten spans and support dynamic search.
Span Accumulation
When HTTP client timeouts were not intercepted, spans remained in ThreadLocal, leading to thousands of entries; the SDK was updated to catch timeout exceptions and clean up the thread‑local storage.
Conclusion
By generating a globally unique TraceID for each request and linking all participating services, the tracing system enables call‑path analysis, performance bottleneck identification, and rapid fault isolation, providing valuable support for DevOps and operational decision‑making.
References
Google Dapper – http://bigbully.github.io/Dapper-translation/
Twitter Zipkin – http://zipkin.io/
Tracing article – http://www.cnblogs.com/zhengyun_ustc/p/55solution2.html
Hujiang Technology
We focus on the real-world challenges developers face, delivering authentic, practical content and a direct platform for technical networking among developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.