Operations 12 min read

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

This article explains the concept of distributed tracing, its importance in micro‑service architectures, the OpenTracing standard, and how SkyWalking implements automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance optimizations to provide low‑overhead observability for backend systems.

Top Architect

Jan 6, 2023

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

In distributed and micro‑service systems a single external request often traverses multiple modules, middleware, and machines, making it difficult to know which applications, modules, and nodes were involved and how each performed; this problem is solved by link (trace) tracking.

What Is Link Tracing?

Link tracing reconstructs a distributed request into a call chain, showing service node latency, target machines, and request status.

Principles of Link Tracing

Key metrics include request RT, exception responses, and bottleneck locations. In monolithic architectures AOP can collect these metrics with minimal intrusion; in micro‑services the complexity grows, requiring a full distributed call chain.

Micro‑service challenges include difficulty locating slow pages across many services and machines, and the need for a complete call chain to reproduce issues.

OpenTracing Standard

OpenTracing provides a vendor‑agnostic API to instrument applications, similar to JDBC’s interface approach, enabling pluggable tracing components.

Its data model consists of:

Trace : the complete request chain.

Span : a single call with start and end times.

SpanContext : global context (e.g., traceId) passed between spans.

SkyWalking Tracing System

SkyWalking uses a plugin‑based Java‑agent to automatically collect span data without code changes, supports context propagation via headers/attachments, generates globally unique traceIds using a Snowflake‑like algorithm, and mitigates time‑rollback by falling back to random IDs.

Sampling is performed at a rate of three samples per three seconds, with downstream services forced to collect data if upstream sampling occurred, ensuring complete chains.

SkyWalking Architecture

Node data is periodically sampled and reported to storage back‑ends such as Elasticsearch or MySQL, enabling visualization and analysis.

Performance Evaluation

Benchmarks show SkyWalking adds negligible CPU, memory, and response‑time overhead at 5000 TPS, and outperforms Zipkin and Pinpoint in latency (22 ms vs. 117 ms and 201 ms). It also requires no code instrumentation, unlike Zipkin.

Additional advantages include multi‑language support (Java, .Net Core, PHP, NodeJS, Go, Lua) and a rich plugin ecosystem for extensibility.

While SkyWalking is highlighted, other tracing solutions may be suitable depending on specific scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability performance monitoring OpenTracing distributed tracing skywalking

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.