Observability: Logging, Metrics, and Tracing in Distributed Systems
Observability in distributed systems combines event logging, aggregated metrics, and request tracing—each offering distinct trade‑offs in detail, storage, and overhead—and while the ELK stack dominates log and metric handling, tracing solutions such as EagleEye and SkyWalking differ by protocol and language, prompting many teams to adopt unified, cloud‑native platforms like Alibaba Cloud’s Log Service for lower cost, real‑time analysis and simplified management.
Observability is not a rigid theoretical framework nor a prescriptive technical specification; its core lies in encouraging teams to internalize observability principles and ensuring that applications built by developers are observable. In academia, observability is refined into three specific research directions: event logging, link tracing, and aggregated metrics. These areas intersect and complement each other.
Logging records events generated during application execution, providing detailed runtime state but consuming significant storage and query resources, often mitigated via filtering. Metrics are aggregated numeric values with minimal storage footprint, suitable for observing system state and trends but lacking fine‑grained detail for problem localization; multi‑dimensional structures such as histograms enhance detail expression. Tracing focuses on requests, enabling easy identification of anomalous points, yet shares logging’s high resource consumption, typically alleviated through sampling.
In industry, log and metric domains have converged on the Elastic Stack (ELK) as a dominant solution, while tracing technology follows a different path, highly dependent on specific network protocols and programming languages. The choice of transport (HTTP vs gRPC) and language (Java, Go, Node.js) influences tracing implementation, often requiring deep integration via agents or plugins, which introduces intrusiveness. Consequently, no single tracing vendor dominates; the market offers diverse products tailored to various technology stacks.
A trace represents the complete call trajectory of an entry request within an IT system, identified by a globally unique TraceId that correlates distributed span data. A span denotes a logical execution unit; spans are linked via nesting or sequencing to establish causal relationships. Each span includes operation name, SpanId/ParentSpanId, start/finish times, status code, and optional tags & events. Tags are key‑value pairs that enrich semantic context (e.g., user data for full‑link pressure testing).
- 0 - 0.1 - 0.1.1 - 0.1.2 - 0.1.2.1 - 0.2 - 0.2.1 - 0.3 - 0.3.1 - 0.3.1.1 - 0.3.2
Tracing implementation hinges on two critical points: low‑cost, high‑quality instrumentation (ensuring rich trace data for rapid root‑cause analysis) and guaranteed transparent propagation of trace context across heterogeneous environments to avoid broken links. Various solutions exist: Alibaba’s EagleEye stores trace data in a concurrent ring buffer with atomic read/write pointers, handling overflow via discard or overwrite policies; SkyWalking employs a partitioned QueueBuffer, offering either JDK‑based blocking queues or a non‑blocking array + atomic index implementation optimized for agent‑side performance.
Data collection and transmission differ between systems. EagleEye relies on local log files gathered by an agent and shipped to the backend, whereas SkyWalking provides gRPC and Kafka‑based real‑time transport. Cross‑thread trace context propagation is achieved through inheritable thread‑local mechanisms (e.g., TransmittableThreadLocal) to prevent context loss in thread pools.
For log storage and analysis, the open‑source ELK stack (Elasticsearch, Logstash, Kibana) offers full‑text search, structured parsing via Grok, and rich visualization but incurs higher operational overhead and cost. Alibaba Cloud’s Log Service (SLS) provides a unified, cloud‑native platform for logs, metrics, and traces, featuring Logtail for efficient, low‑resource log collection, unified storage with fast writes and queries, built‑in alerting, and extensive machine‑learning‑enabled analytics. Comparative analysis shows SLS delivers lower latency, ~44% of ELK’s total cost at hundred‑TB scale, superior ease of use, and richer aggregation/ML functions.
Choosing an observability stack depends on factors such as technology‑stack homogeneity, performance requirements, operational overhead, and need for real‑time data ingestion. Unified platforms like SLS simplify management for large, cloud‑native environments, while ELK remains attractive for organizations seeking maximal flexibility and extensive plugin ecosystems.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.