Operations 9 min read

Observability: Concepts, Challenges, and Didi’s Implementation

The article explains observability as the ability to infer any system state from external data, contrasts it with traditional monitoring, outlines challenges of high‑dimensional, high‑cardinality data and storage costs, and describes Didi’s hybrid MTL architecture that separates low‑ and high‑cardinality logs and metrics while linking them via TraceIDs to provide detailed, cost‑effective insight and streamlined debugging.

Didi Tech

Sep 12, 2023

Observability: Concepts, Challenges, and Didi’s Implementation

Observability has become a hot topic in recent years. The article starts with a typical on‑call scenario where a developer receives an alert about an interface error count exceeding a threshold, struggles to trace the root cause using traditional tools (tail, grep, etc.), and discovers that the problem originates from a dependent service rather than the recent deployment.

From this scenario several pain points are identified:

Lack of deeper analysis capability: After obtaining monitoring charts, engineers must resort to low‑level tools to investigate.

Complex micro‑service architecture makes source identification difficult: It is hard to know whether the issue lies in the service itself or a dependency.

Unclear alert rules: Determining whether conditions such as if len(error) > 30 then alert() are reasonable is non‑trivial.

Reliance on historical experience: Ad‑hoc filters like a specific error code (e.g., 9527) persist even when the underlying issue never recurs.

Observability is defined as the ability to understand any state of a system from the outside, without needing predefined metrics. When new states appear, no additional instrumentation or code changes are required.

The article compares traditional monitoring with observability:

Monitoring focuses on aggregated values (average, max, min); observability emphasizes raw details such as logs and metric distributions.

Monitoring relies on static thresholds and run‑books; observability encourages engineers to explore detailed data for more accurate problem detection.

Monitoring is typically for experienced engineers; observability aims to be accessible to all engineers.

The goal of observability is to provide a comprehensive, detailed view of system behavior, enabling faster issue discovery and improved stability.

Implementing observability faces two practical challenges when following the “high‑dimensionality, high‑cardinality” approach advocated by many SaaS vendors:

Data volume and dimensionality cause explosive user cost.

Existing storage solutions struggle to handle such massive, high‑cardinality datasets.

Open‑source communities often adopt a “curve‑saving” strategy by correlating three primary signals: Metrics, Traces, and Logs (the MTL model). By linking high‑level metric abstractions, cross‑service trace contexts, and human‑readable logs, observability can be achieved without overwhelming storage.

Didi’s implementation combines both approaches to balance cost, efficiency, and observability goals. It redesigns log and metric collection, separating low‑cardinality and high‑cardinality dimensions into different back‑ends while establishing relationships between them. Users can view original log lines and trace IDs directly from alerts.

Specifically, log collectors sample and upload raw log entries together with their associated metric curves at regular intervals. Metric collection requires developers to pass an extra label—typically a TraceID—that is not part of the metric’s label set but is stored alongside the sampled data, enabling a link between metrics and traces.

Product features highlighted include:

Directly linking alert notifications to the original log text in IM, eliminating the need for manual tail / grep operations.

Drilling from chart views to raw logs and automatically jumping to the trace platform when a TraceID is detected.

In conclusion, Didi’s internal “MTL” architecture—integrating Logs, Traces, and Metrics—provides developers and operations teams with a richer understanding of system state, facilitating more accurate fault detection and resolution. The article hopes to share these experiences as guidance for others.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Microservices Logging Tracing Didi

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.