Operations 16 min read

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

Didi Tech

Sep 5, 2023

Observability and Stability Engineering in Didi Ride‑Hailing Platform

Didi’s ride‑hailing business has grown rapidly and its technology has continuously evolved. The platform must support rigorous transaction logic, complex business systems, and strong consistency guarantees while handling massive traffic spikes during peak hours and holidays across cities of various sizes.

Technically, Didi has migrated its core services to micro‑services, fully moved to the cloud, and built multi‑active architectures (same‑city and cross‑region). This evolution brings challenges for service communication reliability, service governance, fault tolerance, distributed tracing, log management, and metric integration, making observability extremely difficult.

The article introduces the concept of observability—originating from control theory—as the ability to infer internal system state from external outputs (logs, metrics, traces). In Didi, observability is applied to improve performance and stability of distributed systems.

Current observability practice includes:

Standard business logs, exception logs, component logs (cache, DB, RPC).

Key business metrics such as success rate, error codes, latency, and custom metrics.

Complete trace chains for request flows.

To address the growing complexity, Didi built an automated alarm system that:

Mass‑covers business scenarios using Cartesian product modeling of multiple dimensions (city, product, time slot, etc.).

Generates alarm rules automatically by calculating thresholds from historical data and applying statistical or machine‑learning models (moving‑average, N‑MA, R‑squared regression).

Combines AI/ML predictions, statistical variance analysis, and handcrafted rules to reduce false positives.

Implementation details:

Distributed tasks replace serial batch processing: a scheduled master job dispatches metric IDs to workers, each worker computes alarm strategies and writes them back.

Scheduled tasks run during low‑traffic windows (e.g., 02:00‑04:00) and support automatic threshold updates.

A management platform simplifies creation, modification, and bulk disabling of alarm rules.

For fault handling, Didi introduced an automated “East Sea Dragon King” robot (ChatOps) that pushes SOPs, real‑time alarm cards, and escalation notifications to the internal chat system (D‑Chat). It integrates logs, metrics, traces, and change events (deployment, configuration, traffic injection) to quickly locate root causes and guide mitigation actions.

The article also shares reflections on stability engineering:

Stability requires systematic, long‑term investment rather than ad‑hoc fixes.

Combining technical, organizational, and operational measures yields sustainable reliability.

Continuous evolution toward AIOps, chaos engineering, and automated fault injection is the future direction.

Finally, the piece includes a recruitment section inviting backend and test engineers to join Didi’s stability team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Observability system reliability Didi aiops fault detection

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.