Observability and Stability Engineering in Didi Ride‑Hailing Platform
At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.
Didi’s ride‑hailing business has grown rapidly and its technology has continuously evolved. The platform must support rigorous transaction logic, complex business systems, and strong consistency guarantees while handling massive traffic spikes during peak hours and holidays across cities of various sizes.
Technically, Didi has migrated its core services to micro‑services, fully moved to the cloud, and built multi‑active architectures (same‑city and cross‑region). This evolution brings challenges for service communication reliability, service governance, fault tolerance, distributed tracing, log management, and metric integration, making observability extremely difficult.
The article introduces the concept of observability—originating from control theory—as the ability to infer internal system state from external outputs (logs, metrics, traces). In Didi, observability is applied to improve performance and stability of distributed systems.
Current observability practice includes:
Standard business logs, exception logs, component logs (cache, DB, RPC).
Key business metrics such as success rate, error codes, latency, and custom metrics.
Complete trace chains for request flows.
To address the growing complexity, Didi built an automated alarm system that:
Mass‑covers business scenarios using Cartesian product modeling of multiple dimensions (city, product, time slot, etc.).
Generates alarm rules automatically by calculating thresholds from historical data and applying statistical or machine‑learning models (moving‑average, N‑MA, R‑squared regression).
Combines AI/ML predictions, statistical variance analysis, and handcrafted rules to reduce false positives.
Implementation details:
Distributed tasks replace serial batch processing: a scheduled master job dispatches metric IDs to workers, each worker computes alarm strategies and writes them back.
Scheduled tasks run during low‑traffic windows (e.g., 02:00‑04:00) and support automatic threshold updates.
A management platform simplifies creation, modification, and bulk disabling of alarm rules.
For fault handling, Didi introduced an automated “East Sea Dragon King” robot (ChatOps) that pushes SOPs, real‑time alarm cards, and escalation notifications to the internal chat system (D‑Chat). It integrates logs, metrics, traces, and change events (deployment, configuration, traffic injection) to quickly locate root causes and guide mitigation actions.
The article also shares reflections on stability engineering:
Stability requires systematic, long‑term investment rather than ad‑hoc fixes.
Combining technical, organizational, and operational measures yields sustainable reliability.
Continuous evolution toward AIOps, chaos engineering, and automated fault injection is the future direction.
Finally, the piece includes a recruitment section inviting backend and test engineers to join Didi’s stability team.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.