Data‑Driven Risk Quantification Platform for SRE at Didi
Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.
In traditional enterprise IT, operations staff were often seen as a passive support role because there were few rapid product iterations or massive daily service releases. In the modern DevOps era, operations (SRE) must handle hundreds or thousands of releases and online changes each day. If they remain passive, they become overwhelmed, creating serious stability risks for the entire platform.
The author, formerly with IBM Cloud and now in Didi's operations department, shares insights from years of automation work.
1. Business Challenges
During the early stages of rapid expansion at Didi, traffic and user numbers grew exponentially, and service modules were iterated constantly.
Under business‑first pressure, operations face heavy burdens. For example, non‑standard monitoring leads to noisy alerts that drown out valuable signals and waste storage resources. Uncontrolled deployments during peak periods can cause service outages.
2. How to Respond
Operations should move from passive adaptation to proactively guiding developers: standardizing change procedures, rational use of monitoring resources, and proper use of IT infrastructure.
The proposed solution is a data‑driven approach—quantifying monitoring, deployment, and core components (MySQL, Codis, Zookeeper) to provide numeric guidance for best‑practice adoption.
The "Risk Quantification Platform" was launched, featuring a "Change Credit Score" (measuring change operations such as deployments and config updates) and a "Monitoring Health Score" (measuring the quality of alert usage). This creates a visible hand that drives business teams toward higher stability.
Three difficulties of data‑driven practice
1) Data acquisition : Capturing every user operation (e.g., pause duration in a gray release, high‑peak deployments, double‑check steps, rollback occurrences) and converting them into measurable scores.
2) Defining standards : Building a mathematical model from extensive ops experience. For monitoring health, criteria include alive‑metric presence, basic metrics (cpu.idle, mem.used, disk.used), upstream/downstream monitoring, alert effectiveness (avoiding over‑notification), MTTA/MTTR, and dashboard availability. Different items receive weighted scores (e.g., alive metrics 40%, alert effectiveness 30%). A score of 80 is considered passing.
3) Driving adoption : Using the scores to rank business lines, turning risk quantification into a competitive game that encourages teams to improve their practices.
3. Results
The platform has been in production for over a year. The Change Credit Score has steadily risen, as shown in the chart below.
Correspondingly, the number of incident cases has declined despite an increase in change volume, indicating that higher scores correlate with improved service stability.
We hope that similar credit‑score mechanisms will bring positive stability outcomes across other business units.
4. Future Outlook
Future work includes extending risk quantification to more online operations such as middleware components, security‑related services, and potential password or data leakage risks.
We will continue to summarize operational experience, convert it into quantitative scores, and build best‑practice guidelines to drive wider adoption.
Finally, we aim to maximize the value of the collected data by turning it into actionable insights for the entire organization.
— END —
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.