Operations 9 min read

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

Didi Tech

Jan 7, 2019

Data‑Driven Risk Quantification Platform for SRE at Didi

In traditional enterprise IT, operations staff were often seen as a passive support role because there were few rapid product iterations or massive daily service releases. In the modern DevOps era, operations (SRE) must handle hundreds or thousands of releases and online changes each day. If they remain passive, they become overwhelmed, creating serious stability risks for the entire platform.

The author, formerly with IBM Cloud and now in Didi's operations department, shares insights from years of automation work.

1. Business Challenges

During the early stages of rapid expansion at Didi, traffic and user numbers grew exponentially, and service modules were iterated constantly.

Under business‑first pressure, operations face heavy burdens. For example, non‑standard monitoring leads to noisy alerts that drown out valuable signals and waste storage resources. Uncontrolled deployments during peak periods can cause service outages.

2. How to Respond

Operations should move from passive adaptation to proactively guiding developers: standardizing change procedures, rational use of monitoring resources, and proper use of IT infrastructure.

The proposed solution is a data‑driven approach—quantifying monitoring, deployment, and core components (MySQL, Codis, Zookeeper) to provide numeric guidance for best‑practice adoption.

The "Risk Quantification Platform" was launched, featuring a "Change Credit Score" (measuring change operations such as deployments and config updates) and a "Monitoring Health Score" (measuring the quality of alert usage). This creates a visible hand that drives business teams toward higher stability.

Three difficulties of data‑driven practice

1) Data acquisition : Capturing every user operation (e.g., pause duration in a gray release, high‑peak deployments, double‑check steps, rollback occurrences) and converting them into measurable scores.

2) Defining standards : Building a mathematical model from extensive ops experience. For monitoring health, criteria include alive‑metric presence, basic metrics (cpu.idle, mem.used, disk.used), upstream/downstream monitoring, alert effectiveness (avoiding over‑notification), MTTA/MTTR, and dashboard availability. Different items receive weighted scores (e.g., alive metrics 40%, alert effectiveness 30%). A score of 80 is considered passing.

3) Driving adoption : Using the scores to rank business lines, turning risk quantification into a competitive game that encourages teams to improve their practices.

3. Results

The platform has been in production for over a year. The Change Credit Score has steadily risen, as shown in the chart below.

Correspondingly, the number of incident cases has declined despite an increase in change volume, indicating that higher scores correlate with improved service stability.

We hope that similar credit‑score mechanisms will bring positive stability outcomes across other business units.

4. Future Outlook

Future work includes extending risk quantification to more online operations such as middleware components, security‑related services, and potential password or data leakage risks.

We will continue to summarize operational experience, convert it into quantitative scores, and build best‑practice guidelines to drive wider adoption.

Finally, we aim to maximize the value of the collected data by turning it into actionable insights for the entire organization.

— END —

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring SRE data-driven operations Risk Quantification

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.