Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop
This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.
Exploration
In early 2019 Good Doctor suffered several severe online outages caused by unstandardized SQL and slow interfaces, leading to site‑wide unavailability. To address this, the architecture team launched the “DOA (Dead or Alive)” project to stabilize middleware and then a service risk governance project to identify slow interfaces, unsafe SQL, and unreasonable dependencies.
The team classifies risk awareness into four categories: knowing what you know, knowing what you don’t know, not realizing you know, and not realizing you don’t know. Most engineers fall into the last category, prompting a deeper discussion of service risk governance.
The presentation is divided into three parts: Exploration (defining known and unknown risks), Adventure (identifying, quantifying, and tracking risks), and Expedition (hands‑on workshop).
Exploration – Risk Questions
Common questions include the meaning of p99, acceptable latency, service hierarchy, dependency loops, DB jitter, and whether front‑end aggregation can replace back‑end logic.
Why p99?
Latency follows a normal distribution with long‑tail effects, making average values misleading. Using the 99th percentile (p99) as the Service Level Indicator (SLI) provides a more robust measure of stability; the target SLO is backend p99 < 100 ms and frontend p99 < 600 ms.
Finding Risks
Key risk factor is dependency latency. Monitoring upstream service latency helps locate high‑latency roots. Improper dependencies such as circular or bidirectional calls can cause cascading delays and difficult troubleshooting.
First Pitfall: Unreasonable Service Dependencies
Understanding service layers (high‑level vs low‑level) and upstream/downstream relationships is essential. Circular dependencies increase network overhead dramatically; a 50 ms call repeated ten times yields 500 ms latency.
Second Pitfall: Assuming Middleware Is 100 % Available
Middleware latency, connection time, and retry counts can become bottlenecks, especially with short‑lived connections. Slow SQL, cache misses, and Redis lock contention also contribute to high latency.
Third Pitfall: Ignoring Third‑Party Service Failures
Over‑reliance on external APIs without timeout or heartbeat mechanisms leads to hidden failures. Redundancy, disaster‑recovery, and monitoring of latency and success rates mitigate these risks.
Adventure – Platform Implementation
The risk‑governance platform aggregates trace logs, risk notifications, DBA slow‑SQL suggestions, and visual dashboards. It uses ClickHouse for raw logs, transforms them into metrics stored in GraphiteMergeTree, and supports OLAP queries for multi‑dimensional analysis.
Risk tasks are prioritized by latency impact and request volume; developers can set remediation plans, receive “rotten egg” alerts for upstream services, and view detailed service profiles with highlighted slow SQL.
sum(appslow_count>y) by(appname,method) and sum(appslow_p99>x) by(appname,method)
Expedition – Hands‑On Workshop
Participants view latency curves (p50, p75, p95, p99), explore task lists sorted by priority, and drill into task details that show interface portraits, slow‑SQL highlights, and optimization recommendations.
Conclusion
By adopting Metrics‑Driven Development, defining SLI/SLO, and continuously monitoring dependencies, middleware, and third‑party services, Good Doctor significantly improved overall service stability and created a repeatable process for long‑term risk mitigation.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.