Operations 16 min read

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

HaoDF Tech Team

Nov 8, 2021

Exploration

In early 2019 Good Doctor suffered several severe online outages caused by unstandardized SQL and slow interfaces, leading to site‑wide unavailability. To address this, the architecture team launched the “DOA (Dead or Alive)” project to stabilize middleware and then a service risk governance project to identify slow interfaces, unsafe SQL, and unreasonable dependencies.

The team classifies risk awareness into four categories: knowing what you know, knowing what you don’t know, not realizing you know, and not realizing you don’t know. Most engineers fall into the last category, prompting a deeper discussion of service risk governance.

The presentation is divided into three parts: Exploration (defining known and unknown risks), Adventure (identifying, quantifying, and tracking risks), and Expedition (hands‑on workshop).

Exploration – Risk Questions

Common questions include the meaning of p99, acceptable latency, service hierarchy, dependency loops, DB jitter, and whether front‑end aggregation can replace back‑end logic.

Why p99?

Latency follows a normal distribution with long‑tail effects, making average values misleading. Using the 99th percentile (p99) as the Service Level Indicator (SLI) provides a more robust measure of stability; the target SLO is backend p99 < 100 ms and frontend p99 < 600 ms.

Finding Risks

Key risk factor is dependency latency. Monitoring upstream service latency helps locate high‑latency roots. Improper dependencies such as circular or bidirectional calls can cause cascading delays and difficult troubleshooting.

First Pitfall: Unreasonable Service Dependencies

Understanding service layers (high‑level vs low‑level) and upstream/downstream relationships is essential. Circular dependencies increase network overhead dramatically; a 50 ms call repeated ten times yields 500 ms latency.

Second Pitfall: Assuming Middleware Is 100 % Available

Middleware latency, connection time, and retry counts can become bottlenecks, especially with short‑lived connections. Slow SQL, cache misses, and Redis lock contention also contribute to high latency.

Third Pitfall: Ignoring Third‑Party Service Failures

Over‑reliance on external APIs without timeout or heartbeat mechanisms leads to hidden failures. Redundancy, disaster‑recovery, and monitoring of latency and success rates mitigate these risks.

Adventure – Platform Implementation

The risk‑governance platform aggregates trace logs, risk notifications, DBA slow‑SQL suggestions, and visual dashboards. It uses ClickHouse for raw logs, transforms them into metrics stored in GraphiteMergeTree, and supports OLAP queries for multi‑dimensional analysis.

Risk tasks are prioritized by latency impact and request volume; developers can set remediation plans, receive “rotten egg” alerts for upstream services, and view detailed service profiles with highlighted slow SQL.

sum(appslow_count>y) by(appname,method) and sum(appslow_p99>x) by(appname,method)

Expedition – Hands‑On Workshop

Participants view latency curves (p50, p75, p95, p99), explore task lists sorted by priority, and drill into task details that show interface portraits, slow‑SQL highlights, and optimization recommendations.

Conclusion

By adopting Metrics‑Driven Development, defining SLI/SLO, and continuously monitoring dependencies, middleware, and third‑party services, Good Doctor significantly improved overall service stability and created a repeatable process for long‑term risk mitigation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices SRE latency optimization service reliability risk governance

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Exploration

Exploration – Risk Questions

Why p99?

Finding Risks

First Pitfall: Unreasonable Service Dependencies

Second Pitfall: Assuming Middleware Is 100 % Available

Third Pitfall: Ignoring Third‑Party Service Failures

Adventure – Platform Implementation

Expedition – Hands‑On Workshop

Conclusion

HaoDF Tech Team

How this landed with the community

Was this worth your time?

0 Comments

Second Pitfall: Assuming Middleware Is 100 % Available