Operations 8 min read

Why 100% Service Uptime Isn’t Worth the Cost: SRE Insights on Risk and ROI

The article explains why striving for perfect service availability is unnecessary, outlines the cost of high reliability, shows how to measure availability and SLOs, discusses who should set SLOs, and highlights the importance of ROI when improving reliability.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Why 100% Service Uptime Isn’t Worth the Cost: SRE Insights on Risk and ROI

Chapter 3 of "Site Reliability Engineering (Google)" discusses embracing risk and shares key insights with personal reflections.

Service availability must be 100%? It’s actually unnecessary. A service’s end‑users cannot perceive the difference between 99.99% and 99.999% reliability because their devices, networks, and ISPs introduce far larger uncertainties.

High reliability brings high cost. Moving from 99.9% to 99.99% reduces annual downtime by only 47.7 minutes, yet the cost—additional redundant servers and opportunity cost of engineering effort—can be substantial.

Cost of redundant physical servers/computing resources

Opportunity cost of diverting engineering effort from feature development

How to measure availability

Typical method uses unplanned downtime:

<code>Availability = System Uptime / (System Uptime + Unplanned Downtime)</code>

Unplanned downtime includes crashes, feature outages, or performance degradation. Planned downtime (e.g., scheduled upgrades) does not affect SLA.

For distributed systems where partial outages occur, availability can be measured by request success rate:

<code>Availability = Successful Requests / Total Requests</code>

Large internet companies with thousands of micro‑services cannot report a single service‑level metric; instead they use business‑level indicators (e.g., ride‑order volume for a ride‑hailing app) as a “north‑star” metric.

Who defines SLOs? At Google, the product‑technical team that owns the commercial goals sets the SLO. For internal infrastructure services (e.g., BigTable), the service’s own engineering team collaborates with upstream service owners.

Different upstream services may have conflicting requirements (low latency vs. high throughput), so infrastructure teams often create separate clusters with distinct SLOs.

Improving SLOs must consider ROI.

Example: raising availability from 99.9% to 99.99% adds 0.09% availability. If annual revenue is $1,000,000, the incremental value is $900. If the cost to achieve this improvement is less than $900, it is worthwhile; otherwise, it is not.

SLO and error‑budget construction process

Product management defines an SLO for a service each quarter.

Actual uptime is measured by an independent monitoring system.

The difference between measured uptime and the SLO is the remaining error budget.

As long as the error budget remains, new releases can be deployed.

Further reading

Fast Cat Cloud Observability product – focuses on fault localization and stability governance.

Nightingale Professional Edition – provides enhanced monitoring capabilities and expert observability guidance.

Unified On‑Call Center – addresses alert noise reduction, scheduling, escalation, and collaboration.

operationsSREReliabilityROISLO
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.