Choosing Appropriate SLIs and Defining SLOs for Reliable Services
This guide explains how to select suitable service‑level indicators (SLIs), define customer‑centric service‑level objectives (SLOs), use error budgets, and iteratively improve reliability for various system types such as services, data processing, and storage, with practical recommendations for Google Cloud environments.
Choosing the Right SLI
Selecting appropriate Service Level Indicators (SLIs) is essential for understanding service performance. For multi‑tenant SaaS applications, capture SLIs at the tenant level rather than only globally, propagating a tenant identifier through the stack so monitoring can aggregate per‑tenant metrics.
Service Systems
Typical SLIs for systems that provide data include:
Availability – the proportion of time the service is up, usually expressed as a percentage such as 99%.
Latency – how quickly a certain percentile of requests is satisfied, e.g., the 99th‑percentile latency of 300 ms.
Quality – a service‑specific measure of how close the response is to the ideal, expressed as binary or a 0‑100% scale.
Data Processing Systems
Typical SLIs for data‑processing workloads include:
Coverage – the proportion of data that has been processed, e.g., 99.9%.
Correctness – the proportion of output data that is considered correct, e.g., 99.99%.
Freshness – how recent the source or aggregated data is, e.g., updates within the last 20 minutes.
Throughput – the amount of data processed, e.g., 500 MiB/s or 1 000 requests per second.
Storage Systems
Typical SLIs for storage‑focused systems include:
Durability – the likelihood that written data can be retrieved later, e.g., 99.9999%.
Throughput and latency are also common storage SLIs.
Selecting SLIs Based on User Experience and Setting SLOs
Reliability should be defined by the user. Measure reliability as close to the user as possible, for example by monitoring mobile/Web clients, load balancers, or finally the server itself.
Detect mobile or Web clients when possible (e.g., using Firebase Performance Monitoring).
If not feasible, monitor external HTTP(S) load balancer logs with Cloud Monitoring.
As a last resort, monitor Compute Engine instances with Stackdriver Monitoring.
Set SLOs high enough to satisfy almost all users but not unrealistically close to 100%; a slightly lower target accounts for transient network issues that users may not notice.
Frequent small changes can accelerate feature delivery, while the error budget helps balance change velocity against reliability.
Iteratively Improving SLOs
SLOs should be revisited quarterly or at least annually to ensure they still reflect user happiness and correlate with service incidents, adjusting them as business needs evolve.
Using Strict Internal SLOs
Adopt internal SLOs stricter than external SLAs to catch problems before they incur financial penalties, and pair them with blameless post‑mortem processes.
Managing Development Pace with an Error Budget
An error budget indicates whether the system is exceeding or falling short of the desired reliability over a defined window (e.g., 30 days). When the budget has headroom, rapid feature development is safe; when it nears zero, freeze or slow changes and focus on reliability work.
Google Cloud’s Operations suite provides SLO monitoring, a UI for manual configuration, an API for programmatic SLO creation, and dashboards for tracking error‑budget consumption.
Recommendations
Define and measure customer‑centric SLIs such as availability and latency.
Establish internal error budgets stricter than external SLAs, including consequences like production freezes.
Set latency SLIs that capture outliers (e.g., 90th or 99th percentile).
Review SLOs at least annually to ensure alignment with user satisfaction and service disruption.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.