Tag

SLI

1 views collected around this technical thread.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Jun 1, 2025 · Operations

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

This article explains the differences between SLA, SLO, and SLI, shows how to express user expectations as concrete service level agreements, and introduces essential high‑availability metrics such as availability percentages, MTBF, MTTR, RPO, RTO, WRT, and MTD for reliable system design.

High AvailabilitySLASLI
0 likes · 9 min read
Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems
Efficient Ops
Efficient Ops
Mar 4, 2025 · Operations

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

This article explains how SRE teams should collaboratively define Service Level Indicators, Objectives, and Agreements, and then cover reliability, performance, observability signals, error budgeting, risk management, incident handling, and the engineering work needed to build robust cloud‑native platforms.

Error BudgetObservabilitySLI
0 likes · 13 min read
Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

DevOpsGoogleSLI
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetObservabilityReliability
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Architects Research Society
Architects Research Society
Aug 24, 2022 · Operations

Choosing Appropriate SLIs and Defining SLOs for Reliable Services

This guide explains how to select suitable service‑level indicators (SLIs), define customer‑centric service‑level objectives (SLOs), use error budgets, and iteratively improve reliability for various system types such as services, data processing, and storage, with practical recommendations for Google Cloud environments.

Error BudgetReliabilitySLI
0 likes · 10 min read
Choosing Appropriate SLIs and Defining SLOs for Reliable Services
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

OncallReliabilitySLI
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

OncallSLISLO
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
IT Architects Alliance
IT Architects Alliance
Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetSLISLO
0 likes · 15 min read
Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

Error BudgetSLISLO
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

SLASLISLO
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive