How Bilibili Implements SLO Engineering to Boost Service Reliability
This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.
Overview of SLO Engineering
Bilibili has built a comprehensive SLO engineering practice that integrates basic components, metric collection, business dashboards, and SLO‑based alerts to monitor and improve service reliability.
Core Components
Basic Components : Underlying capabilities such as organization, business, language type, CI/CD metadata, unified authentication, and Prometheus‑based metric reporting, with SLO data stored in ClickHouse.
Metric Definition and Collection : Aggregates HTTP and gRPC data into service‑level and multi‑AZ metrics.
Business Dashboard : Visualizes overall availability and error‑budget burn‑down charts.
SLO Alerts : Triggered by error‑budget consumption and enriched with root‑cause analysis across application, downstream, middleware, and change events.
Choosing the Right SLI
An SLI is typically a ratio of good events to total events. Selection follows two principles: the metric should reflect stability of the target, and it should be strongly related to user experience (e.g., availability or latency rather than internal connection counts).
Application SLI
For request‑driven services, the simplest implementation uses load‑balancer metrics, but internal RPC services require additional instrumentation. The system is layered (SLB, gateway, application) to capture error count, availability, and latency at each level, ensuring full fault coverage.
Core Scenario SLI
Core‑scenario SLIs provide fine‑grained measurement of key business APIs, evaluating availability, error count, latency, and throughput. Metrics from multiple core scenarios are aggregated to produce business‑level indicators (e.g., App playback).
Business SLI
Business SLIs complement technical metrics by capturing user‑visible outcomes such as order success rate or live stream count, often computed via real‑time big‑data pipelines or database binlog subscriptions.
Component SLI
Component SLIs monitor foundational services that can cause large‑scale outages: traffic‑ingress (DCDN, SLB, APIGW) for availability and latency; storage (MySQL, Redis, ES) for cluster health; and pipeline components (message queues, offline jobs) for task latency.
SLO Definition and Alerting
SLOs set target reliability levels and guide error‑budget calculations. Important considerations include choosing a time window (rolling vs. natural) and deriving target values from historical data. Six alerting rules are described, ranging from target error‑rate based alerts to multi‑window consumption‑rate alerts.
SLO‑Driven Quality Operation
Using SLOs as the primary metric unifies data collection, calculation, aggregation, and display, enabling SREs and developers to coordinate stability work more efficiently and at lower cost compared with traditional ad‑hoc monitoring.
GOC – Global Operations Center
Bilibili's GOC aims to prevent foreseeable issues, accelerate recovery of unexpected problems, and avoid recurrence. It targets a 1‑minute detection, 5‑minute localization, and 10‑minute resolution workflow for incidents.
Fault Discovery, Predefinition and Emergency Coordination
Fault discovery combines SLO alerts, business KPI drops, and customer‑feedback channels. Predefined fault scenarios group related applications and core scenarios, allowing a single alarm to trigger a coordinated response, with automatic escalation for higher‑severity alerts.
Fault Localization and Fast Recovery
Localization leverages the global availability dashboard to pinpoint affected services, drill down from business to application and core‑scenario SLIs, and use multi‑dimensional analysis (service metrics, change events, logs) to identify root causes. Fast recovery relies on multi‑active architecture across ingress, application, messaging, and storage layers, combined with auto‑scaling, rate‑limiting, and automated failover mechanisms.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.