Operations 16 min read

How Bilibili Implements SLO Engineering to Boost Service Reliability

This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.

Efficient Ops

Dec 20, 2023

How Bilibili Implements SLO Engineering to Boost Service Reliability

Overview of SLO Engineering

Bilibili has built a comprehensive SLO engineering practice that integrates basic components, metric collection, business dashboards, and SLO‑based alerts to monitor and improve service reliability.

Core Components

Basic Components : Underlying capabilities such as organization, business, language type, CI/CD metadata, unified authentication, and Prometheus‑based metric reporting, with SLO data stored in ClickHouse.

Metric Definition and Collection : Aggregates HTTP and gRPC data into service‑level and multi‑AZ metrics.

Business Dashboard : Visualizes overall availability and error‑budget burn‑down charts.

SLO Alerts : Triggered by error‑budget consumption and enriched with root‑cause analysis across application, downstream, middleware, and change events.

Choosing the Right SLI

An SLI is typically a ratio of good events to total events. Selection follows two principles: the metric should reflect stability of the target, and it should be strongly related to user experience (e.g., availability or latency rather than internal connection counts).

Application SLI

For request‑driven services, the simplest implementation uses load‑balancer metrics, but internal RPC services require additional instrumentation. The system is layered (SLB, gateway, application) to capture error count, availability, and latency at each level, ensuring full fault coverage.

Core Scenario SLI

Core‑scenario SLIs provide fine‑grained measurement of key business APIs, evaluating availability, error count, latency, and throughput. Metrics from multiple core scenarios are aggregated to produce business‑level indicators (e.g., App playback).

Business SLI

Business SLIs complement technical metrics by capturing user‑visible outcomes such as order success rate or live stream count, often computed via real‑time big‑data pipelines or database binlog subscriptions.

Component SLI

Component SLIs monitor foundational services that can cause large‑scale outages: traffic‑ingress (DCDN, SLB, APIGW) for availability and latency; storage (MySQL, Redis, ES) for cluster health; and pipeline components (message queues, offline jobs) for task latency.

SLO Definition and Alerting

SLOs set target reliability levels and guide error‑budget calculations. Important considerations include choosing a time window (rolling vs. natural) and deriving target values from historical data. Six alerting rules are described, ranging from target error‑rate based alerts to multi‑window consumption‑rate alerts.

SLO‑Driven Quality Operation

Using SLOs as the primary metric unifies data collection, calculation, aggregation, and display, enabling SREs and developers to coordinate stability work more efficiently and at lower cost compared with traditional ad‑hoc monitoring.

GOC – Global Operations Center

Bilibili's GOC aims to prevent foreseeable issues, accelerate recovery of unexpected problems, and avoid recurrence. It targets a 1‑minute detection, 5‑minute localization, and 10‑minute resolution workflow for incidents.

Fault Discovery, Predefinition and Emergency Coordination

Fault discovery combines SLO alerts, business KPI drops, and customer‑feedback channels. Predefined fault scenarios group related applications and core scenarios, allowing a single alarm to trigger a coordinated response, with automatic escalation for higher‑severity alerts.

Fault Localization and Fast Recovery

Localization leverages the global availability dashboard to pinpoint affected services, drill down from business to application and core‑scenario SLIs, and use multi‑dimensional analysis (service metrics, change events, logs) to identify root causes. Fast recovery relies on multi‑active architecture across ingress, application, messaging, and storage layers, combined with auto‑scaling, rate‑limiting, and automated failover mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations reliability engineering SLO service level objectives

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.