Operations 27 min read

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Bilibili’s SRE team, confronting rapid growth and complex systems, built a systematic stability operation that includes emergency response, incident handling, on‑call scheduling, and an Event Operations Center platform, using metrics like MTTR, MTTI and AI‑assisted automation to reduce downtime and improve reliability.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Based on Liu Hao's presentation from the Deeplus live broadcast "Bilibili's SRE Practices for Business Stability", this article summarizes the key concepts and practical experiences of Bilibili's Site Reliability Engineering (SRE) team.

The talk begins with an overview of Bilibili's rapid growth and the increasing complexity of its systems, which leads to frequent incidents. To maintain a high baseline of stability, Bilibili has established an SRE organization that focuses on systematic stability operations, covering emergency response, incident handling, disaster recovery drills, and cultural awareness.

The presentation is divided into four main parts:

Case analysis – two real incidents (a live‑streaming event failure and a large‑scale alarm storm) are examined, highlighting problems such as insufficient alarm notification, lack of escalation, and fragmented post‑mortem processes.

Stability operation from the perspective of emergency response – defining emergency response, extending the concept to overall stability, and emphasizing the need to reduce MTTR and increase MTBF.

Core operational elements – focusing on the three pillars of people, process, and tools/platform, and describing the event lifecycle (pre‑event, event, fault, improvement).

Two operational carriers – the OnCall scheduling system and the Event Operations Center platform, which provide automated duty rotation, alarm routing, incident aggregation, noise reduction, online collaboration, and structured post‑mortems.

Key metrics introduced include MTTI (Mean Time to Incident), MTTK (Mean Time to Knowledge), MTTF (Mean Time to Fix), and MTTV (Mean Time to Verify), all aimed at quantifying and improving the incident lifecycle.

The OnCall system is a three‑dimensional model linking organization, business, and function to assign duty personnel accurately. It supports calendar‑based scheduling, primary/backup roles, automatic shift generation, and integrates with WeChat, phone, and virtual numbers to protect personal contact information.

The Event Operations Center aggregates alerts, changes, complaints, and public opinion into a unified event model (base, who, when, where, what). It performs noise reduction (horizontal and vertical suppression), provides one‑click incident group creation, real‑time progress dashboards, and structured post‑mortem templates that automatically link related changes and generate actionable improvement items.

Challenges discussed include metadata unification across services and a shift in work mode from manual, ad‑hoc coordination to systematic, platform‑driven processes. Benefits reported are faster fault recovery, reduced on‑call fatigue, clearer responsibility assignment, and data‑driven continuous improvement.

Future directions involve leveraging AI to assist fault localization, early risk detection, and further automation of remediation.

metricsSREincident managementReliability EngineeringBilibiliOncall
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.