Operations 16 min read

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

Bilibili Tech

Dec 1, 2023

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

In the first half of 2023, Bilibili experienced multiple emergency incidents caused by changes, with industry data indicating that about 70% of such incidents stem from change activities. The article emphasizes that technical debt does not disappear on its own and that unmanaged changes can lead to costly outages.

Typical change‑related incidents described include:

A low‑level L3 service launched on a non‑working day depended on an L0 core service; a code bug made the core service unavailable across two availability zones, breaking multi‑active traffic shifting.

A configuration change deployed to multiple zones without proper gray‑release or observation, causing service outage.

A gateway change without gray‑release observation, also leading to outage.

A capacity‑quota change that caused overload during deployment.

From these cases, the root causes are categorized as:

Non‑standard changes (e.g., weekend releases, insufficient testing).

Lack of gray‑release (rapid, multi‑zone changes without observation windows).

Poor observability during change (ignoring metrics or lacking focus on key indicators).

No rollback or mitigation plan.

Chaotic emergency response (slow detection, disorganized communication).

To address these issues, Bilibili introduced a series of safety‑production measures: change‑management policies, a preventive change platform, a control platform, emergency response processes, and internal culture promotion.

Key requirements for safe production include defining prohibited release windows, ensuring releases are gray, observable, and recoverable, and clearly mapping environments to code branches.

The overall design consists of several platforms:

Change Platform – the execution layer closest to users, handling containers, gateways, configurations, and physical machines. It performs pre‑release checks (resource capacity, SLOs, diff checks, custom rules) and enforces gray‑release, observability, and rapid recovery.

Change Control Platform – a central system that defines standard change metadata, aggregates change data, provides defensive checks, and offers emergency escape channels (green channel for immediate stop‑loss, whitelist for longer exemptions).

Quality Platform – measures business/application SLI, defines SLOs, and sets detection and blocking thresholds to trigger alerts or automatic emergency handling.

Alert/Monitoring Platform – visualizes company‑wide stability dashboards.

Each platform evolves through multiple versions to improve observability (showing baseline vs. current version trends, integrating alert information) and recovery (fast rollback, escape routes, regular fault‑drill exercises).

The implementation roadmap is divided into four stages:

Minimum viable product (MVP) – core change‑control and container release platforms.

Adding control conditions – richer observability, escape mechanisms, and SLO‑based thresholds.

Expanding coverage – supporting physical‑machine releases, gateway and configuration changes.

Continuous operation – internal promotion, user feedback loops, safety‑production exams for release qualification.

In summary, the systematic enforcement of change‑management policies, combined with platform‑level observability, defensive checks, and rapid recovery mechanisms, significantly reduces change‑induced failures and fosters a culture of reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Platform Engineering Observability SRE Reliability

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.