Safe Production Practices: Change Management Platform Design and Implementation at Bilibili
After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.
In the first half of 2023, Bilibili experienced multiple emergency incidents caused by changes, with industry data indicating that about 70% of such incidents stem from change activities. The article emphasizes that technical debt does not disappear on its own and that unmanaged changes can lead to costly outages.
Typical change‑related incidents described include:
A low‑level L3 service launched on a non‑working day depended on an L0 core service; a code bug made the core service unavailable across two availability zones, breaking multi‑active traffic shifting.
A configuration change deployed to multiple zones without proper gray‑release or observation, causing service outage.
A gateway change without gray‑release observation, also leading to outage.
A capacity‑quota change that caused overload during deployment.
From these cases, the root causes are categorized as:
Non‑standard changes (e.g., weekend releases, insufficient testing).
Lack of gray‑release (rapid, multi‑zone changes without observation windows).
Poor observability during change (ignoring metrics or lacking focus on key indicators).
No rollback or mitigation plan.
Chaotic emergency response (slow detection, disorganized communication).
To address these issues, Bilibili introduced a series of safety‑production measures: change‑management policies, a preventive change platform, a control platform, emergency response processes, and internal culture promotion.
Key requirements for safe production include defining prohibited release windows, ensuring releases are gray, observable, and recoverable, and clearly mapping environments to code branches.
The overall design consists of several platforms:
Change Platform – the execution layer closest to users, handling containers, gateways, configurations, and physical machines. It performs pre‑release checks (resource capacity, SLOs, diff checks, custom rules) and enforces gray‑release, observability, and rapid recovery.
Change Control Platform – a central system that defines standard change metadata, aggregates change data, provides defensive checks, and offers emergency escape channels (green channel for immediate stop‑loss, whitelist for longer exemptions).
Quality Platform – measures business/application SLI, defines SLOs, and sets detection and blocking thresholds to trigger alerts or automatic emergency handling.
Alert/Monitoring Platform – visualizes company‑wide stability dashboards.
Each platform evolves through multiple versions to improve observability (showing baseline vs. current version trends, integrating alert information) and recovery (fast rollback, escape routes, regular fault‑drill exercises).
The implementation roadmap is divided into four stages:
Minimum viable product (MVP) – core change‑control and container release platforms.
Adding control conditions – richer observability, escape mechanisms, and SLO‑based thresholds.
Expanding coverage – supporting physical‑machine releases, gateway and configuration changes.
Continuous operation – internal promotion, user feedback loops, safety‑production exams for release qualification.
In summary, the systematic enforcement of change‑management policies, combined with platform‑level observability, defensive checks, and rapid recovery mechanisms, significantly reduces change‑induced failures and fosters a culture of reliability.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.