Operations 20 min read

Design and Implementation of Bilibili's Change Control Platform

Bilibili’s Change Prevention Platform consolidates data from over 60 systems to proactively detect and block more than 100 risky changes daily, reducing change‑related incidents by applying a four‑pillar framework of technical support, landing, cross‑domain enablement, and cultural safeguards, while evolving toward AI‑driven, end‑to‑end change defense.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Implementation of Bilibili's Change Control Platform

Approximately 70% of incidents are caused by changes, and Bilibili has suffered the same. After multiple change‑induced incidents, Bilibili built a Change Prevention Platform that addresses technical support, implementation, cross‑domain enablement, and organizational culture, aiming to shift from reactive to proactive defense.

The platform now integrates with more than 60 systems and 400 scenarios, executing over 1,000 change checks daily and intercepting more than 100 potential failures each day, resulting in a noticeable drop in change‑related incidents.

Background : Change failures account for 60‑75% of incidents. External pressures (rising industry failures, cloud‑native and micro‑service complexity) and internal pressures (stress on reliability teams) drive the need for proactive change control.

Strategic Questions : How to formulate a strategy and measure its value? The answer focuses on reducing change quantity and improving change quality, using both qualitative and quantitative metrics.

Design Principles : The solution is structured into four pillars – technical support, technical landing, cross‑domain enablement, and organizational culture.

Technical Support : Build a change meta‑model and instance model based on data sources such as service trees, CMDB, and observability. Provide capabilities like change hosting, perception, control, value visualization, and open APIs.

Technical Landing : Apply the capabilities to code changes, configuration changes, traffic routing, resource control, job scheduling, and business configuration switches.

Cross‑Domain Enablement : Use change capabilities for incident response, fault injection, and AI‑assisted query/analysis.

Organizational Culture : Establish safety committees, assign stability owners, define safety production requirements, and promote a red‑line culture through training and events.

Core Modules :

Change Definition – unified description and process models.

Change Hosting – platform and scenario information registration.

Change Perception – aggregation, cleaning, and retrieval of change data.

Change Analysis – fault localization using change data.

Change Control – generic and custom check items (e.g., activity isolation, SLO checks, batch checks) with block, warning, and escalation strategies.

Practice Cases : Three concrete scenarios illustrate how the platform reduces deployment errors, improves configuration management, and provides real‑time visibility during releases.

Evolution : The system has progressed through three major iterations – from hosting and perception to solidified control measures, then to an intelligent, data‑driven defense mechanism.

Lessons Learned : Emphasize thorough design, proper entry points, balance efficiency with effectiveness, maintain utility, and pursue incremental progress.

Future Outlook : Aim for full business coverage, lower integration costs, introduce intelligent detection, ensure effective rollback and recovery, and build a sustainable, data‑and‑AI‑driven change defense mechanism.

platform engineeringAutomationDevOpschange managementReliabilityBilibiliincident prevention
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.