Operations 16 min read

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

Bilibili’s engineering team built a scenario‑metadata and one‑click fault‑drill platform, implemented multi‑tier degradation, dynamic capacity planning, and extensive automated fault‑injection testing to guarantee zero‑severity incidents during the high‑traffic 2025 Spring Festival Gala live stream.

Bilibili Tech

Mar 18, 2025

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

On January 9, 2025, Bilibili announced a partnership with China Central Television to become the exclusive bullet‑screen interaction platform for the 2025 Spring Festival Gala. The technical team’s primary goal is to guarantee stable operation of the seven‑day live broadcast, especially the four‑hour golden slot, where any code bug or third‑party service interruption could cause a system failure.

Key Challenges

Very tight development schedule with frequent requirement changes and strict deadlines.

Expected user traffic far exceeding normal levels, leading to potential traffic spikes.

Minimal tolerance for faults during the live event; hundreds of services must remain stable under high concurrency.

To address these challenges, cross‑department core engineers held in‑depth discussions on the differences between event‑level and daily operational guarantees, and shared protection strategies.

System Architecture and Fault‑Scenario Construction

Before each large‑scale event, a fault‑drill is performed based on user‑function scenarios. Historically, scenario mapping relied on manual documentation, which was error‑prone and inefficient. In 2024, a scenario metadata platform was built to collect real‑time user operation chains, generating over 75 core scenarios and 300+ sub‑scenarios for the live broadcast.

With complete scenario data, the team linked the fault‑drill platform to the metadata platform, enabling one‑click fault creation and execution, reducing drill time from minutes to seconds. Results are annotated directly on the platform to identify strong/weak dependencies and necessary improvements.

Dynamic Updates and Risk Visibility

Real‑time link monitoring: daily scans detect new dependencies and prompt users to add corresponding drill scenarios.

Progress dashboard: shows coverage rate, drill counts per scenario, and recent trigger times.

Multi‑Level Business Degradation Strategy

A three‑tier disaster‑recovery scheme is designed for core live‑room functions:

Level 1 – Playback must be guaranteed: local cache fallback, active‑active deployment across data centers, and client‑side fallback after three timeout attempts.

Level 2 – Secondary modules: multi‑level caching for the quiz system (memory → Redis cluster → KV cluster → MySQL) and rate‑limiting for the bullet‑screen system.

Level 3 – Optional modules: minimal static configuration for the bottom panel; when unavailable, only essential features remain.

Resource Assurance and Capacity Planning

Due to a short preparation window and limited hardware procurement during the Chinese New Year, the team faced challenges such as insufficient cloud provider resources, lack of elastic capacity in some data centers, and difficulty scaling stateful services (DB, KV). Strategies included:

VPA (Vertical Pod Autoscaling) based on application profiling and historical usage.

Elastic migration of UGC transcoding to the cloud.

Multi‑active traffic shifting to balance load across data centers.

Cross‑department resource borrowing and multi‑vendor CDN provisioning with DNS‑based load balancing.

Component Stability Assurance

High‑load stress testing of each component (DB, KV, MQ, etc.) at 1.5‑2× expected peak.

Isolation of fault domains per scenario (live, activity, payment).

Business degradation paths: active‑active retry via API gateway, dual‑cluster fallback, async conversion, and pre‑agreed recovery windows.

Testing and Fault‑Injection Practices

The team built a full‑link dependency map and executed automated fault‑injection drills covering interface timeouts, service latency spikes, traffic surges, network outages, and container tampering. Metrics such as functional availability, response time, alerting, and recovery time were collected.

Automated drills enable frequent, efficient fault simulation, automatic recovery, end‑to‑end testing, and intelligent fault analysis.

Conclusion

Through systematic multi‑round fault drills covering over 900 core business interfaces and 9,000 downstream service fault scenarios, the team identified and mitigated hundreds of stability risks, achieving zero‑severity incidents during the live broadcast. The shared technical insights provide valuable references for large‑scale, high‑concurrency live streaming events.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

live streaming Operations system reliability high concurrency Fault Injection

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.