Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures
In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.
Non‑gossip: Bilibili outage analysis and mitigation techniques
Hello everyone, many of you have heard that Bilibili recently went down. I was actually a victim of this incident, so I will review the whole event, rationally speculate on the causes, and share some prevention techniques and insights.
Event Timeline
When Bilibili first started to fail, I was live‑streaming and writing some code. I didn’t notice the bullet chat stopped, and soon the chat entry disappeared entirely, indicating a problem.
After trying to toggle the chat and restart the stream, the screen showed a disconnection from the server.
Initially I thought it was my network, but the website was reachable while the service was not, suggesting a platform‑wide issue.
Within a few hours, users could not access any Bilibili functionality. The site first returned 404 Not Found, then 502 Bad Gateway. After about an hour some features returned, and by early morning on the 14th the service was fully restored.
Cause Speculation
Having watched a talk by Bilibili’s former technical director on high‑availability architecture, I revisited that material and formed two main hypotheses.
Guess 1: Gateway Failure
Other sites (A site, Jinjiang, Douban) also experienced outages at the same time, pointing to a shared public service failure. The most likely culprit is the CDN.
CDN (Content Delivery Network) caches content at edge nodes to offload traffic from origin servers. If the CDN goes down, traffic floods the gateway, overwhelming it.
The gateway, acting as the traffic manager, performs load balancing, flow control, and circuit breaking. If it cannot protect itself in time, it can be knocked out by a sudden traffic surge, causing downstream services to lose their entry point.
Guess 2: Service Avalanche
Bilibili’s system contains many inter‑dependent services. If a downstream service becomes slow due to CDN or machine failure, upstream services experience increased latency, leading to a cascading slowdown. Accumulated requests eventually cause the entire call chain to collapse, similar to a clogged toilet that can no longer accept input.
Official Explanation
The official statement cited a data‑center failure. The earlier high‑availability presentation had not emphasized disaster‑recovery or multi‑active designs, focusing instead on rate‑limiting, degradation, circuit breaking, retries, and timeout handling at the service and application layers.
Analysis of other answers suggests that Bilibili’s proprietary components, combined with cloud provider issues, caused a large‑scale fault area. Restarting containers and re‑balancing load took several hours.
Prevention Techniques
To improve service resilience, I summarized several high‑availability techniques into a mind map (image omitted). These include robust CDN strategies, circuit breakers, graceful degradation, automated failover, multi‑region deployment, and thorough monitoring.
Takeaways
First, maintain a questioning spirit: when a system fails, consider not only your own code but also third‑party libraries, components, and infrastructure.
Second, avoid becoming a “template architect” who merely repeats textbook designs; instead, gain practical experience and continuously refine system architecture.
Finally, adopt defensive programming habits and design for fault tolerance from the start, because even minor issues on large platforms can have massive impact.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.