Operations 10 min read

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

macrozheng

Jul 18, 2021

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

Non‑gossip: Bilibili outage analysis and mitigation techniques

Hello everyone, many of you have heard that Bilibili recently went down. I was actually a victim of this incident, so I will review the whole event, rationally speculate on the causes, and share some prevention techniques and insights.

Event Timeline

When Bilibili first started to fail, I was live‑streaming and writing some code. I didn’t notice the bullet chat stopped, and soon the chat entry disappeared entirely, indicating a problem.

After trying to toggle the chat and restart the stream, the screen showed a disconnection from the server.

Initially I thought it was my network, but the website was reachable while the service was not, suggesting a platform‑wide issue.

Within a few hours, users could not access any Bilibili functionality. The site first returned 404 Not Found, then 502 Bad Gateway. After about an hour some features returned, and by early morning on the 14th the service was fully restored.

Cause Speculation

Having watched a talk by Bilibili’s former technical director on high‑availability architecture, I revisited that material and formed two main hypotheses.

Guess 1: Gateway Failure

Other sites (A site, Jinjiang, Douban) also experienced outages at the same time, pointing to a shared public service failure. The most likely culprit is the CDN.

CDN (Content Delivery Network) caches content at edge nodes to offload traffic from origin servers. If the CDN goes down, traffic floods the gateway, overwhelming it.

The gateway, acting as the traffic manager, performs load balancing, flow control, and circuit breaking. If it cannot protect itself in time, it can be knocked out by a sudden traffic surge, causing downstream services to lose their entry point.

Guess 2: Service Avalanche

Bilibili’s system contains many inter‑dependent services. If a downstream service becomes slow due to CDN or machine failure, upstream services experience increased latency, leading to a cascading slowdown. Accumulated requests eventually cause the entire call chain to collapse, similar to a clogged toilet that can no longer accept input.

Official Explanation

The official statement cited a data‑center failure. The earlier high‑availability presentation had not emphasized disaster‑recovery or multi‑active designs, focusing instead on rate‑limiting, degradation, circuit breaking, retries, and timeout handling at the service and application layers.

Analysis of other answers suggests that Bilibili’s proprietary components, combined with cloud provider issues, caused a large‑scale fault area. Restarting containers and re‑balancing load took several hours.

Prevention Techniques

To improve service resilience, I summarized several high‑availability techniques into a mind map (image omitted). These include robust CDN strategies, circuit breakers, graceful degradation, automated failover, multi‑region deployment, and thorough monitoring.

Takeaways

First, maintain a questioning spirit: when a system fails, consider not only your own code but also third‑party libraries, components, and infrastructure.

Second, avoid becoming a “template architect” who merely repeats textbook designs; instead, gain practical experience and continuously refine system architecture.

Finally, adopt defensive programming habits and design for fault tolerance from the start, because even minor issues on large platforms can have massive impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations high availability CDN Service Architecture Bilibili Outage

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.