Chaos Engineering and Fault Injection Practices at Bilibili: Architecture, Implementation, and Automation
Bilibili built a middleware‑based chaos engineering platform that injects faults into Golang microservices via AOP, supporting server‑ and client‑side, database, cache, and queue components, with fine‑grained instance, request, target, and user controls, automated dependency collection, experiment orchestration, and CI integration to boost system reliability.
In the cloud‑native era, the number of microservices grows explosively, making service‑to‑service call graphs extremely complex and raising the bar for system reliability. This background has driven increasing interest in chaos engineering.
Chaos engineering is not new; its ideas originated around 2008 when Netflix suffered a three‑day outage due to a database failure, leading to the creation of the original ChaosMonkey project. Since then, tools such as SimianArmy, ChaosKong, Gremlin, ChaosMonkeyV2, ChaosBlade, ChaosMesh, and ChaosMeta have emerged.
Bilibili (B‑Station) started its chaos engineering journey later than many peers. In 2019, it began experimenting with fault injection using the open‑source ChaosBlade tool in offline environments. By 2021, additional platform features such as experiment management were added, but tool investment remained intermittent and business adoption uneven.
Repeated production incidents highlighted two key stability issues: disaster‑recovery failures and tangled dependencies. Bilibili therefore pursued two parallel tracks: disaster‑recovery drills (infrastructure‑level) and fault‑injection drills (business‑level). The following content focuses on the latter.
Fault injection at the business level is implemented via a middleware‑based approach built on Bilibili’s Golang microservice framework (Kratos). The design follows Aspect‑Oriented Programming (AOP) principles: request processing is intercepted by middleware that decides, based on experiment configuration, whether to inject faults such as timeouts, error codes, or HTTP status changes.
The platform supports both server‑side and client‑side injection. Server‑side injection simulates failures from the provider’s perspective, affecting downstream services, while client‑side injection simulates failures from the caller’s perspective, avoiding permission issues and limiting impact scope.
Supported component categories include:
Server: HTTP Server, gRPC Server (errors, timeouts, custom codes)
Client: HTTP Client, gRPC Client (errors, timeouts, custom codes)
Database: MySQL, TiDB, Taishan (errors, timeouts)
Cache: Redis, Memcached (errors, timeouts)
Message Queue: DataBus (send/receive errors, timeouts)
Special internal components (e.g., in‑memory aggregation tools)
The fault‑injection workflow is:
Application code imports the SDK and calls fault.Init() .
The SDK registers middleware for each component.
The SDK establishes a gRPC connection with the Fault‑Service.
Users create experiment scenarios via the Fault‑Console.
Fault‑Admin stores experiment definitions in Fault‑DB/KV.
Fault‑Service pushes experiments to the SDK.
The SDK intercepts relevant flows and injects faults, reporting execution data.
To limit the blast radius of experiments, the platform offers four control granularities:
Instance level – experiments target specific service instances.
Request level – middleware decides based on request attributes (e.g., path).
Target level – detailed matching rules such as specific keys or operation types.
User‑account level – precise account matching or suffix‑based grouping.
For asynchronous job scenarios (e.g., content review, like count aggregation) that lack a traditional request interface, Bilibili leverages its internally built CQRS‑based event‑governance framework (Railgun). Fault‑injection middleware is inserted at the topic‑consumer entry point, allowing fault context injection based on topic name and message content.
Dependency collection is integrated into the fault‑injection SDK. When a component makes an outbound call, the SDK can capture and report the dependency relationship, enabling automatic construction of dependency graphs without manual code inspection.
Automation of experiments addresses the high manual effort required to run fault‑injection tests. The platform can automatically split dependencies into individual experiment tasks, trigger them, and infer strong versus weak dependencies based on error responses. OpenAPI endpoints allow CI pipelines to start/stop experiments, perform assertions, and collect results (including logs, screenshots, video recordings).
Multi‑application scenarios are supported by defining a “business scenario” that aggregates multiple “application scenarios,” each with its own set of fault targets. The platform can orchestrate combined fault injections across services with a single click.
In summary, Bilibili’s fault‑injection platform provides a comprehensive suite of capabilities—middleware‑based fault injection, fine‑grained blast‑radius control, automated dependency collection, and experiment automation—across a wide range of business lines (recommendation, playback, live streaming, etc.), significantly improving system stability.
Future work includes expanding fault‑injection coverage for additional framework components and closing the loop by integrating monitoring and logging to make chaos experiments safer and more automated.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.