Operations 21 min read

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

DevOps

Aug 13, 2020

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

Chaos engineering uses fault injection to expose weaknesses in distributed systems, improving stability; with the rise of micro‑services and cloud‑native architectures, its importance has grown dramatically.

Industry leaders such as Netflix (Chaos Monkey), Alibaba (ChaosBlade), PingCAP (Chaos Mesh), and Gremlin have pioneered chaos engineering, publishing seminal works and open‑source tools.

ByteDance’s practice has evolved through three generations. The first generation was a simple disaster‑recovery platform focused on network‑level fault injection and basic threshold‑based metric analysis. The second generation introduced a fault‑center with a declarative fault model, separating fault implementation from experiment orchestration, and added a more extensible architecture. The third generation added automated metric observation, machine‑learning‑based anomaly detection, red‑blue gameday exercises, and strong/weak dependency analysis.

The fault model defines four elements: Target (the micro‑service under test), Scope Filter (the explosion radius), Dependency (the downstream component), and Action (the concrete fault). Example specifications are shown below:

spec. // microservice A, 10% of instances in cluster1 experience CPU saturation
    target("application A").
    cluster_scope_filter("cluster1").
    percent_scope_filter("10%").
    dependency("cpu").
    action("cpu_burn").
    end_at("2020-04-19 13:36:23")

spec. // microservice B, downstream service C experiences 200ms delay
    target("application B").
    cluster_scope_filter("cluster2").
    dependency("application C").
    action("delay, 200ms").
    end_at("2020-04-19 13:36:23")

The fault‑center consists of three core components—API Server (exposes the declarative interface backed by etcd), Scheduler (parses fault declarations and discovers target instances), and Controller (translates actions into agent commands or middleware API calls)—mirroring Kubernetes design principles.

Experiment selection follows four principles:

From offline to production environments.

From small to large fault scopes.

From past‑incident faults to future‑scenario faults.

From weekday to weekend (or any time) execution.

Future work focuses on infrastructure‑level chaos, including building an IAAS layer with OpenStack to inject faults at the virtualization level, fully automated random experiments based on red‑blue defense targets, and intelligent fault diagnosis by correlating large‑scale fault‑metric datasets.

In conclusion, chaos engineering has become essential for building resilient services; ByteDance invites the community to collaborate and advance the practice toward more robust infrastructure and automated reliability engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Observability chaos engineering Reliability Fault Injection Site Reliability Engineering

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.