How KuJiaLe Built a Chaos Engineering Platform to Boost System Resilience
This article details KuJiaLe's journey from monolithic to micro‑service architecture, the stability challenges encountered, and how they designed and deployed a ChaosBlade‑based fault‑injection platform that improves fault tolerance, accelerates incident response, and enhances overall user experience.
Background
Since 2015 KuJiaLe has been transitioning its overall technical architecture from a monolithic application to a micro‑service, containerized, and middle‑platform model. The evolution introduced three prominent problems: frequent stability failures due to large‑scale architectural changes, inability of traditional process‑optimisation (code review, release) and stability checks to meet reliability needs, and the need to verify that service‑governance, monitoring alerts, and DevOps infrastructure work effectively when incidents occur.
In the past two years, failures have caused direct or indirect losses exceeding one million RMB and damaged the company’s reputation.
Analysis
In complex distributed systems, failures cannot be fully prevented; therefore the goal is to identify risks before abnormal behaviours are triggered. A fault‑drill platform, driven by fault‑induced scenarios, uncovers hidden risks in the architecture, validates infrastructure completeness, limits fault impact, and provides both pre‑emptive prevention and real‑time mitigation.
The platform aims to:
Improve system fault tolerance and robustness.
Increase development and operations emergency response efficiency.
Expose problems early to reduce online failure frequency and recurrence.
Enhance user experience.
Technology Selection
KuJiaLe’s stack is primarily Java‑based, emphasizing low cost, high ROI, and reusability. After evaluating open‑source solutions, the team selected ChaosBlade as the core infrastructure and integrated self‑developed standards and specifications to productise the fault‑drill platform.
ChaosBlade, an open‑source tool from Alibaba, follows chaos‑engineering principles and offers:
Rich applicable scenarios (basic resources, Docker containers, cloud‑native platforms, Java applications).
Active community and abundant documentation.
Low entry cost.
Tool Comparison
Tool
chaos‑mesh
chaosmonkey
chaosblade
chaoskube
litmus
Platform Support
K8S
VM/Container
JVM/Container/K8S
K8S
K8S
CPU
No
No
Yes
No
No
MEM
No
No
Yes
No
No
Container
No
Yes
Yes
No
Yes
K8S Pod
Yes
No
Yes
Partial
Partial
Network
Yes
No
Yes
No
Yes
Disk
Yes
No
Yes
No
No
Fault‑Drill Platform Design
Business Process
The fault‑drill workflow consists of three major stages:
Drill Review
Drill Execution
Drill Retrospective and Improvement
Architecture
The platform, named Ares, integrates with internal CMDB (for host authentication), ticket system (for workflow approval), SaltStack execution engine (for fault injection), and monitoring/alerting systems to form a product‑oriented, platform‑centric architecture.
Modules
Fault Types : Application‑level and infrastructure‑level faults, with custom SOA and MicroTask faults for KuJiaLe.
Permission Management
Host permissions are verified against the CMDB.
Executor permissions are approved via corporate WeChat.
Drill Auditing records execution logs for security audit.
Task Orchestration automates multi‑step, cross‑application, cross‑service, and cross‑database workflows required for comprehensive fault drills.
Implementation and Practice
Landing Strategy
Adopt a "single‑point blast" approach, limiting experiments to the smallest scope: single service, single host, single cluster, or single data center, to gain team acceptance and uncover most issues early.
Dependency Governance
Dependencies are classified into infrastructure (KVM, containers), basic services (databases, message queues), and third‑party services. Accurate dependency graphs are built through architecture‑aware monitoring that captures process‑level call relationships and visualises them across servers, containers, and processes.
From Offline to Production
Experiments progress from test environment → pre‑release → preview with targeted traffic → production clusters, ensuring safety while validating real‑world impact.
Precise Full‑Link Fault Injection
By leveraging the full‑link observability map, faults can be injected at any node with fine‑grained control over affected customer IDs, regions, device types, or APIs, minimizing risk while testing resilience.
Results and Benefits
The platform has been in production for one year, executing over 100 drills and uncovering more than 50 issues.
Practice 1: Validate Infrastructure Completeness
Attack: Application hang.
Result: During pause, monitoring dashboards displayed clear data gaps, and no alerts were triggered; after recovery, the dashboard did not flag the anomaly.
Practice 2: Validate Application Behaviour Under Attack
Attack: Disk fill.
Observed metrics: Application QPS, CPU, disk usage, etc., showed abnormal spikes in unrelated indicators.
Vision
Integrate random fault drills into CI/CD pipelines to accumulate fault‑indicator data, validate emergency plans, refine architecture, and foster a chaos‑engineering culture.
Automate strong/weak dependency extraction by injecting faults, monitoring full‑link metric anomalies, and inferring dependency strength.
Simulate diverse real‑world failure events using historical data and big‑data analysis to generate realistic chain‑link faults.
Scale coverage to core online services with precise full‑link fault injection and red‑blue team exercises.
Collect fault‑pattern relationships to support future intelligent fault diagnosis.
Qunhe Technology Quality Tech
Kujiale Technology Quality
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.