Operations 13 min read

How KuJiaLe Built a Chaos Engineering Platform to Boost System Resilience

This article details KuJiaLe's journey from monolithic to micro‑service architecture, the stability challenges encountered, and how they designed and deployed a ChaosBlade‑based fault‑injection platform that improves fault tolerance, accelerates incident response, and enhances overall user experience.

Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
How KuJiaLe Built a Chaos Engineering Platform to Boost System Resilience

Background

Since 2015 KuJiaLe has been transitioning its overall technical architecture from a monolithic application to a micro‑service, containerized, and middle‑platform model. The evolution introduced three prominent problems: frequent stability failures due to large‑scale architectural changes, inability of traditional process‑optimisation (code review, release) and stability checks to meet reliability needs, and the need to verify that service‑governance, monitoring alerts, and DevOps infrastructure work effectively when incidents occur.

In the past two years, failures have caused direct or indirect losses exceeding one million RMB and damaged the company’s reputation.

Analysis

In complex distributed systems, failures cannot be fully prevented; therefore the goal is to identify risks before abnormal behaviours are triggered. A fault‑drill platform, driven by fault‑induced scenarios, uncovers hidden risks in the architecture, validates infrastructure completeness, limits fault impact, and provides both pre‑emptive prevention and real‑time mitigation.

The platform aims to:

Improve system fault tolerance and robustness.

Increase development and operations emergency response efficiency.

Expose problems early to reduce online failure frequency and recurrence.

Enhance user experience.

Technology Selection

KuJiaLe’s stack is primarily Java‑based, emphasizing low cost, high ROI, and reusability. After evaluating open‑source solutions, the team selected ChaosBlade as the core infrastructure and integrated self‑developed standards and specifications to productise the fault‑drill platform.

ChaosBlade, an open‑source tool from Alibaba, follows chaos‑engineering principles and offers:

Rich applicable scenarios (basic resources, Docker containers, cloud‑native platforms, Java applications).

Active community and abundant documentation.

Low entry cost.

Tool Comparison

Tool

chaos‑mesh

chaosmonkey

chaosblade

chaoskube

litmus

Platform Support

K8S

VM/Container

JVM/Container/K8S

K8S

K8S

CPU

No

No

Yes

No

No

MEM

No

No

Yes

No

No

Container

No

Yes

Yes

No

Yes

K8S Pod

Yes

No

Yes

Partial

Partial

Network

Yes

No

Yes

No

Yes

Disk

Yes

No

Yes

No

No

Fault‑Drill Platform Design

Business Process

The fault‑drill workflow consists of three major stages:

Drill Review

Drill Execution

Drill Retrospective and Improvement

Architecture

The platform, named Ares, integrates with internal CMDB (for host authentication), ticket system (for workflow approval), SaltStack execution engine (for fault injection), and monitoring/alerting systems to form a product‑oriented, platform‑centric architecture.

Modules

Fault Types : Application‑level and infrastructure‑level faults, with custom SOA and MicroTask faults for KuJiaLe.

Permission Management

Host permissions are verified against the CMDB.

Executor permissions are approved via corporate WeChat.

Drill Auditing records execution logs for security audit.

Task Orchestration automates multi‑step, cross‑application, cross‑service, and cross‑database workflows required for comprehensive fault drills.

Implementation and Practice

Landing Strategy

Adopt a "single‑point blast" approach, limiting experiments to the smallest scope: single service, single host, single cluster, or single data center, to gain team acceptance and uncover most issues early.

Dependency Governance

Dependencies are classified into infrastructure (KVM, containers), basic services (databases, message queues), and third‑party services. Accurate dependency graphs are built through architecture‑aware monitoring that captures process‑level call relationships and visualises them across servers, containers, and processes.

From Offline to Production

Experiments progress from test environment → pre‑release → preview with targeted traffic → production clusters, ensuring safety while validating real‑world impact.

Precise Full‑Link Fault Injection

By leveraging the full‑link observability map, faults can be injected at any node with fine‑grained control over affected customer IDs, regions, device types, or APIs, minimizing risk while testing resilience.

Results and Benefits

The platform has been in production for one year, executing over 100 drills and uncovering more than 50 issues.

Practice 1: Validate Infrastructure Completeness

Attack: Application hang.

Result: During pause, monitoring dashboards displayed clear data gaps, and no alerts were triggered; after recovery, the dashboard did not flag the anomaly.

Practice 2: Validate Application Behaviour Under Attack

Attack: Disk fill.

Observed metrics: Application QPS, CPU, disk usage, etc., showed abnormal spikes in unrelated indicators.

Vision

Integrate random fault drills into CI/CD pipelines to accumulate fault‑indicator data, validate emergency plans, refine architecture, and foster a chaos‑engineering culture.

Automate strong/weak dependency extraction by injecting faults, monitoring full‑link metric anomalies, and inferring dependency strength.

Simulate diverse real‑world failure events using historical data and big‑data analysis to generate realistic chain‑link faults.

Scale coverage to core online services with precise full‑link fault injection and red‑blue team exercises.

Collect fault‑pattern relationships to support future intelligent fault diagnosis.

microservicesObservabilityDevOpsChaos EngineeringFault Injection
Qunhe Technology Quality Tech
Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.