Cloud Native 10 min read

Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems

iQIYI’s Little Deer Chaos Platform injects faults and runs red‑blue attacks across production services, enabling teams to validate alerts, circuit‑breakers, and fail‑over mechanisms—demonstrated by video playback and membership service case studies—thereby fostering zero‑trust design, faster skill growth, and resilient cloud‑native operations.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems

In the programmer community, classic sayings such as “If the code runs successfully, don’t touch it” and “Never touch a stable production system” are widely accepted. However, system fragility comes from hardware failures, code bugs, architectural flaws, and unpredictable traffic spikes, especially as architectures migrate to cloud‑native environments.

Chaos engineering, introduced by Netflix engineers in 2010, deliberately injects failures into production systems to verify their behavior under adverse conditions, acting like a “vaccine” that reveals hidden weaknesses before they cause major outages.

iQIYI adopted chaos engineering early, first through individual business teams (e.g., the financial‑payment team). After a 2020 pandemic‑peak playback failure caused by a small network jitter that cascaded into a large outage, iQIYI launched a company‑wide, standardized chaos‑engineering platform called the Little Deer Chaos Platform (小鹿乱撞平台).

The platform serves two main roles:

Business self‑test: Service owners can inject faults into their own production or test environments to verify high‑availability mechanisms such as alerts, degradation, circuit‑breakers, and disaster‑recovery routing.

Red‑Blue attack: An independent architecture evaluation team performs randomized attack experiments from a third‑party perspective to validate the resilience of critical services.

Typical workflow on the platform includes four steps:

Select the target service.

Configure the fault injection method (e.g., network latency, service shutdown).

Orchestrate the attack plan.

Observe the execution, collect metrics, and generate a concise fault‑exercise report.

Two concrete case studies illustrate the platform’s impact:

Case 1 – Video Playback Service Couchbase Cache Fault: A network jitter caused a timeout when accessing Couchbase, triggering a circuit‑breaker that switched traffic to a backup KV store. The fault was reproduced via a 1000 ms latency injection, confirming the effectiveness of the circuit‑breaker.

Case 2 – Membership Service Redis Distributed‑Lock Fault: Three failure scenarios were tested: network disconnection, primary Redis failure with automatic fail‑over, and primary failure without fail‑over followed by a restart. The experiments identified necessary client‑side retry and back‑off strategies.

A summary table of common pitfalls observed across more than 20 business lines includes issues such as stale database address caches after fail‑over, malformed error responses, undetected slow nodes, and long‑standing alarm failures.

Key take‑aways for architects:

Zero trust: Assume any dependency can fail and design redundancy accordingly.

Exploration mindset: Actively use chaos experiments to discover hidden failure modes.

By continuously practicing chaos engineering, iQIYI’s technical leaders have built stronger confidence in their architectures, accelerated skill growth, and ensured stable business operations in the increasingly complex cloud‑native era.

Cloud NativedevopsChaos EngineeringreliabilityFault InjectionResilience Testing
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.