Understanding Service Avalanche and Circuit Breaker Mechanisms through the Red Cliffs Battle Analogy
This article uses the historic Battle of Red Cliffs as an analogy to explain service avalanche, its causes in micro‑service architectures, and how circuit‑breaker, rate‑limiting, and isolation techniques can prevent cascading failures in modern distributed systems.
Red Cliffs Battle
The famous Battle of Red Cliffs from the novel Romance of the Three Kingdoms is used as a metaphor to illustrate how the "service avalanche" problem can be taken to the extreme in micro‑service systems.
1. Restoring the Red Cliffs Battle
After Cao Cao unified the north, he moved south, defeated Liu Bei, and occupied Jingxiang, intending to eliminate Sun Quan. Liu Bei and Sun Quan formed an alliance against Cao Cao's 800,000 troops. Cao Cao's northern troops lacked experience in naval warfare and suffered from seasickness, so he ordered the ships to be linked together with iron chains to reduce the impact of waves.
Dialogue among Zhou Yu, Huang Gai and Zhuge Liang:
Huang Gai : Cao Cao’s chained ships are a disaster; if one catches fire, the whole fleet will burn. We should use fire attacks. Zhou Yu : How can we get close to their ships? Huang Gai : I will feign surrender, bring a few ships loaded with oil‑soaked straw, and set them ablaze when near the enemy. Zhou Yu : Brilliant! But where do we get the east wind? Zhuge Liang : I will borrow the east wind.
The fire ships broke through the enemy formation, creating a sea of fire and leading to a decisive victory for the allied forces.
2. Analysis of the Battle Situation
Zhou Yu and Huang Gai identified the weakness of the chained ships: if one ship catches fire, the whole chain burns. This mirrors the "service avalanche" problem in distributed systems.
In a micro‑service architecture, each service calls others via interfaces. As business grows, the number of services and their inter‑dependencies increase, making the overall logic more complex. If a dependent service becomes unavailable, the failure can cascade, causing a complete outage—just like an avalanche of snow.
3. Service Avalanche in Systems
Micro‑services typically use RPC or HTTP calls with timeout limits and retry mechanisms. Without circuit‑breaker or rate‑limiting, a single failure can trigger an avalanche. The following example illustrates this:
Three services: Order Service , Product Service , Inventory Service .
During a high‑traffic event (e.g., Double‑11), the Inventory Service becomes unavailable, causing timeouts.
Product Service repeatedly retries, exhausting its resources, and eventually crashes.
Order Service, depending on Product Service, also fails, leading to a total outage.
4. Real‑World Scenarios Causing Avalanches
4.1 Service Provider Unavailability
Hardware failures (network, disk).
Software bugs that consume excessive CPU.
Cache breakdowns causing a sudden surge of database traffic.
Flash‑sale spikes overwhelming service capacity.
4.2 Retry Amplification
Users manually retrying after no response.
Application‑level retry logic that repeats failed calls multiple times.
5. Preventing Service Avalanche
Pre‑emptive measures: rate limiting, active degradation, isolation.
Post‑failure recovery: circuit breaking, passive degradation.
The rest of this article focuses on circuit‑breaker mechanisms.
6. Circuit‑Breaker Principles and Algorithms
6.1 Concept
A circuit breaker works like an electrical fuse: when the current (request latency or error rate) exceeds a threshold, the fuse blows to protect downstream components.
If a service becomes consistently slow or times out, the circuit opens and subsequent calls are rejected with a fast failure response, allowing the service time to recover.
6.2 How to Trip a Circuit
When the number of failures or the failure ratio within a time window exceeds a configured threshold, the circuit opens.
6.3 Request‑Counting Algorithm
Check if the circuit is open; if so, reject the request.
If closed, verify whether the time window is full.
If the window is not full, increment the request bucket.
On response, increment either the success or failure bucket.
When the window is full, evaluate whether to open the circuit.
6.4 Recovery Algorithm
After a cooldown period, the circuit moves to a half‑open state, allowing a limited number of test requests.
If test requests succeed, the circuit closes; otherwise, it reopens.
6.5 Failure‑Rate Time Window
Two types of windows are used:
Fixed window: counts total traffic in a set interval; cannot limit short‑term bursts.
Sliding window: moves continuously, providing smoother control.
6.6 Service‑Recovery Attempt Window
The circuit stays open for a configured period (e.g., 1 minute), then switches to half‑open to probe the service. If the probe succeeds, the circuit closes; otherwise, it reopens and the cycle repeats, possibly with increasing back‑off intervals.
7. Circuit‑Breaker Middleware
While you can implement your own circuit‑breaker, it is recommended to use proven open‑source solutions such as Alibaba's Sentinel or Netflix's Hystrix (now in maintenance mode).
8. Turning the Tide
To help Cao Cao avoid the chained‑ship disaster, possible strategies include:
Replace iron chains with ropes that are easier to cut (circuit‑breaker).
Segment the fleet into isolated zones so a fire in one zone does not spread (resource isolation).
Set up checkpoints to verify ships before they proceed (pre‑flight checks).
9. Rate Limiting and Degradation
Rate limiting controls traffic by allowing only a portion of requests to pass, ensuring the service can handle the load. Common algorithms are fixed‑window, leaky‑bucket, and token‑bucket.
Leaky‑Bucket Algorithm
Flows traffic at a constant rate; however, it can increase latency during bursts.
Token‑Bucket Algorithm
Allows N requests per second, refilling tokens at a steady rate; often implemented with Redis in distributed environments.
Conclusion
The classic novel Romance of the Three Kingdoms provides a vivid analogy for understanding service avalanche and circuit‑breaker concepts, helping engineers design more resilient micro‑service systems.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.