Operations 19 min read

Understanding and Preventing Cascading Failures in Distributed Systems

The article explains how cascading failures arise from positive feedback loops in distributed systems, illustrates real‑world incidents such as the 2015 DynamoDB outage, outlines anti‑patterns like unlimited retries and unchecked load, and presents practical mitigation techniques including load‑shedding, circuit breakers, exponential back‑off, and controlled replication to improve system resilience.

DevOps

May 18, 2022

Understanding and Preventing Cascading Failures in Distributed Systems

Cascading failures are disasters caused by positive feedback loops that amplify an original problem, often leading to degraded capacity, increased latency, and a chain of errors in distributed software systems.

Real‑world examples, such as the 2015 Amazon DynamoDB outage and Parsely's Kafkapocalypse, show how a brief network glitch or a sudden load increase can trigger a cascade: overloaded services retry, increasing load further, eventually taking the entire system offline.

The article identifies several anti‑patterns that exacerbate cascades: unlimited resource scaling that cannot keep up with load, manual restarts as the only fix, dangerous client retry loops, "death queries" that crash services, and failure‑triggered workload spikes across data centers.

Mitigation strategies include:

Load‑shedding and limiting concurrent requests (e.g., Netflix's concurrency‑limits library).

Implementing exponential back‑off with jitter for retries. Example in Go:

const MAX_RETRIES = 5<br/>const JITTER_RANGE_MSEC = 200<br/>steps_msec := []int{100, 500, 1000, 5000, 15000}<br/>rand.Seed(time.Now().UTC().UnixNano())<br/>for i := 0; i < MAX_RETRIES; i++ {<br/>    _, err := doServerRequest()<br/>    if err == nil { break }<br/>    time.Sleep(time.Duration(steps_msec[i] + rand.Intn(JITTER_RANGE_MSEC)) * time.Millisecond)<br/>}

Using circuit breakers to stop repeated failing calls and periodically probe services.

Applying token‑bucket or similar rate‑limiting algorithms to control replication and background work.

Designing services to avoid long start‑up times and to handle overload gracefully.

Additional anti‑patterns and their fixes are presented as six concrete examples, each accompanied by diagrams (causal loop diagrams) that illustrate how feedback loops can be balanced or broken.

The article concludes that while cascading failures are inherent to many distributed systems, understanding their root causes and applying the above patterns can significantly reduce their risk and impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems SRE Resilience Circuit Breaker load shedding cascading failures

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.