Fundamentals 8 min read

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Unreliable Network Issue Classification

Previously we discussed partial failures in distributed systems caused by component or service faults; another major cause is network failures.

We now explore the unreliable network problems in distributed systems.

Request/Response Loss : Packets may be completely lost due to physical link failures (e.g., a cut fiber) or protocol errors (e.g., TCP retransmission limits).

Reordering and Delay : Asynchronous networks cannot guarantee in‑order delivery, and latency is affected by congestion control algorithms (e.g., TCP BBR) and routing hops, leading to long‑tail delays where the 99th‑percentile latency can be ten times the average.

Ambiguity of Node Failure : It is difficult to distinguish a network partition from a node crash; a remote process may appear unresponsive due to network issues and be mistakenly considered dead.

Synchronous vs Asynchronous Networks

Synchronous network and bounded latency : In traditional telephone networks, a fixed bandwidth is reserved for the entire call path, forming a star topology with guaranteed, predictable latency.

Because the circuit is pre‑reserved and there is no queuing, end‑to‑end delay is fixed, which we call bounded latency.

Asynchronous network and unbounded latency

Data‑center and mobile networks are asynchronous, similar to highways where vehicles of different sizes (data packets) share lanes and may experience congestion, causing unpredictable, unbounded latency.

Asynchronous networks use packet‑switched transmission, which can lead to out‑of‑order delivery; to preserve data integrity, packets carry headers and trailers (the classic “packet framing” mechanism).

Common asynchronous network topologies include Fat‑Tree and Spine‑Leaf structures.

Ambiguity of Node Failure

Network latency unreliability adds complexity to fault detection in distributed systems, making it hard to tell whether a node is truly failed or merely unreachable due to network issues.

Fault detection differs for compute clusters and storage clusters:

Compute cluster

Load balancers must stop sending requests to a dead node (e.g., Node3).

Storage cluster

In a master‑slave setup, if the master fails, a slave must be elected as the new master.

Because network conditions are unpredictable, timeout settings must balance between being too long (delaying failure handling) and too short (causing false positives and cascading overloads). A heuristic timeout of 2d + r (where d is round‑trip latency and r is request processing time) is reasonable, but real‑world variability forces a trade‑off between latency and resource utilization.

Summary

In data‑center asynchronous networks, packet‑switching and congestion‑induced queuing create uncertainty in timeout configuration, requiring a trade‑off between latency and resource utilization.

distributed systemsfault tolerancenetwork reliabilityasynchronous networksynchronous network
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.