Fundamentals 8 min read

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Xiaokun's Architecture Exploration Notes

May 11, 2025

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Unreliable Network Issue Classification

Previously we discussed partial failures in distributed systems caused by component or service faults; another major cause is network failures.

We now explore the unreliable network problems in distributed systems.

Request/Response Loss : Packets may be completely lost due to physical link failures (e.g., a cut fiber) or protocol errors (e.g., TCP retransmission limits).

Reordering and Delay : Asynchronous networks cannot guarantee in‑order delivery, and latency is affected by congestion control algorithms (e.g., TCP BBR) and routing hops, leading to long‑tail delays where the 99th‑percentile latency can be ten times the average.

Ambiguity of Node Failure : It is difficult to distinguish a network partition from a node crash; a remote process may appear unresponsive due to network issues and be mistakenly considered dead.

Synchronous vs Asynchronous Networks

Synchronous network and bounded latency : In traditional telephone networks, a fixed bandwidth is reserved for the entire call path, forming a star topology with guaranteed, predictable latency.

Because the circuit is pre‑reserved and there is no queuing, end‑to‑end delay is fixed, which we call bounded latency.

Asynchronous network and unbounded latency

Data‑center and mobile networks are asynchronous, similar to highways where vehicles of different sizes (data packets) share lanes and may experience congestion, causing unpredictable, unbounded latency.

Asynchronous networks use packet‑switched transmission, which can lead to out‑of‑order delivery; to preserve data integrity, packets carry headers and trailers (the classic “packet framing” mechanism).

Common asynchronous network topologies include Fat‑Tree and Spine‑Leaf structures.

Ambiguity of Node Failure

Network latency unreliability adds complexity to fault detection in distributed systems, making it hard to tell whether a node is truly failed or merely unreachable due to network issues.

Fault detection differs for compute clusters and storage clusters:

Compute cluster

Load balancers must stop sending requests to a dead node (e.g., Node3).

Storage cluster

In a master‑slave setup, if the master fails, a slave must be elected as the new master.

Because network conditions are unpredictable, timeout settings must balance between being too long (delaying failure handling) and too short (causing false positives and cascading overloads). A heuristic timeout of 2d + r (where d is round‑trip latency and r is request processing time) is reasonable, but real‑world variability forces a trade‑off between latency and resource utilization.

Summary

In data‑center asynchronous networks, packet‑switching and congestion‑induced queuing create uncertainty in timeout configuration, requiring a trade‑off between latency and resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems fault tolerance Network Reliability asynchronous network synchronous network

Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.