Fundamentals 8 min read

Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

The article explains how network failures such as packet loss, reordering, latency, and ambiguous node failures make distributed systems unreliable, compares synchronous and asynchronous networks, and discusses the trade‑off between timeout settings and resource utilization.

Xiaokun's Architecture Exploration Notes

Apr 20, 2025

Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

Unreliable Network Issue Classification

Distributed systems suffer from partial failures not only from faulty components but also from unreliable network conditions.

Request/Response Loss : Packets may be completely lost due to physical link failures (e.g., a cut fiber) or protocol errors (e.g., TCP retransmission limits).

Reordering and Delay : Asynchronous networks cannot guarantee in‑order delivery; congestion control algorithms and hop count cause long‑tail latency, where the 99th‑percentile delay can be ten times the average.

Ambiguity of Node Failure : It is hard to distinguish a network partition from an actual node crash, leading to false failure detections.

Synchronous Network and Bounded Latency

A synchronous network reserves a fixed bandwidth path for the entire communication session, similar to a dedicated telephone circuit. The topology is typically star‑shaped, providing predictable, fixed end‑to‑end latency, which we call bounded latency.

Asynchronous Network and Unbounded Latency

Data‑center and mobile networks operate as asynchronous networks, akin to a highway where vehicles of different sizes (data packets) share lanes and may change lanes or queue, causing congestion. Consequently, end‑to‑end latency is unpredictable (unbounded latency).

Packet reordering arises from packet‑switching: packets are split into smaller units and may arrive out of order, requiring sequence numbers or delimiters to reassemble correctly.

Typical asynchronous network topologies include Fat‑Tree and Spine‑Leaf architectures:

Ambiguity of Node Failure

Network unreliability adds complexity to fault‑tolerance mechanisms because it is difficult to tell whether a node is truly down or merely unreachable due to network issues.

In compute clusters, load balancers must stop sending requests to a dead node (e.g., Node3). In storage clusters, a master‑slave setup requires electing a new master when the current master fails.

Timeout settings must balance between being long enough to avoid false positives and short enough to prevent prolonged failures; a heuristic of 2d + r (round‑trip delay plus processing time) is often suggested, but real‑world variability makes precise tuning difficult.

Summary

In data‑center asynchronous networks, packet‑switching and congestion‑induced queuing introduce significant uncertainty in timeout configuration, forcing engineers to trade off between latency tolerance and resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Latency Node Failure Network Reliability asynchronous network synchronous network

Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.