Fundamentals 54 min read

Fundamentals of Distributed Systems: Nodes, Replication, Consistency, and Core Protocols

This article provides a comprehensive overview of distributed‑system fundamentals, covering node concepts, failure types, replication models, consistency levels, performance and availability metrics, data‑distribution strategies, replica control protocols, lease mechanisms, quorum, two‑phase commit, MVCC, Paxos, and the CAP theorem.

Architecture Digest
Architecture Digest
Architecture Digest
Fundamentals of Distributed Systems: Nodes, Replication, Consistency, and Core Protocols

The article begins by defining a node as an OS process and explains that in the presented model a node is an indivisible whole, which may be split into multiple nodes if a process consists of independent parts.

It then enumerates common anomalies in distributed environments, including machine crashes, network partitions, message loss, out‑of‑order delivery, unreliable TCP, and data loss, and presents a golden rule for exception handling: anticipate every failure considered during design.

Replication is introduced as redundancy for data or services. Data replicas are persisted on multiple nodes to survive storage loss, while service replicas provide the same functionality without relying on local storage. The article stresses that replication protocols are the theoretical core of distributed systems.

Various consistency models are described: strong consistency (all reads see the latest write), monotonic consistency (reads never regress), session consistency (monotonic within a user session), eventual consistency (updates eventually converge), and weak consistency (no guarantees on read freshness). The trade‑offs among these models are discussed.

Key system metrics are outlined: performance (throughput, latency, concurrency), availability (service correctness under failures), scalability (linear performance growth with added machines), and consistency (impact of replication on user experience).

The article surveys data‑distribution techniques: simple hashing (with poor scalability), range‑based partitioning, chunk‑based distribution, and consistent hashing (including virtual nodes) that improve load balancing and scalability while requiring metadata management.

Replica placement strategies are examined, contrasting machine‑based replicas (simple but limited scalability) with segment‑based replicas that decouple replicas from physical machines, enabling efficient recovery and balanced load.

Localized computation is explained as a technique to co‑locate computation with its data, reducing network traffic under the principle “move computation, not data”.

Replica control protocols are classified into centralized and decentralized approaches. Centralized protocols use a single coordinator (e.g., primary‑secondary) to order updates and enforce consistency, but suffer from single‑point‑of‑failure downtime. Decentralized protocols achieve consensus without a coordinator, offering higher fault tolerance at the cost of complexity.

The primary‑secondary protocol workflow (update, concurrency control, propagation, and failure handling) is detailed, including optimizations such as relay updates and read‑only primary reads for strong consistency.

Lease mechanisms are presented as a way to grant time‑bounded rights to cache data, ensuring cache validity without continuous coordination and providing strong fault tolerance even under network partitions or node crashes.

Quorum techniques are explained: write operations succeed after being acknowledged by W out of N replicas, and reads from R replicas guarantee seeing the latest committed value when W+R > N. Variants like write‑all‑read‑one (WARO) and general quorum configurations are discussed.

Logging techniques (redo logs, checkpoints, 0/1 directory structures) are described for crash recovery, illustrating how ordered log replay and periodic snapshots restore system state.

Two‑phase commit (2PC) is outlined as a classic centralized consensus protocol, highlighting its poor fault tolerance and performance due to multiple round‑trips and the need for all participants to agree.

Multi‑Version Concurrency Control (MVCC) is introduced, showing how versioned data enables concurrent transactions and conflict detection, similar to version control systems.

Paxos, a proven decentralized consensus algorithm, is explained with its roles (proposer, acceptor, learner), preparation and acceptance phases, and the guarantee that only a single value can be chosen per instance, providing strong consistency with high availability.

Finally, the CAP theorem is presented, stating that a distributed system can at most simultaneously provide two of the three properties: consistency, availability, and partition tolerance. The article positions lease, quorum, 2PC, and Paxos within this trade‑off space.

distributed systemsCAP theoremReplicationconsistencyconsensus
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.