Fundamentals 39 min read

Mastering Distributed Consistency: Real‑World Patterns and Protocols

This article examines the challenges of consistency in large‑scale distributed service systems, presents real‑world case studies such as payment transfers and order processing, and outlines practical patterns—including ACID/BASE theory, two‑phase and three‑phase commit, TCC, query, compensation, periodic reconciliation, and reliable messaging—to help engineers design robust, eventually consistent architectures.

Efficient Ops

Apr 6, 2017

Mastering Distributed Consistency: Real‑World Patterns and Protocols

1. Background

Consistency is an abstract computer‑science term with multiple meanings. In the traditional IT era it often meant strong consistency, where all components appear as a single unified entity. In the Internet era, massive data volumes and high throughput requirements force horizontal scaling, leading to service node pooling and the need for logical splitting—horizontal (multiple nodes sharing the same function) and vertical (decomposing a complex function into simple, single‑purpose modules). Both types of splitting introduce consistency challenges across distributed services.

2. Problems

Case 1: Buying a house

When partners have different preferences for the size of a house, the inconsistency leads to dissatisfaction and conflict, illustrating how inconsistency can cause serious issues even in daily life.

Case 2: Transfer

A bank must debit one account and credit another atomically; if either step fails, funds are lost, potentially bankrupting a company.

Case 3: Order and inventory deduction

If an order succeeds but inventory deduction fails, overselling occurs; if inventory is deducted but the order fails, underselling and extra costs arise.

Case 4: Synchronous timeout

Service A calls Service B and times out; Service A cannot know whether B completed the operation, leading to uncertainty.

Case 5: Asynchronous callback timeout

Service A receives no callback from Service B, leaving the system state ambiguous and risking data loss.

Case 6: Lost orders

When one system records an order but another does not, the order is lost, potentially causing financial loss.

Case 7: Inconsistent status between systems

Both systems have the request, but their statuses differ, leading to errors.

Case 8: Cache‑database inconsistency

High‑read traffic forces the use of caches; keeping cache and database consistent (strong vs weak) becomes a key challenge.

Case 9: Inconsistent local cache nodes

Multiple nodes hold copies of semi‑static data; updates cause temporary inconsistency across nodes.

Case 10: Inconsistent cache data structures

Partial failures when building a complex cached structure can cause downstream errors.

3. Patterns

3.1 Solving inconsistency in daily life

Two approaches: prevent inconsistency (e.g., align expectations before marriage) or compensate gradually (e.g., start with a smaller house and upgrade later). The same idea applies to distributed systems: avoid inconsistency when possible, otherwise use compensation.

3.2 ACID vs. BASE theory

ACID (Atomicity, Consistency, Isolation, Durability) guarantees strong consistency for relational databases. Most relational databases (Oracle, MySQL, DB2) provide ACID properties, making them suitable for transaction‑heavy services. NoSQL is generally unsuitable for core financial transactions.

Atomicity – operations are all‑or‑nothing Consistency – database rules are preserved Isolation – concurrent transactions do not interfere Durability – once committed, data persists

When data is sharded across many nodes, strong consistency may be impossible; in such cases, eventual consistency (BASE) is used.

3.3 Distributed consistency protocols

The Open Group defines a Distributed Transaction Service (DTS) with four roles: application, transaction manager, resource manager, and communication manager. J2EE implements DTS via XA and TX protocols.

Two‑phase commit (2PC) consists of a prepare phase (participants write redo/undo logs and lock resources) and a commit phase (participants finalize or abort based on coordinator decision). Diagram:

2PC guarantees strong consistency but suffers from blocking, single‑point‑of‑failure, and split‑brain problems.

Three‑phase commit (3PC) adds an inquiry phase to detect impossibility early and uses timeouts to avoid indefinite blocking. Diagram:

3PC reduces blocking but still cannot guarantee consistency under extreme failures.

TCC (Try‑Confirm‑Cancel) splits a transaction into Try, Confirm, and Cancel phases. If Try succeeds, Confirm is executed; otherwise Cancel rolls back. TCC simplifies 2PC logic and provides automatic compensation, though extreme failures may still require manual intervention. Diagram:

3.4 Practical patterns

Query mode

Every service operation provides a query interface (by unique request ID or order number) so callers can check the current state and decide next steps. Single‑record queries are preferred; batch queries must be paginated and rate‑limited.

Compensation mode

Based on query results, failed or unknown sub‑operations are either retried or rolled back, restoring overall consistency. Compensation can be automatic, operator‑driven, or require developer intervention.

Asynchronous assurance mode

Low‑latency operations are off‑loaded to asynchronous tasks; results are later notified to the caller. This smooths traffic spikes in e‑commerce and payment systems.

Periodic reconciliation mode

Background jobs periodically compare states across services using globally unique IDs (e.g., Snowflake) and reconcile differences via compensation.

Reliable messaging mode

Messages are persisted before sending (either in the business DB or a dedicated message store) and retried until acknowledged, ensuring at‑least‑once delivery while maintaining idempotency.

Cache consistency model

For high‑read workloads, use distributed caches, fully populate caches, and adopt a write‑through (DB first, cache later) or read‑through strategy to achieve weak consistency without sacrificing performance.

4. Summary

The article distills practical experience from large‑scale service‑oriented systems, enumerates concrete inconsistency scenarios, and abstracts them into reusable patterns—query, compensation, periodic reconciliation, reliable messaging, and cache strategies—providing engineers with a toolbox to achieve eventual consistency in distributed architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CAP theorem Consistency ACID BASE transaction protocols

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.