Mastering Distributed Consistency: Real‑World Patterns and Protocols
This article examines the challenges of consistency in large‑scale distributed service systems, presents real‑world case studies such as payment transfers and order processing, and outlines practical patterns—including ACID/BASE theory, two‑phase and three‑phase commit, TCC, query, compensation, periodic reconciliation, and reliable messaging—to help engineers design robust, eventually consistent architectures.
1. Background
Consistency is an abstract computer‑science term with multiple meanings. In the traditional IT era it often meant strong consistency, where all components appear as a single unified entity. In the Internet era, massive data volumes and high throughput requirements force horizontal scaling, leading to service node pooling and the need for logical splitting—horizontal (multiple nodes sharing the same function) and vertical (decomposing a complex function into simple, single‑purpose modules). Both types of splitting introduce consistency challenges across distributed services.
2. Problems
Case 1: Buying a house
When partners have different preferences for the size of a house, the inconsistency leads to dissatisfaction and conflict, illustrating how inconsistency can cause serious issues even in daily life.
Case 2: Transfer
A bank must debit one account and credit another atomically; if either step fails, funds are lost, potentially bankrupting a company.
Case 3: Order and inventory deduction
If an order succeeds but inventory deduction fails, overselling occurs; if inventory is deducted but the order fails, underselling and extra costs arise.
Case 4: Synchronous timeout
Service A calls Service B and times out; Service A cannot know whether B completed the operation, leading to uncertainty.
Case 5: Asynchronous callback timeout
Service A receives no callback from Service B, leaving the system state ambiguous and risking data loss.
Case 6: Lost orders
When one system records an order but another does not, the order is lost, potentially causing financial loss.
Case 7: Inconsistent status between systems
Both systems have the request, but their statuses differ, leading to errors.
Case 8: Cache‑database inconsistency
High‑read traffic forces the use of caches; keeping cache and database consistent (strong vs weak) becomes a key challenge.
Case 9: Inconsistent local cache nodes
Multiple nodes hold copies of semi‑static data; updates cause temporary inconsistency across nodes.
Case 10: Inconsistent cache data structures
Partial failures when building a complex cached structure can cause downstream errors.
3. Patterns
3.1 Solving inconsistency in daily life
Two approaches: prevent inconsistency (e.g., align expectations before marriage) or compensate gradually (e.g., start with a smaller house and upgrade later). The same idea applies to distributed systems: avoid inconsistency when possible, otherwise use compensation.
3.2 ACID vs. BASE theory
ACID (Atomicity, Consistency, Isolation, Durability) guarantees strong consistency for relational databases. Most relational databases (Oracle, MySQL, DB2) provide ACID properties, making them suitable for transaction‑heavy services. NoSQL is generally unsuitable for core financial transactions.
Atomicity – operations are all‑or‑nothing Consistency – database rules are preserved Isolation – concurrent transactions do not interfere Durability – once committed, data persists
When data is sharded across many nodes, strong consistency may be impossible; in such cases, eventual consistency (BASE) is used.
3.3 Distributed consistency protocols
The Open Group defines a Distributed Transaction Service (DTS) with four roles: application, transaction manager, resource manager, and communication manager. J2EE implements DTS via XA and TX protocols.
Two‑phase commit (2PC) consists of a prepare phase (participants write redo/undo logs and lock resources) and a commit phase (participants finalize or abort based on coordinator decision). Diagram:
2PC guarantees strong consistency but suffers from blocking, single‑point‑of‑failure, and split‑brain problems.
Three‑phase commit (3PC) adds an inquiry phase to detect impossibility early and uses timeouts to avoid indefinite blocking. Diagram:
3PC reduces blocking but still cannot guarantee consistency under extreme failures.
TCC (Try‑Confirm‑Cancel) splits a transaction into Try, Confirm, and Cancel phases. If Try succeeds, Confirm is executed; otherwise Cancel rolls back. TCC simplifies 2PC logic and provides automatic compensation, though extreme failures may still require manual intervention. Diagram:
3.4 Practical patterns
Query mode
Every service operation provides a query interface (by unique request ID or order number) so callers can check the current state and decide next steps. Single‑record queries are preferred; batch queries must be paginated and rate‑limited.
Compensation mode
Based on query results, failed or unknown sub‑operations are either retried or rolled back, restoring overall consistency. Compensation can be automatic, operator‑driven, or require developer intervention.
Asynchronous assurance mode
Low‑latency operations are off‑loaded to asynchronous tasks; results are later notified to the caller. This smooths traffic spikes in e‑commerce and payment systems.
Periodic reconciliation mode
Background jobs periodically compare states across services using globally unique IDs (e.g., Snowflake) and reconcile differences via compensation.
Reliable messaging mode
Messages are persisted before sending (either in the business DB or a dedicated message store) and retried until acknowledged, ensuring at‑least‑once delivery while maintaining idempotency.
Cache consistency model
For high‑read workloads, use distributed caches, fully populate caches, and adopt a write‑through (DB first, cache later) or read‑through strategy to achieve weak consistency without sacrificing performance.
4. Summary
The article distills practical experience from large‑scale service‑oriented systems, enumerates concrete inconsistency scenarios, and abstracts them into reusable patterns—query, compensation, periodic reconciliation, reliable messaging, and cache strategies—providing engineers with a toolbox to achieve eventual consistency in distributed architectures.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.