Why Unreliable Clocks Threaten Distributed Systems—and How to Fix Them
This article examines the unreliability of physical clocks in distributed systems, compares synchronous and asynchronous network timing, explains the roles of wall and monotonic clocks, and explores logical clocks, snapshot isolation, and practical solutions such as Google Spanner's TrueTime to ensure data consistency.
Comparison of Synchronous and Asynchronous Network Clocks
Synchronous networks use a global clock coordinated via NTP, while each node in an asynchronous network runs its own independent clock, leading to potential clock drift and latency issues.
Synchronous network has a global clock; asynchronous nodes maintain independent clocks and rely on NTP to correct drift, but network delay remains a problem.
Synchronous messages first transmit a synchronization pulse before data, whereas asynchronous messages use start‑ and end‑bit markers to guarantee data integrity.
Monotonic Clock vs. Wall Clock
Modern computers expose two types of clocks:
Wall clock : Returns the current date and time (e.g., Java System.currentTimeMillis() ), suitable for timestamps but unsuitable for measuring durations because hardware drift and NTP instability can cause inaccuracies.
Monotonic clock : Used for measuring elapsed time (e.g., Java System.nanoTime() ); it advances only forward and is reliable for duration measurement within a single process, though it is not synchronized across machines.
Wall clocks can suffer from quartz drift, large NTP offsets, misconfiguration, or network latency, making them unreliable for ordering events in distributed systems.
Unreliable Clocks Cause Distributed System Problems
Using timestamps for event ordering can lead to out‑of‑order writes. For example, with a Last‑Write‑Wins (LWW) policy, differing node clocks may cause a newer write to be discarded, resulting in data loss.
Logical clocks, based on a globally incrementing counter rather than physical time, provide a safe way to order events without relying on unreliable wall clocks.
Logical Clock vs. Physical Clock in Distributed Databases
Logical clocks require persistence and causal tracking, making them more complex than physical clocks, but they avoid the pitfalls of clock drift.
STW (Stop‑The‑World) can trigger data‑write safety issues when a master node’s lease expires due to clock inconsistencies.
Relying on wall‑clock timestamps for lease expiration can cause premature lease loss or split‑brain scenarios across regions.
Global Snapshot and Synchronous Clocks
Achieving a globally monotonic counter across data centers is challenging; naive locking hurts performance. Snapshot isolation offers a balance, but transaction IDs must increase monotonically to maintain consistency.
Google Spanner solves this by using the TrueTime API, which reports a confidence interval for the wall clock, allowing the system to wait out the uncertainty before committing transactions.
Summary of Distributed System Clock Issues
When designing distributed systems, engineers must assume any node may pause at any time, and they must account for clock drift, synchronization delays, and lease expiration to ensure data reliability and consistency.
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.