Why Unreliable Clocks Threaten Distributed Systems—and How to Fix Them
This article examines how unreliable physical clocks—both wall and monotonic—affect distributed systems, compares synchronous and asynchronous network timing, illustrates conflicts caused by timestamp drift, and presents logical clocks and Google’s TrueTime as robust solutions for achieving consistent ordering and data reliability.
Comparison of Synchronous and Asynchronous Network Clocks
In synchronous networks a global clock is coordinated via NTP, while each node in an asynchronous network runs its own independent clock, leading to latency and drift issues.
Unreliable Monotonic and Wall Clocks
Wall clocks (e.g., Java System.currentTimeMillis() ) represent absolute time points and rely on NTP, making them unsuitable for measuring durations due to drift and possible backward jumps. Monotonic clocks (e.g., Java System.nanoTime() ) advance only forward and are appropriate for measuring elapsed time within a single process, but they are not comparable across machines.
Quartz crystal drift can cause time inaccuracies and backward jumps.
Large offsets between local clocks and NTP can cause false expirations.
Misconfiguration or network latency of NTP services further degrades reliability.
Problems Caused by Clock Dependence in Distributed Systems
Using timestamps for ordering can lead to anomalies such as write‑loss under the “last‑write‑wins” (LWW) rule when nodes have unsynchronized clocks.
Example: two clients write conflicting values at 42.004 s and 42.003 s to different nodes; the node that receives the later timestamp discards the earlier write, causing data loss.
Logical Clocks as a Solution
Logical clocks provide a globally increasing counter that captures the causal order of events without relying on physical time, making them safer for conflict resolution.
TrueTime and Global Snapshot Clocks
Google Spanner uses the TrueTime API, which reports a confidence interval for the wall clock, allowing the system to wait out uncertainty and achieve globally consistent snapshot isolation.
If intervals do not overlap, ordering is clear.
If they overlap, Spanner delays commits until the interval passes, reducing uncertainty to about 7 ms using GPS or atomic clocks.
Summary of Distributed‑System Clock Issues
When designing distributed systems, one must assume that any node may pause, clocks may drift, and network partitions may occur; robust designs must account for these factors to ensure data reliability.
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.