Fundamentals 11 min read

Why Unreliable Clocks Threaten Distributed Systems—and How to Fix Them

This article examines the unreliability of physical clocks in distributed systems, compares synchronous and asynchronous network timing, explains the roles of wall and monotonic clocks, and explores logical clocks, snapshot isolation, and practical solutions such as Google Spanner's TrueTime to ensure data consistency.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Why Unreliable Clocks Threaten Distributed Systems—and How to Fix Them

Comparison of Synchronous and Asynchronous Network Clocks

Synchronous networks use a global clock coordinated via NTP, while each node in an asynchronous network runs its own independent clock, leading to potential clock drift and latency issues.

Synchronous network has a global clock; asynchronous nodes maintain independent clocks and rely on NTP to correct drift, but network delay remains a problem.

Synchronous messages first transmit a synchronization pulse before data, whereas asynchronous messages use start‑ and end‑bit markers to guarantee data integrity.

Monotonic Clock vs. Wall Clock

Modern computers expose two types of clocks:

Wall clock : Returns the current date and time (e.g., Java System.currentTimeMillis() ), suitable for timestamps but unsuitable for measuring durations because hardware drift and NTP instability can cause inaccuracies.
Monotonic clock : Used for measuring elapsed time (e.g., Java System.nanoTime() ); it advances only forward and is reliable for duration measurement within a single process, though it is not synchronized across machines.

Wall clocks can suffer from quartz drift, large NTP offsets, misconfiguration, or network latency, making them unreliable for ordering events in distributed systems.

Unreliable Clocks Cause Distributed System Problems

Using timestamps for event ordering can lead to out‑of‑order writes. For example, with a Last‑Write‑Wins (LWW) policy, differing node clocks may cause a newer write to be discarded, resulting in data loss.

Logical clocks, based on a globally incrementing counter rather than physical time, provide a safe way to order events without relying on unreliable wall clocks.

Logical Clock vs. Physical Clock in Distributed Databases

Logical clocks require persistence and causal tracking, making them more complex than physical clocks, but they avoid the pitfalls of clock drift.

STW (Stop‑The‑World) can trigger data‑write safety issues when a master node’s lease expires due to clock inconsistencies.

Relying on wall‑clock timestamps for lease expiration can cause premature lease loss or split‑brain scenarios across regions.

Global Snapshot and Synchronous Clocks

Achieving a globally monotonic counter across data centers is challenging; naive locking hurts performance. Snapshot isolation offers a balance, but transaction IDs must increase monotonically to maintain consistency.

Google Spanner solves this by using the TrueTime API, which reports a confidence interval for the wall clock, allowing the system to wait out the uncertainty before committing transactions.

Summary of Distributed System Clock Issues

When designing distributed systems, engineers must assume any node may pause at any time, and they must account for clock drift, synchronization delays, and lease expiration to ensure data reliability and consistency.

distributed systemsdata consistencymonotonic clocklogical clockclock synchronization
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.