How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services
This article explores how fencing tokens can provide safety and liveness guarantees in distributed lock services, illustrating fault scenarios, token-based conflict resolution, and abstract system models that help engineers prioritize correctness while tolerating temporary unavailability.
In many distributed systems we trade resource utilization and cost, often using cheap but unreliable components that introduce unbounded latency, clock drift, and process pauses. Building a fault‑tolerant distributed system requires mechanisms that can handle these realities.
Fencing Tokens Example
Consider a previous example where acquiring a lease ID caused a stop‑the‑world (STW) pause, leading to conflicting writes and file corruption. Two clients held write locks on the same file and both wrote successfully, but the storage layer lacked a way to decide which write should prevail.
To solve this, we introduce a Fencing Tokens mechanism. When a lock is acquired, a unique, monotonically increasing token is also issued—implemented via Google’s TrueTime timestamps or logical clocks. Clients include this token with each write; the storage layer accepts the write only if the token is greater than the last persisted token, rejecting older tokens.
Safety & Liveness
From the fencing token example we can derive the following properties for a lock service:
Uniqueness: any two token requests return different values.
Monotonic ordering: if request X precedes request Y, then token Tx < Ty.
Availability: a non‑crashed node that requests a token will eventually receive a response.
Uniqueness and monotonic ordering correspond to the safety property of a distributed system, while availability corresponds to liveness . Safety means “nothing bad happens”; liveness means “something good eventually happens.”
Design Implications
When designing a distributed lock service we must first guarantee safety—ensure tokens are unique and ordered—before considering liveness. The service may be unavailable for short periods (e.g., during a crash‑recovery), but as long as safety is preserved, temporary unavailability is tolerable.
Distributed Abstract System Models
To reason about fault tolerance we abstract distributed systems into models. Two common models are:
Time‑based system model
Node‑failure system model
Models help us formalize faults and design algorithms that satisfy safety and liveness under assumed conditions. For example, storing data in memory means a node crash loses data; persisting to disk (using RAID, SAN, NAS) mitigates that, but introduces consistency challenges. Strong consistency maps to safety, while eventual consistency maps to liveness.
Conclusion
Distinguishing safety and liveness is essential when building fault‑tolerant distributed systems. Engineers must identify which failures are unacceptable at the business level (safety) and which can be tolerated temporarily (liveness), then make architectural trade‑offs accordingly.
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.