Fundamentals 8 min read

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

This article explores how fencing tokens can provide safety and liveness guarantees in distributed lock services, illustrating fault scenarios, token-based conflict resolution, and abstract system models that help engineers prioritize correctness while tolerating temporary unavailability.

Xiaokun's Architecture Exploration Notes

May 11, 2025

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

In many distributed systems we trade resource utilization and cost, often using cheap but unreliable components that introduce unbounded latency, clock drift, and process pauses. Building a fault‑tolerant distributed system requires mechanisms that can handle these realities.

Fencing Tokens Example

Consider a previous example where acquiring a lease ID caused a stop‑the‑world (STW) pause, leading to conflicting writes and file corruption. Two clients held write locks on the same file and both wrote successfully, but the storage layer lacked a way to decide which write should prevail.

To solve this, we introduce a Fencing Tokens mechanism. When a lock is acquired, a unique, monotonically increasing token is also issued—implemented via Google’s TrueTime timestamps or logical clocks. Clients include this token with each write; the storage layer accepts the write only if the token is greater than the last persisted token, rejecting older tokens.

File corruption due to concurrent writes

Safety & Liveness

From the fencing token example we can derive the following properties for a lock service:

Uniqueness: any two token requests return different values.

Monotonic ordering: if request X precedes request Y, then token Tx < Ty.

Availability: a non‑crashed node that requests a token will eventually receive a response.

Uniqueness and monotonic ordering correspond to the safety property of a distributed system, while availability corresponds to liveness . Safety means “nothing bad happens”; liveness means “something good eventually happens.”

Design Implications

When designing a distributed lock service we must first guarantee safety—ensure tokens are unique and ordered—before considering liveness. The service may be unavailable for short periods (e.g., during a crash‑recovery), but as long as safety is preserved, temporary unavailability is tolerable.

Distributed Abstract System Models

To reason about fault tolerance we abstract distributed systems into models. Two common models are:

Time‑based system model

Node‑failure system model

Models help us formalize faults and design algorithms that satisfy safety and liveness under assumed conditions. For example, storing data in memory means a node crash loses data; persisting to disk (using RAID, SAN, NAS) mitigates that, but introduces consistency challenges. Strong consistency maps to safety, while eventual consistency maps to liveness.

Conclusion

Distinguishing safety and liveness is essential when building fault‑tolerant distributed systems. Engineers must identify which failures are unacceptable at the business level (safety) and which can be tolerated temporarily (liveness), then make architectural trade‑offs accordingly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems fault tolerance fencing tokens lock service safety liveness

Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.