Understanding ZAB: How ZooKeeper Guarantees Strong Consistency
This article explains the ZAB (ZooKeeper Atomic Broadcast) protocol, detailing its atomic broadcast mechanism, two‑phase commit style messaging, transaction IDs (ZXID), leader‑follower coordination, crash recovery principles, data synchronization, and the underlying ZXID design with illustrative code snippets.
Zookeeper achieves final consistency of distributed transactions through the ZAB consensus protocol.
ZAB Protocol Overview
ZAB stands for ZooKeeper Atomic Broadcast.
ZAB protocol is a consistency protocol designed specifically for the distributed coordination service ZooKeeper, supporting crash recovery. Based on this protocol, ZooKeeper implements a leader‑follower architecture to keep data consistent across cluster replicas.
The message broadcast process of ZAB uses an atomic broadcast protocol similar to a two‑phase commit. For a client request, the Leader generates a transaction proposal and sends it to all Followers, collects votes, and then commits the transaction.
In ZAB’s two‑phase commit, the abort logic is removed. All Followers either acknowledge the Leader’s proposal or discard it. Once a majority of Followers have ACKed, the Leader begins committing the transaction.
The Leader assigns a globally monotonic increasing transaction ID (ZXID) to each proposal. Because ZAB must preserve strict causal ordering, proposals are processed according to their ZXID order.
During broadcasting, the Leader creates a queue for each Follower and enqueues proposals in FIFO order.
Each Follower writes the received proposal to a local transaction log on disk; after a successful write it sends an ACK to the Leader.
When the Leader receives ACKs from a majority of Followers, it sends a COMMIT message, completes the transaction locally, and Followers commit upon receiving the COMMIT.
The use of an atomic broadcast protocol ensures distributed data consistency; a majority of nodes maintain consistent data.
Message Broadcast
You can think of the broadcast mechanism as a simplified 2PC protocol that guarantees transaction order consistency.
When a client submits a transaction request, the Leader generates a proposal and sends it to all Followers. After receiving ACKs from a majority, the transaction is committed. ZAB uses an atomic broadcast protocol; only a majority of ACKs are needed, which can lead to inconsistency if the Leader crashes. Crash recovery handles such cases. Message broadcast uses TCP to preserve ordering. The Leader assigns a global increasing ZXID to each proposal, and proposals are processed in ZXID order.
The Leader assigns a queue to each Follower, enqueues proposals by ZXID, and sends them FIFO. Followers write each proposal to a local transaction log, ACK the Leader, and upon receiving a majority of ACKs the Leader broadcasts a COMMIT, after which Followers also commit.
Crash Recovery
During message broadcast, if the Leader crashes, can data consistency be guaranteed? When the Leader crashes, ZAB enters crash recovery to handle two situations:
Leader crashes after replicating data to all Followers.
Leader crashes after receiving ACKs, committing itself, and sending some commits.
ZAB defines two principles:
The protocol ensures execution of transactions that have been committed by the Leader on all servers.
The protocol ensures discarding of transactions that were only proposed/replicated by the Leader but not committed.
Implementation relies on ZXID. After a crash, the system selects the largest ZXID as the recovery snapshot, which eliminates the need for explicit commit checks and improves efficiency.
Data Synchronization
After leader election and before normal operation, the Leader verifies that all committed proposals have been acknowledged by a majority, ensuring data synchronization.
The Leader prepares a queue for each Follower and sends any unsynchronized proposals followed by a commit message.
When Followers have synchronized all missing proposals and applied them locally, the Leader adds those Followers to the usable follower list.
Design of ZXID
ZXID is a 64‑bit number. The lower 32 bits are a simple monotonically increasing counter; the Leader increments it for each new proposal.
The upper 32 bits distinguish different Leaders. When a new Leader is elected, it reads the maximum ZXID from its local log, derives an epoch value, increments it, and uses this as the new epoch. The lower 32 bits then start from zero.
Using epoch numbers prevents different Leaders from generating identical ZXIDs.
<code>// Leader.java
void lead() throws IOException, InterruptedException {
// ....
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
// ....
}
public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
synchronized (connectingFollowers) {
// ....
if (isParticipant(sid)) {
// add self to connecting followers
connectingFollowers.add(sid);
}
QuorumVerifier verifier = self.getQuorumVerifier();
// if enough followers have joined, election succeeds
if (connectingFollowers.contains(self.getId()) && verifier.containsQuorum(connectingFollowers)) {
waitingForNewEpoch = false;
self.setAcceptedEpoch(epoch);
connectingFollowers.notifyAll();
} else {
// wait for followers or timeout
while (waitingForNewEpoch && cur < end && !quitWaitForEpoch) {
connectingFollowers.wait(end - cur);
cur = Time.currentElapsedTime();
}
if (waitingForNewEpoch) {
throw new InterruptedException("Timeout while waiting for epoch from quorum");
}
}
return epoch;
}
}
// ZxidUtils
public static long makeZxid(long epoch, long counter) {
return (epoch << 32L) | (counter & 0xffffffffL);
}
</code>ZAB Protocol Implementation
Write Data Process
The following diagram summarizes the write‑data process in ZooKeeper’s source code:
References
https://www.cnblogs.com/veblen/p/10985676.html
https://zookeeper.apache.org
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.