Introduction to ZooKeeper: Design Goals, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election, and Deployment
This article provides a comprehensive overview of ZooKeeper, covering its purpose as a distributed coordination service, design objectives such as consistency and reliability, hierarchical data model, session and watch mechanisms, consistency guarantees, leader election and Zab protocol, as well as practical deployment details.
ZooKeeper Introduction
ZooKeeper is an open‑source distributed application coordination service that offers a simple set of primitives enabling developers to implement synchronization, configuration maintenance, and naming services.
Design Goals
Final Consistency: All clients see the same view regardless of which server they connect to.
Reliability: If a message is accepted by one server, it will be accepted by all servers.
Timeliness: Clients receive updates or failure notifications within a bounded time interval; for the freshest data, call sync() before reading.
Wait‑free: Slow or failed clients do not interfere with fast clients, allowing each client to wait effectively.
Atomicity: Updates either succeed completely or fail, with no intermediate state.
Ordering: Global ordering ensures that if message a precedes message b on one server, it does so on all servers; partial ordering guarantees that messages from the same sender preserve their order.
Data Model
ZooKeeper maintains a hierarchical namespace similar to a standard file system.
The key characteristics of this model are:
Each node (znode) is uniquely identified by its full path, e.g., /NameService/Server1 .
Znodes may have children and can store data; temporary (EPHEMERAL) znodes cannot have children.
Each znode is versioned; multiple versions of the stored data are kept, and the version number increments automatically.
Node types include: Persistent: Remains after server restarts; may contain data and children. Ephemeral: Deleted automatically when the client session ends. Non‑sequential: Creation succeeds for only one client when multiple attempt simultaneously; the name is exactly as specified. Sequential: The created node name receives a 10‑digit decimal suffix, allowing all concurrent creators to succeed with unique names.
Znodes can be watched for data changes or child‑list modifications; watches are a core feature used by many ZooKeeper functions.
Every state change generates a globally ordered transaction ID (zxid) that determines the order of operations across the ensemble.
Session
A client establishes a connection to a ZooKeeper ensemble, and its session state transitions are illustrated in the accompanying diagram.
If a client loses connection due to a timeout, it enters the CONNECTING state and automatically attempts to reconnect; if reconnection occurs within the session timeout, the client returns to CONNECTED . The server, not the client, decides when a session expires.
Watch
A watch is a one‑time trigger sent to the client that set it, activated when the watched data changes.
a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes。
Key points about watches:
One‑time trigger: After a change, the watch fires once; subsequent changes require the client to set a new watch.
Sent to the client: Watches are delivered asynchronously over the socket; ordering guarantees ensure a client sees the watch before the corresponding data change.
Data specificity: Different watch types (data watches vs. child watches) are triggered by setData() , create() , delete() , etc. If a client disconnects during a watch‑related event, the watch may be lost.
Consistency Guarantees
ZooKeeper provides high performance with fast reads and writes, offering the following guarantees:
Sequential Consistency: Updates from a single client are applied in order.
Atomicity: Updates are all‑or‑nothing.
Single System Image: All clients see the same system state regardless of the server they connect to.
Reliability: Once an update is committed, it persists until overwritten.
Timeliness: Clients see a consistent view within a bounded time window.
How ZooKeeper Works
Each server in a ZooKeeper ensemble assumes one of three roles (leader, follower, observer) and can be in one of four states (LOOKING, LEADING, FOLLOWING, OBSERVING). The core of ZooKeeper is the atomic broadcast protocol (Zab), which ensures ordered state updates.
Leader Election
When the current leader fails, the ensemble enters recovery mode and elects a new leader using either a basic Paxos or a fast Paxos algorithm (fast Paxos is the default). The basic Paxos election proceeds as follows:
The election thread initiates the vote and collects responses.
It sends a query to all servers (including itself).
Responses are validated, and each server’s ID and proposed leader information are recorded.
The server with the highest zxid is selected as the candidate.
If the candidate obtains a majority (n/2 + 1) votes, it becomes the leader; otherwise the process repeats.
The fast Paxos election has each server propose itself as leader, resolve epoch and zxid conflicts, and converge on a single leader.
Leader Workflow
The leader performs three main functions:
Recover data after a crash.
Maintain heartbeats with followers and process follower requests.
Handle follower messages such as PING, REQUEST, ACK, and REVALIDATE.
Follower Workflow
Followers:
Send PING, REQUEST, ACK, and REVALIDATE messages to the leader.
Receive and process messages from the leader.
Forward client write requests to the leader for voting.
Return results to clients.
Follower message types include PING (heartbeat), PROPOSAL (leader’s proposal), COMMIT (finalized transaction), UPTODATE (sync completion), REVALIDATE (session validation), and SYNC (client‑initiated state sync).
Zab: Broadcasting State Updates
When a server receives a request, followers forward it to the leader, which executes the request and broadcasts it as a transaction. Commitment follows a two‑phase commit:
Leader sends a PROPOSAL to all followers.
Each follower writes the proposal to disk and replies with an ACK.
Once the leader receives ACKs from a quorum, it sends a COMMIT.
The protocol guarantees total order of transactions across the ensemble and handles leader crashes by ensuring that any transaction committed by a crashed leader is re‑committed by the new leader.
Deployment
Basic Information Table
Hostname
OS Version
IP Address
Installed Software
zookeeper-230
CentOS 7.7
192.168.15.230
JDK1.8, zookeeper‑3.6.2
zookeeper-231
CentOS 7.7
192.168.15.231
JDK1.8, zookeeper‑3.6.2
zookeeper-232
CentOS 7.7
192.168.15.232
JDK1.8, zookeeper‑3.6.2
System Information
实验虚拟机配置1c2g25G
[root@zookeeper-230 ~]# uname -a
Linux zookeeper-230 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@zookeeper-230 ~]# rpm -q centos-release
centos-release-7-7.1908.0.el7.centos.x86_64Application Information
Project
Path
Application Path
/usr/local/zookeeper3.6
Configuration Path
/usr/local/zookeeper3.6/conf
Default Log Path
/usr/local/zookeeper3.6/logs
Custom Snapshot Log Path
/usr/local/zookeeper3.6/zkdata
Custom Transaction Log Path
/usr/local/zookeeper3.6/zklogs
Author: 二价亚铁
Original link: https://www.cnblogs.com/xw-01/p/18263814
License: CC BY‑NC‑ND 2.5 China Mainland.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.