Operations 12 min read

Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

This article details a real‑world investigation of a ZooKeeper election‑port failure that prevented adding observer nodes to a Codis cache cluster, outlines systematic connectivity checks, log analysis, and two migration plans, and finally presents step‑by‑step procedures for rolling upgrades, configuration adjustments, and successful cluster restoration.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

1 Problem Background

Our Codis cache depends on ZooKeeper; while attempting to add an observer node for cross‑datacenter disaster recovery, the new node could not join because the election port (3888) was unreachable, as shown by logs and telnet attempts.

我:???先telnet一下leader的3888端口试试

我:3888 真的不通了?!!新节点不能加入集群!!!但是老集群读写状态是正常的,这情况太离谱了......

我:重启老节点,集群应该会重新发现节点,顺便恢复选举端口

我:完了...老节点端口通了但是也回不到集群了??!

Images illustrate the failed connection attempts and error screenshots.

2 Exploration

2.1 Port Connectivity Statistics

Collected connectivity data for each ZooKeeper node, revealing that the election ports were blocked.

2.2 Unrestarted Node Status Check

Identified that many CLOSE_WAIT sockets originated from security‑scan IPs, but no successful connections were established.

2.3 Log File Investigation

Default logging configuration caused log files to grow beyond 20 GB without proper rotation, and no useful clues were found in the logs.

2.4 Online Port‑Probe Monitoring

After adding a monitoring probe, the new node’s 3888 port became unreachable again within an hour.

2.5 Breakthrough

Comparing jstack traces with a peer team revealed the missing QuorumCnxManager$Listener thread, which is responsible for listening on the election port; its absence explained the election failure.

3 Cluster Recovery Plan Development

3.1 Re‑creating the Production Cluster in a Test Environment

Restored a ZooKeeper‑3.4.6 five‑node cluster from a production snapshot, then deliberately disabled the 3888 ports to simulate the failure.

3.2 Verifying Codis‑Proxy Registration

Confirmed that Codis‑Proxy registers temporary nodes in /jodis/… and that clients of different versions can still read/write via the test cluster.

3.3 Key‑Factor Tests

ZooKeeper version upgrade : evaluated compatibility of 3.4.13 with existing data. Codis‑Proxy temporary node registration : ensured no registration errors. Multiple Codis client versions : checked that clients continue to resolve the ZooKeeper DNS name. Log rotation : designed a daily log‑rotation policy.

3.4 Solution Selection

Option 1 – In‑place Rolling Upgrade : create a new ZooKeeper‑3.4.13 work directory, copy leader data, stop old nodes 6 and 8, start new nodes sequentially, resulting in a 3.4.13 cluster. Advantages: simple, continuous. Drawback: inevitable service interruption during node shutdown.

Option 2 – Offline Old Nodes, Build New Small Cluster, Then Expand : shut down nodes 6 and 8, build a fresh three‑node cluster with the same data, then add new nodes to reach five nodes. Advantages: avoids interruption. Drawback: risk of data inconsistency during split.

4 Implementation

4.1 Log Rotation Configuration

Updated bin/zkEnv.sh to point logs to a logs directory and enable daily rotation; modified conf/log4j.properties accordingly.

4.2 Preparing a Three‑Node Small Cluster

Adjusted conf/zoo.cfg dataDir and dataLogDir, reassigned myid values, and copied the full data from the production leader to the new nodes.

4.3 Adjusting Old Node Paths and Rolling Restart

Changed myid and server entries for nodes 7, 9, and 10, shut them down, switched to the new work directory, and performed a staged restart (starting new nodes 4, 5, 6 as leaders in turn, then bringing back the remaining nodes) until the cluster stabilized with node 5 as leader.

[root()@XShellTerminal bin]# echo mntr | nc localhost 2181
zk_server_state leader
zk_synced_followers 4.0
......

The cluster was fully restored, upgraded, and log rotation was operational.

5 Summary

Ensure ZooKeeper port monitoring (client, election, sync) to detect issues early.

Minor version upgrades can eliminate hidden bugs.

Thorough key‑factor testing is essential before rollout.

Always have a fallback strategy; there is no “maybe”.

operationsZookeeperVersion Upgradelog managementCodiscluster recovery
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.