Backend Development 4 min read

Why Zookeeper Connections Fail After 1 MB and How to Fix Them

A staging environment’s new scheduled task kept failing due to Zookeeper disconnections caused by packets exceeding the default 1 MB maxBuffer, and the article explains the root cause, heartbeat timing, and how adjusting Djute.maxbuffer or upgrading Zookeeper resolves the issue.

Vipshop Quality Engineering
Vipshop Quality Engineering
Vipshop Quality Engineering
Why Zookeeper Connections Fail After 1 MB and How to Fix Them

Problem Statement

The issue was first reported by business: a newly added scheduled task in the Staging environment could not be dispatched on time. Log analysis showed repeated Zookeeper disconnection errors that kept increasing.

Analysis Process

Initial checks revealed nothing unusual, and reproducing the problem failed because the ZK reconnection logic is normally well‑tested. Logs showed that after more than ten minutes the ZK session was still not established, prompting the question of whether a server‑side problem existed.

Further investigation of Zookeeper source code revealed that when a data packet exceeds maxBuffer (default 1 MB), the server throws a "Len error 1733124" exception. This occurs because the client bundles all watches into a single packet; if the combined size exceeds the server’s jute.maxbuffer setting, the server rejects the connection and the client keeps retrying.

The Zookeeper client’s reconnection strategy works as follows: after a heartbeat timeout of 1/3 of the session timeout, it sends a heartbeat; at 2/3 of the timeout it attempts to connect to another server, keeping the old session ID. If the watch packet size exceeds the server’s buffer limit, the connection fails repeatedly.

Solution

After reproducing the condition in a test environment, the problem was solved by setting the server parameter -Djute.maxbuffer to a suitable larger value. A more permanent fix is to upgrade the Zookeeper component to version 3.4.8 or later.

The issue was shared with other product teams using Zookeeper, leading to further discussions.

Backenddistributed systemsZookeepertroubleshootingconnectionmaxbuffer
Vipshop Quality Engineering
Written by

Vipshop Quality Engineering

Technology exchange and sharing for quality engineering

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.