Why ZooKeeper Is Not the Best Choice for Service Discovery: Design Considerations and Lessons from Alibaba
The article analyzes the limitations of using ZooKeeper for service discovery, compares it with Alibaba's ConfigServer, discusses CAP trade‑offs, scalability, health‑check design, disaster recovery, and provides practical guidance on when to choose alternative registration solutions.
Standing at the crossroads of the future and looking back at history often sparks wild thoughts, such as what would have happened if certain events had occurred earlier or not at all, similar to pondering the fate of Archduke Franz Ferdinand or ancient Chinese figures.
At the end of 2007, Taobao launched an internal refactoring project called "Five‑Color Stone," which later became the foundation for Taobao's service‑oriented architecture and led to the creation of the ConfigServer service registry.
Around 2008, Yahoo began publicly promoting its distributed coordination product ZooKeeper, which was inspired by Google's Chubby and Paxos papers.
In November 2010, ZooKeeper graduated from an Apache Hadoop sub‑project to a top‑level Apache project, officially becoming an industrial‑grade, stable product.
In 2011, Alibaba open‑sourced Dubbo and, to decouple it from internal systems, adopted ZooKeeper as its registry, establishing a typical Dubbo + ZooKeeper service‑oriented solution that boosted ZooKeeper's reputation as a registry.
By the 2015 Double‑11 shopping festival, ConfigServer had been in use for nearly eight years, supporting millions of services within Alibaba and driving the evolution from ConfigServer 2.0 to 3.0.
Fast‑forward to 2018: when evaluating service discovery, many wonder whether ZooKeeper truly remains the optimal choice.
Is ZooKeeper Really the Best Choice for Service Discovery?
Reflecting on history, we ask: what if ZooKeeper had been introduced before Alibaba's HSF ConfigServer? Would we have taken a detour of heavily modifying ZooKeeper to fit Alibaba's massive service‑oriented needs?
Today, we are convinced that ZooKeeper is not the best option for service discovery, just as Eureka and the article "Eureka! Why You Shouldn’t Use ZooKeeper for Service Discovery" argue.
---
Registry Center Requirements and Key Design Considerations
Let’s return to the requirements of a service‑discovery registry, combining Alibaba's real‑world practices to analyze why ZooKeeper may not be suitable.
Is the Registry a CP or AP System?
CAP and BASE theories are well‑known guiding principles for distributed systems; we now directly examine consistency and availability needs for a registry.
Data Consistency Requirement
The core function of a registry can be seen as a query function Si = F(service-name) , where service-name is the query key and the returned value is the list of endpoints (ip:port) .
Note: In the following text, "service" is abbreviated as "svc".
Consider a service svcB with 10 replicas. If two callers of svcA receive two different endpoint lists, the traffic distribution becomes slightly unbalanced, but eventual consistency can quickly converge the data within the SLA (e.g., 1 s), making the inconsistency acceptable.
Partition Tolerance and Availability Requirement
Now examine the impact of a network partition on the registry. Imagine a typical three‑datacenter ZooKeeper deployment (2‑2‑1 nodes). When datacenter 3 becomes isolated, ZooKeeper node 5 is unwritable because it cannot contact the leader.
Consequently, services in datacenter 3 cannot register, restart, scale up, or down. Moreover, even though network connectivity between services inside datacenter 3 remains fine, they cannot call services in datacenters 1 or 2 because the registry refuses to serve them.
This situation violates the principle that a registry must never break service connectivity for any reason.
In practice, we sometimes deliberately exploit temporary inconsistency to force same‑datacenter calls, improving latency.
Overall, in the CAP trade‑off, availability outweighs strong consistency for a registry; the design should favor AP rather than CP.
Service Scale, Capacity, and Connectivity
How large is your micro‑service landscape? Hundreds of services, thousands of nodes? Three years from now the scale could double, stressing the registry.
When the number of services exceeds a certain threshold, ZooKeeper quickly becomes a bottleneck, as its write throughput and connection count cannot scale horizontally.
While ZooKeeper excels at coarse‑grained coordination (distributed locks, leader election) where TPS demands are modest, it struggles with high‑frequency service registration, health‑check writes, and long‑lived connections required by large‑scale service discovery.
One workaround is to split business domains across multiple ZooKeeper clusters, but this contradicts the principle that a registry should not limit future service connectivity.
Does the Registry Need Persistent Storage and Transaction Logs?
ZooKeeper's ZAB protocol writes a transaction log for every write and periodically snapshots memory to disk, ensuring durability. However, for service discovery the most critical data—real‑time healthy service addresses—does not require persistence.
Historical address lists and health states are largely irrelevant; callers only need the current snapshot.
Nevertheless, metadata such as version, group, data‑center, weight, and auth policies does need persistent storage and searchable capabilities.
Service Health Check
When using ZooKeeper as a registry, health checks often rely on session activity and Ephemeral ZNodes, essentially tying health to TCP connection liveness, which is insufficient.
A robust registry should allow services to define custom health‑check logic rather than a one‑size‑fits‑all TCP probe.
Disaster Recovery Considerations for the Registry
If the registry itself crashes, service calls should remain unaffected; the registry should be a weak dependency, only needed during registration, scaling, or topology changes.
Clients must cache registry data (client snapshot) and handle full registry outages gracefully. ZooKeeper's native client lacks this capability, so additional mechanisms are required.
Do You Have ZooKeeper Experts to Rely On?
ZooKeeper is simple in concept but complex in production at scale. Understanding its client/session state machine and handling exceptions like ConnectionLossException and SessionExpiredException are essential.
ConnectionLossException is recoverable but requires the application to determine request idempotency and possible state reconstruction.
SessionExpiredException forces a new session and invalidates any Ephemeral nodes or locks.
Where to Use ZooKeeper
Alibaba maintains one of the largest ZooKeeper clusters in China, with a custom high‑availability branch called TaoKeeper.
ZooKeeper shines in coarse‑grained coordination for big‑data and offline tasks, but it is ill‑suited for high‑throughput transaction scenarios, large‑scale service discovery, and health monitoring.
Thus, use ZooKeeper for big‑data coordination (left) and avoid it for transaction‑heavy service discovery (right).
Conclusion
We are not wholly dismissing ZooKeeper; rather, based on a decade of Alibaba’s large‑scale service‑oriented practice, we summarize lessons for designing and using a service registry, hoping to guide the industry toward better choices.
Thank you for reading; if this article helped your architecture journey, feel free to share it and join our architect community.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.