Cloud Native 18 min read

Design and Implementation of a Zookeeper Operator for Kubernetes

This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.

Manbang Technology Team
Manbang Technology Team
Manbang Technology Team
Design and Implementation of a Zookeeper Operator for Kubernetes

Introduction In 2018 at KubeCon, Alibaba’s Chen Jun introduced the concept of a Node Operator, inspiring the development of a Zookeeper Operator to containerize NoSQL components and manage their lifecycle on Kubernetes.

Functional Requirements The operator must provide rapid deployment, secure scaling, automated monitoring, self‑healing, and visual operation capabilities.

CRD Definition The first step is defining a declarative Item spec that includes node resources, monitoring components, replica count, and persistent storage.

Architecture

Deploy : Generates native resources such as StatefulSet, Service, ConfigMap, and PersistentVolume for fast Zookeeper cluster deployment.

Monitor : Creates ServiceMonitor and PrometheusRule resources to register the cluster with Prometheus and set alerting policies.

Scale : Controls scaling and rolling upgrades, ensuring minimal master‑slave switches during restarts.

CRD Example

apiVersion: database.ymm-inc.com/v1beta1
kind: ZooKeeper
metadata:
name: zookeeper-sample
spec:
version: v3.5.6
cluster:
name: test
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 2Gi
exporter:
exporter: true
exporterImage: harbor.ymmoa.com/monitoring/zookeeper_exporter
exporterVersion: v3.5.6
nodeCount: 3
storage:
size: 100Gi

Deployment Details

Labels applied to the StatefulSet and Service for selection and monitoring: labels: app: zookeeper app.kubernetes.io/instance: zookeeper-sample component: zookeeper zookeeper: zookeeper-sample

InitContainer copies the Zookeeper configuration file into the pod’s working directory.

Main Containers include the Zookeeper process, a monitoring sidecar (exporter), and an agent container for health checks.

Environment Variables such as POD_IP , POD_NAME , and ZK_SERVER_HEAP are injected from the pod spec.

Readiness Probe uses the ruok command to verify the node is ready before updating the dynamic configuration file.

Monitoring Integration

ServiceMonitor registers the exporter port http-metrics with Prometheus: apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: zookeeper component: zookeeper spec: endpoints: - interval: 30s port: http-metrics

PrometheusRule creates alerting policies, e.g., sending alerts to a DingTalk robot.

Scaling and Upgrade Strategy

Scaling updates spec.cluster.nodeCount in the Zookeeper CR and triggers the operator to add or remove nodes using the Zookeeper reconfiguration API.

Rolling upgrades are performed by updating the StatefulSet with an OnDelete strategy; the operator deletes pods in a controlled order, respecting MaxUnavailable and leader election.

Partitioned rolling updates allow selective pod replacement based on an index, ensuring minimal disruption.

Agent Sidecar API

/status – returns Zookeeper node metrics (sent/received, latency, mode, version, etc.).

/runok – checks if the node is running without errors.

/health – health check for the agent itself.

/get – retrieves the current dynamic configuration.

/add and /del – add or remove cluster members via Zookeeper reconfigure.

OAM Integration The operator aligns with the Open Application Model (OAM) by defining reusable Components (e.g., the Zookeeper workload) and Traits (e.g., scaling and rolling‑update CRDs), enabling platform‑agnostic application description and management.

Conclusion The Zookeeper operator demonstrates a cloud‑native approach to managing stateful services on Kubernetes, providing deployment, scaling, monitoring, fault‑tolerance, and upgrade capabilities, while offering extensibility for future features such as backup, migration, and advanced scheduling.

monitoringcloud-nativekubernetesOperatorZookeeperscalingCRD
Manbang Technology Team
Written by

Manbang Technology Team

Manbang Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.