How Vivo Built a Scalable Karmada Operator with Ansible for Multi‑Cluster Management
Vivo’s engineering team shares their practical experience creating a Karmada‑Operator using the Operator SDK and Ansible, detailing background, deployment challenges, design choices, API and architecture, etcd management, member cluster handling, CI pipeline, and performance testing to enable robust multi‑cloud Kubernetes orchestration.
Background
Karmada is an open‑source cloud‑native multi‑cloud container orchestration project that has attracted many enterprises and is running in production. Multi‑cloud has become a foundational infrastructure for data‑center construction, driving rapid development of multi‑region disaster recovery, large‑scale multi‑cluster management, cross‑cloud elasticity, and migration scenarios.
Vivo migrated its business to Kubernetes, causing rapid growth in cluster size and number, which increased operational difficulty. After building an internal multi‑cluster management solution that still fell short, the team evaluated community projects and chose Karmada.
Unified management of multiple Kubernetes clusters, reducing platform complexity.
Cross‑cluster elastic scaling and scheduling to improve resource utilization and cut costs.
Karmada uses native Kubernetes APIs, lowering migration effort.
Disaster recovery: decoupled control plane and member clusters enable resource reallocation on failures.
Extensibility: custom scheduling plugins and OpenKruise interpreter plugins can be added.
Karmada‑Operator Implementation
2.1 Operator SDK Overview
The Operator Framework provides a toolkit for building Kubernetes native applications (Operators) in an automated, scalable way. Operators simplify management of complex, stateful workloads by leveraging Kubernetes extensibility for provisioning, scaling, backup, and recovery.
Writing Operators can be challenging due to low‑level APIs, boilerplate code, and lack of modularity. The Operator SDK mitigates these challenges by offering high‑level APIs, scaffolding, code generation, and extensions for common use cases.
2.2 Solution Selection
Option 1: Go‑based Operator – suited for stateful services on Kubernetes but limited for binary deployments, external etcd, and member‑cluster registration.
Option 2: Ansible‑based Operator – supports both Kubernetes‑based and binary deployments, external etcd, and member‑cluster lifecycle via SSH and Ansible modules.
Option 3: Hybrid Go + Ansible Operator – combines capabilities of Option 2 with Go‑level flexibility.
After evaluating the three options, Vivo selected the Ansible‑based Operator (Option 2) because it provides feature parity with the Go SDK, matches Karmada’s production requirements, is easy to learn for Ansible users, offers strong extensibility, and avoids the need for extensive Go code.
2.3 API Design
The Operator SDK can generate a CRD named
KarmadaDeployment. Additional CRDs
EtcdBackupand
EtcdRestoreare defined for etcd data management. The
specfields are translated into Ansible variables, and the
statusis populated by the Ansible runner or the
k8s_statusmodule.
2.4 Architecture Design
The architecture supports both containerized and binary deployments. Containerized deployment relies solely on Kubernetes APIs, while binary deployment uses SSH to manage the Karmada control plane and member clusters. Member clusters are registered via provided kubeconfig and credentials defined in the CR.
2.5 Control Plane Management
Standardized certificate management using OpenSSL, separating etcd and Karmada certificates.
Karmada‑apiserver can use external load balancers instead of Kubernetes Services.
Flexible upgrade strategies supporting component‑wise and full‑cluster upgrades.
Rich global variable definitions to enable component configuration changes.
2.6 etcd Cluster Management
etcd is the metadata store for Karmada and must be highly available in production. The Operator provides Ansible plugins to manage etcd clusters, including adding/removing members, backup (e.g., to CephFS), recovery, and health checks.
2.7 Member Cluster Management
Member clusters are registered and deregistered through dynamic Ansible inventory generation based on the
KarmadaDeploymentspec. Two roles,
add‑memberand
del‑member, handle join and unjoin operations, supporting concurrent processing and optional SSH mode.
CI Introduction
To improve developer experience, Vivo built a CI pipeline that runs GitHub self‑hosted runners and KubeVirt VMs. The pipeline executes syntax and unit tests, provisions VMs, deploys one host and two member clusters, installs Karmada, and runs e2e and Bookinfo tests. Planned CI matrix tests include linting (ansible‑lint, shellcheck, yamllint, etc.), full deployment validation (karmadactl, charts, binary), member join/unjoin, Karmada upgrades, etcd backup/restore, and performance testing with 2000‑node simulations.
Conclusion
Through community research and Vivo’s practice, the Karmada‑Operator design was finalized. The Ansible‑based Operator offers high extensibility, reliability, intuitive logic authoring, and out‑of‑the‑box functionality, providing a robust foundation for managing Karmada at scale. Remaining challenges include adding webhook support and richer CRD scaffolding. Ongoing development will continue to enhance features and stability.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.