Cloud Native 14 min read

Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management

Vivo replaced error‑prone manual Ansible playbooks with a custom Kubernetes Operator that uses declarative CRDs and modular Ansible scripts to automate the full lifecycle—deployment, scaling, upgrades, and recovery—of thousands of nodes across multiple data‑centers, supported by extensive CI testing and future kubeadm integration.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management

Background : Vivo’s business has been migrating to Kubernetes (k8s) across multiple data centers. Managing many large‑scale k8s clusters efficiently and reliably is a key challenge. Traditional deployment relies on manual ansible playbooks, leading to error‑prone operations, lack of version control, insufficient testing, and tangled component parameters.

Problems with the previous approach :

Manual, “black‑screen” operations cause mistakes and configuration drift.

No proper versioning of deployment scripts, hindering upgrades.

Long validation cycles without automated test cases or CI.

Monolithic ansible tasks; lack of modularization for etcd, Docker, k8s, network plugins, and addons.

Binary‑only deployment requires a custom management framework, making the process cumbersome.

Component parameters are chaotic; a single k8s version may have over 100 configurable flags.

The article introduces a custom Kubernetes‑Operator that uses declarative APIs to let cluster administrators interact with CR resources, simplifying operations and reducing risk. A single admin can manage thousands of nodes.

Cluster Deployment Practice :

The deployment workflow consists of ten steps: Bootstrap OS → Pre‑install → Install Docker → Install etcd → Install Kubernetes Master → Install Kubernetes Node → Configure network plugin → Install addons → Post‑install setup.

After clusters are provisioned, modular ansible scripts handle individual components (Docker, etcd, k8s, network‑plugin, addons) to avoid full‑scale re‑execution. Component parameters are managed via the ComponentConfig API, improving maintainability, upgradeability, programmability, and configurability.

Future plans include moving to kubeadm for lifecycle management, leveraging its certificate handling, kubeconfig generation, image management, and addon installation.

Ansible Usage Guidelines :

Prefer built‑in ansible modules.

Avoid hostvars and delegate_to.

Enable –limit mode.

CI Matrix Testing :

Extensive CI tests are run for syntax (ansible‑lint, shellcheck, yamllint, etc.) and cluster functionality (deployment, scaling, upgrades, parameter changes). Performance and feature tests include API server health, network connectivity, node health, k8s e2e and conformance tests.

The CI pipeline uses GitLab, gitlab‑runner, ansible, and kubevirt. The flow is:

Developer submits a PR.

CI triggers ansible syntax checks.

Ansible creates a namespace, PVC, and a kubevirt VM template; the VM runs on k8s.

Ansible deploys the k8s cluster.

Functional verification and performance tests are executed.

Resources (kubevirt VM, PVC) are destroyed.

Multiple PRs generate isolated jobs in the cluster, achieving a “k8s on k8s” architecture.

Kubernetes‑Operator Overview :

Operators extend the k8s API to manage complex applications. The custom CRDs include:

ClusterDeployment : Top‑level CR containing all configuration parameters (etcd, k8s, LB, version, network, addons).

MachineSet : Group of machines for control, compute, or etcd roles.

Machine : Individual machine details and status.

Cluster : Status subresource linked to ClusterDeployment.

Ansible executor : Jobs, ConfigMaps, Secrets that run ansible playbooks and store inventory/variables.

Extension controllers : Addons, cluster install, remoteMachineSet, cloud‑provider integrations, etc.

The architecture places the operator in a metadata “pass” platform that centrally manages multiple business clusters, providing multi‑cloud management, unified scheduling, high availability, and disaster recovery.

Scenarios :

1. Cluster expansion : When a new application requires more capacity, admins approve physical resources, generate Machine CRs, and the operator creates MachineSets that bind to idle machines, runs ansible to provision them, and updates statuses.

2. Failure recovery : If a business cluster fails, the operator either relies on other clusters to absorb the load or creates a new cluster from the standby pool, provisioning machines via ansible and migrating workloads.

Operator Execution Flow :

Cluster admin or platform creates a ClusterDeployment CR.

ClusterDeployment controller detects the change.

MachineSet and Machine resources are created.

ClusterInstall controller generates ConfigMaps and Jobs that run the appropriate ansible playbooks (scale, upgrade, install).

Scheduler schedules the Job pods.

Kubelet executes the ansible playbooks inside the pods.

Job controller updates the ClusterDeployment status and cleans up resources.

NodeHealthy controller syncs node readiness back to Machine status.

Addons controller installs or upgrades addons once the cluster is ready.

Conclusion : Vivo’s large‑scale k8s operation combines optimized deployment tooling, extensive CI matrix testing, and a Kubernetes‑Operator that automates the full lifecycle of clusters across multiple data centers. The operator enables “K8s as a Service”, ensuring safety, stability, and reduced operational overhead while supporting future expansion to public‑cloud integration.

Cloud NativeCI/CDKubernetes== operatorcluster managementAnsiblemulti-data center
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.