Why Multi-Cluster Kubernetes Matters and How Vivo Tackles It
This article examines the motivations, benefits, and existing solutions for Kubernetes multi‑cluster management, then details Vivo's non‑federated and federated approaches, application‑centric continuous delivery, elastic scaling, unified scheduling, gray‑release strategies, and summarizes the current state and challenges.
Why Multi‑Cluster Is Needed
With the rapid growth of Kubernetes and cloud‑native technologies, containerized workloads have become standardized and decoupled from underlying infrastructure, providing a solid foundation for multi‑cluster and hybrid‑cloud deployments.
1. Single‑cluster capacity limits
Clusters are limited to 5,000 nodes and 150,000 Pods, and the maximum node count varies with deployment patterns and workload characteristics.
2. Multi‑cloud usage
Avoid vendor lock‑in and leverage the latest technologies across different clouds for cost or capability reasons.
3. Traffic bursts
During sudden traffic spikes, workloads can be expanded to public‑cloud clusters, requiring IaaS integration for automatic scaling of CPU‑ and memory‑intensive services.
4. High availability
Single clusters cannot survive network or data‑center failures; a primary‑backup model or read‑write separation across clusters ensures continuity.
5. Geo‑distributed active‑active
Real‑time data synchronization enables simultaneous reads and writes across clusters for critical data such as global user accounts.
6. Regional affinity
Placing services in the same region reduces bandwidth costs and balances load locally.
Multi‑Cluster Exploration
2.1 Community Projects
Federation v1 : Deprecated because it introduced an extra API layer that conflicted with native Kubernetes APIs.
Federation v2 : Also deprecated; it focused on propagating RBAC and policy objects rather than full workload scheduling.
Karmada : Builds on Federation v2 concepts, adding native API support, multi‑level HA, automatic fault‑migration, cross‑cluster autoscaling, and service discovery.
Clusternet : Open‑source platform for multi‑cluster management and cross‑cluster application orchestration, designed for hybrid‑cloud, distributed‑cloud, and edge scenarios.
OCM (Open Cluster Management) : Simplifies multi‑cloud cluster management, supports resource and workload orchestration, and offers an extensible addon framework.
2.2 Vivo’s Exploration
2.2.1 Non‑Federated Cluster Management
Vivo uses a unified web UI to import Kubernetes cluster credentials, view resources, and manage Deployments, Services, and LoadBalancers without adding federation complexity. CI/CD, monitoring, and alerting are integrated, and most workloads remain managed as independent clusters.
2.2.2 Federated Cluster Management
Federation unifies resource management and scheduling across clusters, supporting hybrid‑cloud, private‑cloud, and edge deployments. Although it adds architectural complexity and control‑plane overhead, it enables exciting capabilities such as transparent workload migration and cross‑cluster application orchestration.
Vivo’s federated direction focuses on four areas:
Resource distribution and orchestration
Elastic burst handling
Multi‑cluster scheduling
Service governance and traffic routing
Application‑Oriented Multi‑Cluster Practices
Elasticity : Ensures rapid deployment, scaling, and reliable service delivery.
Usability : Leverages Service Mesh for global governance of micro‑service applications.
Portability : Enables seamless migration across clusters and clouds.
3.1 Continuous Delivery
Vivo registers multiple Kubernetes clusters with Karmada, which handles resource scheduling and fault‑tolerance. The container platform manages K8s resources, Karmada policies, and configurations. CI/CD performs unit tests, security scans, image builds, and generates K8s objects via the platform API for unified delivery.
For complex scenarios such as in‑place upgrades or gray releases, OpenKruise is used. Resources like
PropagationPolicyand
OverridePolicyallow up to twelve configuration objects per application.
3.2 Elastic Scaling
3.2.1 FedHPA (Cross‑Cluster HPA)
FedHPA uses native HPA objects; Karmada’s
FedHpaControllerdistributes min/max replica settings across member clusters and keeps status synchronized.
3.2.2 CronHPA (Scheduled Scaling)
CronHPA defines time‑based scaling windows. The controller creates a
CronHPAresource, which Karmada‑scheduler translates into per‑cluster replica allocations using the go‑cron library.
3.2.3 Manual & Targeted Scaling
Users specify a workload and desired replica count; Karmada‑scheduler distributes the change across clusters. Targeted scaling can delete specific Pods via
ScaleStrategy.PodsToDeleteand custom resource interpretation.
3.3 Unified Scheduling
3.3.1 Multi‑Cluster Scheduling
Karmada’s scheduler and emulator estimate resources per cluster. Workloads generate ResourceBinding (RB) objects, which are pre‑selected and then optimally assigned to clusters using static or dynamic strategies.
3.3.2 Rescheduling
If a cluster fails or RB allocation deviates from expectations, Karmada re‑evaluates and redistributes workloads to healthy clusters.
3.3.3 Single‑Cluster Scheduler Simulation
Current simulators model four scheduling algorithms using a fake client; improvements are needed to match production schedulers.
3.4 Gray Release
3.4.1 Application Migration
Non‑federated applications are gradually migrated to Karmada via a whitelist, allowing seamless user experience while both management modes coexist.
3.4.2 Rollback
When migration errors occur, administrators remove the application from the whitelist, annotate workloads, and adjust Karmada interpreters to prevent further replica changes, effectively halting control‑plane actions.
3.4.3 Migration Strategy
Test → Pre‑release → Production
Batch gray rollout with a 1:2:7 ratio for major changes
Both parties verify and monitor for 5‑10 minutes
Proceed if no anomalies; otherwise trigger rollback
Summary
Vivo currently relies on non‑federated multi‑cluster management combined with CI/CD to provide rolling updates, gray releases, manual and targeted scaling, and elastic scaling. While non‑federated solutions lack unified resource management, fault‑tolerance, and cross‑cluster scheduling, Vivo is actively exploring these capabilities through federated approaches. Federation adds architectural complexity and control‑plane overhead, and the ecosystem is still evolving, so enterprises should align federation adoption with their specific needs and robust operational monitoring.
References
GitHub: kubernetes-retired/federation
GitHub: kubernetes-retired/kubefed
GitHub: karmada-io/karmada
GitHub: clusternet/clusternet
GitHub: open-cluster-management-io/ocm
GitHub: kubernetes-sigs/cluster-api
GitHub: clusterpedia-io/clusterpedia
GitHub: submariner-io/submariner
GitHub: karmada-io/multi-cluster-ingress-nginx
GitHub: istio/istio
GitHub: cilium/cilium
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.