Vivo’s Cloud‑Native Container Practices: High‑Availability, Automation, and Platform Evolution
Vivo’s cloud‑native journey, detailed from its 2018 machine‑learning pilot to a large‑scale container ecosystem, showcases how high‑availability design, automated multi‑cluster operations, CI/CD pipelines, and unified traffic ingress have dramatically improved efficiency, reduced costs, and enabled rapid, scalable AI‑driven services across the business.
Based on Pan Liangbiao’s talk at the 2022 Vivo Developer Conference, this article summarizes Vivo’s exploration and implementation of cloud‑native container technologies, focusing on high‑availability, automated operations, platform upgrades, and ecosystem integration.
Since 2018, Vivo has built a one‑stop cloud‑native machine‑learning platform on top of containers, supporting algorithm middle‑platform services such as data management, model training, and deployment for advertising, recommendation, and search. The success of this pilot led to a strategic upgrade toward a large‑scale, cost‑effective, cloud‑native container ecosystem.
1. Container Technology and Cloud‑Native Concepts
Containers have evolved from the Unix chroot (1979) through four stages: emergence, burst, commercial exploration, and expansion. Compared with virtual machines, containers offer lower overhead, faster startup, better resource utilization, and superior scalability.
Cloud‑native is defined by two main viewpoints: Pivotal (DevOps, continuous delivery, micro‑services, containers) and CNCF (supporting core components such as Kubernetes and Prometheus). Core technologies: containers, micro‑services, service mesh. Core principles: immutable infrastructure and declarative APIs.
2. Value Analysis
From efficiency, cost, and quality perspectives, cloud‑native and containers provide:
Efficiency: rapid continuous delivery, portable images, elastic scaling.
Cost: on‑demand resource allocation, high scheduler utilization, reduced fragmentation.
Quality: observability, self‑healing, manageable clusters.
3. Vivo’s Container Exploration and Practice
2.1 Pilot Exploration
Starting in 2018, Vivo built a cloud‑native machine‑learning platform on containers, delivering end‑to‑end capabilities for recommendation, advertising, and search. The platform offers five advantages: full‑scene coverage, short queue time (P99 < 45 min), low cost (CPU utilization > 45 %), high efficiency (training 830 M samples/hour), and superior results (training success rate > 95 %).
2.2 Value Mining
Containers helped reduce costs (CPU utilization improvement from ~25 % to industry‑level 40‑50 %) and increase efficiency (addressing middleware upgrades, migration, testing, traffic spikes, and global deployment consistency).
2.3 Strategic Upgrade
Vivo upgraded its internal strategy to build a first‑class container ecosystem based on cloud‑native principles, adding unified traffic ingress, container operation platforms, naming services, and monitoring.
2.4 Challenges
Key challenges include rapid cluster scale growth (10 k+ hosts, 10 k+ instances), operational standardization, monitoring pressure, and seamless Kubernetes version upgrades. Platform challenges involve IP changes, ecosystem compatibility, user habits, and quantifying operational benefits.
2.5 Best Practices
2.5.1 High‑Availability : Fault prevention (process tools, disaster recovery, infrastructure), fault detection (monitoring dashboards, inspections), and fault recovery (playbooks, post‑mortems).
2.5.2 Automated Operations : Multi‑cluster management platform with standardized configuration, white‑screen operation, and audit logs.
2.5.3 Architecture Upgrade : Four‑layer architecture – container + K8s base, IAAS integration, platform services (online, middleware, big data, AI training), and business enablement.
2.5.4 Capability Enhancements : OpenKruise workload extensions, lossless service release, Harbor image security, Dragonfly2 image acceleration, fixed‑IP support, Karmada multi‑cluster management.
2.5.5 CI/CD Integration : Jenkins + Spinnaker pipeline – code checkout, build, security scan, image push, API‑driven deployment.
2.5.6 Unified Traffic Ingress : Migration from Nginx to APISIX to handle massive container‑driven traffic and IP churn.
2.6 Outcomes
Product capability matrix now covers four layers (basic services, core capabilities, platform CI/CD, business layer) and supports 600+ online services, 500+ algorithm services, 20+ big‑data clusters, and extensive AI training workloads.
2.7 Summary
Four dimensions of reflection: finding value, defining strategy, building platforms, and seeking breakthroughs. The overall message emphasizes technology serving business, with cost‑optimal, efficient solutions.
3. Future Outlook
Vivo envisions three directions: full containerization, embracing cloud‑native, and offline mixed deployment. The goal is “write once, run everywhere” with extreme efficiency and cost‑optimal operations.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.