Building a Private Cloud with Kubernetes: Architecture, Challenges, and the Wayne Platform
This article details 360 Search's journey of constructing a Kubernetes‑based private cloud, covering its evolution, network and storage designs, logging pipeline, the Wayne management platform, encountered pitfalls, and future open‑source plans, offering practical insights for similar deployments.
With the rapid acceleration of containerization, the demand for container orchestration has grown, leading from a three‑way competition among Kubernetes, Mesos, and Swarm to Kubernetes becoming the dominant standard. Many companies now build private clouds on Kubernetes; this article presents 360 Search's overall architecture, the problems encountered, and the solutions applied.
Development Background
Initially, Kubernetes was used only for stateless web services due to limited support for stateful workloads. Since 2017, Kubernetes has become the de‑facto standard for private clouds, and its support for stateful services has matured, allowing the platform to run tens of thousands of containers.
Architecture Design
The overall cloud platform architecture is shown below:
Network
Flannel era : Early deployments used Flannel VXLAN, which caused an excessive number of forwarding table entries. Subsequent optimizations reduced these entries, and DSR optimization prevented return traffic from traversing the Flannel interface.
ExternalIP edge nodes were used for NAT via iptables.
Calico era : As the cluster scale grew, the network switched to Calico with BGP support, eliminating VXLAN overhead.
IDC network model and Calico customizations include using the same AS number for servers, leveraging default routes, aggregating routes to /27, and using annotations for /32 routes.
Storage
Initially, stateful services ran on a self‑maintained Gluster cluster. Later, the storage team provided Ceph RBD and CephFS, and most stateful workloads migrated to Ceph. A custom component named Robin (soon open‑source) was built to manage RBD images and CephFS paths.
Logging
Containers write logs to stdout; Docker captures them, and Kubelet symlinks them to /var/log/containers . Early logging used Logstash → Kafka → HDFS. As log volume grew, Logstash became a bottleneck, so Filebeat was extended with a custom Kubernetes processor that adds labels such as Deployment name and Pod name, and performance optimizations dramatically increased single‑node processing capacity.
Wayne Platform
Kubernetes alone is powerful but not enough for a user‑friendly private‑cloud portal. Wayne is a web‑based, multi‑cluster, visual management platform built on top of Kubernetes, offering:
Visual operations to reduce learning cost and speed up deployments.
Graphical, JSON, and YAML editing modes.
Micro‑kernel architecture with plug‑in extensibility.
Multi‑cluster management.
Fine‑grained permission control by department and project.
Multiple authentication methods (LDAP, OAuth2, database).
Comprehensive audit logging.
Open API platform with API‑Key management.
Multi‑level monitoring of cluster health.
Wayne also integrates a WebShell for online log viewing and container debugging.
Pitfalls Encountered
Flannel failure : Pods became unreachable when Flannel crashed; monitoring Flannel status is required.
Deployment rolling update load‑balancer issue : iptables not refreshed for terminating pods caused request loss; using preStop with a graceful shutdown period resolves it.
Kubernetes 1.9 endpoint bug : After Apiserver crash, endpoints stopped updating; upgrading to 1.10 and setting --endpoint-reconciler-type=lease fixes the problem.
iptables SNAT port conflict : nf_conntrack design can cause duplicate local ports and dropped SYN packets; mitigation includes increasing the pool of local IPs or enabling random port selection via NF_NAT_RANGE_PROTO_RANDOM_FULLY .
Future Directions
Wayne will be open‑sourced (GitHub: https://github.com/Qihoo360/wayne) to give back to the CNCF community and address shortcomings of the official Dashboard, such as lack of multi‑cluster and multi‑tenant support.
More services will continue migrating to the private cloud, expanding beyond the current ten‑thousand containers.
Q&A
Q: In Calico with BGP, is the container directly reachable, and when does the SYN‑drop issue occur? A: It occurs on edge nodes using ExternalIP mode; we plan to switch to direct container connectivity.
Q: How to enable RPC between old and new services that reside in different virtual networks? A: Use LVS VIPs for calls; internal clusters can reach services via the service domain name.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.