Introduction to Kubeflow and Its Installation Process
This article introduces Kubeflow, explains the typical machine‑learning model lifecycle, outlines Kubeflow’s core components and Kubernetes advantages, provides detailed server and storage configuration, walks through ksonnet and Kubeflow installation steps, and shows how to verify deployments and access the Kubeflow UI.
Kubeflow is a Google‑initiated machine‑learning platform built on Kubernetes that enables definition of resources such as TFJob and allows distributed model training to be managed like regular application deployments.
Before Kubeflow, a production ML workflow typically passes through data cleaning and validation, dataset splitting, training, model validation, large‑scale training, model export, service deployment, and monitoring, requiring additional tooling for data collection, feature extraction, resource management, and logging.
Core Kubeflow components include:
Jupyter multi‑tenant Notebook service
TensorFlow/PyTorch as supported ML engines
Seldon for model deployment on Kubernetes
TF‑Serving for online TensorFlow model serving with version control
Argo workflow engine
Ambassador API gateway
Istio for service management and telemetry
Ksonnet for deploying Kubernetes resources
Kubeflow leverages Kubernetes advantages such as native resource isolation, automated cluster management, CPU/GPU scheduling, support for distributed storage, and mature monitoring and alerting.
Server configuration
GPU: Nvidia‑Tesla‑K80
Network: 1 GbE (note: may become a bottleneck for large datasets)
CephFS service configuration
Network: 10 GbE (Ceph clusters should be co‑located with Kubernetes to avoid high latency).
Kubeflow installation prerequisites
Kubernetes version: v1.12.2 (kube‑dns required)
Kubeflow version: v0.3.2
Jsonnet version: v0.11.2
Install ksonnet
(Installation steps shown in the accompanying image.)
Install Kubeflow
After completing all installation steps, verify the deployment status of Kubeflow resources in the Kubernetes cluster. Check the status of each Deployment and its Pods to ensure they are running correctly.
Use Ambassador as the unified external gateway; forward its service port with kubectl port-forward to access Kubeflow locally.
Access the Kubeflow UIs via localhost:8080 in a browser. The UIs allow you to use Jupyter Notebook for end‑to‑end development, run code, view results, and launch TF‑operator for distributed TensorFlow training.
Conclusion
Major cloud providers and hardware vendors are investing in Kubeflow to enable large‑scale, multi‑GPU training on Kubernetes, improving GPU utilization and streamlining the ML workflow, while presenting new challenges for DevOps teams.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.