Building an Efficient Machine Learning Training Platform on Kubernetes
This article describes how the Liulishuo algorithm team designed and implemented a Kubernetes‑based training platform that addresses the iterative, data‑intensive, and resource‑dynamic characteristics of machine learning workloads by pooling resources, enabling rapid provisioning, and optimizing scheduling and storage.
Introduction
This article shares how the Liulishuo algorithm team internally built an efficient training platform on top of the Kubernetes container orchestration system, tailored to the specific characteristics and workflow of machine‑learning training tasks.
Characteristics of Machine‑Learning Training
Iteration: Training is an iterative, exploratory process where algorithm engineers repeatedly try, compare, and debug different data, models, and parameters, requiring a highly interactive platform.
Data: Machine learning relies on massive datasets, and training jobs frequently read and write data, demanding flexible and efficient data access.
Compute Resources: Training often depends on industrial‑grade GPUs, which are costly and exhibit bursty, volatile demand, leading to potential resource waste during allocation and release.
Advantages of Kubernetes
Resource Pooling: Kubernetes leverages container technology and its scheduler to achieve resource pooling in three ways:
Fine‑grained resource allocation: Containers use cgroups for sharing, reducing interference between workloads on the same host. The LIMIT + REQUEST model allows over‑commitment of compressible resources like CPU, improving utilization.
Rapid provisioning and release: Lightweight containers can be created and deleted quickly; pre‑built Docker images reduce start‑up time.
Decoupling workloads from underlying compute: Containers and images act as an intermediate layer, enabling reuse of the same compute resources across different training tasks.
Extensibility: Kubernetes follows a Resource + Operator architecture. Operators watch the apiserver for state changes and act to reconcile the cluster, making it easy to add new functionality without entangling existing logic. This design ensures high scalability for large, complex systems.
Declarative Interfaces
Declarative APIs are well‑suited for machine‑learning training because they allow templating of distributed compute clusters (e.g., TensorFlow). Once templates are defined, deployments become repeatable and intuitive using simple template languages.
Efficiency Improvements
Data Storage : Training data is organized into four categories—raw data, data set, workspace, and shared—to simplify management. The cluster uses a hybrid S3 + NFS storage model: S3 provides large capacity with internet‑wide access, while NFS compensates for S3’s poor random‑read/write performance.
Image Auto‑Build : Containerization changes the delivery artifact from an executable to an image. To lower the barrier, the platform adopts RISEML, which builds container images from local configuration files, uploads code, and launches tasks automatically.
Scheduling Optimization
Custom scheduling policies address several issues:
Node CPU exhaustion prevents GPU allocation; labeling nodes and using nodeSelector ensures GPU‑capable nodes retain sufficient CPU.
CPU request overestimation leads to low utilization; applying low REQUEST and high LIMIT for CPU improves sharing.
When the cluster is fully loaded, the default FIFO scheduler may ignore submission order. Enabling the pod‑priority alpha feature and customizing the priority function restores order‑based scheduling.
Conclusion
Machine‑learning training has unique requirements for resource usage and interactivity. Kubernetes, as a cloud‑native container orchestration platform, pools resources, enhances compute efficiency, and offers extensibility that allows continuous improvement of the training platform to meet emerging challenges.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.