Cloud Native 9 min read

Building an Efficient Machine Learning Training Platform on Kubernetes

This article describes how the Liulishuo algorithm team designed and implemented a Kubernetes‑based training platform that addresses the iterative, data‑intensive, and resource‑dynamic characteristics of machine learning workloads by pooling resources, enabling rapid provisioning, and optimizing scheduling and storage.

Liulishuo Tech Team

Nov 29, 2018

Building an Efficient Machine Learning Training Platform on Kubernetes

Introduction

This article shares how the Liulishuo algorithm team internally built an efficient training platform on top of the Kubernetes container orchestration system, tailored to the specific characteristics and workflow of machine‑learning training tasks.

Characteristics of Machine‑Learning Training

Iteration: Training is an iterative, exploratory process where algorithm engineers repeatedly try, compare, and debug different data, models, and parameters, requiring a highly interactive platform.

Data: Machine learning relies on massive datasets, and training jobs frequently read and write data, demanding flexible and efficient data access.

Compute Resources: Training often depends on industrial‑grade GPUs, which are costly and exhibit bursty, volatile demand, leading to potential resource waste during allocation and release.

Advantages of Kubernetes

Resource Pooling: Kubernetes leverages container technology and its scheduler to achieve resource pooling in three ways:

Fine‑grained resource allocation: Containers use cgroups for sharing, reducing interference between workloads on the same host. The LIMIT + REQUEST model allows over‑commitment of compressible resources like CPU, improving utilization.

Rapid provisioning and release: Lightweight containers can be created and deleted quickly; pre‑built Docker images reduce start‑up time.

Decoupling workloads from underlying compute: Containers and images act as an intermediate layer, enabling reuse of the same compute resources across different training tasks.

Extensibility: Kubernetes follows a Resource + Operator architecture. Operators watch the apiserver for state changes and act to reconcile the cluster, making it easy to add new functionality without entangling existing logic. This design ensures high scalability for large, complex systems.

Declarative Interfaces

Declarative APIs are well‑suited for machine‑learning training because they allow templating of distributed compute clusters (e.g., TensorFlow). Once templates are defined, deployments become repeatable and intuitive using simple template languages.

Efficiency Improvements

Data Storage : Training data is organized into four categories—raw data, data set, workspace, and shared—to simplify management. The cluster uses a hybrid S3 + NFS storage model: S3 provides large capacity with internet‑wide access, while NFS compensates for S3’s poor random‑read/write performance.

Image Auto‑Build : Containerization changes the delivery artifact from an executable to an image. To lower the barrier, the platform adopts RISEML, which builds container images from local configuration files, uploads code, and launches tasks automatically.

Scheduling Optimization

Custom scheduling policies address several issues:

Node CPU exhaustion prevents GPU allocation; labeling nodes and using nodeSelector ensures GPU‑capable nodes retain sufficient CPU.

CPU request overestimation leads to low utilization; applying low REQUEST and high LIMIT for CPU improves sharing.

When the cluster is fully loaded, the default FIFO scheduler may ignore submission order. Enabling the pod‑priority alpha feature and customizing the priority function restores order‑based scheduling.

Conclusion

Machine‑learning training has unique requirements for resource usage and interactivity. Kubernetes, as a cloud‑native container orchestration platform, pools resources, enhances compute efficiency, and offers extensibility that allows continuous improvement of the training platform to meet emerging challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Machine Learning Kubernetes resource scheduling training platform

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.