Artificial Intelligence 11 min read

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

DataFunSummit
DataFunSummit
DataFunSummit
Elastic Distributed Training at Huya: Design, Implementation, and Results

Huya’s AI platform evolved from a chaotic state before 2019, where AI development was isolated, hardware resources were not shared, and the technology stack was inconsistent. Starting in 2019, a cloud‑native Kubernetes platform was built to unify resource scheduling, development, training, and inference workflows, later adding AI CI/CD and visualized training tracking.

Elastic distributed training was introduced to address three main problems: (1) pronounced GPU usage peaks and valleys due to live‑stream traffic, leaving many GPUs idle during low‑traffic periods; (2) fragmented GPU resources that prevent multi‑GPU tasks from using free GPUs across different machines; and (3) training interruptions caused by node failures that require manual intervention.

Elasticity allows a training job to dynamically expand to idle GPUs and contract when higher‑priority jobs arrive, and to survive node crashes by automatically rescheduling remaining GPUs without losing training state.

The elastic training design relies on ETCD for node registration, leader election, and watch mechanisms. Each node registers its IP, port, and GPU information to ETCD, retrieves peer information, and participates in rank election (e.g., rank0, rank1, rank2). After establishing traditional RingAllReduce communication, the training proceeds.

When a new node joins, it registers to ETCD; existing nodes detect the change, finish the current training step, pause, fetch updated node information, and resume training with the expanded set of nodes. The reverse process handles node removal. This dynamic scaling is orchestrated by a custom Kubernetes operator that launches training pods, each containing a Rendezvous component that interacts with ETCD.

The platform also includes a Remote Cache to store intermediate training data, enabling paused low‑priority jobs to resume from cached states when resources become available.

Performance tests on ResNet‑50 with ImageNet showed that elastic training achieved comparable accuracy and total GPU‑hours to single‑node multi‑GPU training, while significantly reducing training time by utilizing idle GPUs during low‑traffic periods.

Algorithm engineers only need to modify a few dozen lines of code to switch from traditional to elastic training using the EFDL framework, which supports PyTorch DDP, PyTorch Horovod, and TensorFlow Horovod (including Keras and Estimator APIs). They set the training mode to EFDL and specify minimum and maximum worker counts.

The deployment brings two major benefits: (1) shorter training times by automatically leveraging low‑peak and fragmented GPU resources, and (2) reduced operational costs by avoiding unnecessary machine provisioning and improving priority‑based scheduling across the company.

Future work focuses on making the system easier to use (reducing required code changes and abstracting elasticity concepts), more stable (enhancing fault tolerance), more efficient (optimizing distributed training performance), and more open (contributing back to the open‑source community).

kubernetesGPU schedulingdistributed trainingAI Platformelastic-trainingHuya
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.