Artificial Intelligence 9 min read

How Tencent Engineers Shattered the 128‑GPU ImageNet Training Record in 2m31s

Tencent engineers broke the world record for training ImageNet with 128 V100 GPUs in just 2 minutes 31 seconds, detailing a suite of optimizations—including a new Light distributed training framework, single‑machine speed boosts, multi‑machine communication enhancements, and advanced batch convergence techniques—that together dramatically cut training time while maintaining high accuracy.

Tencent Tech
Tencent Tech
Tencent Tech
How Tencent Engineers Shattered the 128‑GPU ImageNet Training Record in 2m31s

Recently, Tencent engineers broke the world record for training ImageNet with 128 GPUs in 2 minutes 31 seconds, 7 seconds faster than the previous record.

Using a 25 Gbps VPC network, 128 V100 GPUs, and the new Light large‑scale distributed multi‑machine multi‑GPU training framework, they completed 28 epochs with 93% top‑5 accuracy.

Motivation: AI models are becoming increasingly complex, with massive data volumes, deeper networks, billions of parameters, and long training times, leading to high costs.

Tencent aimed to push the limits of AI model training frameworks.

They developed the Light framework, optimizing single‑machine training speed, multi‑machine communication, and batch convergence.

Single‑machine speed improvements

Addressed slow remote storage access, CPU contention from many threads, and JPEG decoding bottlenecks by caching data on SSD/memory, auto‑tuning thread counts, and pre‑decoding images.

Multi‑machine communication optimization

Implemented adaptive gradient fusion, 2D communication with multiple streams, and gradient compression to reduce communication volume and improve bandwidth utilization, raising throughput to 3100 samples/second.

Batch convergence techniques

Used large‑batch training with multi‑stage resolution, gradient‑compression precision compensation, and AutoML (TianFeng) for hyper‑parameter search, achieving 93% top‑5 accuracy after only 28 epochs.

All these optimizations enabled the new world record, now integrated into Tencent Cloud’s Intelligent‑Ti AI platform, offering a one‑stop service for data preprocessing, model building, training, evaluation, and deployment.

Future work will continue to improve usability, training and inference performance for the broader AI community.

machine learningGPUdistributed trainingTencent Cloudlarge-scale AIImageNet
Tencent Tech
Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.