Artificial Intelligence 13 min read

WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

The paper introduces WPipe, a group‑based interleaved pipeline parallelism method that reduces memory overhead and weight‑update latency compared with PipeDream‑2BW, achieving up to 1.4× speed‑up and 36% lower memory usage while preserving model accuracy on large‑scale DNNs.

AntTech
AntTech
AntTech
WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

ICLR (International Conference on Learning Representations) is one of the three top conferences in machine learning, and the work described here was accepted to ICLR 2022. Recent trends of using massive deep neural networks (DNNs) have driven the development of parallel pipeline techniques such as GPipe, PipeDream, and PipeDream‑2BW, but the latter still suffers from excessive memory redundancy and weight‑update delay.

WPipe addresses these two drawbacks by partitioning the model into two groups and applying a novel moving‑operation that enables seamless weight updates while cutting both memory overhead and update latency by roughly half. Experiments on large‑scale language model BERT and vision model ResNeXt show that WPipe achieves 1.4× acceleration and a 36% reduction in memory consumption without harming final accuracy.

The article reviews related work on model parallelism, distinguishing intra‑layer (data‑parallel) and inter‑layer approaches, and explains how pipeline parallelism (PP) can improve resource utilization. Synchronous PP (Sync‑PP) eliminates weight staleness but introduces idle bubbles that reduce throughput, whereas asynchronous PP (Async‑PP) removes bubbles at the cost of maintaining multiple weight versions.

WPipe’s design interleaves two groups of model partitions so that forward passes of the next update cycle are moved ahead of the backward passes of the current cycle, eliminating cross‑cycle weight conflicts and halving the number of required weight versions. This also reduces runtime activation memory by 50%.

Memory analysis (Table 1) demonstrates that WPipe uses half the parameter cache of PipeDream‑2BW and significantly less activation memory, which is critical for training very large models where activation memory dominates.

Experimental evaluation covers convergence, throughput, and memory usage on ResNeXt (CV) and BERT (NLP) models. Convergence results show WPipe matches the final accuracy of PipeDream‑2BW and data parallelism. Throughput measurements reveal WPipe consistently outperforms competing methods, with up to 44× batch‑size increase in the WPipe‑R variant. Memory usage experiments confirm WPipe‑R achieves the greatest reduction, especially at larger batch sizes.

In conclusion, WPipe provides a memory‑efficient, high‑throughput pipeline parallelism solution that outperforms the state‑of‑the‑art PipeDream‑2BW, delivering 1.4× speed‑up, 36% lower memory consumption, and weight‑update semantics close to pure data parallelism.

deep learningpipeline parallelismlarge-scale DNNMemory Efficiencytraining throughputWPipe
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.