Artificial Intelligence 13 min read

WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

The paper introduces WPipe, a group‑based interleaved pipeline parallelism method that reduces memory overhead and weight‑update latency compared with PipeDream‑2BW, achieving up to 1.4× speed‑up and 36% lower memory usage while preserving model accuracy on large‑scale DNNs.

AntTech

May 24, 2022

WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

ICLR (International Conference on Learning Representations) is one of the three top conferences in machine learning, and the work described here was accepted to ICLR 2022. Recent trends of using massive deep neural networks (DNNs) have driven the development of parallel pipeline techniques such as GPipe, PipeDream, and PipeDream‑2BW, but the latter still suffers from excessive memory redundancy and weight‑update delay.

WPipe addresses these two drawbacks by partitioning the model into two groups and applying a novel moving‑operation that enables seamless weight updates while cutting both memory overhead and update latency by roughly half. Experiments on large‑scale language model BERT and vision model ResNeXt show that WPipe achieves 1.4× acceleration and a 36% reduction in memory consumption without harming final accuracy.

The article reviews related work on model parallelism, distinguishing intra‑layer (data‑parallel) and inter‑layer approaches, and explains how pipeline parallelism (PP) can improve resource utilization. Synchronous PP (Sync‑PP) eliminates weight staleness but introduces idle bubbles that reduce throughput, whereas asynchronous PP (Async‑PP) removes bubbles at the cost of maintaining multiple weight versions.

WPipe’s design interleaves two groups of model partitions so that forward passes of the next update cycle are moved ahead of the backward passes of the current cycle, eliminating cross‑cycle weight conflicts and halving the number of required weight versions. This also reduces runtime activation memory by 50%.

Memory analysis (Table 1) demonstrates that WPipe uses half the parameter cache of PipeDream‑2BW and significantly less activation memory, which is critical for training very large models where activation memory dominates.

Experimental evaluation covers convergence, throughput, and memory usage on ResNeXt (CV) and BERT (NLP) models. Convergence results show WPipe matches the final accuracy of PipeDream‑2BW and data parallelism. Throughput measurements reveal WPipe consistently outperforms competing methods, with up to 44× batch‑size increase in the WPipe‑R variant. Memory usage experiments confirm WPipe‑R achieves the greatest reduction, especially at larger batch sizes.

In conclusion, WPipe provides a memory‑efficient, high‑throughput pipeline parallelism solution that outperforms the state‑of‑the‑art PipeDream‑2BW, delivering 1.4× speed‑up, 36% lower memory consumption, and weight‑update semantics close to pure data parallelism.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Pipeline Parallelism large-scale DNN memory efficiency Training Throughput WPipe

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.