Tagged articles

14 articles

Page 1 of 1

May 17, 2026 · Artificial Intelligence

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

DeepSeekGPU Communication OverlapMixture of Experts

0 likes · 18 min read

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

Machine Heart

May 16, 2026 · Artificial Intelligence

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

In a deep interview, former Google TPU architect Reiner Pope explains that low‑concurrency fast‑mode services trade higher fees for faster streaming but are limited by memory‑bandwidth bottlenecks, that optimal concurrency balances compute and memory costs, and that pipeline‑parallel sparse expert models and reinforcement‑learning fine‑tuning introduce new inefficiencies and overtraining risks.

LLMMemory BandwidthOvertraining

0 likes · 7 min read

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

AI Architecture Hub

Jan 2, 2026 · Artificial Intelligence

How Manifold-Constrained Hyper-Connections Boost LLM Performance with Minimal Overhead

DeepSeek's new mHC architecture projects residual connections onto a manifold, enabling a 6.7% training cost increase for 27B models while delivering significant stability and downstream performance gains over traditional residual and hyper‑connection designs.

LLMManifold OptimizationPipeline Parallelism

0 likes · 13 min read

How Manifold-Constrained Hyper-Connections Boost LLM Performance with Minimal Overhead

Xiaohongshu Tech REDtech

Dec 11, 2025 · Artificial Intelligence

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

The article introduces a fine‑grained activation offloading technique implemented in Megatron‑Core that offloads module‑level activations to CPU, overlaps transfer with computation, and remains compatible with pipeline and virtual pipeline parallelism, dramatically reducing peak GPU memory for large language models while incurring minimal throughput loss.

LLMMegatronMemory Optimization

0 likes · 18 min read

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

IT Services Circle

Nov 28, 2025 · Artificial Intelligence

Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work

AI model training relies on parallel computing, and this guide explains the four main parallelism strategies—Data Parallelism, Pipeline Parallelism, Tensor Parallelism, and Expert Parallelism—detailing their mechanisms, advantages, drawbacks, and how techniques like ZeRO and mixed 3D parallelism optimize memory and performance for massive models.

3D ParallelismAI parallelismData Parallelism

0 likes · 14 min read

Unlocking AI Model Speed: How Data, Pipeline, Tensor & Expert Parallelism Work

Architect

May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismExpert Parallelism

0 likes · 14 min read

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

AI Algorithm Path

May 11, 2025 · Artificial Intelligence

How to Parallelize Ultra‑Large Model Training with PyTorch

The article explains the core concepts and trade‑offs of five parallelism techniques—data, tensor, context, pipeline, and expert parallelism—plus the ZeRO optimizer, showing when each method is appropriate for training ultra‑large PyTorch models and providing concrete code snippets and performance considerations.

Context ParallelismData ParallelismExpert Parallelism

0 likes · 21 min read

How to Parallelize Ultra‑Large Model Training with PyTorch

Alibaba Cloud Big Data AI Platform

Mar 7, 2025 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

This article explains how the Pai‑Megatron‑Patch toolkit enhances the usability and training performance of the Qwen2‑VL multimodal large model by introducing model‑parallel weight conversion, user‑friendly data loading, visual feature processing optimizations, optimizer offloading, and pipeline parallelism techniques, supported by extensive experimental analysis.

MegatronPipeline ParallelismQwen2-VL

0 likes · 25 min read

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

AI Algorithm Path

Feb 10, 2025 · Artificial Intelligence

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

This article explains how the DualPipe scheduling mechanism in DeepSeek‑R1 improves GPU cluster compute‑communication efficiency by using fine‑grained pipeline stages and bidirectional data flow, comparing it with Zero Bubble pipeline parallelism and discussing the challenges of large‑scale distributed training.

DeepSeekDualPipeModel Parallelism

0 likes · 10 min read

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

DataFunTalk

Jul 8, 2024 · Artificial Intelligence

Challenges and Techniques for Distributed Training of Large Language Models

This article discusses the historical background, major challenges such as massive compute and memory demands, and the technical ecosystem—including data parallelism, pipeline parallelism, and optimization strategies like DeepSpeed and 1F1B—to enable efficient distributed training of large language models.

AI infrastructureDeepSpeedPipeline Parallelism

0 likes · 22 min read

Challenges and Techniques for Distributed Training of Large Language Models

DataFunSummit

Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI infrastructureData ParallelismDeepSpeed

0 likes · 23 min read

Challenges and Techniques in Distributed Training of Large Language Models

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedGPU optimizationLLaMA

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework

AntTech

May 24, 2022 · Artificial Intelligence

WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

The paper introduces WPipe, a group‑based interleaved pipeline parallelism method that reduces memory overhead and weight‑update latency compared with PipeDream‑2BW, achieving up to 1.4× speed‑up and 36% lower memory usage while preserving model accuracy on large‑scale DNNs.

Pipeline ParallelismTraining ThroughputWPipe

0 likes · 13 min read

WPipe: Group‑Based Interleaved Pipeline Parallelism for Large‑Scale DNN Training

DataFunSummit

Apr 25, 2022 · Artificial Intelligence

Token‑Level Pipeline Parallelism for Transformer‑based Language Models (TeraPipe)

The article introduces a token‑level pipeline parallelism strategy that splits the sequence‑length dimension of Transformer‑based language models, explains why this approach is feasible, presents a dynamic‑programming formulation for optimal slicing, discusses engineering challenges, and evaluates its performance on large GPT models.

Performance OptimizationPipeline ParallelismToken-level

0 likes · 13 min read

Token‑Level Pipeline Parallelism for Transformer‑based Language Models (TeraPipe)