Tagged articles
2 articles
Page 1 of 1
Data Party THU
Data Party THU
May 17, 2026 · Artificial Intelligence

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

DeepSeekGPU Communication OverlapMixture of Experts
0 likes · 18 min read
How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 6, 2026 · Artificial Intelligence

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek V4Expert ParallelGPU Distributed Training
0 likes · 19 min read
Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap