Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek V4Expert ParallelGPU Distributed Training

0 likes · 19 min read

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

Fun with Large Models

Aug 30, 2025 · Artificial Intelligence

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

This article explains how to fine‑tune large models across multiple machines and GPUs by covering data, model, tensor, and pipeline parallelism, hybrid 3D parallel strategies, engineering details such as NCCL, PyTorch Distributed, DeepSpeed, fault‑tolerance, checkpointing, and the ZeRO optimizer stages that dramatically reduce memory usage.

Data ParallelDeepSpeedMegatron-LLM

0 likes · 8 min read

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

Alibaba Cloud Infrastructure

Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayKubernetesLLM

0 likes · 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

Rare Earth Juejin Tech Community

May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

GPU MemoryPipeline ParallelZeRO

0 likes · 9 min read

GPU Memory Analysis and Distributed Training Strategies

Huawei Cloud Developer Alliance

Jul 17, 2023 · Artificial Intelligence

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training

During a livestream titled “Solving the ‘Development Difficulty’ of Large Models with MindSpore Auto Parallel”, Huawei’s MindSpore experts explained how the framework’s distributed training techniques—including data, model, and pipeline parallelism as well as memory‑saving strategies—enable efficient pre‑training of trillion‑parameter models across diverse AI domains.

Data ParallelLarge ModelsMemory Optimization

0 likes · 6 min read

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training