Tagged articles

8 articles

Page 1 of 1

May 17, 2026 · Artificial Intelligence

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

DeepSeekGPU Communication OverlapMixture of Experts

0 likes · 18 min read

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

Machine Heart

May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU

0 likes · 12 min read

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

Machine Learning Algorithms & Natural Language Processing

May 7, 2026 · Artificial Intelligence

How TileLang Enables Efficient Small Operators in Large LLMs (DeepSeek V4 Report)

The article analyzes TileLang, the DSL behind DeepSeek V4, showing how its Fragment and Parallel abstractions, host‑side codegen via TVM‑FFI, and Z3 prover integration let developers implement fused small operators with hand‑written performance, faster development, and easier maintenance.

DSLDeepSeekGPU compiler

0 likes · 11 min read

How TileLang Enables Efficient Small Operators in Large LLMs (DeepSeek V4 Report)

Machine Learning Algorithms & Natural Language Processing

May 6, 2026 · Artificial Intelligence

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek V4Expert ParallelGPU Distributed Training

0 likes · 19 min read

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

CodeTrend

Apr 26, 2026 · Artificial Intelligence

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

The article analyzes why most open‑source large models cannot run on Huawei Ascend NPU, detailing the CUDA‑centric ecosystem, Ascend's CANN stack, three core technical hurdles, and the deep collaboration and tooling that enabled DeepSeek V4’s successful adaptation.

AI model portingCANNDeepSeek V4

0 likes · 10 min read

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

Old Zhang's AI Learning

Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU optimizationLLM training

0 likes · 9 min read

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

Tencent Technical Engineering

Jan 23, 2026 · Artificial Intelligence

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

AI infrastructureAgent SystemsGPU communication

0 likes · 27 min read

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

Network Intelligence Research Center (NIRC)

Nov 24, 2025 · Artificial Intelligence

Simplifying AI Operator Development with TileLang DSL

TileLang is a Python‑style DSL built on TVM that separates algorithm logic from hardware scheduling, offers beginner to expert interfaces, supports multiple GPU and CPU backends, and delivers performance on par with or better than existing AI kernels, as demonstrated with GEMM, FlashAttention and other benchmarks.

AI operatorsDSLGEMM

0 likes · 10 min read

Simplifying AI Operator Development with TileLang DSL