How PD (Prefill‑Decode) Disaggregation Makes LLM Inference Faster and More Stable

The article explains PD (Prefill‑Decode) disaggregation, an architecture that separates the compute‑bound Prefill stage from the memory‑bound Decode stage onto different GPU pools, eliminating interference, enabling independent scaling, leveraging hardware specialization, and delivering up to 85% lower tail latency for large language model inference.

GPU scalingKV cache transportLLM inference

0 likes · 10 min read

How PD (Prefill‑Decode) Disaggregation Makes LLM Inference Faster and More Stable

Alibaba Cloud Big Data AI Platform

Mar 25, 2026 · Artificial Intelligence

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

This article explains how to use NVIDIA Isaac Lab and the TiledCamera component to run large‑scale, multimodal reinforcement learning on GPU clusters, covering environment setup, noVNC visualization, command‑line execution, distributed training with torchrun, and performance analysis across multiple GPU configurations.

GPU scalingNVIDIA Isaac LabTiledCamera

0 likes · 12 min read

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

Alibaba Cloud Infrastructure

Oct 29, 2025 · Artificial Intelligence

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

The article outlines how Alibaba Cloud’s container service has evolved to support large‑scale GPU clusters, AI data pipelines, and the new AI Serving Stack, enabling enterprises to deploy, scale, and manage LLM inference services efficiently while addressing Day0‑Day2 challenges.

AI infrastructureAlibaba CloudGPU scaling

0 likes · 13 min read

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

Baobao Algorithm Notes

Sep 18, 2024 · Artificial Intelligence

Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It

Training deep learning models on a thousand GPUs faces steep communication overhead, higher failure probability, and scaling inefficiencies, but by profiling each step, overlapping compute and communication, using gradient bucketing and accumulation, and employing elastic training techniques, practitioners can approach near‑linear performance while mitigating common pitfalls.

GPU scalingLarge ModelsPerformance Optimization

0 likes · 13 min read

Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It

Architects' Tech Alliance

Jul 30, 2024 · Artificial Intelligence

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

This article translates and analyzes the MegaScale system—co‑developed by ByteDance and Peking University—that enables efficient, stable training of massive language models on clusters of more than 10,000 GPUs, achieving 55.2% MFU and a 1.34× speedup over Megatron‑LM.

Distributed SystemsGPU scalingLLM training

0 likes · 15 min read

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

Kuaishou Tech

Jul 16, 2021 · Artificial Intelligence

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

BaguaGPU scalingPyTorch

0 likes · 15 min read

Bagua: An Open‑Source Distributed Training Framework for Deep Learning