Tagged articles

11 articles

Page 1 of 1

Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation

0 likes · 18 min read

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

Baidu Intelligent Cloud Tech Hub

Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference

0 likes · 16 min read

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

Baidu Intelligent Cloud Tech Hub

Jan 5, 2026 · Artificial Intelligence

How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization

This article details Baidu's Tianchi supernode design and software tuning—covering hardware scale‑up, deployment planning, Prefill and Decode stage optimizations, quantization strategies, and communication schemes—to dramatically boost large‑model inference throughput and latency while lowering token‑cost.

AI infrastructureParallelismPerformance Optimization

0 likes · 20 min read

How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization

Alibaba Cloud Infrastructure

May 14, 2025 · Artificial Intelligence

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

AI PerformanceAlibaba CloudDistributed Systems

0 likes · 7 min read

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

DeWu Technology

Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIGPU AccelerationSpeculative Decoding

0 likes · 22 min read

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

Baidu Geek Talk

Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT

0 likes · 10 min read

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Baidu Intelligent Cloud Tech Hub

Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance

0 likes · 10 min read

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

Alibaba Cloud Infrastructure

Nov 29, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

KVCacheTsinghua Universitylarge-model inference

0 likes · 4 min read

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

Volcano Engine Developer Services

Jun 20, 2023 · Artificial Intelligence

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Large-model offline (batch) inference, which processes massive data on billion-parameter models, faces GPU memory and distributed scheduling challenges; this article explains how Ray's cloud-native framework, model parallelism, and Ray Datasets pipelines address these issues, improve throughput, and enable elastic, efficient GPU utilization.

GPU utilizationRaycloud-native

0 likes · 16 min read

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Baidu Geek Talk

Mar 9, 2023 · Industry Insights

How Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle Boost AI Painting Performance

This article analyzes Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle‑optimized Stable Diffusion models, presenting benchmark comparisons, hardware‑specific speed and memory gains, and the underlying inference optimizations that enable low‑cost, high‑throughput AI‑generated image creation.

AI paintingAIGCGPU Acceleration

0 likes · 9 min read

How Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle Boost AI Painting Performance