How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

CUDA GraphINT8 QuantizationKunlun XPU

0 likes · 13 min read

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

Baidu Intelligent Cloud Tech Hub

Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference

0 likes · 16 min read

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalTensorRT

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

DataFunSummit

Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphGPU optimizationPyTorch

0 likes · 11 min read

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

DaTaobao Tech

Sep 7, 2022 · Artificial Intelligence

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation

The team enhanced real‑time recommendation by redesigning TensorFlow graphs—using constant‑folding, a custom CallGraphOP cache, a simplified dense layer, and CUDA‑Graph compatibility—boosting single‑machine throughput ~40%, raising GPU utilization from 30% to 43%, cutting latency and saving roughly 30% of hardware resources.

CUDA GraphGPU performanceRecommendation Systems

0 likes · 11 min read

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation