Tagged articles

10 articles

Page 1 of 1

Mar 23, 2026 · Artificial Intelligence

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

CUDA GraphINT8 QuantizationKunlun XPU

0 likes · 13 min read

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

Bilibili Tech

Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Inference AccelerationLarge Language ModelsOperator fusion

0 likes · 21 min read

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

Baidu Tech Salon

Aug 20, 2024 · Artificial Intelligence

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

The PaddlePaddle Neural Network Compiler (CINN) combines a PIR‑based frontend and a hardware‑specific backend to apply graph‑level optimizations, operator fusion, schedule transformations and automatic tuning, delivering up to 4× faster kernels and 30‑60% overall speed‑ups for deep‑learning and scientific workloads.

CINNGPU optimizationOperator fusion

0 likes · 19 min read

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

Alibaba Cloud Big Data AI Platform

Mar 9, 2023 · Artificial Intelligence

How We Won the DeepRec CTR Contest: 36% Faster Training with Operator Tweaks

The NicePerf team, after clinching the top spot in the Tianchi DeepRec CTR model performance competition, shares a detailed walkthrough of their CPU‑only training optimizations—including operator selection, custom C++ kernels, and workflow tweaks—that cut overall training time by over a third.

CPU trainingDIENDeepRec

0 likes · 9 min read

How We Won the DeepRec CTR Contest: 36% Faster Training with Operator Tweaks

Baidu Geek Talk

Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling

0 likes · 23 min read

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

Baidu Intelligent Cloud Tech Hub

Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU optimizationOperator fusion

0 likes · 23 min read

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

Baidu Intelligent Cloud Tech Hub

Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference

This article presents a comprehensive analysis of AI inference bottlenecks, explores industry acceleration techniques such as model simplification, operator fusion, and single‑operator optimization, and details Baidu Cloud's AIAK‑Inference suite with practical demos showing up to 90% latency reduction.

AI inferenceAIAK-InferenceBaidu Cloud

0 likes · 16 min read

How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference

DataFunSummit

Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic BatchingInference AccelerationModel Compression

0 likes · 12 min read

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

DataFunTalk

Apr 22, 2022 · Artificial Intelligence

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

This article presents a comprehensive overview of inference optimization methods—including model pruning, quantization, knowledge distillation, caching, instruction‑set acceleration, and operator fusion—and details a GPU‑centric parallel acceleration methodology with CUDA basics, performance‑analysis tools, theoretical limits, and practical case studies, all illustrated with real‑world examples from Tencent's intelligent dialogue products.

CachingGPU AccelerationKnowledge Distillation

0 likes · 18 min read

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

Alibaba Cloud Developer

Jul 2, 2019 · Artificial Intelligence

How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine

Alibaba’s MNN (Mobile Neural Network) engine, now open‑sourced on GitHub, showcases how a lightweight, end‑side deep‑learning inference framework tackles fragmentation, optimizes model conversion, scheduling, and execution across diverse devices, delivering significant performance gains for mobile and IoT AI applications.

Inference EngineMNNOperator fusion

0 likes · 15 min read

How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine