Collection size

99 articles

Page 1 of 5

Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

GPU optimizationOpenAI API CompatibilityPagedAttention

0 likes · 61 min read

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

MaGe Linux Operations

Feb 27, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

This guide explains how to deploy vLLM for large‑language‑model serving on Kubernetes, covering GPU resource management, tensor‑parallel configuration, continuous batching, quantization choices, autoscaling with HPA and KEDA, multi‑model routing, and best‑practice recommendations for performance, cost control, and high availability.

GPUKubernetesLLM inference

0 likes · 48 min read

How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

MaGe Linux Operations

Dec 26, 2025 · Operations

Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production

This article examines why vLLM experiences out‑of‑memory errors in production, explains memory fragmentation caused by PagedAttention, outlines four typical OOM scenarios with concrete command‑line solutions, and provides deep analysis, configuration scripts, dynamic tuning, troubleshooting flowcharts, monitoring alerts, and best‑practice recommendations.

DeploymentGPUMemory Fragmentation

0 likes · 24 min read

Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production

Baidu Intelligent Cloud Tech Hub

Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOKunlun P800LLaMA-Factory

0 likes · 32 min read

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

MaGe Linux Operations

Dec 19, 2025 · Artificial Intelligence

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

DockerGPUInference Optimization

0 likes · 30 min read

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

Raymond Ops

Apr 27, 2026 · Artificial Intelligence

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.

CUDAGPU MemoryLLM serving

0 likes · 22 min read

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

Ops Community

Dec 28, 2025 · Artificial Intelligence

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.

DeploymentGPU optimizationLLM inference

0 likes · 44 min read

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

Alibaba Cloud Native

Mar 9, 2024 · Cloud Computing

Deploy Google Gemma LLM on Alibaba Cloud Function Compute GPU with Low‑Cost Idle Mode

This guide shows how to quickly and cheaply deploy the open‑source Google Gemma large language model on Alibaba Cloud Function Compute GPU using the new idle‑billing mode, covering prerequisites, Docker image creation, function setup, idle reservation, testing, monitoring, and cost estimation.

Cloud ComputingFunction ComputeGPU

0 likes · 10 min read

Deploy Google Gemma LLM on Alibaba Cloud Function Compute GPU with Low‑Cost Idle Mode

Alibaba Cloud Native

Feb 13, 2025 · Artificial Intelligence

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba CloudGPU Reserved Instancesdeployment guide

0 likes · 14 min read

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

Alibaba Cloud Native

Jun 29, 2024 · Cloud Native

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

This guide walks through enabling KServe on Alibaba Cloud ASM, preparing the Llama‑2‑7B model with TensorRT‑LLM, creating the necessary Kubernetes resources, and deploying a serverless AI inference service that can be queried via a simple curl request.

AI inferenceKServeKubernetes

0 likes · 14 min read

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

Efficient Ops

Oct 14, 2025 · Artificial Intelligence

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.

GPU AccelerationLLM inferencePython

0 likes · 8 min read

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

Ops Development Stories

Jun 15, 2025 · Artificial Intelligence

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.

CPU deploymentGPU deploymentLLM inference

0 likes · 38 min read

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudGPU

0 likes · 17 min read

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

Old Zhang's AI Learning

Apr 7, 2026 · Artificial Intelligence

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

CPU KV offloadGPUGemma 4

0 likes · 18 min read

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

Old Zhang's AI Learning

Mar 27, 2026 · Artificial Intelligence

vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2

The March 2026 vLLM release bundle introduces four substantial upgrades—Semantic Router v0.2 Athena, NVIDIA Nemotron 3 Super, the parallel speculative decoding P‑EAGLE, and a completely re‑architected Model Runner V2—each backed by concrete benchmarks, architectural diagrams, and code examples that demonstrate how the engine evolves from a pure inference engine to a full‑stack AI serving platform.

GPU AccelerationModel Runner V2Nemotron-3-Super

0 likes · 17 min read

vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2

Old Zhang's AI Learning

Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI modelsGGUFUnsloth

0 likes · 19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

DeepHub IMBA

Apr 4, 2026 · Artificial Intelligence

Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference

This article walks through constructing Mini-vLLM, a from‑scratch LLM inference engine that tackles the O(N²) attention cost with KV‑cache, boosts throughput via dynamic batching, adds observability with Prometheus/Grafana, supports gRPC, and scales across multiple workers, with benchmark numbers demonstrating its CPU‑only performance.

DockerDynamic BatchingInference Engine

0 likes · 12 min read

Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference

Alibaba Cloud Infrastructure

Dec 22, 2025 · Artificial Intelligence

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.

Alibaba CloudKV CacheKubernetes

0 likes · 18 min read

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

AI Engineer Programming

Apr 22, 2026 · Artificial Intelligence

Free LLM API Tokens: Complete Provider List, Limits, and Usage Tips

This guide compiles free large‑language‑model APIs from official vendors and third‑party platforms, detailing each service's token quotas, rate limits, base URLs, usage restrictions, and available models, while offering practical advice on token optimization, multi‑platform rotation, rate‑limit handling, and key security.

AIFree APILLM

0 likes · 15 min read

Free LLM API Tokens: Complete Provider List, Limits, and Usage Tips

Alibaba Cloud Native

Jun 28, 2025 · Cloud Native

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.

FallbackHigressObservability

0 likes · 15 min read

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide