Tagged articles

63 articles

Page 1 of 1

May 31, 2026 · Artificial Intelligence

Can Low-Bit Models Cut Inference Costs Better Than Small Models?

The article analyzes how low‑bit quantization differs from simply using smaller LLMs, examines hardware‑level precision reduction, compares post‑training quantization with native low‑bit designs, and explains the runtime and testing requirements needed to achieve real inference cost savings.

LLM inferencecost optimizationhardware acceleration

0 likes · 7 min read

Can Low-Bit Models Cut Inference Costs Better Than Small Models?

Machine Learning Algorithms & Natural Language Processing

May 29, 2026 · Artificial Intelligence

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.

Dynamic Token SelectionKernel OptimizationLLM inference

0 likes · 17 min read

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

Baidu Intelligent Cloud Tech Hub

May 27, 2026 · Artificial Intelligence

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

The article analyzes the architectural challenges of large‑model inference in the Agent era—such as memory‑intensive MLA structures, MoE communication overhead, exploding KV‑Cache size, and tool‑call accuracy—and presents a series of engineering solutions including hierarchical KV‑Cache pooling, sequence parallelism, offloading strategies, and chip‑level adaptations to achieve higher throughput and lower token costs.

AI InfraAgentDeepSeek

0 likes · 15 min read

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

Tencent Technical Engineering

May 25, 2026 · Artificial Intelligence

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

This article walks through a two‑month source‑code study of vLLM, explaining how token‑level scheduling, continuous batching, and the Paged Attention mechanism reshape tensor dimensions to turn large‑model inference into a compute‑bound, high‑throughput process while managing GPU memory efficiently.

FlashAttentionGPU optimizationLLM inference

0 likes · 29 min read

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

Old Zhang's AI Learning

May 17, 2026 · Artificial Intelligence

Why DeepSeek V4 Flash’s Quantized Model Is Gaining Traction

The DeepSeek V4 Flash quantized GGUF model and the dedicated ds4 inference engine, both released by antirez, offer dramatically reduced activation parameters, massive 1‑million‑token context windows, aggressive KV‑cache compression and hardware‑specific quantizations that enable smooth local inference on high‑memory Macs and CUDA machines, while sacrificing generality for performance.

DeepSeek V4 FlashGGUFLLM inference

0 likes · 11 min read

Why DeepSeek V4 Flash’s Quantized Model Is Gaining Traction

360 Zhihui Cloud Developer

May 15, 2026 · Artificial Intelligence

How PD (Prefill‑Decode) Disaggregation Makes LLM Inference Faster and More Stable

The article explains PD (Prefill‑Decode) disaggregation, an architecture that separates the compute‑bound Prefill stage from the memory‑bound Decode stage onto different GPU pools, eliminating interference, enabling independent scaling, leveraging hardware specialization, and delivering up to 85% lower tail latency for large language model inference.

GPU scalingKV cache transportLLM inference

0 likes · 10 min read

How PD (Prefill‑Decode) Disaggregation Makes LLM Inference Faster and More Stable

Old Zhang's AI Learning

May 13, 2026 · Artificial Intelligence

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

vLLM tops the Artificial Analysis ranking by delivering the highest throughput for DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5 on identical NVIDIA Blackwell Ultra hardware, thanks to extensive kernel‑fusion optimizations that remain in the main branch.

DeepSeekLLM inferenceQwen

0 likes · 7 min read

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Which Inference Framework Maximizes Your GPU Performance in 2026?

This article compares six popular LLM inference frameworks—vLLM, TensorRT‑LLM, llama.cpp, ds4.c, Ollama, and Omlx—across performance, ease of use, and hardware compatibility, then provides a practical matrix to help users select the best fit for their GPU.

Apple SiliconGPU performanceLLM inference

0 likes · 10 min read

Which Inference Framework Maximizes Your GPU Performance in 2026?

Old Zhang's AI Learning

May 10, 2026 · Artificial Intelligence

DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

DFlash replaces the speculative draft model with a block‑diffusion drafter, generating 16 tokens per forward pass and achieving up to 6× speedup over baseline (2.5× over EAGLE‑3) without quality loss, while supporting a wide range of open‑source LLMs and multiple back‑ends.

Block DiffusionDFlashLLM inference

0 likes · 12 min read

DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

Machine Heart

May 7, 2026 · Artificial Intelligence

Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

TokenSpeed, an open‑source LLM inference engine designed for agent workloads, delivers TensorRT‑LLM‑level performance and vLLM‑level ease of use, outperforms TensorRT‑LLM by up to 11% throughput and halves latency on speculative decoding, and has earned Nvidia’s public recommendation.

Agent workloadsLLM inferenceNVIDIA Blackwell

0 likes · 8 min read

Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

Old Zhang's AI Learning

May 6, 2026 · Artificial Intelligence

Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Google’s new Multi‑Token Prediction (MTP) drafter for Gemma 4 delivers up to three‑fold inference speedups across hardware and frameworks—validated by official benchmarks and independent DGX Spark tests—while preserving identical output quality, and is immediately usable via Hugging Face, vLLM, MLX, Ollama and edge‑device runtimes.

Apple SiliconGemma 4LLM inference

0 likes · 9 min read

Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Old Zhang's AI Learning

Apr 21, 2026 · Artificial Intelligence

Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.

Hybrid AttentionKVCache optimizationLLM inference

0 likes · 12 min read

Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

Old Zhang's AI Learning

Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV CacheLLM inferencePerformance Optimization

0 likes · 13 min read

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

Old Zhang's AI Learning

Apr 7, 2026 · Artificial Intelligence

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

CPU KV offloadGPUGemma 4

0 likes · 18 min read

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

Old Zhang's AI Learning

Apr 6, 2026 · Artificial Intelligence

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Ollama 0.19 replaces its inference backend with Apple’s MLX framework and adopts NVIDIA’s NVFP4 4‑bit quantization, delivering up to a 93% speed increase on M5 chips while keeping accuracy comparable to cloud‑based deployments, and adds three cache upgrades for smoother agent interactions.

Apple SiliconLLM inferenceMLX

0 likes · 10 min read

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

ShiZhen AI

Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV CacheLLM inferenceMemory Optimization

0 likes · 9 min read

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

ShiZhen AI

Mar 31, 2026 · Artificial Intelligence

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Google's TurboQuant paper shows KV cache compression up to 6.4× with minimal quality loss, causing DRAM and SSD stocks to tumble, while the open‑source community reproduces the method in under two days and Anthropic and OpenAI add powerful developer‑focused AI features.

AI toolchainClaude CodeKV Cache

0 likes · 9 min read

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

KV CacheLLM inferenceTurboQuant

0 likes · 8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Shi's AI Notebook

Mar 27, 2026 · Artificial Intelligence

Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

The article explains how Prompt Caching leverages vLLM's PagedAttention and block‑level hashing to reuse KV cache across identical prefixes, dramatically cutting LLM inference latency and cost, and provides concrete engineering tips for maximizing cache hit rates.

HashingKV CacheLLM inference

0 likes · 7 min read

Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

Baidu Intelligent Cloud Tech Hub

Mar 23, 2026 · Artificial Intelligence

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

CUDA GraphINT8 QuantizationKunlun XPU

0 likes · 13 min read

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

Old Zhang's AI Learning

Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM inference

0 likes · 5 min read

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

Machine Learning Algorithms & Natural Language Processing

Mar 5, 2026 · Artificial Intelligence

Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang

The new Speculative Speculative Decoding (SSD) framework, built by the Mamba and FlashAttention authors, eliminates the serial draft‑verification bottleneck in LLM inference by running the draft model asynchronously, introducing a speculation cache and the Saguaro algorithm, which together deliver up to 5× speedup over autoregressive baselines and up to 2× over optimized engines on Llama‑3 and Qwen‑3, reshaping the latency‑throughput trade‑off.

Asynchronous ParallelismLLM inferencePerformance Optimization

0 likes · 9 min read

Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang

AI Explorer

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV CacheLLM inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

DeepHub IMBA

Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV CacheLLM inferencePagedAttention

0 likes · 16 min read

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

MaGe Linux Operations

Feb 27, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

This guide explains how to deploy vLLM for large‑language‑model serving on Kubernetes, covering GPU resource management, tensor‑parallel configuration, continuous batching, quantization choices, autoscaling with HPA and KEDA, multi‑model routing, and best‑practice recommendations for performance, cost control, and high availability.

GPUKubernetesLLM inference

0 likes · 48 min read

How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

PaperAgent

Feb 27, 2026 · Artificial Intelligence

How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

This article analyzes the DualPath architecture that redesigns KV‑Cache data paths to overcome storage‑NIC saturation in Prefill‑Decode LLM systems, presenting theoretical proofs, detailed engineering solutions, and extensive offline and online benchmarks that demonstrate up to 2.25× performance gains.

DualPathLLM inferencePerformance Optimization

0 likes · 9 min read

How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

AI Engineering

Jan 23, 2026 · Industry Insights

vLLM Core Team Launches Inferact, Secures $150M Seed Funding

The vLLM core maintainers have founded Inferact, raised a $150 million seed round led by Andreessen Horowitz and Lightspeed, and highlighted escalating inference challenges, the project's ecosystem dominance, and a continued commitment to open‑source development.

AI infrastructureInferactLLM inference

0 likes · 3 min read

vLLM Core Team Launches Inferact, Secures $150M Seed Funding

Alibaba Cloud Developer

Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage

0 likes · 20 min read

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

AI Frontier Lectures

Jan 12, 2026 · Industry Insights

Why LLM Inference Hits a Memory Wall – Four Hardware Research Directions

The article analyses the challenges of large‑language‑model inference, highlighting memory bandwidth and interconnect as the primary bottlenecks, and presents four research opportunities—high‑bandwidth flash, processing‑near‑memory, 3D memory‑logic stacking, and low‑latency interconnect—while evaluating current Nvidia solutions and proposing integrated architectural approaches.

3D stackingAI hardware researchLLM inference

0 likes · 22 min read

Why LLM Inference Hits a Memory Wall – Four Hardware Research Directions

MaGe Linux Operations

Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU optimizationH800

0 likes · 19 min read

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

Alibaba Cloud Developer

Jan 6, 2026 · Artificial Intelligence

How Tair‑KVCache‑HiSim Simulates LLM Inference 390 000× Faster with <5% Error

This article explains the design, challenges, and high‑fidelity architecture of Tair‑KVCache‑HiSim, a simulation tool that models multi‑level KV‑Cache behavior for large‑language‑model inference, predicts latency, throughput and cost under SLO constraints, and validates its predictions against real GPU deployments with sub‑5% error.

AI infrastructureKVCacheLLM inference

0 likes · 32 min read

How Tair‑KVCache‑HiSim Simulates LLM Inference 390 000× Faster with <5% Error

Ops Community

Dec 28, 2025 · Artificial Intelligence

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.

DeploymentGPU optimizationLLM inference

0 likes · 44 min read

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

Alibaba Cloud Developer

Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI infrastructureKVCacheKubernetes

0 likes · 21 min read

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

AI2ML AI to Machine Learning

Dec 21, 2025 · Artificial Intelligence

Why KV Caching Is Critical for Efficient LLM Inference

The article breaks down the principles of KV caching in large language models, explaining how Q/K/V behavior differs between training and inference, the role of prompts, cache size trade‑offs, and the complexities of concurrent inference, all backed by concrete examples and references.

Concurrent InferenceKV CacheLLM inference

0 likes · 7 min read

Why KV Caching Is Critical for Efficient LLM Inference

Baidu Geek Talk

Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU MemoryLLM inference

0 likes · 20 min read

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Baidu Intelligent Cloud Tech Hub

Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference

0 likes · 21 min read

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Baidu Intelligent Cloud Tech Hub

Nov 25, 2025 · Artificial Intelligence

Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It

The Baidu Baige team discovered that DeepSeek‑V3.2‑Exp’s long‑context performance lagged behind the official report, traced the issue to a subtle RoPE layout mismatch in the open‑source inference demo, collaborated with DeepSeek to fix it, and verified that the model’s speed and accuracy fully recovered across multiple benchmarks.

AI infrastructureDeepSeekLLM inference

0 likes · 9 min read

Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It

Baidu Intelligent Cloud Tech Hub

Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference

0 likes · 9 min read

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Alibaba Cloud Infrastructure

Oct 29, 2025 · Artificial Intelligence

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

The article outlines how Alibaba Cloud’s container service has evolved to support large‑scale GPU clusters, AI data pipelines, and the new AI Serving Stack, enabling enterprises to deploy, scale, and manage LLM inference services efficiently while addressing Day0‑Day2 challenges.

AI infrastructureAlibaba CloudGPU scaling

0 likes · 13 min read

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

Efficient Ops

Oct 14, 2025 · Artificial Intelligence

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.

GPU AccelerationLLM inferencePython

0 likes · 8 min read

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

Architects' Tech Alliance

Sep 30, 2025 · Artificial Intelligence

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

This article explains how key‑value (KV) caching and the new CachedAttention technique dramatically reduce large‑language‑model inference costs by reusing stored attention data across dialogue turns, leveraging a three‑tier memory hierarchy of HBM, DRAM, and SSD to overcome bandwidth and capacity bottlenecks.

AI PerformanceCachedAttentionKV Cache

0 likes · 8 min read

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

Volcano Engine Developer Services

Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI infrastructureKVCacheLLM inference

0 likes · 30 min read

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

AI Algorithm Path

Jul 13, 2025 · Artificial Intelligence

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

This article explains how to estimate the GPU memory required for running large language models of 3 B, 7 B, and 13 B parameters, walks through step‑by‑step calculations, shows how hardware limits affect feasibility, and offers practical optimization techniques such as quantization and CPU offloading.

AI model sizingCPU offloadingFP16

0 likes · 5 min read

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

Ops Development Stories

Jun 15, 2025 · Artificial Intelligence

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.

CPU deploymentGPU deploymentLLM inference

0 likes · 38 min read

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

Alibaba Cloud Infrastructure

May 1, 2025 · Artificial Intelligence

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

This article demonstrates how to use ACK AI Profiling, built on eBPF and dynamic process injection, to perform non-intrusive, low‑overhead profiling of Kubernetes‑deployed large‑language‑model inference services, identify GPU memory growth causes, and apply optimization recommendations to prevent OOM issues.

AI profilingGPU MemoryKubernetes

0 likes · 10 min read

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

Ops Development & AI Practice

Apr 2, 2025 · Artificial Intelligence

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

Cache‑Augmented Generation (CAG) speeds up large language model text generation by caching the Transformer attention layer’s key‑value states, dramatically reducing the quadratic compute cost of autoregressive decoding while keeping the model’s knowledge unchanged.

AI PerformanceCAGCache‑augmented generation

0 likes · 9 min read

How Cache‑Augmented Generation (CAG) Supercharges LLM Inference

Baidu Intelligent Cloud Tech Hub

Mar 7, 2025 · Artificial Intelligence

Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

This guide explains how to set up Baidu Baige's PD‑separated deployment for the DeepSeek R1 large‑language model, covering resource preparation, data acquisition, Prefill and Decode service configuration, and API invocation to achieve lower latency and higher throughput.

Baidu BaigeDeepSeekGPU deployment

0 likes · 7 min read

Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

Meituan Technology Team

Mar 6, 2025 · Artificial Intelligence

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.

DeepSeek-R1GPU deploymentINT8 Quantization

0 likes · 11 min read

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Architect

Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance OptimizationSpeculative Decoding

0 likes · 23 min read

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

AI Algorithm Path

Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU optimization

0 likes · 8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

DaTaobao Tech

Oct 16, 2024 · Artificial Intelligence

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

CPU optimizationDynamic QuantizationINT8

0 likes · 19 min read

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

Baobao Algorithm Notes

Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM inferenceOpenAI o1Reward Model

0 likes · 34 min read

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

Alibaba Cloud Big Data AI Platform

Sep 16, 2024 · Artificial Intelligence

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

With the growing complexity of LLM architectures like GQA, MLA, and MoE, runtime overhead has become a bottleneck; this article analyzes Python performance, communication costs, and synchronous execution in current inference frameworks, introduces the fully asynchronous TAG architecture, and demonstrates its superior throughput and latency through benchmarks.

GPU utilizationLLM inferenceRuntime Optimization

0 likes · 12 min read

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

Alibaba Cloud Infrastructure

Sep 5, 2024 · Artificial Intelligence

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

This guide explains how to quickly build a high‑performance, observable, and elastically scalable LLM inference service by deploying NVIDIA NIM on an Alibaba Cloud ACK cluster using the Cloud‑Native AI Suite, KServe, Prometheus, Grafana, and custom autoscaling based on request‑queue metrics.

Alibaba Cloud ACKAutoscalingGrafana

0 likes · 15 min read

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

Baobao Algorithm Notes

Jul 11, 2024 · Artificial Intelligence

Why Separate Prefill and Decode? A Deep Dive into DistServe’s Split Inference Architecture

This article explores the two‑stage LLM inference pipeline, introduces TTFT and TPOT metrics, explains the motivation for prefilling‑decoding separation, presents experimental comparisons between split and merged architectures, and details optimization techniques and parallel‑strategy modeling for DistServe.

DistServeGoodputLLM inference

0 likes · 28 min read

Architect

Jul 2, 2024 · Artificial Intelligence

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.

AI SystemsDecodeKVCache

0 likes · 9 min read

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

DataFunSummit

Apr 14, 2024 · Artificial Intelligence

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

This article presents a comprehensive overview of NVIDIA’s TensorRT-LLM, detailing its product positioning as a scalable LLM inference solution, key features such as model support, low-precision and quantization techniques, parallelism strategies, the end-to-end usage workflow, performance highlights, future roadmap, and answers to common technical questions.

LLM inferenceNVIDIAParallelism

0 likes · 13 min read

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

Baobao Algorithm Notes

Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU MemoryLLM inferencePagedAttention

0 likes · 25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

Architects' Tech Alliance

Mar 22, 2024 · Industry Insights

Can Groq’s LPU Outsmart Nvidia GPUs in AI Inference?

The article examines Groq’s new LPU AI chip, comparing its inference speed and architecture to Nvidia GPUs, discusses the company’s market positioning, recent CEO statements, and the broader AI‑hardware race, while questioning whether Groq can become the go‑to accelerator for startups by the end of 2024.

AI chipsAI hardwareGroq

0 likes · 9 min read

Can Groq’s LPU Outsmart Nvidia GPUs in AI Inference?

NewBeeNLP

Feb 8, 2024 · Artificial Intelligence

How Speculative Decoding Supercharges Large Language Model Inference

This survey examines speculative decoding—a draft‑then‑verify technique that parallelizes token generation to cut LLM inference latency, outlines its core components, compares independent and self‑drafting methods, discusses verification strategies, and highlights open research challenges.

Artificial IntelligenceLLM inferenceParallelism

0 likes · 15 min read

How Speculative Decoding Supercharges Large Language Model Inference

DataFunTalk

Jan 31, 2024 · Artificial Intelligence

Introduction to NVIDIA TensorRT-LLM Inference Framework

TensorRT-LLM is NVIDIA's scalable inference framework for large language models that combines TensorRT compilation, fast kernels, multi‑GPU parallelism, low‑precision quantization, and a PyTorch‑like API to deliver high‑performance LLM serving with extensive customization and future‑focused enhancements.

GPU AccelerationLLM inferenceNVIDIA

0 likes · 12 min read

Introduction to NVIDIA TensorRT-LLM Inference Framework

Alibaba Cloud Native

Jan 17, 2024 · Artificial Intelligence

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.

Cloud Native AIIn‑Flight BatchingLLM inference

0 likes · 13 min read

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

Alibaba Cloud Big Data AI Platform

Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI infrastructureAutoTunerLLM inference

0 likes · 11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner