Tagged articles

60 articles

Page 1 of 1

May 31, 2026 · Artificial Intelligence

vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

The vLLM 0.22 stable release introduces production‑grade DeepSeek V4 support, massive kernel fusions, up to 10‑20× speedups, Batch Invariance with 28.9% latency gain, a Rust front‑end, multi‑level KV cache offload that can double context length, and broad hardware coverage across NVIDIA, AMD, CPU and RISC‑V, making it a pivotal upgrade for inference infrastructure teams.

Batch InvarianceDeepSeek V4Inference Optimization

0 likes · 13 min read

vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

Machine Heart

May 30, 2026 · Artificial Intelligence

Can MIT’s Attention Matching Cut LLM Memory 50× Without Accuracy Loss?

MIT researchers introduce Attention Matching, a latent‑space KV‑cache compaction technique that reduces large‑language‑model memory usage up to 50‑fold with negligible precision loss, outperforming token‑pruning, summarization, and prior compaction methods across benchmarks like QuALITY, LongHealth, and AIME‑2025.

Attention MatchingKV CacheLLM

0 likes · 13 min read

Can MIT’s Attention Matching Cut LLM Memory 50× Without Accuracy Loss?

Machine Heart

May 29, 2026 · Artificial Intelligence

Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

OSCAR, a new attention‑aware 2‑bit KV cache quantization method, cuts KV memory by up to 8×, delivers up to 3× decode speedup and 7× throughput gain, and matches BF16 accuracy across 4B‑32B models on diverse long‑context reasoning tasks, surpassing TurboQuant.

2-bit compressionKV CacheLLM Quantization

0 likes · 12 min read

Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

Baidu Intelligent Cloud Tech Hub

May 27, 2026 · Artificial Intelligence

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

The article analyzes the architectural challenges of large‑model inference in the Agent era—such as memory‑intensive MLA structures, MoE communication overhead, exploding KV‑Cache size, and tool‑call accuracy—and presents a series of engineering solutions including hierarchical KV‑Cache pooling, sequence parallelism, offloading strategies, and chip‑level adaptations to achieve higher throughput and lower token costs.

AI InfraAgentDeepSeek

0 likes · 15 min read

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

DataFunTalk

May 26, 2026 · Industry Insights

Why DeepSeek’s Permanent Price Cut Aims at a $10 Trillion AI Market

DeepSeek’s 75% permanent API price reduction is analyzed as a strategic move to shrink KV‑cache memory, lower hardware dependence, trigger a demand surge, reshape the AI hardware ecosystem, and capture an estimated $10 trillion market opportunity.

AI hardwareAI infrastructureAI pricing

0 likes · 13 min read

Why DeepSeek’s Permanent Price Cut Aims at a $10 Trillion AI Market

Architect

May 25, 2026 · Artificial Intelligence

From KV Cache to Harness: How DeepSeek Is Shifting Costs to the System Layer

DeepSeek’s recent V4 release shows that as model inference becomes cheaper, the dominant expenses are moving to system‑level components such as KV cache, memory, storage, compilers, scheduling, hardware adapters, and the emerging Agent Harness layer, reshaping AI infrastructure economics.

AI infrastructureAgent HarnessDeepSeek

0 likes · 23 min read

From KV Cache to Harness: How DeepSeek Is Shifting Costs to the System Layer

Old Zhang's AI Learning

May 23, 2026 · Artificial Intelligence

The Underrated Lifesaving Template for Qwen Local Deployment

This article analyzes the hidden pitfalls of Qwen's official Jinja chat template, explains how the community‑maintained Qwen‑Fixed‑Chat‑Templates v19 fixes rendering errors, KV‑Cache loss, token waste and agent dead‑locks, and provides step‑by‑step installation instructions for LM Studio, llama.cpp, vLLM and MLX.

Agent LoopChat TemplateKV Cache

0 likes · 10 min read

The Underrated Lifesaving Template for Qwen Local Deployment

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Attention optimizationKV CacheLLM

0 likes · 25 min read

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Old Zhang's AI Learning

May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20KV CacheSpeculative Decoding

0 likes · 12 min read

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Flash AttentionKV CacheMixture of Experts

0 likes · 10 min read

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

AI Waka

May 12, 2026 · Artificial Intelligence

Is 3‑Bit KV Cache the Ultimate Solution? An In‑Depth Evaluation of Google’s TurboQuant

Through ten experiments on three LLMs, this study measures TurboQuant’s 3‑bit KV‑cache compression, revealing that while quality remains strong, speed gains vary by model, memory savings depend on implementation, and attention‑entropy analysis explains why 2‑bit compression degrades performance.

Attention EntropyInference PerformanceKV Cache

0 likes · 14 min read

Is 3‑Bit KV Cache the Ultimate Solution? An In‑Depth Evaluation of Google’s TurboQuant

Node.js Tech Stack

May 9, 2026 · Artificial Intelligence

Redis Founder Crafts DeepSeek V4 AI Inference Engine, Node.js Star Applauds

Redis creator Salvatore Sanfilippo (antirez) released DS4, a Metal‑only C inference engine tailored for DeepSeek V4 Flash on high‑end Macs, featuring narrow model focus, 2‑bit quantization, disk‑based KV cache, benchmark speeds around 26 tokens/s, and a dual OpenAI/Anthropic compatible server.

2-bit quantizationAI inference engineDeepSeek V4

0 likes · 13 min read

Redis Founder Crafts DeepSeek V4 AI Inference Engine, Node.js Star Applauds

Lao Guo's Learning Space

May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV Cache

0 likes · 11 min read

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

AI Tech Publishing

Apr 29, 2026 · Artificial Intelligence

Why Do AI Agents Forget and Hallucinate? A Complete Guide to KV‑Cache Memory Mechanisms

The article explains that AI agents’ forgetting and hallucinations stem from token‑level attention scores causing key‑value cache eviction before retrieval, then surveys KV‑cache basics, naive growth, streaming‑LLM windowing, SnapKV’s attention‑guided compression, token‑retention studies, Memory Sparse Attention, compares these methods, and discusses practical system pitfalls and design implications.

AI agentsKV CacheMemory Sparse Attention

0 likes · 20 min read

Why Do AI Agents Forget and Hallucinate? A Complete Guide to KV‑Cache Memory Mechanisms

Old Zhang's AI Learning

Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek V4GPU MemoryKV Cache

0 likes · 15 min read

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

Architect

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

The DeepSeek V4 technical report shows how a 1 million‑token context forces a redesign of attention, KV‑cache, optimizer, quantization and inference budgeting, turning long‑context capability from a costly showcase into a production‑ready feature for agents, search and Chinese professional tasks.

1M contextAttention optimizationDeepSeek

0 likes · 28 min read

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

AI Tech Publishing

Apr 20, 2026 · Artificial Intelligence

How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

This article explains the mechanics of prompt‑caching for large language models, breaks down static versus dynamic context, details KV‑cache operation and its pricing, and shows how Claude Code’s 30‑minute programming session reached a 92% cache hit rate that reduced inference costs by 81%, concluding with three production‑grade design rules.

AI agentsAnthropic APIClaude Code

0 likes · 13 min read

How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

Geek Labs

Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

Flash AttentionInference OptimizationKV Cache

0 likes · 5 min read

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

Old Zhang's AI Learning

Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV CacheLLM inferencePerformance Optimization

0 likes · 13 min read

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

AI Programming Lab

Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV CacheLLM pricingLarge Language Model

0 likes · 13 min read

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

AI Tech Publishing

Apr 5, 2026 · Artificial Intelligence

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

Inference OptimizationKV CacheLLM

0 likes · 7 min read

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

AI Step-by-Step

Apr 5, 2026 · Artificial Intelligence

How Context Engineering Powers Dynamic Business Data Assembly for LLM Agents

The article explains why relying solely on handcrafted prompts leads to hallucinations in LLM agents and presents six concrete context‑engineering practices—XML isolation, hierarchical ordering, KV caching, vector reranking, async memory compression, and minimal few‑shot examples—illustrated with a full e‑commerce refund‑handling case study.

AgentContext EngineeringKV Cache

0 likes · 10 min read

How Context Engineering Powers Dynamic Business Data Assembly for LLM Agents

DeepHub IMBA

Apr 4, 2026 · Artificial Intelligence

Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference

This article walks through constructing Mini-vLLM, a from‑scratch LLM inference engine that tackles the O(N²) attention cost with KV‑cache, boosts throughput via dynamic batching, adds observability with Prometheus/Grafana, supports gRPC, and scales across multiple workers, with benchmark numbers demonstrating its CPU‑only performance.

DockerDynamic BatchingInference Engine

0 likes · 12 min read

Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference

ShiZhen AI

Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV CacheLLM inferenceMemory Optimization

0 likes · 9 min read

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

IT Services Circle

Mar 31, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

Google’s newly released TurboQuant algorithm compresses KV‑Cache from 16‑bit to 3‑bit, slashing memory usage to one‑sixth while preserving zero accuracy loss, dramatically accelerating large‑language‑model inference on GPUs and reshaping the memory market.

AI inferenceGoogle ResearchKV Cache

0 likes · 7 min read

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

ShiZhen AI

Mar 31, 2026 · Artificial Intelligence

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Google's TurboQuant paper shows KV cache compression up to 6.4× with minimal quality loss, causing DRAM and SSD stocks to tumble, while the open‑source community reproduces the method in under two days and Anthropic and OpenAI add powerful developer‑focused AI features.

AI toolchainClaude CodeKV Cache

0 likes · 9 min read

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

KV CacheLLM inferenceTurboQuant

0 likes · 8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Shi's AI Notebook

Mar 27, 2026 · Artificial Intelligence

Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

The article explains how Prompt Caching leverages vLLM's PagedAttention and block‑level hashing to reuse KV cache across identical prefixes, dramatically cutting LLM inference latency and cost, and provides concrete engineering tips for maximizing cache hit rates.

HashingKV CacheLLM inference

0 likes · 7 min read

Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

SuanNi

Mar 26, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

TurboQuant, a new technique from Google Research, dramatically compresses key‑value caches by up to six times without precision loss, using PolarQuant and QJL algorithms to transform vectors into polar coordinates and apply quantized Johnson‑Lindenstrauss transforms, thereby boosting inference speed and enabling longer context handling for large language models.

AI compressionKV CachePerformance

0 likes · 13 min read

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

Old Zhang's AI Learning

Mar 26, 2026 · Artificial Intelligence

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Google’s TurboQuant reduces KV‑Cache memory by up to 4.6×, speeds 3‑bit attention computation up to 8× on H100, and delivers near‑zero accuracy loss across long‑context benchmarks, with open‑source implementations for Metal, vLLM and llama.cpp.

GoogleInference OptimizationKV Cache

0 likes · 10 min read

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Shi's AI Notebook

Mar 16, 2026 · Artificial Intelligence

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.

KV CacheTransformerattention

0 likes · 21 min read

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

MaGe Linux Operations

Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU MemoryKV CacheLLM OOM

0 likes · 28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

AI Explorer

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV CacheLLM inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

DeepHub IMBA

Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV CacheLLM inferencePagedAttention

0 likes · 16 min read

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

Machine Learning Algorithms & Natural Language Processing

Feb 28, 2026 · Artificial Intelligence

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.

DeepSeekDualPathKV Cache

0 likes · 13 min read

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

Machine Learning Algorithms & Natural Language Processing

Feb 27, 2026 · Artificial Intelligence

Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?

DeepSeek’s new DualPath inference framework, co‑developed with leading Chinese universities, decouples compute from KV‑Cache memory access to eliminate I/O stalls in multi‑round agentic workloads, delivering up to nearly 2× higher throughput and dramatically reducing job‑completion time across several large‑scale LLMs.

AI infrastructureAgentic InferenceDeepSeek

0 likes · 13 min read

Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?

Alibaba Cloud Developer

Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

Inference OptimizationKV CacheLLM

0 likes · 15 min read

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

PaperAgent

Jan 21, 2026 · Artificial Intelligence

Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture

DeepSeek’s recent FlashMLA update introduces the new MODEL1, featuring a tighter KV-Cache layout, an extra two-stage cache, and a fixed 512×512 head dimension, with four code changes detailed in a public GitHub commit and illustrated by comparative diagrams.

AI ArchitectureDeepSeekFlashMLA

0 likes · 3 min read

Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture

AI Frontier Lectures

Jan 5, 2026 · Artificial Intelligence

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference

Tencent WeChat AI introduces WeDLM, a diffusion language model that works with standard causal attention and KV caching, achieving up to ten‑fold speedups over autoregressive models while maintaining or improving generation quality across math reasoning and open‑ended tasks.

Diffusion Language ModelKV CacheWeDLM

0 likes · 8 min read

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference

AI2ML AI to Machine Learning

Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Draft-Target ModelInference AccelerationKV Cache

0 likes · 8 min read

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Alibaba Cloud Infrastructure

Dec 22, 2025 · Artificial Intelligence

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.

Alibaba CloudKV CacheKubernetes

0 likes · 18 min read

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

AI2ML AI to Machine Learning

Dec 21, 2025 · Artificial Intelligence

Why KV Caching Is Critical for Efficient LLM Inference

The article breaks down the principles of KV caching in large language models, explaining how Q/K/V behavior differs between training and inference, the role of prompts, cache size trade‑offs, and the complexities of concurrent inference, all backed by concrete examples and references.

Concurrent InferenceKV CacheLLM inference

0 likes · 7 min read

Why KV Caching Is Critical for Efficient LLM Inference

ShiZhen AI

Dec 4, 2025 · Artificial Intelligence

What Is a Context Window? Explaining LLM Memory Capacity

The article explains that a context window defines an LLM's token‑level memory capacity, shows how longer windows cause quadratic computation growth, introduces KV Cache as a way to extend context without exploding resources, and covers advanced techniques like Ring Attention, NIAH benchmarking, and attention decay in long sequences.

KV CacheLLMNIAH benchmark

0 likes · 6 min read

What Is a Context Window? Explaining LLM Memory Capacity

Fighter's World

Nov 7, 2025 · Industry Insights

Is AI Triggering a Global Memory Shortage? Inside the Emerging Memory Supercycle

The article analyzes how generative AI workloads are reshaping the storage market into a multi‑year Memory Supercycle, detailing demand spikes from Mid‑Training checkpoints, synthetic data, KV‑Cache offload and multimodal video models, while supply is strained by HBM production and geopolitical factors.

AIHBMKV Cache

0 likes · 26 min read

Is AI Triggering a Global Memory Shortage? Inside the Emerging Memory Supercycle

AI2ML AI to Machine Learning

Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

KV CacheLLMMQA

0 likes · 9 min read

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

Data Party THU

Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

Efficient AttentionKV CacheRoPE

0 likes · 11 min read

How Tensor Product Attention Redefines Long‑Context Transformers

Architects' Tech Alliance

Sep 30, 2025 · Artificial Intelligence

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

This article explains how key‑value (KV) caching and the new CachedAttention technique dramatically reduce large‑language‑model inference costs by reusing stored attention data across dialogue turns, leveraging a three‑tier memory hierarchy of HBM, DRAM, and SSD to overcome bandwidth and capacity bottlenecks.

AI PerformanceCachedAttentionKV Cache

0 likes · 8 min read

How KV Cache and CachedAttention Revolutionize LLM Inference Efficiency

Baobao Algorithm Notes

Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU MemoryKV CacheLLM

0 likes · 18 min read

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

Data Party THU

Sep 3, 2025 · Artificial Intelligence

Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching

This article provides a comprehensive technical overview of today’s large‑model ecosystem, covering the Transformer architecture, Mixture‑of‑Experts extensions, five fine‑tuning methods, the evolution from traditional RAG to agentic RAG, classic agent design patterns, diverse text‑chunking strategies, and the KV‑cache optimization that accelerates inference.

Fine‑tuningKV CacheMixture of Experts

0 likes · 13 min read

Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching

Wu Shixiong's Large Model Academy

Aug 20, 2025 · Artificial Intelligence

Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding

This guide walks through common large‑model interview challenges, including a hands‑on implementation of multi‑head attention with KV‑cache, the mathematical reason for scaling by sqrt(dₖ), a concise speculative decoding algorithm, and systematic debugging steps for NaN loss during training.

KV CacheLarge Model InterviewMulti‑Head Attention

0 likes · 14 min read

Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding

DataFunTalk

Jul 19, 2025 · Artificial Intelligence

Mastering Context Engineering for AI Agents: KV-Cache, Tool Management, and Error Handling

Peak, co‑founder of Manus, shares practical lessons on building AI agents through context engineering, emphasizing KV‑cache optimization, stable prompt prefixes, controlled tool selection, file‑system memory, attention‑directed todo lists, and preserving error traces to improve robustness and scalability.

AI agentsContext EngineeringError Handling

0 likes · 14 min read

Mastering Context Engineering for AI Agents: KV-Cache, Tool Management, and Error Handling

Tencent Cloud Developer

Jul 2, 2025 · Artificial Intelligence

Big Model Evolution: From Transformers to Enterprise Deployment

This article surveys the rapid evolution of large language models from the Transformer breakthrough to trillion‑parameter capabilities, explains key techniques such as self‑attention, MoE and KV‑Cache, explores practical aspects like temperature tuning, sales AI applications, and compares private versus cloud deployment strategies for enterprises.

Enterprise DeploymentKV CacheTemperature

0 likes · 6 min read

Big Model Evolution: From Transformers to Enterprise Deployment

Baidu Geek Talk

May 19, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

AI inferenceAlltoall optimizationHPN

0 likes · 14 min read

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

Baidu Intelligent Cloud Tech Hub

May 16, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.

AI inferenceAll-to-AllHPN

0 likes · 14 min read

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baobao Algorithm Notes

May 13, 2025 · Artificial Intelligence

Why Decoder‑Only Models Dominate AI Today: Beyond the Low‑Rank Myth

The article explains why the once‑popular low‑rank argument is outdated and how decoder‑only architectures have become mainstream thanks to KV‑cache efficiency, open‑source projects like vLLM and sglang, and their impact on modern AI interview expectations.

KV Cachedecoder-onlyopen-source

0 likes · 3 min read

Why Decoder‑Only Models Dominate AI Today: Beyond the Low‑Rank Myth

AI Algorithm Path

May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization

0 likes · 18 min read

Uncovering the Secrets of LLM Inference Optimization

AIWalker

Apr 2, 2025 · Artificial Intelligence

EasyControl: Plug‑and‑Play DiT Control with Arbitrary Aspect Ratios and Accelerated Inference

EasyControl introduces a lightweight condition‑injection LoRA module, a position‑aware training paradigm, and causal attention with KV‑cache to enable plug‑and‑play multi‑condition control for DiT models, supporting arbitrary image resolutions while cutting inference latency by up to 30% and preserving high‑quality generation.

Conditional GenerationDiTEasyControl

0 likes · 17 min read

EasyControl: Plug‑and‑Play DiT Control with Arbitrary Aspect Ratios and Accelerated Inference

NewBeeNLP

Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV CacheModel Parallelism

0 likes · 17 min read

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOKV CacheLLM training

0 likes · 32 min read

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

Alibaba Cloud Developer

Feb 20, 2024 · Artificial Intelligence

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

This article explains two production‑grade optimization techniques for large language model inference—KV‑cache reuse across multi‑turn dialogues and speculative sampling with a small draft model—detailing their design, implementation, and performance impact.

AIInference OptimizationKV Cache

0 likes · 14 min read

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling