Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.
The presentation introduces the Ant Group team's focus on large‑model inference optimization, especially memory (VRAM) consumption, which is a critical bottleneck for models such as Llama‑65B. It outlines the fundamental memory demand formula, the impact of model precision, head count, and context length, and shows why current hardware growth cannot keep up with model size.
To address these challenges, the team leverages NVIDIA's Virtual Memory Management (VMM) API introduced in CUDA 10.2. Instead of a single cudaMalloc call, VMM allows a two‑step allocation: first reserve a large virtual address space, then map physical memory chunks (typically 2 MiB) on demand via handles. This enables fine‑grained control, fragmentation reduction, and on‑the‑fly memory release.
The core contribution, Virtual Tensor , wraps the VMM workflow in a PyTorch‑compatible tensor that automatically handles KV‑cache allocation, pooling, pre‑allocation, and asynchronous mmap / release operations. It decouples attention kernel computation from memory management, allowing developers to integrate new kernels (e.g., FlashAttention 3) with only a few lines of code.
Building on Virtual Tensor, the LayerKV project targets the prefill stage where the first token generation often stalls due to full‑memory checks. By allocating KV cache layer‑by‑layer and offloading completed layers asynchronously, the system hides memory transfer latency and dramatically lowers first‑token delay while keeping overall QPS stable.
Experimental results on various hardware (H100, A100, L20, H20) and model sizes (7 B, 30 B, 70 B) show up to 2× speed‑up over vLLM’s built‑in FlashAttention 2, 8× improvement in first‑token latency under high load, and up to 60× acceleration in extreme scenarios. The optimizations also reduce QPS degradation to less than 5 % for large models.
Compared with existing solutions such as DistServe, the proposed approach provides finer‑grained KV‑cache reservation, dynamic offloading, and adaptive scheduling, resulting in more flexible memory usage and better trade‑offs between latency and throughput.
Open‑source components include the partially released Virtual Tensor code and the GMLake training‑side memory manager. The LayerKV implementation is pending release pending paper publication.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.