HyperOffload: A New Storage Paradigm Aiming to Break the AI Memory Wall

HyperOffload, a joint effort by Shanghai Jiao Tong University and Huawei’s MindSpore team, proposes a dynamic tensor offloading system that moves data between GPU memory, CPU RAM, and SSDs, aiming to overcome the “memory wall” that limits trillion‑parameter AI model training and deployment.

AI infrastructureAI memory wallGPU Memory Management

0 likes · 6 min read

HyperOffload: A New Storage Paradigm Aiming to Break the AI Memory Wall

AI2ML AI to Machine Learning

Dec 22, 2025 · Artificial Intelligence

The Core Ideas Behind Paged Attention for KV‑Caching

This article explains how Paged Attention, introduced by the vLLM team, applies virtual‑memory techniques, non‑contiguous block mapping, copy‑on‑write reuse, distributed scheduling, and hardware‑level optimizations to improve KV‑cache efficiency and reduce memory fragmentation in large language model serving.

Copy-on-WriteDistributed SchedulingGPU Memory Management

0 likes · 6 min read

The Core Ideas Behind Paged Attention for KV‑Caching

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementLarge ModelsMegatron

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Infra Learning Club

Nov 1, 2024 · Artificial Intelligence

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

GPU Memory Managementcpu_offload_gblarge language model inference

0 likes · 9 min read

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference