PyTorch GPU Memory Profiling: Checkpointing, Mixed Precision, Optimizer Choice
The article explains the seven sources of GPU memory usage during PyTorch training, shows how to measure them with built‑in profiling APIs and the memory‑viz tool, and evaluates three effective optimizations—gradient checkpointing, mixed‑precision training, and optimizer selection—detailing their memory savings and performance costs.
GPU memory consumption factors
Model parameters – the weight tensors.
Gradients – one tensor per parameter, same size as the parameters.
Optimizer state – Adam stores two extra tensors (m and v) per parameter.
Activations – outputs of each layer that must be kept for back‑propagation.
Input batches – data loaded onto the GPU.
CUDA workspace – temporary kernel buffers and cuDNN caches.
Memory fragmentation – allocated blocks that cannot be reused because of gaps.
For a 200 million‑parameter fp32 model trained with Adam, the memory breakdown is roughly:
Parameters: 800 MB
Gradients: 800 MB (same as parameters)
Adam state (m + v): 1 600 MB (2 × parameters)
Activations: 2–10 × parameters (highly variable)
Input batches: depends on batch size
CUDA workspace: 500 MB – 1 GB
Fragmentation: 5 % – 20 % of total memory
Thus a model that theoretically needs only 800 MB can occupy 5–8 GB in practice.
Measuring actual usage
PyTorch provides precise memory‑visibility utilities. The key metrics are:
import torch
# GPU memory actually allocated for tensors (GB)
allocated = torch.cuda.memory_allocated() / 1024**3
# GPU memory reserved by the allocator, including unused portions (GB)
reserved = torch.cuda.memory_reserved() / 1024**3
# Peak allocated memory since the last reset (GB)
peak = torch.cuda.max_memory_allocated() / 1024**3
# Reset the peak‑memory counter
torch.cuda.reset_peak_memory_stats()The difference reserved - allocated equals fragmented memory. For example, if allocated is 5 GB and reserved is 8 GB, 3 GB are reserved but not efficiently used.
Printing a full allocator‑pool summary shows size‑wise allocation vs. peak values and per‑category details:
print(torch.cuda.memory_summary())Memory‑history visualization
PyTorch can record every allocation and dump a snapshot:
torch.cuda.memory._record_memory_history(max_entries=100_000)
# Run one training step
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()
# Save the snapshot
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Disable further recording
torch.cuda.memory._record_memory_history(enabled=None)Upload the generated memory_snapshot.pickle to https://pytorch.org/memory_viz to view an interactive UI that shows each allocation, release, and the full call stack that triggered it.
Optimization techniques
1. Gradient checkpointing (compute‑for‑memory)
Activations are usually the largest memory consumer. Gradient checkpointing recomputes activations during the backward pass instead of storing them.
from torch.utils.checkpoint import checkpoint
class MyBlock(nn.Module):
def forward(self, x):
return checkpoint(self._forward, x, use_reentrant=False)
def _forward(self, x):
# Expensive computation here
return xTypical savings: 40 %–60 % reduction in activation memory, at the cost of a 20 %–30 % slowdown in backward speed.
2. Mixed‑precision training
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')
with autocast('cuda', dtype=torch.float16):
output = model(x)
loss = criterion(output, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Activations, gradients, and most operations use fp16 (2 bytes per value) while parameters and optimizer states stay in fp32 for stability. Typical savings: 30 %–50 % total memory reduction, and fp16 operations often run faster on modern GPUs.
3. Optimizer choice
Adam stores two extra tensors per parameter; for a 1‑billion‑parameter fp32 model the optimizer state alone consumes ~8 GB.
SGD with momentum: one extra tensor per parameter (half the Adam overhead).
AdamW with bnb.optim.AdamW8bit: stores optimizer state in 8‑bit, cutting memory by a factor of four with negligible accuracy loss.
Lion: memory comparable to SGD, convergence similar to Adam.
For models exceeding one billion parameters, optimizer selection can be the deciding factor for whether training fits on the available hardware.
Conclusion
Measuring GPU memory with the provided PyTorch utilities enables reductions of 30 %–60 %, allowing larger batch sizes, faster training, and better gradient estimates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
