Tagged articles
5 articles
Page 1 of 1
Infra Learning Club
Infra Learning Club
Apr 27, 2025 · Cloud Native

Why Containerd 2.x Fails to Find nvidia‑smi with GPU‑Operator and How to Fix It

When deploying a Kubernetes cluster with kubespray and the NVIDIA runtime, Containerd 2.x reports "nvidia‑smi not found" because the go‑toml v2 parser treats the "binaryName" key differently, causing the wrong runtime wrapper to be used; the article details the configuration inspection, version comparison, code demonstrations, and practical work‑arounds.

Runtimecontainerdgo-toml
0 likes · 8 min read
Why Containerd 2.x Fails to Find nvidia‑smi with GPU‑Operator and How to Fix It
Infra Learning Club
Infra Learning Club
Mar 9, 2025 · Cloud Native

How to Fix nvidia-smi Missing GPU Process Info Inside Containers

The article explains why nvidia-smi cannot display GPU processes when run inside a container, analyzes the underlying pid‑namespace isolation and kernel‑level restrictions, and provides three practical solutions—including using hostPid, custom kernel interception modules, and the nvitop tool—plus a workaround for gpu‑operator deployments.

GPUKernel ModuleKubernetes
0 likes · 8 min read
How to Fix nvidia-smi Missing GPU Process Info Inside Containers
Infra Learning Club
Infra Learning Club
Feb 12, 2025 · Fundamentals

Why Does Nvidia Report Less GPU Memory Than Specified?

The article investigates why Nvidia L40S and RTX A6000 GPUs show less memory via nvidia‑smi than their advertised 48 GB, revealing that enabled ECC memory reserves a few gigabytes, and demonstrates the effect by toggling ECC on a Tesla‑T4 card.

ECCGPU MemoryL40S
0 likes · 4 min read
Why Does Nvidia Report Less GPU Memory Than Specified?