Cloud Native 11 min read

Setting /dev/shm Size for Kubernetes Pods: A Production Troubleshooting Guide

During a production deployment of large language model training on Kubernetes, a pod failed due to insufficient /dev/shm shared memory; the article details the root cause, explores missing pod spec parameters, and presents a complete solution using an emptyDir volume with medium: Memory and sizeLimit to configure shared memory.

Go Programming World
Go Programming World
Go Programming World
Setting /dev/shm Size for Kubernetes Pods: A Production Troubleshooting Guide

In a production environment where a large language model was being trained on Kubernetes, a pod repeatedly failed because the container's /dev/shm shared‑memory size was too small, causing NCCL communication errors.

Cause

The issue was first reported in an internal OA group where the algorithm team mentioned that the vLLM framework could not allocate enough shared memory for NCCL. Docker runs with --shm-size set to 128 MiB, but the Kubernetes pod did not set any equivalent parameter.

Assumption

It was assumed that the missing /dev/shm size setting in the pod spec was the root cause. Attempts to find a direct pod‑spec field using kubectl explain pod.spec and kubectl explain pod.spec.containers returned no results.

$ kubectl explain pod.spec
$ kubectl explain pod.spec.containers

Searches on the official Kubernetes site yielded very few results, prompting a broader Google search.

Solution

StackOverflow provided a solution: mount an emptyDir volume with medium: Memory to /dev/shm . The following pod spec fragment implements this:

spec:
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
  containers:
  - image: gcr.io/project/image
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm

Verification in the production environment confirmed that the pod now started without errors and the model training proceeded normally.

To also limit the size of the emptyDir , the sizeLimit field can be used:

apiVersion: v1
kind: Pod
metadata:
  name: test-pd
spec:
  containers:
  - image: registry.k8s.io/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
  volumes:
  - name: cache-volume
    emptyDir:
      sizeLimit: 500Mi

Further research showed that Docker’s default /dev/shm size is 64 MiB, which can be changed with --shm-size :

# Create a container
$ docker run --rm -it --name ubuntu ubuntu
# Inspect default shm size
$ docker inspect ubuntu | grep -i shm
    "ShmSize": 67108864,
# Run with larger shm
$ docker run --rm -it --name ubuntu --shm-size=2gb ubuntu

Mounting the host’s /dev/shm into a container is also possible via hostPath , though it is rarely needed in Kubernetes:

$ docker run --rm -it --name ubuntu -v /dev/shm:/dev/shm ubuntu

Review of /dev/shm

/dev/shm is a tmpfs (in‑memory file system) used for fast inter‑process communication. Because it resides in RAM, it is ideal for frameworks that rely on shared memory, such as NCCL.

Conclusion

The problem was resolved by explicitly mounting an emptyDir volume with medium: Memory (and optionally sizeLimit ) to provide sufficient shared memory for the pod. This approach aligns with Kubernetes best practices and avoids the need for a custom --shm-size flag in the pod spec.

Kubernetesshared memoryPodemptyDirshm
Go Programming World
Written by

Go Programming World

Mobile version of tech blog https://jianghushinian.cn/, covering Golang, Docker, Kubernetes and beyond.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.