Cloud Native 9 min read

Deploying NVIDIA‑Docker 2.0 on Large‑Scale Kubernetes: A Step‑by‑Step Guide

This tutorial walks through installing NVIDIA‑Docker 2.0, configuring Docker’s runtime, deploying the NVIDIA device plugin on a Kubernetes 1.9 cluster, and testing GPU‑enabled pods, highlighting the advantages over the legacy nvidia‑docker 1.0 approach.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Deploying NVIDIA‑Docker 2.0 on Large‑Scale Kubernetes: A Step‑by‑Step Guide

1. Experiment Environment

CentOS Linux release 7.2.1511 (Core)

Kubernetes: 1.9

GPU: nvidia‑tesla‑k80

2. Installation (version 2.0)

Follow the official installation guide. Prerequisites:

GNU/Linux x86_64 with kernel version > 3.10

Docker >= 1.12

NVIDIA GPU with Architecture > Fermi (2.1)

NVIDIA drivers ~= 361.93 (untested on older versions)

<code># Remove existing nvidia-docker 1.0
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

# Install nvidia-docker2 and reload daemon
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia‑smi with the official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
</code>

Configure Docker to use the NVIDIA container runtime:

<code>{
  "default-runtime":"nvidia",
  "runtimes":{
    "nvidia":{
      "path":"/usr/bin/nvidia-container-runtime",
      "runtimeArgs":[]
    }
  }
}
</code>

Restart Docker:

<code>systemctl restart docker</code>

3. GPU on Kubernetes

Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9. Each container can request whole GPUs, but fractional requests or sharing a single GPU among multiple containers are not supported.

4. Deploying the NVIDIA Device Plugin

Enable GPU support (feature‑gate before v1.10) and install NVIDIA drivers and the device plugin on each node. Deploy the plugin with the following DaemonSet manifest:

<code>apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.9
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
</code>

Create the plugin resources:

<code>kubectl create -f nvidia-docker-plugin.yml</code>

5. Test GPU Pod

Deploy a test pod that requests one GPU:

<code>apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "k8s.gcr.io/cuda-vector-add:v0.1"
    resources:
      limits:
        nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-tesla-k80
</code>

Run the pod and verify that the GPU device and CUDA libraries are available inside the container.

6. Summary

Using nvidia‑docker 1.0 requires manually mounting GPU drivers, whereas nvidia‑docker 2.0 leverages the Kubernetes device plugin, simplifying GPU provisioning. The device‑plugin model and the extensible container‑runtime interface demonstrate Kubernetes’ powerful extensibility for integrating external resources.

GPU device plugin diagram
GPU device plugin diagram
kubernetesGPUcontainer runtimenvidia-dockerdevice plugin
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.