Cloud Native 9 min read

Deploying NVIDIA‑Docker 2.0 on Large‑Scale Kubernetes: A Step‑by‑Step Guide

This tutorial walks through installing NVIDIA‑Docker 2.0, configuring Docker’s runtime, deploying the NVIDIA device plugin on a Kubernetes 1.9 cluster, and testing GPU‑enabled pods, highlighting the advantages over the legacy nvidia‑docker 1.0 approach.

360 Zhihui Cloud Developer

Nov 29, 2018

Deploying NVIDIA‑Docker 2.0 on Large‑Scale Kubernetes: A Step‑by‑Step Guide

1. Experiment Environment

CentOS Linux release 7.2.1511 (Core)

Kubernetes: 1.9

GPU: nvidia‑tesla‑k80

2. Installation (version 2.0)

Follow the official installation guide. Prerequisites:

GNU/Linux x86_64 with kernel version > 3.10

Docker >= 1.12

NVIDIA GPU with Architecture > Fermi (2.1)

NVIDIA drivers ~= 361.93 (untested on older versions)

# Remove existing nvidia-docker 1.0
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

# Install nvidia-docker2 and reload daemon
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia‑smi with the official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Configure Docker to use the NVIDIA container runtime:

{
  "default-runtime":"nvidia",
  "runtimes":{
    "nvidia":{
      "path":"/usr/bin/nvidia-container-runtime",
      "runtimeArgs":[]
    }
  }
}

Restart Docker:

systemctl restart docker

3. GPU on Kubernetes

Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9. Each container can request whole GPUs, but fractional requests or sharing a single GPU among multiple containers are not supported.

4. Deploying the NVIDIA Device Plugin

Enable GPU support (feature‑gate before v1.10) and install NVIDIA drivers and the device plugin on each node. Deploy the plugin with the following DaemonSet manifest:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.9
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Create the plugin resources:

kubectl create -f nvidia-docker-plugin.yml

5. Test GPU Pod

Deploy a test pod that requests one GPU:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "k8s.gcr.io/cuda-vector-add:v0.1"
    resources:
      limits:
        nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-tesla-k80

Run the pod and verify that the GPU device and CUDA libraries are available inside the container.

6. Summary

Using nvidia‑docker 1.0 requires manually mounting GPU drivers, whereas nvidia‑docker 2.0 leverages the Kubernetes device plugin, simplifying GPU provisioning. The device‑plugin model and the extensible container‑runtime interface demonstrate Kubernetes’ powerful extensibility for integrating external resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes GPU container-runtime NVIDIA Docker Device Plugin

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.