Cloud Native 12 min read

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

360 Smart Cloud

Apr 25, 2024

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

1. Introduction

Since the release of ChatGPT at the end of 2022, a wave of large‑model research has emerged worldwide, and high‑performance networking has become a key factor for distributed training beyond powerful AI chips.

2. High‑Performance Network Overview

Traditional TCP/IP networks suffer from protocol‑stack latency and high CPU load, while RDMA‑based solutions such as RoCE (RDMA over Converged Ethernet) and InfiniBand provide low‑latency, high‑bandwidth, low‑CPU‑consumption communication.

Two main schemes are used in industry: RoCE v2 and InfiniBand. Both have been deployed inside 360 Group to support large‑model projects like 360 智脑 and 360 智绘.

3. Building a RoCE v2 Network in a Cloud‑Native Environment

The cluster uses six NICs per host: two bonded Ethernet NICs for the management plane and four Mellanox NICs (mlx5) for the data plane. Cilium maintains the management network, while Multus CNI, macvlan, and whereabouts provide a second data‑plane network for pods.

Key components include NVIDIA’s network‑operator, which installs the following:

rdmaSharedDevicePlugin:
  deploy: true
  image: k8s-rdma-shared-dev-plugin
  repository: ghcr.io/mellanox
  version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775
  useCdi: false
  resources:
    - resourcePrefix: nvidia.com
      resourceName: mlx5_0
      rdmaHcaMax: 100
      vendors: [15b3]
      ifNames: [lan2]
    - resourcePrefix: nvidia.com
      resourceName: mlx5_1
      rdmaHcaMax: 100
      vendors: [15b3]
      ifNames: [lan3]
    - resourcePrefix: nvidia.com
      resourceName: mlx5_2
      rdmaHcaMax: 100
      vendors: [15b3]
      ifNames: [lan4]
    - resourcePrefix: nvidia.com
      resourceName: mlx5_3
      rdmaHcaMax: 100
      vendors: [15b3]
      ifNames: [lan5]

MacvlanNetwork objects are created for each data‑plane NIC to allocate IP pools via the whereabouts IPAM plugin.

apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net-ipam-lan2
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.4.0/22",
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.4.1"
    }
  master: lan2
  mode: bridge
  mtu: 1500
  networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net-ipam-lan3
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.8.0/22",
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.8.1"
    }
  master: lan3
  mode: bridge
  mtu: 1500
  networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net-ipam-lan4
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.12.0/22",
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.12.1"
    }
  master: lan4
  mode: bridge
  mtu: 1500
  networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net-ipam-lan5
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.16.0/22",
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.16.1"
    }
  master: lan5
  mode: bridge
  mtu: 1500
  networkNamespace: prod

A sample Volcano job requests the four mlx5 resources and sets NCCL environment variables to enable RoCE communication.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: rdma-test
  namespace: prod
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    pytorch:
      - '--master=master'
      - '--worker=worker'
      - '--port=23456'
  policies:
    - action: RestartJob
      event: PodEvicted
  queue: default
  schedulerName: volcano
  tasks:
    - maxRetry: 3
      minAvailable: 1
      name: master
      replicas: 1
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: rdma-net-ipam-lan2,rdma-net-ipam-lan3,rdma-net-ipam-lan4,rdma-net-ipam-lan5
        spec:
          containers:
            - command:
                - /bin/bash
                - '-c'
                - sleep 1440h
              env:
                - name: NCCL_DEBUG
                  value: INFO
                - name: NCCL_IB_DISABLE
                  value: '0'
                - name: NCCL_NET_GDR_READ
                  value: '1'
                - name: NCCL_IB_HCA
                  value: mlx5
                - name: NCCL_IB_GID_INDEX
                  value: '5'
                - name: NCCL_SOCKET_IFNAME
                  value: eth0
              image: torch
              name: pytorch
              resources:
                limits:
                  nvidia.com/gpu: '8'
                  nvidia.com/mlx5_0: '1'
                  nvidia.com/mlx5_1: '1'
                  nvidia.com/mlx5_2: '1'
                  nvidia.com/mlx5_3: '1'
                requests:
                  nvidia.com/gpu: '8'
                  nvidia.com/mlx5_0: '1'
                  nvidia.com/mlx5_1: '1'
                  nvidia.com/mlx5_2: '1'
                  nvidia.com/mlx5_3: '1'
              schedulerName: volcano

4. Building an InfiniBand Network in a Cloud‑Native Environment

IB requires dedicated IB switches and NVIDIA UFM for management; no second network plane or MacvlanNetwork objects are needed, and job annotations related to RoCE are omitted.

5. Performance Evaluation

Using the 360 AI development platform to launch MPI‑based distributed training jobs, all‑reduce performance tests show that both RoCE v2 and IB achieve far higher bandwidth than traditional Ethernet, confirming their suitability for training trillion‑parameter models.

All capabilities are integrated into the 360 AI platform, allowing users to create GPU tasks that automatically leverage the high‑performance networks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes RDMA AI training High‑Performance Networking InfiniBand RoCE

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.