DGL Operator: A Kubernetes Native Controller for Distributed Graph Neural Network Training
DGL Operator is an open‑source Kubernetes controller that automates the lifecycle of distributed graph neural network training by handling configuration generation, graph partitioning, training execution, and resource cleanup, providing a cloud‑native solution for large‑scale GNN workloads.
DGL Operator is an open‑source training controller developed by 360 Intelligent Engineering AI Platform team, built on the Kubernetes cloud‑native stack and the AWS DGL graph neural network algorithm framework. The project is hosted on GitHub .
Terminology
Workload (Overload) : logical concept equivalent to a Pod, the unit that produces work (graph partitioning and training).
Job : a set of workloads (or Pods) that together constitute a DGL training lifecycle.
Pod : the smallest scheduling unit in Kubernetes, may contain multiple containers.
initContainer : containers that run to completion before the main container, used for pre‑execution tasks.
Worker Pod : Pods that actually perform distributed training.
Partitioner Pod : Pods that perform graph partitioning.
Launcher Pod : lightweight Pod that coordinates the DGLJob, generates configuration, triggers partitioning and training, and releases resources.
dglrun : workflow script inside the container that controls DGL partitioning and training.
ipconfig : configuration file listing each worker’s name, IP, port and GPU requirements.
kubexec.sh : helper script for remote command execution in Kubernetes.
Single‑machine partition : graph partitioning performed in a single Partitioner Pod.
Distributed partition : experimental ParMETIS‑based partitioning across multiple Worker Pods.
Background
In recent years graph neural networks (GNN) have achieved great success in search, recommendation and knowledge graph domains, but building GNN models quickly remains difficult. While deep‑learning frameworks such as TensorFlow, PyTorch and MXNet provide many ready‑to‑use APIs for CNN/RNN, they lack convenient GNN APIs. DGL, co‑developed by NYU and Amazon, fills this gap by offering graph‑centric APIs and performance optimizations.
Industrial workloads often involve graphs with tens of millions to billions of nodes and edges, prompting Amazon to release a distributed training mode for DGL in 2020.
Challenges of native DGL distributed training
Requires pre‑provisioning of many physical machines or VMs and manual collection of IPs and ports into an ipconfig file.
SSH key‑less trust must be set up across all machines.
Graph partitioning must be manually triggered on the master node; using distributed partitioning also requires installing ParMETIS.
Training must be manually started from the master node; the process cannot be fully automated.
After training, resources must be released manually.
DGL Operator solution
DGL Operator addresses the above challenges by implementing a Kubernetes controller that manages the entire DGL training lifecycle. When a user submits a DGLJob, the Operator automatically schedules suitable hosts, creates Pods with the required images, generates the ipconfig file, performs graph partitioning, triggers fully distributed training, persists model outputs and finally releases compute and storage resources.
Kubernetes and Operator basics
Kubernetes automates deployment, scaling and management of containerized applications. An Operator extends Kubernetes with custom resources and controllers to manage stateful workloads. The Operator watches the desired state defined in a custom resource (DGLJob) and reconciles the actual cluster state accordingly.
Using DGL Operator
Users only need to submit a YAML that defines a DGLJob custom resource. The Operator creates the necessary ConfigMap, RBAC objects, initContainers (e.g., kubectl‑download, watcher‑loop), Launcher, Partitioner and Worker Pods. The user’s Docker image must contain DGL, its dependencies and the dglrun script.
API example
apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
name: dgl-graphsage
namespace: dgl-operator
spec:
cleanPodPolicy: Running
partitionMode: DGL-API
dglReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: dgloperator/graphsage:v0.1.0
name: dgl-graphsage
command:
- dglrun
args:
- --graph-name
- graphsage
# partition arguments
- --partition-entry-point
- code/load_and_partition_graph.py
- --num-partitions
- "2"
# training arguments
- --train-entry-point
- code/train_dist.py
- --num-epochs
- "1"
- --batch-size
- "1000"
Worker:
replicas: 2
template:
spec:
containers:
- image: dgloperator/graphsage:v0.1.0
name: dgl-graphsageKey fields:
cleanPodPolicy can be Running , None or All to control pod deletion after completion.
Launcher and Worker follow the standard PodTemplateSpec definition.
partitionMode can be ParMETIS or DGL-API ; the latter uses DGL’s native dgl.distributed.partition_graph API.
Generated Launcher Pod definition
kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-launcher
spec:
volumes:
- name: kube-volume
emptyDir: {}
- name: dataset-volume
emptyDir: {}
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
initContainers:
- name: kubectl-download
image: 'dgloperator/kubectl-download:v0.1.0'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
- name: watcher-loop-partitioner
image: 'dgloperator/watcher-loop:v0.1.0'
env:
- name: WATCHERFILE
value: /etc/dgl/partfile
- name: WATCHERMODE
value: finished
volumeMounts:
- name: config-volume
mountPath: /etc/dgl
- name: dataset-volume
mountPath: /dgl_workspace/dataset
imagePullPolicy: Always
- name: watcher-loop-worker
image: 'dgloperator/watcher-loop:v0.1.0'
env:
- name: WATCHERFILE
value: /etc/dgl/hostfile
- name: WATCHERMODE
value: ready
volumeMounts:
- name: config-volume
mountPath: /etc/dgl
imagePullPolicy: Always
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
command:
- dglrun
args:
- '--graph-name'
- graphsage
- '--partition-entry-point'
- code/load_and_partition_graph.py
- '--num-partitions'
- '2'
- '--balance-train'
- '--balance-edges'
- '--train-entry-point'
- code/train_dist.py
- '--num-epochs'
- '1'
- '--batch-size'
- '1000'
- '--num-trainers'
- '1'
- '--num-samplers'
- '4'
- '--num-servers'
- '1'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
- name: config-volume
mountPath: /etc/dgl
- name: dataset-volume
mountPath: /dgl_workspace/dataset
imagePullPolicy: Always
restartPolicy: NeverGenerated Partitioner and Worker Pod definitions
kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-partitioner
spec:
volumes:
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
- key: leadfile
path: leadfile
mode: 292
- name: kube-volume
emptyDir: {}
initContainers:
- name: kubectl-download
image: 'dgloperator/kubectl-download:v0.1.0'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
env:
- name: DGL_OPERATOR_PHASE_ENV
value: Partitioner
volumeMounts:
- name: config-volume
mountPath: /etc/dgl
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
restartPolicy: Never kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-worker-0
spec:
volumes:
- name: shm-volume
emptyDir:
medium: Memory
sizeLimit: 10G
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
- key: leadfile
path: leadfile
mode: 292
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
command:
- sleep
args:
- 365d
ports:
- name: dglserver
containerPort: 30050
protocol: TCP
volumeMounts:
- name: shm-volume
mountPath: /dev/shm
- name: config-volume
mountPath: /etc/dgl
imagePullPolicy: AlwaysArchitecture and workflow design
The Operator side workflow creates ConfigMaps, RBAC resources, initContainers, Partitioner and Worker Pods, watches their status and finally runs the Launcher container that executes dglrun . The dglrun side workflow handles graph partitioning (single‑machine or distributed via ParMETIS) and then launches the distributed training command. After training completes, the Operator cleans up all Pods and releases resources.
Conclusion
Since Kubernetes provides image reuse, isolation, declarative resource definitions and extensible scheduling, it has become the backbone for large‑scale machine‑learning systems. DGL Operator leverages this ecosystem to automate ipconfig generation, graph partitioning, distributed training execution and resource cleanup, embodying MLOps principles. It joins other Kubeflow operators (TensorFlow, PyTorch, MPI) and offers the community a simple tool for large‑scale GNN training in cloud‑native environments.
For more information and to try it out, visit the GitHub repository: https://github.com/Qihoo360/dgl-operator .
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.