DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training
The article introduces DGL Operator, an open‑source Kubernetes‑based controller that automates the lifecycle of distributed graph neural network training with DGL, explains its terminology, challenges of native DGL distribution, and provides detailed architecture, workflow, and YAML/CLI examples for easy deployment.
Introduction – DGL Operator, developed by 360 AI Platform team, is an open‑source Kubernetes‑native controller that manages the full lifecycle of distributed training for DGL (Deep Graph Library) graph neural networks. The project is hosted on GitHub: https://github.com/Qihoo360/dgl-operator.
Terminology – The article defines key concepts such as Overload (logical workload), Job (training task), Pod, initContainer, Worker Pod, Partitioner Pod, Launcher Pod, ipconfig, kubexec.sh, single‑machine partitioning, and distributed partitioning.
Background and Challenges – While DGL provides powerful GNN APIs, industrial‑scale training (tens of millions to billions of nodes/edges) faces challenges: manual provisioning of many machines, SSH trust setup, manual graph partitioning, manual script triggering, and resource cleanup.
DGL Operator Solution – By leveraging Kubernetes controllers, DGL Operator automates environment provisioning, ipconfig generation, graph partitioning, distributed training execution, and resource release, turning the entire process into a declarative workflow.
Kubernetes and Operator Basics – Kubernetes automates container deployment, scaling, and management. An Operator extends Kubernetes with custom resources and controllers to manage stateful applications, providing a feedback loop that reconciles desired and actual states.
How to Use DGL Operator – Users submit a DGLJob custom resource via a YAML file to an existing Kubernetes cluster. The Operator creates the necessary ConfigMap, RBAC resources, initContainers, and Pods (Launcher, Partitioner, Workers) to run the distributed training.
API Example – DGLJob YAML
apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
name: dgl-graphsage
namespace: dgl-operator
spec:
cleanPodPolicy: Running
partitionMode: DGL-API
dglReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: dgloperator/graphsage:v0.1.0
name: dgl-graphsage
command:
- dglrun
args:
- --graph-name
- graphsage
- --partition-entry-point
- code/load_and_partition_graph.py
- --num-partitions
- "2"
- --train-entry-point
- code/train_dist.py
- --num-epochs
- "1"
- --batch-size
- "1000"
Worker:
replicas: 2
template:
spec:
containers:
- image: dgloperator/graphsage:v0.1.0
name: dgl-graphsageGenerated Launcher Pod Definition
kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-launcher
spec:
volumes:
- name: kube-volume
emptyDir: {}
- name: dataset-volume
emptyDir: {}
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
initContainers:
- name: kubectl-download
image: 'dgloperator/kubectl-download:v0.1.0'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
- name: watcher-loop-partitioner
image: 'dgloperator/watcher-loop:v0.1.0'
env:
- name: WATCHERFILE
value: /etc/dgl/partfile
- name: WATCHERMODE
value: finished
volumeMounts:
- name: config-volume
mountPath: /etc/dgl
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
command:
- dglrun
args:
- '--graph-name'
- graphsage
- '--partition-entry-point'
- code/load_and_partition_graph.py
- '--num-partitions'
- '2'
- '--balance-train'
- '--balance-edges'
- '--train-entry-point'
- code/train_dist.py
- '--num-epochs'
- '1'
- '--batch-size'
- '1000'
- '--num-trainers'
- '1'
- '--num-samplers'
- '4'
- '--num-servers'
- '1'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
- name: config-volume
mountPath: /etc/dgl
- name: dataset-volume
mountPath: /dgl_workspace/dataset
imagePullPolicy: Always
restartPolicy: NeverGenerated Partitioner Pod Definition
kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-partitioner
spec:
volumes:
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
- key: leadfile
path: leadfile
mode: 292
- name: kube-volume
emptyDir: {}
initContainers:
- name: kubectl-download
image: 'dgloperator/kubectl-download:v0.1.0'
volumeMounts:
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
env:
- name: DGL_OPERATOR_PHASE_ENV
value: Partitioner
volumeMounts:
- name: config-volume
mountPath: /etc/dgl
- name: kube-volume
mountPath: /opt/kube
imagePullPolicy: Always
restartPolicy: NeverGenerated Worker Pod Definition
kind: Pod
apiVersion: v1
metadata:
name: dgl-graphsage-worker-0
spec:
volumes:
- name: shm-volume
emptyDir:
medium: Memory
sizeLimit: 10G
- name: config-volume
configMap:
name: dgl-graphsage-config
items:
- key: kubexec.sh
path: kubexec.sh
mode: 365
- key: hostfile
path: hostfile
mode: 292
- key: partfile
path: partfile
mode: 292
- key: leadfile
path: leadfile
mode: 292
containers:
- name: dgl-graphsage
image: 'dgloperator/graphsage:v0.1.0'
command:
- sleep
args:
- 365d
ports:
- name: dglserver
containerPort: 30050
protocol: TCP
volumeMounts:
- name: shm-volume
mountPath: /dev/shm
- name: config-volume
mountPath: /etc/dgl
imagePullPolicy: AlwaysArchitecture and Workflow – The operator implements two layered workflows: the Operator side (creating ConfigMaps, RBAC, initContainers, Pods, and orchestrating the overall job) and the dglrun side (handling graph partitioning, data transfer, and distributed training). Detailed step‑by‑step sequences for both single‑machine and distributed partitioning scenarios are described, accompanied by diagrams.
Conclusion – By integrating DGL training into the Kubernetes ecosystem, DGL Operator automates configuration generation, graph partitioning, distributed execution, and resource cleanup, embodying MLOps principles. It follows the broader trend of ML‑on‑Kubernetes operators (e.g., TF‑Operator, PyTorch‑Operator) and invites community contributions.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.