Koordinator v1.6 Release: Advanced Heterogeneous Device Scheduling and GPU Management Features
The Koordinator v1.6 release introduces a suite of innovations—including GPU topology‑aware scheduling, end‑to‑end GPU & RDMA joint allocation, strong GPU isolation, differentiated GPU scoring, fine‑grained resource reservation, mixed‑workload QoS, and extensive scheduler and rescheduler optimizations—to efficiently manage heterogeneous resources in Kubernetes clusters for AI and high‑performance computing workloads.
Background: With the rapid rise of large‑model AI and high‑performance computing, demand for efficient heterogeneous device scheduling (GPU, NPU, RDMA) has surged. Koordinator v1.6 responds by enhancing device topology awareness, GPU‑RDMA joint allocation, and GPU isolation to improve AI training and inference performance while boosting cluster utilization.
Core Feature Highlights
1. GPU Topology‑Aware Scheduling – Supports detailed GPU topology detection across various models (e.g., NVIDIA L20/L40S, Huawei Sheng‑teng NPU) and provides APIs for NUMA‑aligned GPU placement. Example:
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling.koordinator.sh/numa-topology-spec: '{"numaTopologyPolicy":"Restricted", "singleNUMANodeExclusive":"Preferred"}'
spec:
containers:
- resources:
limits:
koordinator.sh/gpu: 200
cpu: 64
memory: 500Gi
requests:
koordinator.sh/gpu: 200
cpu: 64
memory: 500Gi2. End‑to‑End GDR Support – Enables GPUDirect RDMA for cross‑node GPU communication, reducing CPU/memory overhead. Joint GPU & RDMA allocation example:
apiVersion: v1
kind: Pod
metadata:
name: pod-vf01
namespace: kubeflow
annotations:
scheduling.koordinator.sh/device-joint-allocate: |-\n {\n "deviceTypes": ["gpu","rdma"]\n }
scheduling.koordinator.sh/device-allocate-hint: |-\n {\n "rdma": {\n "vfSelector": {} //apply VF\n }\n }
spec:
schedulerName: koord-scheduler
containers:
- name: container-vf
resources:
requests:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 100
limits:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 1003. Strong GPU Isolation (Sharing) – Allows multiple Pods to share a single GPU with precise core and memory ratios, leveraging HAMi‑Core for isolation. Example deployment of HAMi‑Core DaemonSet and a shared‑GPU Pod are provided.
4. Differential GPU Scheduling Strategies – Introduces NodeResourcesFitPlus and ScarceResourceAvoidance plugins to apply distinct scoring policies for GPU versus CPU/MEM resources, reducing GPU fragmentation and preventing CPU‑heavy workloads from occupying GPU nodes.
5. Fine‑Grained Resource Reservation – Enhances reservation APIs for exact‑match reservations, reservation‑ignored mode, and reservation affinity with taints/tolerations, enabling precise CPU‑GPU‑MEM alignment and pre‑emptive reservation handling.
6. Mixed‑Workload (Mid‑Tier) Enhancements – Improves resource over‑commit, node profiling, pod‑level QoS (Resctrl, CPU QoS), and metrics for better utilization of idle resources while preserving high‑priority task performance.
7. Scheduler & Rescheduler Optimizations – Moves PodGroup checks earlier, refines plugin state handling, adds latency metrics, and upgrades LowNodeLoad, MigrationController, and global eviction limits to boost scheduling throughput and stability in large clusters.
Future Plans: Continue strengthening GPU management, introduce NPU scheduling, develop rescheduling plugins for resource imbalance, and evolve end‑to‑end device management solutions.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.