What’s New in Koordinator v0.7? Enhanced Coscheduling, ElasticQuota, and Fine‑Grained GPU Sharing
Koordinator v0.7 adds major cloud‑native scheduling features—including enhanced gang (coscheduling) with Strict/NonStrict modes, multi‑hierarchy ElasticQuota management, fine‑grained GPU resource protocols, richer diagnostic APIs, and safer descheduling—targeting machine‑learning and big‑data workloads on Kubernetes.
Version Highlights
Koordinator v0.7 introduces significant scheduling capabilities for machine‑learning and big‑data scenarios, focusing on enhanced coscheduling, ElasticQuota with multi‑hierarchy management, fine‑grained GPU sharing, richer diagnostic APIs, and safer descheduling.
Enhanced Coscheduling
Gang (coscheduling) ensures that all related Pods of a job start together (All‑or‑Nothing). The new implementation adds two modes: Strict (default, rejects waiting Pods on failure) and NonStrict (allows retry until the MinMember requirement is met). Users enable NonStrict by adding the annotation gang.scheduling.koordinator.sh/mode=NonStrict to a PodGroup or Pod.
Failure handling is improved with a ScheduleCycle ‑based retry mechanism that increments a monotonic counter on each retry, avoiding stale‑time based bugs.
Multiple PodGroup s can be coordinated to achieve All‑or‑Nothing across roles by annotating them, e.g.:
apiVersion: v1alpha1
kind: PodGroup
metadata:
name: podGroupA
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["namespaceA/podGroupA", "namespaceB/podGroupB"]
spec: ...A lightweight gang protocol allows users to skip PodGroup creation by annotating individual Pods:
apiVersion: v1
kind: Pod
metadata:
annotations:
gang.scheduling.koordinator.sh/name: "pod-group-a"
gang.scheduling.koordinator.sh/min-available: "5"
name: demo-pod
namespace: default
spec: ...ElasticQuota Scheduling
Traditional ResourceQuota limits resources per Namespace but can lead to low utilization. ElasticQuota adds max (upper bound) and min (guaranteed amount) fields, enabling borrowing and fair sharing.
Koordinator extends this with a multi‑hierarchy model. A parent quota can have child quotas; the sum of children’s min must be less than the parent’s min. Shared weight (default = max) drives fairness. Example hierarchy:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: parentA
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "true"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 100
memory: 200Gi
min:
cpu: 100
memory: 200Gi
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: childA1
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "false"
quota.scheduling.koordinator.sh/parent: "parentA"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 40
memory: 100Gi
min:
cpu: 20
memory: 40GiPods can be linked to a specific quota via a label:
apiVersion: v1
kind: Pod
metadata:
labels:
quota.scheduling.koordinator.sh/name: "elastic-quota-a"
name: demo-pod
namespace: default
spec: ...Fairness is enforced by a shared‑weight mechanism that allocates runtime resources between min and max. When total cluster capacity drops below the sum of min, the system proportionally scales down each quota’s min to fit the available resources.
Fine‑Grained Device (GPU) Scheduling
Koordinator keeps compatibility with the standard nvidia.com/gpu resource but adds extended resources: kubernetes.io/gpu-core – abstracted compute capacity (100 = one full GPU). kubernetes.io/gpu-memory – memory in bytes. kubernetes.io/gpu-memory-ratio – percentage of GPU memory.
Example pod requesting a full GPU:
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"
requests:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"Requesting half a GPU:
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"
requests:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"Each node reports GPU devices through a Device CRD (minor ID, UUID, core, memory). The scheduler’s DeviceShare plugin consumes this CRD, converts pod requests to the internal protocol, and records the selected device IDs in pod annotations during the PreBind phase. Future versions will add bin‑packing and spread strategies.
Scheduler Diagnosis and Debugging
Koordinator provides a RESTful API to aid debugging. Users can raise the kube‑scheduler log level via:
$ curl -X PUT schedulerLeaderIP:10251/debug/flags/v --data '5'Score debugging can be enabled with:
$ curl -X PUT schedulerLeaderIP:10251/debug/flags/s --data '100'Plugin internal state can be queried, e.g. CPU topology:
$ curl schedulerLeaderIP:10252/apis/v1/plugins/NodeNUMAResources/cpuTopologyOptions/node-1The endpoint /apis/v1/__services__ lists all supported plugin APIs.
Safer Descheduling
The koord-descheduler in v0.7 adds eviction rate limiting, namespace‑level gray‑scale control, node/namespace eviction caps, and workload awareness (Deployment, StatefulSet, Kruise CloneSet, AdvancedStatefulSet). Future work will improve fairness to avoid repeated evictions of the same workload.
Other Improvements
CPU fine‑grained scheduling now fully compatible with kubelet static CPU manager (v1.18‑v1.22).
Reservation objects support AllocateOnce semantics for single‑use reservations.
Batch resource declarations now allow limit > request, simplifying burstable workloads in mixed‑mode clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
