Cloud Native 25 min read

What’s New in Koordinator v0.7? Enhanced Coscheduling, ElasticQuota, and Fine‑Grained GPU Sharing

Koordinator v0.7 adds major cloud‑native scheduling features—including enhanced gang (coscheduling) with Strict/NonStrict modes, multi‑hierarchy ElasticQuota management, fine‑grained GPU resource protocols, richer diagnostic APIs, and safer descheduling—targeting machine‑learning and big‑data workloads on Kubernetes.

Alibaba Cloud Native

Oct 10, 2022

What’s New in Koordinator v0.7? Enhanced Coscheduling, ElasticQuota, and Fine‑Grained GPU Sharing

Version Highlights

Koordinator v0.7 introduces significant scheduling capabilities for machine‑learning and big‑data scenarios, focusing on enhanced coscheduling, ElasticQuota with multi‑hierarchy management, fine‑grained GPU sharing, richer diagnostic APIs, and safer descheduling.

Enhanced Coscheduling

Gang (coscheduling) ensures that all related Pods of a job start together (All‑or‑Nothing). The new implementation adds two modes: Strict (default, rejects waiting Pods on failure) and NonStrict (allows retry until the MinMember requirement is met). Users enable NonStrict by adding the annotation gang.scheduling.koordinator.sh/mode=NonStrict to a PodGroup or Pod.

Failure handling is improved with a ScheduleCycle ‑based retry mechanism that increments a monotonic counter on each retry, avoiding stale‑time based bugs.

Multiple PodGroup s can be coordinated to achieve All‑or‑Nothing across roles by annotating them, e.g.:

apiVersion: v1alpha1
kind: PodGroup
metadata:
  name: podGroupA
  namespace: default
  annotations:
    gang.scheduling.koordinator.sh/groups: ["namespaceA/podGroupA", "namespaceB/podGroupB"]
spec: ...

A lightweight gang protocol allows users to skip PodGroup creation by annotating individual Pods:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    gang.scheduling.koordinator.sh/name: "pod-group-a"
    gang.scheduling.koordinator.sh/min-available: "5"
  name: demo-pod
  namespace: default
spec: ...

ElasticQuota Scheduling

Traditional ResourceQuota limits resources per Namespace but can lead to low utilization. ElasticQuota adds max (upper bound) and min (guaranteed amount) fields, enabling borrowing and fair sharing.

Koordinator extends this with a multi‑hierarchy model. A parent quota can have child quotas; the sum of children’s min must be less than the parent’s min. Shared weight (default = max) drives fairness. Example hierarchy:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: parentA
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/is-parent: "true"
    quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
  max:
    cpu: 100
    memory: 200Gi
  min:
    cpu: 100
    memory: 200Gi
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: childA1
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/is-parent: "false"
    quota.scheduling.koordinator.sh/parent: "parentA"
    quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
  max:
    cpu: 40
    memory: 100Gi
  min:
    cpu: 20
    memory: 40Gi

Pods can be linked to a specific quota via a label:

apiVersion: v1
kind: Pod
metadata:
  labels:
    quota.scheduling.koordinator.sh/name: "elastic-quota-a"
  name: demo-pod
  namespace: default
spec: ...

Fairness is enforced by a shared‑weight mechanism that allocates runtime resources between min and max. When total cluster capacity drops below the sum of min, the system proportionally scales down each quota’s min to fit the available resources.

Fine‑Grained Device (GPU) Scheduling

Koordinator keeps compatibility with the standard nvidia.com/gpu resource but adds extended resources: kubernetes.io/gpu-core – abstracted compute capacity (100 = one full GPU). kubernetes.io/gpu-memory – memory in bytes. kubernetes.io/gpu-memory-ratio – percentage of GPU memory.

Example pod requesting a full GPU:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  namespace: default
spec:
  containers:
  - name: main
    resources:
      limits:
        kubernetes.io/gpu-core: 100
        kubernetes.io/gpu-memory: "8Gi"
      requests:
        kubernetes.io/gpu-core: 100
        kubernetes.io/gpu-memory: "8Gi"

Requesting half a GPU:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  namespace: default
spec:
  containers:
  - name: main
    resources:
      limits:
        kubernetes.io/gpu-core: 50
        kubernetes.io/gpu-memory: "4Gi"
      requests:
        kubernetes.io/gpu-core: 50
        kubernetes.io/gpu-memory: "4Gi"

Each node reports GPU devices through a Device CRD (minor ID, UUID, core, memory). The scheduler’s DeviceShare plugin consumes this CRD, converts pod requests to the internal protocol, and records the selected device IDs in pod annotations during the PreBind phase. Future versions will add bin‑packing and spread strategies.

Scheduler Diagnosis and Debugging

Koordinator provides a RESTful API to aid debugging. Users can raise the kube‑scheduler log level via:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/v --data '5'

Score debugging can be enabled with:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/s --data '100'

Plugin internal state can be queried, e.g. CPU topology:

$ curl schedulerLeaderIP:10252/apis/v1/plugins/NodeNUMAResources/cpuTopologyOptions/node-1

The endpoint /apis/v1/__services__ lists all supported plugin APIs.

Safer Descheduling

The koord-descheduler in v0.7 adds eviction rate limiting, namespace‑level gray‑scale control, node/namespace eviction caps, and workload awareness (Deployment, StatefulSet, Kruise CloneSet, AdvancedStatefulSet). Future work will improve fairness to avoid repeated evictions of the same workload.

Other Improvements

CPU fine‑grained scheduling now fully compatible with kubelet static CPU manager (v1.18‑v1.22).

Reservation objects support AllocateOnce semantics for single‑use reservations.

Batch resource declarations now allow limit > request, simplifying burstable workloads in mixed‑mode clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Scheduling GPU ElasticQuota Coscheduling

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.