Cloud Native 13 min read

How Volcano Engine’s New GPU Sharing Scheduler Boosts AI Workloads by 500%

This article explains Volcano Engine's next‑generation GPU sharing scheduling technology, detailing the two‑layer scheduler, card‑level bin‑pack/spread strategies, system architecture, API definitions, and optimization algorithms that together increase GPU deployment density over 500% and improve utilization by more than 50% for AI workloads.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How Volcano Engine’s New GPU Sharing Scheduler Boosts AI Workloads by 500%

In the AI era, deploying large models requires infrastructure that can provide massive AI compute power. Modern cloud‑native platforms now need to manage heterogeneous devices such as GPUs and RDMA, along with fine‑grained device management.

Problem Analysis

Native Kubernetes only supports whole‑GPU scheduling, which can waste expensive GPU resources in several scenarios:

AI inference often processes only a single or small batch of inputs.

High‑performance computing may be CPU‑bound, leaving GPU utilization low.

Development environments (e.g., Jupyter notebooks) sometimes need only low‑spec machines.

CI/CD pipelines usually require limited GPU resources for test cases.

Existing GPU sharing solutions (time‑slicing, MPS, MIG) have limitations in memory and compute isolation, fault isolation, and flexibility.

Two‑Layer Scheduling

Volcano Engine VKE extends the Kubernetes Scheduling Framework with a custom GPUShare plugin that supports 1% compute granularity and 1 MiB memory granularity. This two‑layer scheduler first selects a suitable node, then assigns containers to specific GPU combinations on that node.

Card‑Level Binpack/Spread Strategy

The native scheduler offers node‑level Binpack (fill nodes to increase allocation rate) and Spread (distribute pods for fault isolation). After adding the second scheduling layer, GPU cards become a scheduling domain, requiring both node‑level and card‑level Binpack/Spread strategies to reduce fragmentation or improve fault isolation.

System Architecture

The overall mGPU architecture consists of the following components:

mGPU architecture diagram
mGPU architecture diagram

Scheduler: Central scheduler built on the Scheduling Framework with the GPUShare plugin. It (1) schedules Pods to appropriate nodes and (2) schedules each container to a suitable GPU combination, recording results in Pod annotations.

mGPU Device Plugin: Manages mGPU resources on each node. It (1) publishes mGPU resources to the node object and (2) injects environment variables into containers based on the scheduler’s allocation.

API Definition

Nodes report available mGPU resources as extended resources, with separate dimensions for compute and memory. Example Node object:

<code>apiVersion: v1
kind: Node
metadata:
  name: 10.xx.yy.zz
spec:
  ...
status:
  allocatable:
    vke.volcengine.com/mgpu-core: "400"   # compute, percent
    vke.volcengine.com/mgpu-memory: "130040"   # memory, MiB
  capacity:
    vke.volcengine.com/mgpu-core: "400"
    vke.volcengine.com/mgpu-memory: "130040"
  ...</code>

Pods request mGPU resources in

.spec.containers[i].resources

. Example Pod requesting 30% compute and 1 GiB memory:

<code>apiVersion: v1
kind: Pod
metadata:
  name: test-mgpu
  namespace: default
spec:
  containers:
  - name: app
    resources:
      limits:
        vke.volcengine.com/mgpu-core: "30"
        vke.volcengine.com/mgpu-memory: "1024"
      requests:
        vke.volcengine.com/mgpu-core: "30"
        vke.volcengine.com/mgpu-memory: "1024"
  ...</code>

After successful scheduling, results are stored in Pod annotations, e.g., the container

app

is assigned to GPU index 3 on node

10.xx.yy.zz

.

Scheduling Algorithm

The problem is formulated as an optimization problem. The scheduler evaluates each possible GPU combination on a node, applying both node‑level and card‑level Binpack/Spread strategies.

Objective Function

Score = 0.7 × memory‑dimension score + 0.3 × compute‑dimension score. Memory weight is higher because memory cannot be compressed.

Constraints

All GPUs in a combination must reside on the same node.

The combination must satisfy each container’s compute and memory requests.

Other scheduling constraints are applied after the optimal node is selected.

Search Algorithm

A depth‑first search (DFS) with backtracking and pruning explores all feasible GPU combinations. The search tree’s depth equals the number of containers; each level represents assigning a container to a GPU. Pruning occurs when a partial assignment violates resource constraints. When a leaf node is reached, the combination is scored, and the best‑scoring combination is retained.

Example: a Pod with three containers on a node with three GPUs. The DFS explores assignments, pruning infeasible paths, and finally selects the optimal GPU set.

Summary and Outlook

The GPU sharing scheduler and mGPU virtualization are now available in Volcano Engine’s VKE service. Real‑world tests show a GPU deployment density increase of over 500% and a utilization improvement exceeding 50%.

VKE currently supports scheduling a single container across multiple GPUs and integrates with major batch schedulers. Future work includes GPU topology‑aware scheduling, mixed‑workload placement, and further enhancements to boost AI model training efficiency.

cloud-nativekubernetesResource OptimizationGPU schedulingmGPU
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.