Understanding the Kubernetes Scheduler: Architecture, Scoring Phases, and Custom Plugin Development
This article explains the inner workings of the Kubernetes scheduler, details its pre‑score, score, and normalize‑score phases, discusses zone handling and resource fragmentation, and demonstrates how to extend the scheduler using the Scheduling Framework with a custom Go plugin.
The Kubernetes scheduler acts as the brain that decides where each pod should run, simply filling pod.spec.nodeName with a node name but using complex algorithms to choose the most suitable node based on the workload scenario.
It operates through two main control loops: the first reads unscheduled pods from etcd and enqueues them; the second dequeues a pod, applies Predicates to filter nodes, then runs Priorities to score the remaining nodes from 0 to 100, finally updating the pod's nodeName .
To keep replicas of the same controller spread across nodes, the scheduler scores nodes in three stages. In the PreScore stage it gathers controller selector information and stores it in CycleState for later use.
func (pl *SelectorSpread) PreScore(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodes []*v1.Node) *framework.Status {
if skipSelectorSpread(pod) {
return nil
}
var selector labels.Selector
selector = helper.DefaultSelector(
pod,
pl.services,
pl.replicationControllers,
pl.replicaSets,
pl.statefulSets,
)
state := &preScoreState{selector: selector}
cycleState.Write(preScoreStateKey, state)
return nil
}During the Score stage, the plugin counts matching pods on each node; each matching pod adds one point, and the total becomes the node's raw score.
func (pl *SelectorSpread) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
if skipSelectorSpread(pod) {
return 0, nil
}
c, err := state.Read(preScoreStateKey)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("Error reading %q from cycleState: %v", preScoreStateKey, err))
}
s, ok := c.(*preScoreState)
if !ok {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("%+v convert to preScoreState error", c))
}
nodeInfo, err := pl.sharedLister.NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
}
count := countMatchingPods(pod.Namespace, s.selector, nodeInfo)
return int64(count), nil
}The NormalizeScore stage adjusts these raw scores to a final 0‑100 range, also taking zone information into account to further spread pods across failure domains.
func (pl *SelectorSpread) NormalizeScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status {
if skipSelectorSpread(pod) {
return nil
}
// aggregate scores per zone
countsByZone := make(map[string]int64, 10)
var maxCountByZone int64
var maxCountByNodeName int64
for i := range scores {
if scores[i].Score > maxCountByNodeName {
maxCountByNodeName = scores[i].Score
}
nodeInfo, err := pl.sharedLister.NodeInfos().Get(scores[i].Name)
if err != nil {
return framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", scores[i].Name, err))
}
zoneID := utilnode.GetZoneKey(nodeInfo.Node())
if zoneID == "" {
continue
}
countsByZone[zoneID] += scores[i].Score
}
for _, v := range countsByZone {
if v > maxCountByZone {
maxCountByZone = v
}
}
haveZones := len(countsByZone) != 0
maxNodeScoreF := float64(framework.MaxNodeScore)
for i := range scores {
fScore := maxNodeScoreF
if maxCountByNodeName > 0 {
fScore = maxNodeScoreF * (float64(maxCountByNodeName-scores[i].Score) / float64(maxCountByNodeName))
}
if haveZones {
nodeInfo, err := pl.sharedLister.NodeInfos().Get(scores[i].Name)
if err != nil {
return framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", scores[i].Name, err))
}
zoneID := utilnode.GetZoneKey(nodeInfo.Node())
if zoneID != "" {
zoneScore := maxNodeScoreF
if maxCountByZone > 0 {
zoneScore = maxNodeScoreF * (float64(maxCountByZone-countsByZone[zoneID]) / float64(maxCountByZone))
}
fScore = fScore*(1.0-zoneWeighting) + zoneWeighting*zoneScore
}
}
scores[i].Score = int64(fScore)
}
return nil
}A zone is defined by node labels such as failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region ; the function GetZoneKey builds a composite key from these labels.
func GetZoneKey(node *v1.Node) string {
labels := node.Labels
if labels == nil {
return ""
}
zone, ok := labels[v1.LabelZoneFailureDomain]
if !ok {
zone, _ = labels[v1.LabelZoneFailureDomainStable]
}
region, ok := labels[v1.LabelZoneRegion]
if !ok {
region, _ = labels[v1.LabelZoneRegionStable]
}
if region == "" && zone == "" {
return ""
}
return region + ":\x00:" + zone
}Resource fragmentation occurs when nodes have leftover resources that cannot satisfy larger pod requests; the article suggests a strategy of filling one node completely before scheduling onto another to mitigate this issue.
To extend the scheduler, the modern approach is to use the Scheduling Framework , which provides twelve extension points such as QueueSort, Pre‑filter, Filter, PreScore, Score, NormalizeScore, Reserve, Permit, Pre‑bind, Bind, and Unreserve.
The article walks through a minimal demo plugin written in Go for kube‑scheduler 1.19, showing the implementation of the Name , Score , and ScoreExtensions methods, registration of the plugin in the scheduler command, and a sample configuration file that enables the plugin in the score extension point.
package pkg
import (
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"context"
framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"
)
type demoPlugin struct {}
func NewDemoPlugin(_ runtime.Unknown, _ framework.FrameworkHandle) (framework.Plugin, error) {
return &demoPlugin{}, nil
}
func (d *demoPlugin) Name() string { return "demo-plugin" }
func (d *demoPlugin) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
return 100, nil
}
func (d *demoPlugin) ScoreExtensions() framework.ScoreExtensions { return nil } package main
import (
"git.cai-inc.com/devops/zcy-scheduler/pkg/demo"
"math/rand"
"os"
"time"
"k8s.io/component-base/logs"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
)
func main() {
rand.Seed(time.Now().UnixNano())
command := app.NewSchedulerCommand(
app.WithPlugin(noderesources.AllocatableName, demo.NewDemoPlugin),
)
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
os.Exit(1)
}
} apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
clientConnection:
kubeconfig: "/Users/zcy/newgo/zcy-scheduler/scheduler.conf"
profiles:
- schedulerName: zcy-scheduler
plugins:
score:
enabled:
- name: demo-plugin
disabled:
- name: "*"The article concludes that understanding the scheduler’s two‑loop architecture, its mechanisms for spreading pod replicas and handling resource fragmentation, and mastering the Scheduling Framework are essential for building custom scheduling solutions in Kubernetes.
References: custom‑kube‑scheduler tutorial, scheduler‑plugins repository, and a related article on the Geekbang platform.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.