Cloud Native 13 min read

A Formal TLA+ Model of the Kubernetes Scheduler

This article presents a concise yet detailed formal model of the Kubernetes Scheduler, describing its control‑loop logic, scheduling and preemption processes, feasibility filters, viability scoring, and binding objects using TLA+ specifications and illustrative code snippets.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
A Formal TLA+ Model of the Kubernetes Scheduler

Kubernetes is a container‑orchestration engine that runs containerised applications on a cluster of nodes. This article series aims to provide a deep understanding of Kubernetes, focusing on the Scheduler component.

The Kubernetes Scheduler watches the object store for unbound Pods, selects a suitable node, and updates the Pod’s Spec.NodeName . The Scheduler operates as a multi‑step optimizer that first determines feasible placements (those satisfying all constraints) and then chooses viable placements (those with the highest score), guaranteeing a local optimum rather than a global one.

Formal Model (TLA+)

BoundTo(Pod, Node, Snapshot) ≝ ∧ Pod ∈ Snapshot ∧ Pod.Kind = "Pod" ∧ Node ∈ Snapshot ∧ Node.Kind = "Node" ∧ Pod.Spec.NodeName = Node.Name
Bound(Pod, Snapshot) ≝ ∃ Node ∈ Snapshot: BoundTo(Pod, Node, Snapshot)

The Scheduler’s control‑loop can be expressed as:

Scheduler ≝ LET Unbound ≝ {Pod ∈ Objects : Pod.Kind = "Pod" ∧ ¬Bound(Pod, Objects)} IN
  ∃ Pod ∈ {Pod ∈ Unbound : ∀ Other ∈ Unbound : Other.Spec.Priority ≤ Pod.Spec.Priority}:
    CASE SchedulingEnabled(Pod) ⟶ Scheduling(Pod)
    [] PreemptionEnabled(Pod) ⟶ Preemption(Pod)
    [] OTHER ⟶ UNCHANGED(Objects)

Scheduling Process

SchedulingEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"} : Feasibility(Pod, Node, Objects)
Scheduling(Pod) ≝ LET Feasible ≝ {Node ∈ Objects : Node.Kind = "Node" ∧ Feasibility(Pod, Node, Objects)} IN
  ∃ Node ∈ Feasible :
    ∧ ∀ Other ∈ Feasible : Viability(Pod, Other, Objects) ≤ Viability(Pod, Node, Objects)
    ∧ Objects' = { IF Pod = Object THEN [Pod EXCEPT !["Spec"] = [Pod.Spec EXCEPT !["NodeName"] = Node.Name]] ELSE Object : Object ∈ Objects }

Preemption Process

PreemptionEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"}:
  ∃ Pods ⊆ Jeopardy(Pod, Node, Objects): Feasibility(Pod, Node, Objects \ Pods)
Preemption(Pod) ==
  LET Preemptable == {Node ∈ Objects : Node.Kind = "Node" ∧ ∃ Pods ⊆ Jeopardy(Pod, Node, Objects) : Feasibility(Pod, Node, Objects \ Pods)} IN
    ∃ Node ∈ Preemptable:
      ∃ Pods ⊆ Jeopardy(Pod, Node, Objects) :
        ∀ OtherNode ∈ qualified:
          ∀ OtherPods ⊆ Jeopardy(Pod, OtherNode, Objects) :
            ∧ Casualty(Pods) ≤ Casualty(OtherPods)
            ∧ Objects' = (Objects \ Pods)

Feasibility Filters are a conjunction of predicate functions such as:

Node schedulability and lifecycle phase

Resource requirements vs. allocatable resources

Node selector label matching

Taints and tolerations

Node and pod affinity/anti‑affinity

Filter(Pod, Node) ≝ ∧ Node.Spec.Unschedulable = False ∧ Node.Status.Phase = "Running"
Resources(Pod, Node) ≝ ∧ 1 ≤ Node.Status.Allocatable["pods"] ∧ Max({i ∈ DOMAIN Pod.Spec.InitContainer : Pod.Spec.InitContainer[i].Resources.Required["cpu"]}) ≤ Node.Status.Allocatable["cpu"] ∧ ...
Filter(Pod, Node) == ∀ Label ∈ DOMAIN(Pod.Spec.NodeSelector): Label ∈ DOMAIN(Node.Labels) ∧ Pod.Spec.NodeSelector[Label] = Node.Labels[Label]
Match(Toleration, Taint) == ∧ CASE Toleration.Operator = "Exists" ⟶ Toleration.key = Taint.key [] Toleration.Operator = "Equal" ⟶ Toleration.key = Taint.key ∧ Toleration.value = Taint.value [] OTHER ⟶ FALSE ∧ Toleration.Effect = Taint.Effect

Viability Scoring aggregates weighted ratings from node selector matches, pod affinity, and anti‑affinity:

Viability(Pod, Node, Snapshot) == Sum(<
>)
Rating(Pod, Node) ≝ Sum(<<
  Sum(LAMBDA Term: Term.Weight, {NodeSelectorTerm ∈ Pod.Spec.Affinity.NodeAffinity.Preferred.NodeSelectorTerms : Match_NS(NodeSelectorTerm, Node)}),
  Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.PodAffinity.Preferred : P_Affinity(PodAffinityTerm, Node)}),
  Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.AntiPodAffinity.Preferred : ¬P_Affinity(PodAffinityTerm, Node)})
>>)

An example case study shows a cluster with GPU‑enabled and non‑GPU nodes, demonstrating how the Scheduler respects resource constraints and affinity rules to place Pods appropriately.

Finally, the Scheduler does not update a Pod directly; instead it creates a Binding object whose Target.Name is the chosen node, and the API server updates the Pod’s Spec.NodeName accordingly.

The article originates from a KubeCon 2018 contributor summit and references the original Medium post for further reading.

cloud nativeKubernetesSchedulerContainer Orchestrationscheduling algorithmsTLA+
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.