Cloud Native 13 min read

A Formal TLA+ Model of the Kubernetes Scheduler

This article presents a concise yet detailed formal model of the Kubernetes Scheduler, describing its control‑loop logic, scheduling and preemption processes, feasibility filters, viability scoring, and binding objects using TLA+ specifications and illustrative code snippets.

Cloud Native Technology Community

May 15, 2019

A Formal TLA+ Model of the Kubernetes Scheduler

Kubernetes is a container‑orchestration engine that runs containerised applications on a cluster of nodes. This article series aims to provide a deep understanding of Kubernetes, focusing on the Scheduler component.

The Kubernetes Scheduler watches the object store for unbound Pods, selects a suitable node, and updates the Pod’s Spec.NodeName. The Scheduler operates as a multi‑step optimizer that first determines feasible placements (those satisfying all constraints) and then chooses viable placements (those with the highest score), guaranteeing a local optimum rather than a global one.

Formal Model (TLA+)

BoundTo(Pod, Node, Snapshot) ≝ ∧ Pod ∈ Snapshot ∧ Pod.Kind = "Pod" ∧ Node ∈ Snapshot ∧ Node.Kind = "Node" ∧ Pod.Spec.NodeName = Node.Name

Bound(Pod, Snapshot) ≝ ∃ Node ∈ Snapshot: BoundTo(Pod, Node, Snapshot)

The Scheduler’s control‑loop can be expressed as:

Scheduler ≝ LET Unbound ≝ {Pod ∈ Objects : Pod.Kind = "Pod" ∧ ¬Bound(Pod, Objects)} IN
  ∃ Pod ∈ {Pod ∈ Unbound : ∀ Other ∈ Unbound : Other.Spec.Priority ≤ Pod.Spec.Priority}:
    CASE SchedulingEnabled(Pod) ⟶ Scheduling(Pod)
    [] PreemptionEnabled(Pod) ⟶ Preemption(Pod)
    [] OTHER ⟶ UNCHANGED(Objects)

Scheduling Process

SchedulingEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"} : Feasibility(Pod, Node, Objects)

Scheduling(Pod) ≝ LET Feasible ≝ {Node ∈ Objects : Node.Kind = "Node" ∧ Feasibility(Pod, Node, Objects)} IN
  ∃ Node ∈ Feasible :
    ∧ ∀ Other ∈ Feasible : Viability(Pod, Other, Objects) ≤ Viability(Pod, Node, Objects)
    ∧ Objects' = { IF Pod = Object THEN [Pod EXCEPT !["Spec"] = [Pod.Spec EXCEPT !["NodeName"] = Node.Name]] ELSE Object : Object ∈ Objects }

Preemption Process

PreemptionEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"}:
  ∃ Pods ⊆ Jeopardy(Pod, Node, Objects): Feasibility(Pod, Node, Objects \ Pods)

Preemption(Pod) ==
  LET Preemptable == {Node ∈ Objects : Node.Kind = "Node" ∧ ∃ Pods ⊆ Jeopardy(Pod, Node, Objects) : Feasibility(Pod, Node, Objects \ Pods)} IN
    ∃ Node ∈ Preemptable:
      ∃ Pods ⊆ Jeopardy(Pod, Node, Objects) :
        ∀ OtherNode ∈ qualified:
          ∀ OtherPods ⊆ Jeopardy(Pod, OtherNode, Objects) :
            ∧ Casualty(Pods) ≤ Casualty(OtherPods)
            ∧ Objects' = (Objects \ Pods)

Feasibility Filters are a conjunction of predicate functions such as:

Node schedulability and lifecycle phase

Resource requirements vs. allocatable resources

Node selector label matching

Taints and tolerations

Node and pod affinity/anti‑affinity

Filter(Pod, Node) ≝ ∧ Node.Spec.Unschedulable = False ∧ Node.Status.Phase = "Running"

Resources(Pod, Node) ≝ ∧ 1 ≤ Node.Status.Allocatable["pods"] ∧ Max({i ∈ DOMAIN Pod.Spec.InitContainer : Pod.Spec.InitContainer[i].Resources.Required["cpu"]}) ≤ Node.Status.Allocatable["cpu"] ∧ ...

Filter(Pod, Node) == ∀ Label ∈ DOMAIN(Pod.Spec.NodeSelector): Label ∈ DOMAIN(Node.Labels) ∧ Pod.Spec.NodeSelector[Label] = Node.Labels[Label]

Match(Toleration, Taint) == ∧ CASE Toleration.Operator = "Exists" ⟶ Toleration.key = Taint.key [] Toleration.Operator = "Equal" ⟶ Toleration.key = Taint.key ∧ Toleration.value = Taint.value [] OTHER ⟶ FALSE ∧ Toleration.Effect = Taint.Effect

Viability Scoring aggregates weighted ratings from node selector matches, pod affinity, and anti‑affinity:

Viability(Pod, Node, Snapshot) == Sum(<<Rating_1, Rating_2, ...>>)

Rating(Pod, Node) ≝ Sum(<<
  Sum(LAMBDA Term: Term.Weight, {NodeSelectorTerm ∈ Pod.Spec.Affinity.NodeAffinity.Preferred.NodeSelectorTerms : Match_NS(NodeSelectorTerm, Node)}),
  Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.PodAffinity.Preferred : P_Affinity(PodAffinityTerm, Node)}),
  Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.AntiPodAffinity.Preferred : ¬P_Affinity(PodAffinityTerm, Node)})
>>)

An example case study shows a cluster with GPU‑enabled and non‑GPU nodes, demonstrating how the Scheduler respects resource constraints and affinity rules to place Pods appropriately.

Finally, the Scheduler does not update a Pod directly; instead it creates a Binding object whose Target.Name is the chosen node, and the API server updates the Pod’s Spec.NodeName accordingly.

The article originates from a KubeCon 2018 contributor summit and references the original Medium post for further reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Scheduler Scheduling Algorithms TLA+

Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.