A Formal TLA+ Model of the Kubernetes Scheduler
This article presents a concise yet detailed formal model of the Kubernetes Scheduler, describing its control‑loop logic, scheduling and preemption processes, feasibility filters, viability scoring, and binding objects using TLA+ specifications and illustrative code snippets.
Kubernetes is a container‑orchestration engine that runs containerised applications on a cluster of nodes. This article series aims to provide a deep understanding of Kubernetes, focusing on the Scheduler component.
The Kubernetes Scheduler watches the object store for unbound Pods, selects a suitable node, and updates the Pod’s Spec.NodeName . The Scheduler operates as a multi‑step optimizer that first determines feasible placements (those satisfying all constraints) and then chooses viable placements (those with the highest score), guaranteeing a local optimum rather than a global one.
Formal Model (TLA+)
BoundTo(Pod, Node, Snapshot) ≝ ∧ Pod ∈ Snapshot ∧ Pod.Kind = "Pod" ∧ Node ∈ Snapshot ∧ Node.Kind = "Node" ∧ Pod.Spec.NodeName = Node.Name Bound(Pod, Snapshot) ≝ ∃ Node ∈ Snapshot: BoundTo(Pod, Node, Snapshot)The Scheduler’s control‑loop can be expressed as:
Scheduler ≝ LET Unbound ≝ {Pod ∈ Objects : Pod.Kind = "Pod" ∧ ¬Bound(Pod, Objects)} IN
∃ Pod ∈ {Pod ∈ Unbound : ∀ Other ∈ Unbound : Other.Spec.Priority ≤ Pod.Spec.Priority}:
CASE SchedulingEnabled(Pod) ⟶ Scheduling(Pod)
[] PreemptionEnabled(Pod) ⟶ Preemption(Pod)
[] OTHER ⟶ UNCHANGED(Objects)Scheduling Process
SchedulingEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"} : Feasibility(Pod, Node, Objects) Scheduling(Pod) ≝ LET Feasible ≝ {Node ∈ Objects : Node.Kind = "Node" ∧ Feasibility(Pod, Node, Objects)} IN
∃ Node ∈ Feasible :
∧ ∀ Other ∈ Feasible : Viability(Pod, Other, Objects) ≤ Viability(Pod, Node, Objects)
∧ Objects' = { IF Pod = Object THEN [Pod EXCEPT !["Spec"] = [Pod.Spec EXCEPT !["NodeName"] = Node.Name]] ELSE Object : Object ∈ Objects }Preemption Process
PreemptionEnabled(Pod) ≝ ∃ Node ∈ {Node ∈ Objects : Node.Kind = "Node"}:
∃ Pods ⊆ Jeopardy(Pod, Node, Objects): Feasibility(Pod, Node, Objects \ Pods) Preemption(Pod) ==
LET Preemptable == {Node ∈ Objects : Node.Kind = "Node" ∧ ∃ Pods ⊆ Jeopardy(Pod, Node, Objects) : Feasibility(Pod, Node, Objects \ Pods)} IN
∃ Node ∈ Preemptable:
∃ Pods ⊆ Jeopardy(Pod, Node, Objects) :
∀ OtherNode ∈ qualified:
∀ OtherPods ⊆ Jeopardy(Pod, OtherNode, Objects) :
∧ Casualty(Pods) ≤ Casualty(OtherPods)
∧ Objects' = (Objects \ Pods)Feasibility Filters are a conjunction of predicate functions such as:
Node schedulability and lifecycle phase
Resource requirements vs. allocatable resources
Node selector label matching
Taints and tolerations
Node and pod affinity/anti‑affinity
Filter(Pod, Node) ≝ ∧ Node.Spec.Unschedulable = False ∧ Node.Status.Phase = "Running" Resources(Pod, Node) ≝ ∧ 1 ≤ Node.Status.Allocatable["pods"] ∧ Max({i ∈ DOMAIN Pod.Spec.InitContainer : Pod.Spec.InitContainer[i].Resources.Required["cpu"]}) ≤ Node.Status.Allocatable["cpu"] ∧ ... Filter(Pod, Node) == ∀ Label ∈ DOMAIN(Pod.Spec.NodeSelector): Label ∈ DOMAIN(Node.Labels) ∧ Pod.Spec.NodeSelector[Label] = Node.Labels[Label] Match(Toleration, Taint) == ∧ CASE Toleration.Operator = "Exists" ⟶ Toleration.key = Taint.key [] Toleration.Operator = "Equal" ⟶ Toleration.key = Taint.key ∧ Toleration.value = Taint.value [] OTHER ⟶ FALSE ∧ Toleration.Effect = Taint.EffectViability Scoring aggregates weighted ratings from node selector matches, pod affinity, and anti‑affinity:
Viability(Pod, Node, Snapshot) == Sum(<
>) Rating(Pod, Node) ≝ Sum(<<
Sum(LAMBDA Term: Term.Weight, {NodeSelectorTerm ∈ Pod.Spec.Affinity.NodeAffinity.Preferred.NodeSelectorTerms : Match_NS(NodeSelectorTerm, Node)}),
Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.PodAffinity.Preferred : P_Affinity(PodAffinityTerm, Node)}),
Sum(LAMBDA Term: Term.Weight, {PodAffinityTerm ∈ Pod.Spec.Affinity.AntiPodAffinity.Preferred : ¬P_Affinity(PodAffinityTerm, Node)})
>>)An example case study shows a cluster with GPU‑enabled and non‑GPU nodes, demonstrating how the Scheduler respects resource constraints and affinity rules to place Pods appropriately.
Finally, the Scheduler does not update a Pod directly; instead it creates a Binding object whose Target.Name is the chosen node, and the API server updates the Pod’s Spec.NodeName accordingly.
The article originates from a KubeCon 2018 contributor summit and references the original Medium post for further reading.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.