Cloud Native 19 min read

Mastering Kubernetes Descheduler: Strategies to Balance Your Cluster

Learn how to use Kubernetes Descheduler to rebalance uneven pod distribution across nodes by configuring various built‑in strategies, custom policies, filtering options, and deployment methods such as Jobs and CronJobs, with detailed examples and best‑practice guidelines for production clusters.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Mastering Kubernetes Descheduler: Strategies to Balance Your Cluster

Kubernetes's

kube-scheduler

assigns Pods to Nodes, but the highly dynamic nature of clusters can lead to uneven pod distribution due to low‑utilized nodes, node failures, added or removed labels/taints, and new nodes joining the cluster.

Some nodes are under‑utilized or over‑utilized.

Changes in pod or node affinity break previous scheduling decisions.

Node failures cause pods to be rescheduled elsewhere.

New nodes are added to the cluster.

When such imbalances occur, the Descheduler can be used to rebalance the cluster by evicting Pods according to configurable strategies.

Descheduler

Descheduler applies a set of strategies to identify Pods that should be evicted so that the cluster reaches a more balanced state. All strategies are enabled by default but can be turned on or off individually.

RemoveDuplicates

LowNodeUtilization

RemovePodsViolatingInterPodAntiAffinity

RemovePodsViolatingNodeAffinity

RemovePodsViolatingNodeTaints

RemovePodsViolatingTopologySpreadConstraint

RemovePodsHavingTooManyRestarts

PodLifeTime

Common configuration options include:

nodeSelector

: restricts which nodes are processed.

evictLocalStoragePods

: evicts Pods that use LocalStorage.

ignorePvcPods

: when set to

true

, Pods with PVCs are ignored (default

false

).

maxNoOfPodsToEvictPerNode

: maximum number of Pods that can be evicted from a node.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
nodeSelector: prod=dev
evictLocalStoragePods: true
maxNoOfPodsToEvictPerNode: 40
ignorePvcPods: false
strategies:
  ...
</code>

RemoveDuplicates

This strategy ensures that only one Pod from the same ReplicaSet, ReplicationController, Deployment, or Job runs on a node. Duplicate Pods are evicted to improve distribution, especially after a node recovers from failure.

excludeOwnerKinds

(list of strings): owner kinds to exclude from eviction.

namespaces

(list of strings): namespaces to consider.

thresholdPriority

(int): priority threshold for eviction.

thresholdPriorityClassName

(string): priority class name for eviction.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
    enabled: true
    params:
      removeDuplicates:
        excludeOwnerKinds:
        - "ReplicaSet"
</code>

LowNodeUtilization

This strategy identifies under‑utilized nodes and evicts Pods to those nodes. Thresholds for CPU, memory, and pod count are defined under

nodeResourceUtilizationThresholds

. A separate

targetThresholds

defines over‑utilized nodes from which Pods may be evicted.

thresholds

(map): resource usage percentages that define a low‑utilization node.

targetThresholds

(map): percentages that define a high‑utilization node.

numberOfNodes

(int): minimum number of low‑utilization nodes required to activate the strategy.

thresholdPriority

(int) and

thresholdPriorityClassName

(string): priority filtering.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "cpu": 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu": 50
          "memory": 50
          "pods": 50
</code>

RemovePodsViolatingInterPodAntiAffinity

Evicts Pods that break inter‑pod anti‑affinity rules, ensuring that Pods with mutually exclusive placement constraints are not co‑located on the same node.

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity":
    enabled: true
</code>

RemovePodsViolatingNodeAffinity

When enabled, the

requiredDuringSchedulingIgnoredDuringExecution

node affinity is treated as a temporary requirement and Pods violating it are evicted.

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

nodeAffinityType

(list of strings): e.g.,

requiredDuringSchedulingIgnoredDuringExecution

.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution"
</code>

RemovePodsViolatingNodeTaints

Evicts Pods that do not tolerate a node's

NoSchedule

taint.

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeTaints":
    enabled: true
</code>

RemovePodsViolatingTopologySpreadConstraint

Ensures Pods are spread across topology domains within the

maxSkew

limit. Soft constraints can be enabled by setting

includeSoftConstraints

to

true

(requires Kubernetes ≥ 1.18).

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

includeSoftConstraints

(bool)

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
    enabled: true
    params:
      includeSoftConstraints: false
</code>

RemovePodsHavingTooManyRestarts

Evicts Pods that have exceeded a restart threshold, optionally considering init container restarts.

podRestartThreshold

(int)

includingInitContainers

(bool)

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsHavingTooManyRestarts":
    enabled: true
    params:
      podsHavingTooManyRestarts:
        podRestartThreshold: 100
        includingInitContainers: true
</code>

PodLifeTime

Evicts Pods older than

maxPodLifeTimeSeconds

. The

podStatusPhases

field selects which Pod phases are subject to eviction.

maxPodLifeTimeSeconds

(int)

podStatusPhases

(list of strings)

thresholdPriority

(int)

thresholdPriorityClassName

(string)

namespaces

(list of strings)

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        podStatusPhases:
        - "Pending"
</code>

Filter Pods

Descheduler allows selective eviction through namespace and priority filters.

Namespace filtering

Strategies can include or exclude specific namespaces using

include

or

exclude

lists.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          include:
          - "namespace1"
          - "namespace2"
</code>
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          exclude:
          - "namespace1"
          - "namespace2"
</code>

Priority filtering

All strategies support priority thresholds; only Pods with a priority lower than the configured value are eligible for eviction. Use either

thresholdPriority

(numeric) or

thresholdPriorityClassName

(class name). Both cannot be set simultaneously.

<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        thresholdPriority: 10000
</code>
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        thresholdPriorityClassName: "priorityclass1"
</code>
Note: thresholdPriority and thresholdPriorityClassName cannot be configured together. If the specified priority class does not exist, Descheduler will fail.

Pod Evictions

Critical system Pods (priority class

system-cluster-critical

or

system-node-critical

) are never evicted.

Pods not managed by a ReplicaSet, ReplicationController, Deployment, or Job are ignored.

DaemonSet Pods are never evicted.

Pods using LocalStorage are protected unless

evictLocalStoragePods: true

is set.

Pods with PVCs are evicted unless

ignorePvcPods: true

is set.

Under

LowNodeUtilization

and

RemovePodsViolatingInterPodAntiAffinity

, Pods are evicted from low to high priority; within the same priority,

BestEffort

Pods are evicted before

Burstable

and

Guaranteed

Pods.

Pods annotated with

descheduler.alpha.kubernetes.io/evict

can be forced to evict.

If eviction fails, increase verbosity with

--v=4

or inspect Descheduler logs.

Pods protected by PodDisruptionBudgets (PDB) are not evicted.

Version Compatibility

Descheduler v0.20 → Kubernetes v1.20

Descheduler v0.19 → Kubernetes v1.19

Descheduler v0.18 → Kubernetes v1.18

Descheduler v0.10 → Kubernetes v1.17

Descheduler v0.4‑v0.9 → Kubernetes v1.9+

Descheduler v0.1‑v0.3 → Kubernetes v1.7‑v1.8

Practice

1. Download the matching Descheduler version

<code>$ wget https://github.com/kubernetes-sigs/descheduler/archive/v0.18.0.tar.gz</code>

2. Create RBAC resources

<code>---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: descheduler-cluster-role
  namespace: kube-system
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-cluster-role-binding
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: descheduler-cluster-role
subjects:
- name: descheduler-sa
  kind: ServiceAccount
  namespace: kube-system
</code>

3. Create a ConfigMap with the policy

<code>---
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":
        enabled: true
      "RemovePodsViolatingInterPodAntiAffinity":
        enabled: true
      "LowNodeUtilization":
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              "cpu": 20
              "memory": 20
              "pods": 20
            targetThresholds:
              "cpu": 50
              "memory": 50
              "pods": 50
</code>

4. Run Descheduler as a Job

<code>---
apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod
    spec:
      priorityClassName: system-cluster-critical
      containers:
      - name: descheduler
        image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
        volumeMounts:
        - mountPath: /policy-dir
          name: policy-volume
        command:
        - "/bin/descheduler"
        args:
        - "--policy-config-file"
        - "/policy-dir/policy.yaml"
        - "--v"
        - "3"
      restartPolicy: "Never"
      serviceAccountName: descheduler-sa
      volumes:
      - name: policy-volume
        configMap:
          name: descheduler-policy-configmap
</code>

5. Schedule periodic evictions with a CronJob

<code>---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: descheduler-cronjob
  namespace: kube-system
spec:
  schedule: "*/2 * * * *"
  concurrencyPolicy: "Forbid"
  jobTemplate:
    spec:
      template:
        metadata:
          name: descheduler-pod
        spec:
          priorityClassName: system-cluster-critical
          containers:
          - name: descheduler
            image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
            volumeMounts:
            - mountPath: /policy-dir
              name: policy-volume
            command:
            - "/bin/descheduler"
            args:
            - "--policy-config-file"
            - "/policy-dir/policy.yaml"
            - "--v"
            - "3"
          restartPolicy: "Never"
          serviceAccountName: descheduler-sa
          volumes:
          - name: policy-volume
            configMap:
              name: descheduler-policy-configmap
</code>
Reference: https://github.com/kubernetes-sigs/descheduler
kubernetesK8sPod schedulingDeschedulerCluster balancing
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.