Mastering Kubernetes Descheduler: Strategies to Balance Your Cluster
Learn how to use Kubernetes Descheduler to rebalance uneven pod distribution across nodes by configuring various built‑in strategies, custom policies, filtering options, and deployment methods such as Jobs and CronJobs, with detailed examples and best‑practice guidelines for production clusters.
Kubernetes's
kube-schedulerassigns Pods to Nodes, but the highly dynamic nature of clusters can lead to uneven pod distribution due to low‑utilized nodes, node failures, added or removed labels/taints, and new nodes joining the cluster.
Some nodes are under‑utilized or over‑utilized.
Changes in pod or node affinity break previous scheduling decisions.
Node failures cause pods to be rescheduled elsewhere.
New nodes are added to the cluster.
When such imbalances occur, the Descheduler can be used to rebalance the cluster by evicting Pods according to configurable strategies.
Descheduler
Descheduler applies a set of strategies to identify Pods that should be evicted so that the cluster reaches a more balanced state. All strategies are enabled by default but can be turned on or off individually.
RemoveDuplicates
LowNodeUtilization
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingNodeAffinity
RemovePodsViolatingNodeTaints
RemovePodsViolatingTopologySpreadConstraint
RemovePodsHavingTooManyRestarts
PodLifeTime
Common configuration options include:
nodeSelector: restricts which nodes are processed.
evictLocalStoragePods: evicts Pods that use LocalStorage.
ignorePvcPods: when set to
true, Pods with PVCs are ignored (default
false).
maxNoOfPodsToEvictPerNode: maximum number of Pods that can be evicted from a node.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
nodeSelector: prod=dev
evictLocalStoragePods: true
maxNoOfPodsToEvictPerNode: 40
ignorePvcPods: false
strategies:
...
</code>RemoveDuplicates
This strategy ensures that only one Pod from the same ReplicaSet, ReplicationController, Deployment, or Job runs on a node. Duplicate Pods are evicted to improve distribution, especially after a node recovers from failure.
excludeOwnerKinds(list of strings): owner kinds to exclude from eviction.
namespaces(list of strings): namespaces to consider.
thresholdPriority(int): priority threshold for eviction.
thresholdPriorityClassName(string): priority class name for eviction.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
params:
removeDuplicates:
excludeOwnerKinds:
- "ReplicaSet"
</code>LowNodeUtilization
This strategy identifies under‑utilized nodes and evicts Pods to those nodes. Thresholds for CPU, memory, and pod count are defined under
nodeResourceUtilizationThresholds. A separate
targetThresholdsdefines over‑utilized nodes from which Pods may be evicted.
thresholds(map): resource usage percentages that define a low‑utilization node.
targetThresholds(map): percentages that define a high‑utilization node.
numberOfNodes(int): minimum number of low‑utilization nodes required to activate the strategy.
thresholdPriority(int) and
thresholdPriorityClassName(string): priority filtering.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu": 20
"memory": 20
"pods": 20
targetThresholds:
"cpu": 50
"memory": 50
"pods": 50
</code>RemovePodsViolatingInterPodAntiAffinity
Evicts Pods that break inter‑pod anti‑affinity rules, ensuring that Pods with mutually exclusive placement constraints are not co‑located on the same node.
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
</code>RemovePodsViolatingNodeAffinity
When enabled, the
requiredDuringSchedulingIgnoredDuringExecutionnode affinity is treated as a temporary requirement and Pods violating it are evicted.
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
nodeAffinityType(list of strings): e.g.,
requiredDuringSchedulingIgnoredDuringExecution.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeAffinity":
enabled: true
params:
nodeAffinityType:
- "requiredDuringSchedulingIgnoredDuringExecution"
</code>RemovePodsViolatingNodeTaints
Evicts Pods that do not tolerate a node's
NoScheduletaint.
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeTaints":
enabled: true
</code>RemovePodsViolatingTopologySpreadConstraint
Ensures Pods are spread across topology domains within the
maxSkewlimit. Soft constraints can be enabled by setting
includeSoftConstraintsto
true(requires Kubernetes ≥ 1.18).
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
includeSoftConstraints(bool)
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingTopologySpreadConstraint":
enabled: true
params:
includeSoftConstraints: false
</code>RemovePodsHavingTooManyRestarts
Evicts Pods that have exceeded a restart threshold, optionally considering init container restarts.
podRestartThreshold(int)
includingInitContainers(bool)
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsHavingTooManyRestarts":
enabled: true
params:
podsHavingTooManyRestarts:
podRestartThreshold: 100
includingInitContainers: true
</code>PodLifeTime
Evicts Pods older than
maxPodLifeTimeSeconds. The
podStatusPhasesfield selects which Pod phases are subject to eviction.
maxPodLifeTimeSeconds(int)
podStatusPhases(list of strings)
thresholdPriority(int)
thresholdPriorityClassName(string)
namespaces(list of strings)
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
podStatusPhases:
- "Pending"
</code>Filter Pods
Descheduler allows selective eviction through namespace and priority filters.
Namespace filtering
Strategies can include or exclude specific namespaces using
includeor
excludelists.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
namespaces:
include:
- "namespace1"
- "namespace2"
</code> <code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
namespaces:
exclude:
- "namespace1"
- "namespace2"
</code>Priority filtering
All strategies support priority thresholds; only Pods with a priority lower than the configured value are eligible for eviction. Use either
thresholdPriority(numeric) or
thresholdPriorityClassName(class name). Both cannot be set simultaneously.
<code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
thresholdPriority: 10000
</code> <code>apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
thresholdPriorityClassName: "priorityclass1"
</code>Note: thresholdPriority and thresholdPriorityClassName cannot be configured together. If the specified priority class does not exist, Descheduler will fail.
Pod Evictions
Critical system Pods (priority class
system-cluster-criticalor
system-node-critical) are never evicted.
Pods not managed by a ReplicaSet, ReplicationController, Deployment, or Job are ignored.
DaemonSet Pods are never evicted.
Pods using LocalStorage are protected unless
evictLocalStoragePods: trueis set.
Pods with PVCs are evicted unless
ignorePvcPods: trueis set.
Under
LowNodeUtilizationand
RemovePodsViolatingInterPodAntiAffinity, Pods are evicted from low to high priority; within the same priority,
BestEffortPods are evicted before
Burstableand
GuaranteedPods.
Pods annotated with
descheduler.alpha.kubernetes.io/evictcan be forced to evict.
If eviction fails, increase verbosity with
--v=4or inspect Descheduler logs.
Pods protected by PodDisruptionBudgets (PDB) are not evicted.
Version Compatibility
Descheduler v0.20 → Kubernetes v1.20
Descheduler v0.19 → Kubernetes v1.19
Descheduler v0.18 → Kubernetes v1.18
Descheduler v0.10 → Kubernetes v1.17
Descheduler v0.4‑v0.9 → Kubernetes v1.9+
Descheduler v0.1‑v0.3 → Kubernetes v1.7‑v1.8
Practice
1. Download the matching Descheduler version
<code>$ wget https://github.com/kubernetes-sigs/descheduler/archive/v0.18.0.tar.gz</code>2. Create RBAC resources
<code>---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: descheduler-cluster-role
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: descheduler-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: descheduler-cluster-role-binding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: descheduler-cluster-role
subjects:
- name: descheduler-sa
kind: ServiceAccount
namespace: kube-system
</code>3. Create a ConfigMap with the policy
<code>---
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu": 20
"memory": 20
"pods": 20
targetThresholds:
"cpu": 50
"memory": 50
"pods": 50
</code>4. Run Descheduler as a Job
<code>---
apiVersion: batch/v1
kind: Job
metadata:
name: descheduler-job
namespace: kube-system
spec:
parallelism: 1
completions: 1
template:
metadata:
name: descheduler-pod
spec:
priorityClassName: system-cluster-critical
containers:
- name: descheduler
image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--v"
- "3"
restartPolicy: "Never"
serviceAccountName: descheduler-sa
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
</code>5. Schedule periodic evictions with a CronJob
<code>---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: descheduler-cronjob
namespace: kube-system
spec:
schedule: "*/2 * * * *"
concurrencyPolicy: "Forbid"
jobTemplate:
spec:
template:
metadata:
name: descheduler-pod
spec:
priorityClassName: system-cluster-critical
containers:
- name: descheduler
image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--v"
- "3"
restartPolicy: "Never"
serviceAccountName: descheduler-sa
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
</code>Reference: https://github.com/kubernetes-sigs/descheduler
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.