kubernetes, descheduler

Meet a Kubernetes Descheduler

Last update:

The kube-scheduler is a component responsible for scheduling in Kubernetes. But, sometimes pods can end up on the wrong node due to Kubernetes dynamic nature. You could be editing existing resources, to add node affinity or (anti) pod affinity, or you have more load on some servers and some are running almost on idle. Once the pod is running kube-scheduler will not try to reschedule it again. Depending on the environment you might have a lot of moving parts. In my case, Kubernetes cluster was running on AWS with kops, multiple instance groups with spot instances included, cluster autoscaler, k8s spot rescheduler, etc. If cluster scales to one more node, that node will probably have only one pod that was in pending state before cluster autoscaler kicked in. Not ideal, right?

Previous post

How it works?

Descheduler checks for pods and evicts them based on defined policies. Descheduler is not a default scheduler replacement and depends on it. This project is currently in Kubernetes incubator and not ready for production yet. But, I found it very stable and it worked nicely. So, how to run it?

You can run the descheduler as job or cron job in a pod. I already created a dev image included in below yaml files as komljen/descheduler:v0.8.0, but this project is in active development and things are changing fast. You can create your own image with:

⚡ git clone https://github.com/kubernetes-incubator/descheduler
⚡ cd descheduler && make image

Then tag the image and push it to your repository. For descheduler deployment, if you prefer Helm you can easily deploy it using the chart I created. It has a support for RBAC and I tested it on Kubernetes v1.9. Actually, everything in this blog post was tested with the same version. Add a Helm repository and install the descheduler chart:

⚡ helm repo add akomljen-charts \
    https://raw.githubusercontent.com/komljen/helm-charts/master/charts/

⚡ helm install --name ds \
    --namespace kube-system \
    akomljen-charts/descheduler

Learn how and why to package Kubernetes applications with Helm.

If not, you can do it manually of course. First, adjust RBAC if needed:

# Create a cluster role
⚡ cat << EOF| kubectl create -n kube-system -f -
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: descheduler
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
EOF

# Create a service account
⚡ kubectl create sa descheduler -n kube-system

# Bind the cluster role to the service account
⚡ kubectl create clusterrolebinding descheduler \
    -n kube-system \
    --clusterrole=descheduler \
    --serviceaccount=kube-system:descheduler

Then you need to create a descheduler policy which will be a configmap. At the moment there are four strategies supported RemoveDuplicates, LowNodeUtilization, RemovePodsViolatingInterPodAntiAffinity and RemovePodsViolatingNodeAffinity. They are all enabled by default, but of course you can disable them or adjust if needed. Let's create a config map in kube-system namespace:

⚡ cat << EOF| kubectl create -n kube-system -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler
data:
  policy.yaml: |-  
    apiVersion: descheduler/v1alpha1
    kind: DeschedulerPolicy
    strategies:
      RemoveDuplicates:
         enabled: false
      LowNodeUtilization:
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               cpu: 20
               memory: 20
               pods: 20
             targetThresholds:
               cpu: 50
               memory: 50
               pods: 50
      RemovePodsViolatingInterPodAntiAffinity:
        enabled: true
      RemovePodsViolatingNodeAffinity:
        enabled: true
        params:
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
EOF

You will run a descheduler in a pod as a cron job. In this case, it will run every 30 minutes. Now let's create a cron job also in kube-system namespace:

⚡ cat << EOF| kubectl create -n kube-system -f -
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: descheduler
spec:
  schedule: "*/30 * * * *"
  jobTemplate:
    metadata:
      name: descheduler
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: "true"
    spec:
      template:
        spec:
          serviceAccountName: descheduler
          containers:
          - name: descheduler
            image: komljen/descheduler:v0.6.0
            volumeMounts:
            - mountPath: /policy-dir
              name: policy-volume
            command:
            - /bin/descheduler
            - --v=4
            - --max-pods-to-evict-per-node=10
            - --policy-config-file=/policy-dir/policy.yaml
          restartPolicy: "OnFailure"
          volumes:
          - name: policy-volume
            configMap:
              name: descheduler
EOF

⚡ kubectl get cronjobs -n kube-system
NAME             SCHEDULE       SUSPEND   ACTIVE    LAST SCHEDULE   AGE
descheduler      */30 * * * *   False     0         2m              32m

You will be able to check for completed pods once the cron job starts working:

⚡ kubectl get pods -n kube-system -a | grep Completed
descheduler-1525520700-297pq          0/1       Completed   0          1h
descheduler-1525521000-tz2ch          0/1       Completed   0          32m
descheduler-1525521300-mrw4t          0/1       Completed   0          2m

You can also check for completed pod logs to see what was happening and to adjust descheduler policies if needed:

⚡ kubectl logs descheduler-1525521300-mrw4t -n kube-system
I0505 11:55:07.554195       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0505 11:55:07.554255       1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0505 11:55:07.767903       1 lownodeutilization.go:147] Node "ip-10-4-63-172.eu-west-1.compute.internal" is appropriately utilized with usage: api.ResourceThresholds{"cpu":41.5, "memory":1.3635487207675927, "pods":8.181818181818182}
I0505 11:55:07.767942       1 lownodeutilization.go:149] allPods:9, nonRemovablePods:9, bePods:0, bPods:0, gPods:0
I0505 11:55:07.768141       1 lownodeutilization.go:144] Node "ip-10-4-36-223.eu-west-1.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":48.75, "memory":61.05259502942694, "pods":30}
I0505 11:55:07.768156       1 lownodeutilization.go:149] allPods:33, nonRemovablePods:12, bePods:1, bPods:19, gPods:1
I0505 11:55:07.768376       1 lownodeutilization.go:144] Node "ip-10-4-41-14.eu-west-1.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":39.125, "memory":98.19259268881142, "pods":33.63636363636363}
I0505 11:55:07.768390       1 lownodeutilization.go:149] allPods:37, nonRemovablePods:8, bePods:0, bPods:29, gPods:0
I0505 11:55:07.768538       1 lownodeutilization.go:147] Node "ip-10-4-34-29.eu-west-1.compute.internal" is appropriately utilized with usage: api.ResourceThresholds{"memory":43.19826999287199, "pods":30.90909090909091, "cpu":35.25}
I0505 11:55:07.768552       1 lownodeutilization.go:149] allPods:34, nonRemovablePods:11, bePods:8, bPods:15, gPods:0
I0505 11:55:07.768556       1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 20, Mem: 20, Pods: 20
I0505 11:55:07.768571       1 lownodeutilization.go:69] No node is underutilized, nothing to do here, you might tune your thersholds further
I0505 11:55:07.768576       1 pod_antiaffinity.go:45] Processing node: "ip-10-4-63-172.eu-west-1.compute.internal"
I0505 11:55:07.779313       1 pod_antiaffinity.go:45] Processing node: "ip-10-4-36-223.eu-west-1.compute.internal"
I0505 11:55:07.796766       1 pod_antiaffinity.go:45] Processing node: "ip-10-4-41-14.eu-west-1.compute.internal"
I0505 11:55:07.813303       1 pod_antiaffinity.go:45] Processing node: "ip-10-4-34-29.eu-west-1.compute.internal"
I0505 11:55:07.829109       1 node_affinity.go:40] Executing for nodeAffinityType: requiredDuringSchedulingIgnoredDuringExecution
I0505 11:55:07.829133       1 node_affinity.go:45] Processing node: "ip-10-4-63-172.eu-west-1.compute.internal"
I0505 11:55:07.840416       1 node_affinity.go:45] Processing node: "ip-10-4-36-223.eu-west-1.compute.internal"
I0505 11:55:07.856735       1 node_affinity.go:45] Processing node: "ip-10-4-41-14.eu-west-1.compute.internal"
I0505 11:55:07.945566       1 request.go:480] Throttling request took 88.738917ms, request: GET:https://100.64.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-4-41-14.eu-west-1.compute.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0505 11:55:07.972702       1 node_affinity.go:45] Processing node: "ip-10-4-34-29.eu-west-1.compute.internal"
I0505 11:55:08.145559       1 request.go:480] Throttling request took 172.751657ms, request: GET:https://100.64.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-4-34-29.eu-west-1.compute.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0505 11:55:08.160964       1 node_affinity.go:72] Evicted 0 pods

Viola, you have a descheuler running in your cluster!

Summary

Default Kubernetes scheduler is doing a good job, but because of dynamic environments it can happen that pods are not running on the right node or you want to balance the resources better. With other mechanisms, descheduler is a good companion which evicts pods when needed. I'm looking forward to a production ready release. Stay tuned for the next one.