Background

We create multiple node pools to organize our workloads into a Kubernetes cluster. Time to time, we perform several upgrades to our node pools. Some upgrades, such as changing machine size/type, version, etc. are disruptive; i.e. the node pools get recreated. On a managed Kubernetes cluster, such a GKE (Google Kubernetes Engine), you can automatically perform such up gradation work on your node pools. GKE has the capabilities to automatically migrate the workload from old node pool to new one. GKE manages this workload migration using multiple steps:

  • provision new node pool and wait for new pool to be ready
  • cordon the nodes in old node pool to mark the nodes as unschedulable
  • drain the nodes in old node pool to migrate them to new node pool
  • delete the old node pool

By default, GKE uses surge upgrade strategy with maxSurge (how many new nodes can be created) and maxUnavailable (how many existing nodes can be deleted) to upgrade node pool. GKE waits for surge (new) nodes to be ready before cordoning and draining existing nodes. This way GKE avoids service interruption during the workload migration

Important to note that:

If you use Spot VMs in your node pool, surge upgrade values are ignored, because there is no availability guarantee for Spot VMs. During an upgrade, old nodes are drained directly without waiting for surge (new) nodes that use Spot VMs to be ready which causes service disruption!!

To avoid service disruption, we should manually upgrade a node pool that uses Spot VMs when the upgrade recreates the node pool.

Steps to migrate workload manually between node pools:

  • Disable auto scaling into current node pool. I used gcloud command on GKE cluster, you may use console, cli or anything of your choice

If you don't disable auto scaling, the drain operation, you will be doing later, will create more nodes into your current node pool instead of migrating workload to new node pool

gcloud container node-pools update NODE-POOL-NAME --cluster CLUSTER-NAME --no-enable-autoscaling
  • Create the new node pool with you choice of work such as cloud console, cli, Terraform, Ansible, etc.
  • Cordon the nodes in old node pool to mark them unschedulable, so that no new pods can be scheduled into them
kubectl cordon OLD-POOL-NODE-1 OLD-POOL-NODE-2 OLD-POOL-NODE-3
  • Drain each node in old node pool to migrate the workload to new node pool. Use --delete-emptydir-data --ignore-daemonsets to avoid error
kubectl drain OLD-POOL-NODE-1 --delete-emptydir-data --ignore-daemonsets

kubectl drain OLD-POOL-NODE-2 --delete-emptydir-data --ignore-daemonsets

kubectl drain OLD-POOL-NODE-3 --delete-emptydir-data --ignore-daemonsets
  • Once all nodes in old node pool are drained properly, you can delete the node pool

Conclusion

We know that spot VM makes significant difference in our cloud billing, but spot VM comes with its own challenges. In this post, I used GKE as a reference Kubernetes cluster, but I guess the scenario is similar into other cloud platforms. I would like to hear your observations about using Spot VMs in a Kubernetes cluster from you. Thanks for your time to read this article! Cheers!