Background
We create multiple node pools to organize our workloads into a Kubernetes cluster. Time to time, we perform several upgrades to our node pools. Some upgrades, such as changing machine size/type, version, etc. are disruptive; i.e. the node pools get recreated. On a managed Kubernetes cluster, such a GKE (Google Kubernetes Engine), you can automatically perform such up gradation work on your node pools. GKE has the capabilities to automatically migrate the workload from old node pool to new one. GKE manages this workload migration using multiple steps:
- provision new node pool and wait for new pool to be ready
- cordon the nodes in old node pool to mark the nodes as unschedulable
- drain the nodes in old node pool to migrate them to new node pool
- delete the old node pool
By default, GKE uses surge upgrade strategy with
maxSurge
(how many new nodes can be created) andmaxUnavailable
(how many existing nodes can be deleted) to upgrade node pool. GKE waits for surge (new) nodes to be ready before cordoning and draining existing nodes. This way GKE avoids service interruption during the workload migration
Important to note that:
If you use Spot VMs in your node pool, surge upgrade values are ignored, because there is no availability guarantee for Spot VMs. During an upgrade, old nodes are drained directly without waiting for surge (new) nodes that use Spot VMs to be ready which causes service disruption!!
To avoid service disruption, we should manually upgrade a node pool that uses Spot VMs when the upgrade recreates the node pool.
Steps to migrate workload manually between node pools:
- Disable auto scaling into current node pool. I used
gcloud
command on GKE cluster, you may use console, cli or anything of your choice
If you don't disable auto scaling, the drain operation, you will be doing later, will create more nodes into your current node pool instead of migrating workload to new node pool
gcloud container node-pools update NODE-POOL-NAME --cluster CLUSTER-NAME --no-enable-autoscaling
- Create the new node pool with you choice of work such as cloud console, cli, Terraform, Ansible, etc.
- Cordon the nodes in old node pool to mark them unschedulable, so that no new pods can be scheduled into them
kubectl cordon OLD-POOL-NODE-1 OLD-POOL-NODE-2 OLD-POOL-NODE-3
- Drain each node in old node pool to migrate the workload to new node pool. Use
--delete-emptydir-data --ignore-daemonsets
to avoid error
kubectl drain OLD-POOL-NODE-1 --delete-emptydir-data --ignore-daemonsets
kubectl drain OLD-POOL-NODE-2 --delete-emptydir-data --ignore-daemonsets
kubectl drain OLD-POOL-NODE-3 --delete-emptydir-data --ignore-daemonsets
- Once all nodes in old node pool are drained properly, you can delete the node pool
Conclusion
We know that spot VM makes significant difference in our cloud billing, but spot VM comes with its own challenges. In this post, I used GKE as a reference Kubernetes cluster, but I guess the scenario is similar into other cloud platforms. I would like to hear your observations about using Spot VMs in a Kubernetes cluster from you. Thanks for your time to read this article! Cheers!