New features for the Kubernetes scheduler

December 10, 2019

This article was contributed by Sean Kerner

The Kubernetes scheduler is being overhauled with a series of improvements that will introduce a new framework and enhanced capabilities that could help cluster administrators to optimize performance and utilization. Abdullah Gharaibeh, co-chair of the Kubernetes scheduling special interest group (SIG Scheduling), detailed what has been happening with the scheduler in recent releases and what's on the roadmap in a session at KubeCon + CloudNativeCon North America 2019.

The scheduler component of Kubernetes is a controller that assigns pods to nodes. A Kubernetes pod is a group of containers that is scheduled together, while a node is a worker machine (real or virtual) within the cluster that has all the services needed to run a pod.

Gharaibeh explained that as pods are created they are added into a scheduling queue that is sorted by priority and then processed through two phases. In the first, the pod is run through a filter that makes a determination of what nodes are feasible for that pod to run on. For example, the filter checks to make sure the node has enough resources (CPU and memory) to run the pod. The pods then go through the scoring phase, where the pods are ranked according to additional criteria such as node affinity. If for some reason a pod ends up with no feasible nodes it will be placed back into the queue so it can be scheduled when that becomes possible.

Scheduling framework

There are several enhancements that are being worked on for the scheduler, with the most important from Gharaibeh's perspective being the new scheduling framework. The goal of the framework is to turn the scheduler into an engine that executes callbacks from different extension points. "We wanted to define a number of extension points that we've identified from our past experiences with the scheduler, and with those extension points you can register callback functions," Gharaibeh explained. "A collection of callback functions that define a specific behavior, we're going to call it a plugin."

The Kubernetes Enhancement Proposal (KEP) for the scheduling framework outlines all of the different extension points for plugins. One of the key extension points is called "queue sort", which he said enables an administrator to define a single plugin that defines how the default scheduling queue will be sorted. That's different than the current default scheduling behavior where pods are sorted by priority.

Other extension points include "pre-filter", which checks to make sure that certain pod conditions are met, and "filter", which is used to rule out nodes that can't run a given pod. The "reserve" extension point provides an opportunity to help better schedule pods that are running applications or services that need to maintain state, according to the KEP. "Plugins which maintain runtime state (aka 'stateful plugins') should use this extension point to be notified by the scheduler when resources on a node are being reserved for a given Pod."

The framework also includes what are known as permit plugins, which allow other scheduling plugins to delay binding a pod to a node until a specific condition has been satisfied. Gharaibeh noted that the permit plugins can be useful to enable gang scheduling of a group of pods that need to all be deployed at the same time. "One interesting use case we had was trying to make the scheduler friendlier for batch workloads," he said.

Overall, Gharaibeh emphasized that the scheduling framework will make it easier for Kubernetes developers to extend and add new features into the scheduler. The scheduling framework is currently targeted for the Kubernetes 1.19 milestone, which will be out in mid-2020.

Observability improvements

Another area that developers have been working on is observability for scheduler performance and traffic. For example, a new metric now tracks the amount of time it takes to schedule a pod, from the time it was picked up from the queue until it is bound to a node. The ability to track the number of incoming pods per second has also been added so it's possible to see how fast the scheduler is able to drain the scheduling queue. Additionally, the scheduler now provides visibility on the number of scheduling attempts made per second. "We just want more insights into the scheduler," he said.

Gharaibeh also noted that, in Kubernetes 1.17, improvements have been made to better account for pod overhead in order to more accurately determine resource usage. He explained that when pods are started on a node, there is typically some overhead. The kubelet that runs on each node has some state associated with running pods that consumes cluster resources. Sandbox pods, which provide isolation for workloads using technologies such as gVisor or Kata Containers also needed better resource tracking. Both of those approaches have their own agents running beside a pod that end up consuming resources that were not being accounted for.

The way Kubernetes has dealt with pod overhead in the past is to reserve a predefined amount of resources on the node for system components, though it's an approach that Gharaibeh said doesn't fully account for all of the actual pod overhead. The new "pod overhead" feature that is coming to Kubernetes can be used to define the amount of resources that should be allocated per pod. In Kubernetes 1.17, the scheduler will be aware of pod overhead and when pods are scheduled the required amount of overhead will be added to the resources that are requested by the pod, he said.

Looking forward to the Kubernetes 1.18 release, there is development in progress to enable in-place updates of pod resource requirements. Gharaibeh explained that changing the resource allocation for a pod currently requires that the pod be be recreated since PodSpec, which defines the container resources that are required for the pod, is immutable. With the changes set to come in Kubernetes 1.18, PodSpec will become mutable with regard to resources.

It's hard to understate the critical role that the scheduler plays in Kubernetes and the impact that the changes coming to its capabilities will have for users. The new framework approach offers the promise of improved extensibility and customizability that could lead to better overall resource utilization. Adding the improved visibility for the scheduler means that there is a lot for Kubernetes administrators to look forward to in the coming releases.

Index entries for this article
GuestArticles	Kerner, Sean
Conference	KubeCon NA/2019

New features for the Kubernetes scheduler

Posted Dec 11, 2019 20:02 UTC (Wed) by SEJeff (guest, #51588) [Link]

"""
Looking forward to the Kubernetes 1.18 release, there is development in progress to enable in-place updates of pod resource requirements.
"""

I believe this is the required work for the vertical pod autoscaler to work, which will be really exciting to see come to fruition.