Virtual-machine scheduling and scheduling in virtual machines

July 10, 2019

As is probably well known, a scheduler is the component of an operating system that decides which CPU the various tasks should run on and for how long they are allowed to do so. This happens when an OS runs on the bare hardware of a physical host and it is also the case when the OS runs inside a virtual machine. The only difference being that, in the latter case, the OS scheduler marshals tasks among virtual CPUs.

And what are virtual CPUs? Well, in most platforms they are also a kind of special task and they want to run on some CPUs ... therefore we need a scheduler for that! This is usually called the "double-scheduling" property of systems employing virtualization because, well, there literally are two schedulers: one — let us call it the host scheduler, or the hypervisor scheduler — that schedules the virtual CPUs on the host physical CPUs; and another one — let us call it the guest scheduler — that schedules the guest OS's tasks on the guest's virtual CPUs.

Now what are these two schedulers? That depends on the virtualization platform. They are always different, in the sense that it will never happen that, at runtime, a scheduler has to deal with scheduling virtual CPUs and also scheduling tasks that want to run on those same virtual CPUs (well, it can happen, but then you are not doing virtualization). They can be the same, in terms of code, or they can be completely different from that respect as well.

Trying to clarify things a little with an example, we have KVM, where Linux runs on the host and acts as the hypervisor, and hence it is the Linux scheduler that schedules the virtual CPUs of the KVM virtual machines onto the host's physical CPUs. At the same time, if Linux runs in the VM as well, it is still the Linux scheduler that schedules the tasks onto the VM's virtual CPUs. The huge benefit of an approach like this is that work (like features, bug fixes, and performance enhancements) done on the Linux scheduler automatically benefit both the hypervisor scheduler and the guest scheduler (it's the same code!). It also has the drawback, though, that if one wants to add a feature that greatly improves the Linux scheduler's behavior as a hypervisor scheduler, but at the same time has a negative impact on other workloads and use cases, that "might be difficult" (which is a euphemism for saying that comments will be of the "not even when hell freezes!" kind).

On the other hand, we have the Xen project approach, where it is Xen that runs on the host, and hence it is Xen's scheduler that is in charge of the "virtual CPUs on physical CPUs" side. Inside the VM, we still have Linux, and it is that scheduler that deals with running tasks on the virtual CPUs like in the KVM case. This means that the Xen project development community can add whatever virtualization-specific tweaks they like to the Xen scheduler, which is good. But the downside now is that there are no features, testing, or profiling shared and coming for free from the much wider Linux world.

Faggioli has been working in virtualization for quite a few years, mostly on Xen (and, in fact, on the Xen scheduler) in the past. Now, at SUSE, he is playing with both Xen and KVM. He, therefore, focused on hypervisor scheduling and wanted to give an overview of the Linux and Xen schedulers. The conference audience, however, seemed very well aware that the Linux scheduler includes multiple scheduling policies (such as OTHER, FIFO, RR, and DEADLINE), that we can set tasks' affinity to CPUs, that we can do "pooling" with control groups (cgroups), that we can control resource allocation, also with cgroups, and that we deal with NUMA systems in multiple ways, for instance with automatic NUMA balancing.

He therefore quickly skipped through all that, and gave the overview of the Xen scheduler only, with the hope of having said something that sounded new to most conference attendees. Xen also implements multiple scheduling algorithms (CREDIT2, CREDIT, RTDS, NULL), although they are not exactly scheduling policies in the Linux sense. Similarly, Xen has pooling capabilities via a feature called cpupools, although it is really partitioning the host, rather than grouping tasks like cgroups do in Linux. Xen also supports affinity of virtual CPUs to physical CPUs and it supports a soft variant of affinities, which Linux does not have. Though there has been a talk at this very conference about a proposal to implement it.

Xen has recently switched to using CREDIT2 as its own default scheduler, which is a more advanced and much more maintainable scheduler than what was being used before (CREDIT). Xen also has a scheduler called NULL, which is a scheduler that does not schedule and, in fact, does — literally — nothing, which Faggioli implemented himself a few years ago and it has become popular. It might mean something if, after years of research and work on OS and hypervisor scheduling, what people like the most from what you have done is the scheduler that does nothing.

Now that we know a little bit more about this problem of double-scheduling, and about the schedulers in use within the major open-source virtualization platforms, we can start thinking about whether or not the hypervisor scheduler and the guest scheduler interact with each other. And, if not, whether or not they should. Having these two schedulers interacting would go under the name of scheduler paravirtualization. This is nothing new, as it has been discussed many times. Faggioli did not talk about it much, mostly because if he ever proposes scheduler paravirtualization for Linux, that will be done via email where (very) angry replies is all he will get back, not when he is in a room with actual Linux scheduler maintainers who can throw frozen sharks at him!

Instead, he talked about system topology, both physical and virtual. How CPUs are arranged and how much they share is something that has an impact on the scheduler behavior — quite an impact, actually. In fact, things like whether or not it is better to wake-up a task (and even a virtual CPU) on one CPU or another, and how frequently and how thoroughly the load among the CPUs should be balanced, all depends on the system topology. And, when it comes to virtual machines, virtual topology. VMs do run on hosts that have physical topology — determined by how the various chips of CPUs and memory are arranged on actual silicon — but they also have a virtual topology that the hypervisor and the virtualization tools can define (almost) at will.

[Update below]

Topology is particularly interesting because, while for physical hosts it is determined by wiring of chips on silicon boards, and hence it cannot change (well, if we decide to ignore CPU and memory hotplug, at least), for virtual machines, it can. Strictly speaking, it is not the virtual topology of the VM that varies. That is defined when the VM is created and never (well, if we decide to ignore virtual CPU and memory hotplug) changes.

The fact is that two virtual CPUs can run on two physical CPUs of the same hardware core, at a given point in time, but unless appropriate measures are taken, the host scheduler can move them and have them running one on a NUMA node and the other one on another, whenever it wants. So this is one of the tricky aspects of the relationship and the interactions between host and guest schedulers, that happens as a consequence of there being a physical and virtual topology.

In fact, if one does not define a virtual topology for their VMs, all of the topology-related optimizations that the guest scheduler contains are useless or, even worse, are pure overhead. That is why it is usually necessary to define a virtual topology for the VM itself in order to achieve close to host performance from inside a VM, it . It is, however, better if we do that with some care, or the guest scheduler will, for example during load-balancing, move a task to a virtual CPU that is supposed to be on the same core as the one where its currently running, but in reality is running two NUMA hops away.

The talk included some examples, with graphs and data coming from experiments done on some large AMD servers. What they show is that, to achieve the best performance inside a VM, we absolutely need the VM to have a virtual topology, and one which is and remains consistent with the topology of the host. Which basically means that to achieve the best performance, we need both the host and the guest scheduler, and we also need them to "play well" with each other.

And here we are again, stating not only that both host and guest scheduler matter, but also that they seem to need to interact somehow in order for the best outcome to be achieved. In fact, it looks like it could be good to know, from inside the guest, whether or not the virtual CPUs are pinned to the physical CPUs at the host level. Once we know that, we can decide whether or not the guest scheduler should either honor or ignore the virtual topology.

Currently, in the Linux scheduler, the link between topology (no matter whether physical or virtual) and the actual scheduler behavior is represented by the scheduling domains, which are constructed according to what the topology is. Scheduling domains also come with some flags, which are the things directly responsible for what the Linux scheduler does when making decisions about tasks running on CPUs that are part of a specific domain. The structure of the scheduling-domain hierarchy and the flags of the domains in the hierarchy have a direct impact on the behavior of the scheduler, and hence on performance.

And that was the bulk of Faggioli's proposal: let's try to figure out what the best scheduling-domain hierarchy and set of scheduling-domain flags is for a virtual machine, depending on physical host topology, virtual machine topology, and also other things (such as whether or not the vCPUs are pinned, etc.). And that is what we will use to "drive" the behavior of the guest scheduler. Basically, instead of running full throttle toward some scheduler paravirtualization, we use something that is already there, and see how far we can get. We "only" need a way to figure out (e.g. at VM boot time), whether or not this special "virtual scheduling-domain hierarchy" shall be built and used. And, on this front, Faggioli was happy to hear that the thoughts of the Linux scheduler people in the audience were that it should be possible.

This is all still in the proposal state, but more data from benchmarks is already being collected, and work toward a prototype will start "soon" (for some definition of soon).

Index entries for this article
Conference	OS-Directed Power-Management Summit/2019