The extensible scheduler class
The core idea behind BPF is that it allows programs to be loaded into the kernel from user space at run time; using BPF for scheduling has the potential to enable significantly different scheduling behavior than is seen in Linux systems now. The idea of "pluggable" schedulers is not new; it came up in this 2004 discussion of yet another doomed patch series from Con Kolivas, for example. At that time, the idea of pluggable schedulers was strongly rejected; only by focusing energy on a single scheduler, it was argued, could the development community find a way to satisfy all workloads without filling the kernel with a confusion of special-purpose schedulers.
Of course, the idea that the kernel only has one CPU scheduler is not quite accurate; there are actually several of them, including the realtime and deadline schedulers, that applications can choose between. But almost all work on Linux systems runs under the default "completely fair scheduler", which indeed does a credible job of managing a wide variety of workloads on everything from embedded systems to supercomputers. There is always a desire for better performance, but there have been almost no requests for a pluggable scheduler mechanism for years.
Why, then, is the BPF mechanism being proposed now? In clear anticipation of a long discussion, the cover letter for the patch series describes the motivation behind this work at great length. In short, the argument goes, the ability to write scheduling policies in BPF greatly lowers the difficulty of experimenting with new approaches to scheduling. Both our workloads and the systems they run on have become much more complex since the completely fair scheduler was introduced; experimentation is needed to develop scheduling algorithms that are suited to current systems. The BPF scheduling class allows that experimentation in a safe manner without even needing to reboot the test machine. BPF-written schedulers can also improve performance for niche workloads that may not be worth supporting in the mainline kernel and are much easier to deploy to a large fleet of systems.
Scheduling with BPF
The patch set adds a new scheduling class, called SCHED_EXT, that can be selected with a sched_setscheduler() call like most others (selecting SCHED_DEADLINE is a bit more complicated). It is an unprivileged class, meaning that any process can place itself into SCHED_EXT. SCHED_EXT is placed between the idle class (SCHED_IDLE) and the completely fair scheduler (SCHED_NORMAL) in the priority stack. As a result, no SCHED_EXT scheduler can take over the system in a way that would prevent, for example, an ordinary shell session running as SCHED_NORMAL from running. It also suggests that, on systems where SCHED_EXT is in use, the expectation is that the bulk of the workload will be running in that class.
The BPF-written scheduler is global to the system as a whole; there is no provision for different groups of processes to load their own schedulers. If there is no BPF scheduler loaded, then any processes that have been put into the SCHED_EXT class will be run as if they were in SCHED_NORMAL instead. Once a BPF scheduler is loaded, though, it will take over the responsibility for all SCHED_EXT tasks. There is also a magic function that a BPF scheduler can call (scx_bpf_switch_all()) that will move all processes running below realtime priority into SCHED_EXT.
A BPF program implementing a scheduler will normally manage a set of dispatch queues, each of which may contain runnable tasks that are waiting for a CPU to execute on. By default, there is one dispatch queue for every CPU in the system, and one global queue. When a CPU is ready to run a new task, the scheduler will pull a task off of the relevant dispatch queue and give it the CPU. The BPF side of the scheduler is mostly implemented as a set of callbacks to be invoked via an operations structure, each of which informs the BPF code of an event or a decision that needs to be made. The list is long; the full set can be found in include/sched/ext.h in the SCHED_EXT repository branch. This list includes:
- prep_enable()
- enable()
- The first callback informs the scheduler of a new task that is entering SCHED_EXT; the scheduler can use it to set up any associated data for that task. prep_enable() is allowed to block and can be used for memory allocations. enable(), which cannot block, actually enables scheduling for the new task.
- select_cpu()
- Select a CPU for a task that is just waking up; it should return the number of the CPU to place the task on. This decision can be revisited before the task actually runs, but it may be used by the scheduler to wake the selected CPU if it is currently idle.
- enqueue()
- Enqueue a task into the scheduler for running. Normally this callback will call scx_bpf_dispatch() to place the task into the chosen dispatch queue, from which it will eventually be run. Among other things, that call provides the length of the time slice that should be given to the task once it runs. If the slice is specified as SCX_SLICE_INF, the CPU will go into the tickless mode when this task runs.
It's worth noting that enqueue() is not required to put the task into any dispatch queue; it could squirrel that task away somewhere for the time being if the task should not run immediately. The kernel keeps track, though, to ensure that no task gets forgotten; if a task languishes for too long (30 seconds by default, though the timeout can be shortened), the BPF scheduler will eventually be unloaded.
- dispatch()
- Called when a CPU's dispatch queue is empty; it should dispatch tasks into that queue to keep the CPU busy. If the dispatch queue remains empty, the scheduler will try to grab tasks from the global queue instead.
- update_idle()
- This callback informs the scheduler when a CPU is entering or leaving the idle state.
- runnable()
- running()
- stopping()
- quiescent()
- These all inform the scheduler about status changes for a task; they are called when, respectively, a task becomes runnable, starts running on a CPU, is taken off a CPU, or becomes no longer runnable.
- cpu_acquire()
- cpu_release()
- Inform the scheduler about the status of the CPUs in the system. When a CPU becomes available for the BPF scheduler to manage, a callback to cpu_acquire() informs it of the fact. The loss of a CPU (because, perhaps, a realtime scheduling class has claimed it) is notified with a call to cpu_release().
There are numerous other callbacks for the management of control groups, CPU affinity, core scheduling, and more. There is also a set of functions that the scheduler can call to affect scheduling decisions; for example, scx_bpf_kick_cpu() can be used to preempt a task running on a given CPU and call back into the scheduler to pick a new task to run there.
Examples
The end result is a framework that allows the implementation of a wide
range of scheduling policies in BPF code. To prove the point, the patch
series includes a number of sample schedulers. This
patch contains a minimal "dummy" scheduler that uses the default for
all of the callbacks; it also has a basic scheduler that implements five
priority levels and shows how to stash tasks into BPF maps. "While not
very practical, this is useful as a simple example and will be used to
demonstrate different features
".
Beyond that, there is a "central" scheduler that dedicates one CPU to scheduling decisions, leaving all others free to run the workload. A later patch adds tickless support to that scheduler and concludes:
While scx_example_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting.
As if that weren't enough, scx_example_pair
implements a form of core scheduling using
control groups. The scx_example_userland
scheduler "implements a fairly unsophisticated sorted-list vruntime
scheduler in userland to demonstrate how most scheduling decisions can be
delegated to userland
". The series concludes with the Atropos
scheduler, which features a significant user-space component written in
Rust. The cover letter describes one more, scx_example_cgfifo,
which wasn't included because it depends on the still out-of-tree BPF rbtree
patches. It "provides FIFO policies for individual workloads, and a
flattened hierarchical vtree for cgroups
", and evidently provides
better performance than SCHED_NORMAL for an Apache web-serving
benchmark.
Prospects
This patch set is in its second posting and has, so far, not drawn a lot of
review comments; perhaps it is too big to bikeshed. Scheduler maintainer
Peter Zijlstra responded to the first
version, though, saying:
"I hate all of this. Linus NAK'ed loadable schedulers a number of times
in the past and this is just that again -- with the extra downside of the
whole BPF thing on top
". He then proceeded to review many of the
component patches, though, suggesting that he may not intend to reject this
work outright.
Even so, the BPF scheduler class will clearly be a large bite for the core kernel community to swallow. It adds over 10,000 lines of core code and exposes many scheduling details that have, thus far, been kept deep within the kernel. It would be an acknowledgment that one general-purpose scheduler cannot optimally serve all workloads; some may worry that it would mark an end to work on the completely fair scheduler toward that goal and an increase in fragmentation across Linux systems. The BPF-scheduling developers argue the opposite, that the ability to freely experiment with scheduling models would, instead, accelerate improvements to the completely fair scheduler.
How this will play out is hard to predict, other than to note that the BPF
juggernaut has, thus far, managed to overcome just about every objection
that it has encountered. The days of locking up core-kernel functionality
within the kernel itself seem to be coming to an end. It will be
interesting to see what new scheduling approaches will be enabled by this
subsystem.
Index entries for this article | |
---|---|
Kernel | BPF/CPU scheduling |
Kernel | Releases/6.12 |
Kernel | Scheduler/Extensible scheduler class |
Posted Feb 10, 2023 19:53 UTC (Fri)
by flussence (guest, #85566)
[Link] (3 responses)
Or maybe Linux will proudly make no attempt to keep up with the ongoing tidal wave of mainstream x86/ARM core arrangements that don't fit neatly into a C array and we'll end up with a sequel to the Wasted Cores paper. Right now that looks more likely.
Posted Feb 12, 2023 0:52 UTC (Sun)
by gerdesj (subscriber, #5446)
[Link] (1 responses)
It's great that you are seeing "the real value" but why not let us know why.
Posted Feb 23, 2023 8:45 UTC (Thu)
by Ongy (subscriber, #161805)
[Link]
We are starting to see more and more pluggable and microkernel (userspace) approaches inside the kernel. With drivers, schedulers, LSM I think to some degree even memory allocation (faultfd or whatever the name is).
Which is great. People can experiment, we get better fault safety for drivers, and if upstream is too slow or (arguably) too much of a nuisance, vendors can still ship drivers.
OTOH I really dislike what this enables as well. Vendors get to avoid the GPL. Intel already provides a closed source driver for webcams and touchscreens iirc. by means of having it run in userspace.
It's a really double edged sword. While we might get more drivers faster/easier to backport, it also reduces the pressure on vendors to "play nice" and especially to provide open implementations of how to talk to their hardware.
Posted Feb 23, 2023 12:20 UTC (Thu)
by mrugiero (guest, #153040)
[Link]
Posted Feb 11, 2023 9:23 UTC (Sat)
by hyeyoo (subscriber, #151417)
[Link]
Posted Feb 11, 2023 16:33 UTC (Sat)
by dullfire (guest, #111432)
[Link] (3 responses)
Honestly: this line makes me think linux should try to take a stab at the whole M:N scheduling thing again (which AFAIK no implementations, on any OS/kernel, have been successful/performant thus far). With a (small) M:N syscall interface, you could have a scheduling user-space "process". That process should be a realtime scheduled, and then you get the full power of a plug-able scheduler AND the ability to write it in any language you have a runtime/compiler for.
Posted Feb 11, 2023 18:34 UTC (Sat)
by posk (subscriber, #120275)
[Link] (1 responses)
Google Fibers is an M:N in-process scheduling framework that has been used internally at Google for 10+ years, quite successfully. There have been several attempts to upstream the kernel side of it (UMCG, FUTEX_SWAP).
Posted Feb 12, 2023 0:20 UTC (Sun)
by foom (subscriber, #14868)
[Link]
The magic seems all about having userspace code choose which threads are runnable at any given time -- ensuring that there's only ever 1 "runnable" thread per CPU the process is allowed to use. The most tricky part of that being the ability to cause a new thread to be made chosen to be runnable if another blocks in the kernel.
By ensuring this, the kernel scheduler's decisions are made effectively trivial. ("N CPUs are available, and N threads are currently runnable? Ok I know what to do here!")
Posted Feb 12, 2023 20:32 UTC (Sun)
by idont (subscriber, #128754)
[Link]
Looks like it's unsupported starting from 7.3 as well
Posted Feb 17, 2023 15:55 UTC (Fri)
by jcpunk (subscriber, #95796)
[Link]
If folks are given a "use bpf to implement your custom scheduler workload" they'll be super unhappy if they have to rewrite bits of it because the kernel internals changed.
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
The extensible scheduler class
https://www.ibm.com/docs/en/aix/7.3?topic=multiprocessing...
The extensible scheduler class