Controlling the CPU scheduler with BPF

By Jonathan Corbet
October 21, 2021

While the BPF virtual machine has been supported by Linux for most of the kernel's existence, its role for much of that time was limited to, as its full name (Berkeley packet filter) would suggest, filtering packets. That began to change in 2012 with the introduction of seccomp() filtering, and the pace picked up in 2014 with the arrival of the extended BPF virtual machine. At this point, BPF hooks have found their way into many kernel subsystems. One area that has remained BPF-free, though, is the CPU scheduler; that could change if some version of this patch set from Roman Gushchin finds its way into the mainline.

There are several CPU schedulers in the kernel, each of which works cooperatively to handle specific types of workloads. In systems without realtime processes, though, almost all scheduling is done by the Completely Fair Scheduler (CFS), to the point that most people probably just think of it as "the scheduler". CFS is a complicated beast; it embodies a set of hard-learned heuristics that seek to maximize performance for a wide variety of workloads, and has a number of knobs to tweak for the cases where the heuristics need help. CPU scheduling is a complex task, though, and it is not surprising that the results from CFS are not always seen as being optimal by all users.

Gushchin started the cover letter for the patch set by observing that an extensive look at the effects of the various CFS tuning knobs revealed that most of them have little effect on the performance of the workload. In the end, it came down to a couple of relatively simple decisions:

In other words, some our workloads benefit by having long running tasks preempted by tasks handling short running requests, and some workloads that run only short term requests which benefit from never being preempted.

The best scheduling policy varies from one workload to the next, so there is value in being able to tweak the policy as needed. That said, Gushchin noted most workloads are well served by CFS as it is now; it may not make much sense to add more tweaks for the relatively small set of workloads that can benefit from them.

This is just the sort of situation where BPF has made inroads into other parts of the kernel. It gives users the flexibility to change policies to meet their needs while being fast enough that it can sensibly be used in performance-critical subsystems like the CPU scheduler while not increasing overhead for systems where it is not in use. It is somewhat surprising that there have been no serious attempts to integrate BPF into the scheduler until now.

Gushchin's patch set creates a new BPF program type (BPF_PROG_TYPE_SCHED) for programs that influence CPU-scheduler decisions. There are three attachment points for these programs:

cfs_check_preempt_tick is called during the handling of the scheduler's periodic timer tick; a BPF program attached here can then look at which process is running. If that process should be allowed to continue to run, the hook can return a negative number to prevent preemption. A positive return value, instead, informs the scheduler that it should switch to a different process, thus forcing preemption to happen. Returning zero leaves the decision up to the scheduler as if the hook hadn't been run.
cfs_check_preempt_wakeup is called when a process is woken by the kernel; a negative return value will prevent this process from preempting the currently running process, a positive value will force preemption, and zero leaves it up to the scheduler.
cfs_wakeup_preempt_entity is similar to cfs_check_preempt_wakeup, but it is called whenever a new process is being selected for execution and can influence the decision. A negative return indicates no preemption, positive forces it, and zero leaves the decision to other parts of the scheduler.

Gushchin notes that, at Facebook, the first experiments using these hooks "look very promising". By posting the patch set, he hoped to start a conversation on how BPF could be used within the scheduler.

For the most part, it seems that this goal has not been attained; the conversation around these patches has been relatively muted. The most significant comments have come from Qais Yousef who, since he comes from the mobile world, has a different perspective on scheduler issues. He noted that, in that realm, vendors tend to heavily modify the CPU scheduler (see this article for a look at one vendor's scheduler changes). Yousef would like to see the scheduler improved to the point that these vendor changes are no longer necessary; he worried that the addition of BPF hooks could thwart that effort:

So my worry is that this will open the gate for these hooks to get more than just micro-optimization done in a platform specific way. And that it will discourage having the right discussion to fix real problems in the scheduler because the easy path is to do whatever you want in userspace. I am not sure we can control how these hooks are used.

Yousef later recognized that there could be value in this feature, but suggested it should be tightly controlled. Among other things, he said, BPF programs used as scheduler hooks should be distributed within the kernel tree itself, with any out-of-tree hooks causing the kernel to become tainted, much like how loadable modules work.

Gushchin's position was that, by making it easy to try out scheduler changes, the new BPF hooks could accelerate scheduler development rather than slowing it down. Meanwhile, he suggested, having vendors apply their scheduler changes as BPF programs might be better than the sorts of patches they create now.

Beyond this exchange, the patch set has not yet received any significant feedback from either the core scheduler developers or the BPF community. That will clearly need to change if this work is to ever be considered for merging into the mainline kernel. Allowing user space to hook into the scheduler is likely to be a hard sell at best, but it's an idea that seems unlikely to go away anytime soon. For better or for worse, the Linux kernel serves a wide variety of users; providing the best solution for every one of them out of the box is always going to be a challenge.

Index entries for this article
Kernel	BPF
Kernel	Scheduler

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 4:06 UTC (Fri) by dvdeug (guest, #10998) [Link] (9 responses)

When I first installed Linux, the MP3 player would sometimes skip. (Pentium 3 laptop; expensive, but no Alpha, or yet-to-be-released Itanium. Just a matter of time before we'd all have that 64-bit EPIC sweetness in our computers...) I would have been fine telling the CPU scheduler that it should run the mp3 player and the sound system first and foremost, and the webserver can probably wait if I'm busy compiling something.

The point is, Linus & co. would never tune the kernel for that. It seems weird to imagine that all workloads can be best handled by a standard scheduler; the kernel doesn't know whether this is a mail server with a web server serving instructions on how to get mail (a web server under load is a sign of attack, not valuable work) or web server with a mail server because Exim is installed by default (vice versa) or whether the important thing was getting that headshot (as opposed to using a FPS as a tactile version of top, of course.) I'm sure the scheduler is good enough for virtually all cases, but if you're not running with an excess of CPU power, letting the sysadmin set priorities seems quite useful.

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 4:32 UTC (Fri) by alison (subscriber, #63752) [Link] (3 responses)

> The point is, Linus & co. would never tune the kernel for that. It seems weird to imagine that all workloads can be best handled by a standard scheduler;

/proc/sys/kernel has lots of scheduler parameters. Are you sure that they were not sufficient for your needs? Note also that there are schedulers besides CFS available.

> if you're not running with an excess of CPU power, letting the sysadmin set priorities seems quite useful.

The sysadmin can /usr/bin/renice to set priorities. What BPF could do that a sysadmin will not is dynamically adjust scheduler priorities and other parameters. Whether such adjustments offer genuine improvements in a given situation (or ever, frankly) needs to be measured.

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 22:15 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

> /proc/sys/kernel has lots of scheduler parameters. Are you sure that they were not sufficient for your needs?

They might have been when that Pentium 3 was new — when the Venn diagram of "someone trying to play an MP3 on Linux" and "masochistic computer nerd who enjoys tuning their OS all day" was a circle, but it's now 20 years later and the entire internet is still convinced Linux audio (among other things) is the punchline to a joke because of this missing-stair problem of pretending only supercomputers and full-time administrators use it.

> Note also that there are schedulers besides CFS available.

MuQSS is dead due to maintainer burnout, ProjectC is too unstable to use as a daily driver on a desktop, and telling people to use PREEMPT_RT continues to miss the point: this part of the kernel perpetually sucks for the 99.9% who shouldn't have to become experts to fix it, and is needlessly optimised for the 0.1% who *are* expert enough to tune it to their obscure use case.

"Smooth audio, video and input" is not such an obscure use case — neither are "wifi shouldn't have 4-second RTT" and "writing to a USB stick shouldn't hang the system for minutes", and those areas of the kernel managed to figure this out long ago.

Controlling the CPU scheduler with BPF

Posted Oct 23, 2021 0:56 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

Why is it that, when I reread https://xkcd.com/619/, the most outdated part of the comic is the word "Flash"?

Controlling the CPU scheduler with BPF

Posted Nov 5, 2021 11:49 UTC (Fri) by Shabbyx (guest, #104730) [Link]

Do you mean to say that Linux is still focusing on improving the obscure rather than the common usages?

That's certainly not true, given that I do everything on Linux, including play video games of all sorts, and things are pretty smooth. Definitely a better experience than windows where large file copies bring the system to a crawl, or a browser update bugs you to restart every two seconds.

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 7:22 UTC (Fri) by taladar (subscriber, #68407) [Link] (4 responses)

The real-time scheduling feature is designed to solve that particular problem.

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 19:04 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

Real-time scheduling is great if you control the entire software stack. Otherwise, it's iffy at best:

* Properly supporting SCHED_DEADLINE is complicated and requires you to do a lot of empirical measurement and experimentation to figure out the correct parameter values. In practice, this basically requires dedicated hardware and software if you want it to work as intended. You can probably get some mileage out of simple trial and error, but the numbers you eventually come up with may not be portable to other hardware and software configurations.
* SCHED_FIFO never preempts, which is problematic for software that assumes it is running in a preemptively-multitasked environment. If that software tries to run forever, it will succeed.
* SCHED_RR will starve all SCHED_OTHER threads on the system, whenever the SCHED_RR thread(s) are non-idle. For example, if you have a runaway SCHED_RR process and want to terminate it with kill(1), that can't work unless the terminal emulator, shell, X server, etc. are all running as SCHED_RR (so that your keystrokes are even processed), and your shell does not have the reset-on-fork flag enabled (so that kill can actually get scheduled). This is simply not how the average DE is designed to work by default. On a headless machine, this might make *some* more sense, but you would still need sshd to be SCHED_RR, or else you'd need some kind of hypervisor-like-thing which can do its own scheduling independently of the kernel (and even then, you're basically stuck killing the entire container, unless you can somehow inject threads into the runaway process from the hypervisor, which sounds like a crazy thing to do).

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 19:56 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (1 responses)

Wouldn't you normally restrict SCHED_RR tasks to a subset of cores? That way you can still make some progress on other tasks, including the killing of RR tasks. Also allows some control over noise inducing things like interrupts.

Controlling the CPU scheduler with BPF

Posted Oct 22, 2021 20:18 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

You can do that, but then the question becomes whether it is really superior to letting the high-importance* task run on all cores, under SCHED_OTHER, and giving it a negative nice value. If you have a lot of cores, and are not doing some kind of asymmetric big.LITTLE chicanery, the "all cores, negative nice" approach is probably much more performant in practice, assuming that the task actually benefits from parallelism. If Amdahl's law dictates that your particular high-importance task is not easily parallelized, however, then realtime scheduling locked to one core might be better. Even so, you have a finite number of cores to play with, so there are scalability considerations if you want to have N high-importance tasks instead of just one.

* The term "high-priority" is ambiguous, so I have avoided it. "Importance" here refers to the human sysadmin's subjective opinion of which process should be prioritized, rather than anything which the kernel knows about.

Controlling the CPU scheduler with BPF

Posted Oct 24, 2021 21:23 UTC (Sun) by Jandar (subscriber, #85683) [Link]

> * SCHED_RR will starve all SCHED_OTHER threads on the system, whenever the SCHED_RR thread(s) are non-idle. For example, if you have a runaway SCHED_RR process and want to terminate it with kill(1), that can't work unless the terminal emulator, shell, X server, etc. are all running as SCHED_RR (so that your keystrokes are even processed), and your shell does not have the reset-on-fork flag enabled (so that kill can actually get scheduled).

sched(7) says:
> Since version 2.6.25, Linux also provides two /proc files that can be used to reserve a certain amount of CPU time to be used by non-real-time processes. Reserving CPU time in this fashion allows some CPU time to be allocated to (say) a root shell that can be used to kill a runaway process.

On my desktop 5% are reserved so killing a runaway SCHED_RR process is no problem.

Controlling the CPU scheduler with BPF

Posted Nov 4, 2021 9:26 UTC (Thu) by roblucid (guest, #48964) [Link]

Looking at the observation that 2 tuning parameters were effective for different workloads, why not use a hybrid approach?

Long running tasks are pre-emptible run on a subset of cores, while short runners go on cores with shorter timeslices with penalties that evict them onto long runner cores if they have failed to block. In general the long running tasks won't be pre-empted until the idle capacity on the short timeslice cores is used up, this could balance the strategy with cores moving between sets if a lot of pre-emption is happening or there's a lot of unused capacity.

Controlling the CPU scheduler with BPF

Posted Nov 24, 2021 13:17 UTC (Wed) by foxhoundsk (guest, #155434) [Link]

Introducing BPF into the scheduler seems an interesting work, is there any reason that why the community has only a few comments on this so far?

Thanks!