Controlling the CPU scheduler with BPF
There are several CPU schedulers in the kernel, each of which works cooperatively to handle specific types of workloads. In systems without realtime processes, though, almost all scheduling is done by the Completely Fair Scheduler (CFS), to the point that most people probably just think of it as "the scheduler". CFS is a complicated beast; it embodies a set of hard-learned heuristics that seek to maximize performance for a wide variety of workloads, and has a number of knobs to tweak for the cases where the heuristics need help. CPU scheduling is a complex task, though, and it is not surprising that the results from CFS are not always seen as being optimal by all users.
Gushchin started the cover letter for the patch set by observing that an extensive look at the effects of the various CFS tuning knobs revealed that most of them have little effect on the performance of the workload. In the end, it came down to a couple of relatively simple decisions:
In other words, some our workloads benefit by having long running tasks preempted by tasks handling short running requests, and some workloads that run only short term requests which benefit from never being preempted.
The best scheduling policy varies from one workload to the next, so there is value in being able to tweak the policy as needed. That said, Gushchin noted most workloads are well served by CFS as it is now; it may not make much sense to add more tweaks for the relatively small set of workloads that can benefit from them.
This is just the sort of situation where BPF has made inroads into other parts of the kernel. It gives users the flexibility to change policies to meet their needs while being fast enough that it can sensibly be used in performance-critical subsystems like the CPU scheduler while not increasing overhead for systems where it is not in use. It is somewhat surprising that there have been no serious attempts to integrate BPF into the scheduler until now.
Gushchin's patch set creates a new BPF program type (BPF_PROG_TYPE_SCHED) for programs that influence CPU-scheduler decisions. There are three attachment points for these programs:
- cfs_check_preempt_tick is called during the handling of the scheduler's periodic timer tick; a BPF program attached here can then look at which process is running. If that process should be allowed to continue to run, the hook can return a negative number to prevent preemption. A positive return value, instead, informs the scheduler that it should switch to a different process, thus forcing preemption to happen. Returning zero leaves the decision up to the scheduler as if the hook hadn't been run.
- cfs_check_preempt_wakeup is called when a process is woken by the kernel; a negative return value will prevent this process from preempting the currently running process, a positive value will force preemption, and zero leaves it up to the scheduler.
- cfs_wakeup_preempt_entity is similar to cfs_check_preempt_wakeup, but it is called whenever a new process is being selected for execution and can influence the decision. A negative return indicates no preemption, positive forces it, and zero leaves the decision to other parts of the scheduler.
Gushchin notes that, at Facebook, the first experiments using these hooks
"look very promising
". By posting the patch set, he hoped to
start a conversation on how BPF could be used within the scheduler.
For the most part, it seems that this goal has not been attained; the conversation around these patches has been relatively muted. The most significant comments have come from Qais Yousef who, since he comes from the mobile world, has a different perspective on scheduler issues. He noted that, in that realm, vendors tend to heavily modify the CPU scheduler (see this article for a look at one vendor's scheduler changes). Yousef would like to see the scheduler improved to the point that these vendor changes are no longer necessary; he worried that the addition of BPF hooks could thwart that effort:
So my worry is that this will open the gate for these hooks to get more than just micro-optimization done in a platform specific way. And that it will discourage having the right discussion to fix real problems in the scheduler because the easy path is to do whatever you want in userspace. I am not sure we can control how these hooks are used.
Yousef later recognized that there could be value in this feature, but suggested it should be tightly controlled. Among other things, he said, BPF programs used as scheduler hooks should be distributed within the kernel tree itself, with any out-of-tree hooks causing the kernel to become tainted, much like how loadable modules work.
Gushchin's position was that, by making it easy to try out scheduler changes, the new BPF hooks could accelerate scheduler development rather than slowing it down. Meanwhile, he suggested, having vendors apply their scheduler changes as BPF programs might be better than the sorts of patches they create now.
Beyond this exchange, the patch set has not yet received any
significant feedback from either the core scheduler developers or the BPF
community. That will clearly need to change if this work is to ever be
considered for merging into the mainline kernel. Allowing user space to
hook into the scheduler is likely to be a hard sell at best, but it's an
idea that seems unlikely to go away anytime soon. For better or for worse,
the Linux kernel serves a wide variety of users; providing the best
solution for every one of them out of the box is always going to be a
challenge.
Index entries for this article | |
---|---|
Kernel | BPF |
Kernel | Scheduler |
Posted Oct 22, 2021 4:06 UTC (Fri)
by dvdeug (guest, #10998)
[Link] (9 responses)
The point is, Linus & co. would never tune the kernel for that. It seems weird to imagine that all workloads can be best handled by a standard scheduler; the kernel doesn't know whether this is a mail server with a web server serving instructions on how to get mail (a web server under load is a sign of attack, not valuable work) or web server with a mail server because Exim is installed by default (vice versa) or whether the important thing was getting that headshot (as opposed to using a FPS as a tactile version of top, of course.) I'm sure the scheduler is good enough for virtually all cases, but if you're not running with an excess of CPU power, letting the sysadmin set priorities seems quite useful.
Posted Oct 22, 2021 4:32 UTC (Fri)
by alison (subscriber, #63752)
[Link] (3 responses)
/proc/sys/kernel has lots of scheduler parameters. Are you sure that they were not sufficient for your needs? Note also that there are schedulers besides CFS available.
> if you're not running with an excess of CPU power, letting the sysadmin set priorities seems quite useful.
The sysadmin can /usr/bin/renice to set priorities. What BPF could do that a sysadmin will not is dynamically adjust scheduler priorities and other parameters. Whether such adjustments offer genuine improvements in a given situation (or ever, frankly) needs to be measured.
Posted Oct 22, 2021 22:15 UTC (Fri)
by flussence (guest, #85566)
[Link] (2 responses)
They might have been when that Pentium 3 was new — when the Venn diagram of "someone trying to play an MP3 on Linux" and "masochistic computer nerd who enjoys tuning their OS all day" was a circle, but it's now 20 years later and the entire internet is still convinced Linux audio (among other things) is the punchline to a joke because of this missing-stair problem of pretending only supercomputers and full-time administrators use it.
> Note also that there are schedulers besides CFS available.
MuQSS is dead due to maintainer burnout, ProjectC is too unstable to use as a daily driver on a desktop, and telling people to use PREEMPT_RT continues to miss the point: this part of the kernel perpetually sucks for the 99.9% who shouldn't have to become experts to fix it, and is needlessly optimised for the 0.1% who *are* expert enough to tune it to their obscure use case.
"Smooth audio, video and input" is not such an obscure use case — neither are "wifi shouldn't have 4-second RTT" and "writing to a USB stick shouldn't hang the system for minutes", and those areas of the kernel managed to figure this out long ago.
Posted Oct 23, 2021 0:56 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Nov 5, 2021 11:49 UTC (Fri)
by Shabbyx (guest, #104730)
[Link]
That's certainly not true, given that I do everything on Linux, including play video games of all sorts, and things are pretty smooth. Definitely a better experience than windows where large file copies bring the system to a crawl, or a browser update bugs you to restart every two seconds.
Posted Oct 22, 2021 7:22 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (4 responses)
Posted Oct 22, 2021 19:04 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
* Properly supporting SCHED_DEADLINE is complicated and requires you to do a lot of empirical measurement and experimentation to figure out the correct parameter values. In practice, this basically requires dedicated hardware and software if you want it to work as intended. You can probably get some mileage out of simple trial and error, but the numbers you eventually come up with may not be portable to other hardware and software configurations.
Posted Oct 22, 2021 19:56 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Posted Oct 22, 2021 20:18 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
* The term "high-priority" is ambiguous, so I have avoided it. "Importance" here refers to the human sysadmin's subjective opinion of which process should be prioritized, rather than anything which the kernel knows about.
Posted Oct 24, 2021 21:23 UTC (Sun)
by Jandar (subscriber, #85683)
[Link]
sched(7) says:
On my desktop 5% are reserved so killing a runaway SCHED_RR process is no problem.
Posted Nov 4, 2021 9:26 UTC (Thu)
by roblucid (guest, #48964)
[Link]
Long running tasks are pre-emptible run on a subset of cores, while short runners go on cores with shorter timeslices with penalties that evict them onto long runner cores if they have failed to block. In general the long running tasks won't be pre-empted until the idle capacity on the short timeslice cores is used up, this could balance the strategy with cores moving between sets if a lot of pre-emption is happening or there's a lot of unused capacity.
Posted Nov 24, 2021 13:17 UTC (Wed)
by foxhoundsk (guest, #155434)
[Link]
Thanks!
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
* SCHED_FIFO never preempts, which is problematic for software that assumes it is running in a preemptively-multitasked environment. If that software tries to run forever, it will succeed.
* SCHED_RR will starve all SCHED_OTHER threads on the system, whenever the SCHED_RR thread(s) are non-idle. For example, if you have a runaway SCHED_RR process and want to terminate it with kill(1), that can't work unless the terminal emulator, shell, X server, etc. are all running as SCHED_RR (so that your keystrokes are even processed), and your shell does not have the reset-on-fork flag enabled (so that kill can actually get scheduled). This is simply not how the average DE is designed to work by default. On a headless machine, this might make *some* more sense, but you would still need sshd to be SCHED_RR, or else you'd need some kind of hypervisor-like-thing which can do its own scheduling independently of the kernel (and even then, you're basically stuck killing the entire container, unless you can somehow inject threads into the runaway process from the hypervisor, which sounds like a crazy thing to do).
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF
> Since version 2.6.25, Linux also provides two /proc files that can be used to reserve a certain amount of CPU time to be used by non-real-time processes. Reserving CPU time in this fashion allows some CPU time to be allocated to (say) a root shell that can be used to kill a runaway process.
Controlling the CPU scheduler with BPF
Controlling the CPU scheduler with BPF