Another push for sched_ext
As a quick refresher: sched_ext allows the creation of BPF programs that handle almost every aspect of the scheduling problem; these programs can be loaded (and unloaded) at run time. Sched_ext is designed to safely fall back to the completely fair scheduler should something go wrong (if a process fails to be run within a time limit, for example). It has been used to create a number of special-purpose schedulers, often with impressive performance benefits for the intended workload. See this 2023 article for a more detailed overview of this work.
Heo lists a number of changes that have been made to sched_ext since the previous version was posted in November. For the most part, these appear to be adjustments to the BPF API to make the writing of schedulers easier. There is also a new shutdown mechanism that, among other things, disables the BPF scheduler during power-management events like system suspend. There is now support for CPU-frequency scaling, and some debugging interfaces have been added to make developing schedulers easier. The core design of sched_ext appears to have stabilized, though.
Increasing interest
Even before getting to the changes, though, Heo called attention to the increasing interest in sched_ext that is being shown across the community and beyond. Valve is planning to use sched_ext for better game scheduling on the Steam Deck. Ubuntu is considering shipping it in the 24.10 release. Meta and Google are increasing their use of it in their production fleets. There is also evidently interest in using it in ChromeOS, and Occulus is looking at it as well. Heo concludes that section with:
Given that there already is substantial adoption which continues to grow and sched_ext doesn't affect the built-in schedulers or the rest of kernel in an invasive manner, I believe it's reasonable to consider sched_ext for inclusion.
Whether that inclusion will happen remains an open question, though. The posting of version 4 of the patch set in July 2023 led to a slow-burning discussion on the merits of this development. Scheduler maintainer Peter Zijlstra rejected the patches outright, saying:
There is not a single doubt in my mind that if I were to merge this, there will be Enterprise software out there that will mandate its own BPF sched thing, or else it won't work.They will not care, they will not contribute, they might even pull a RedHat and only share the code to customers.
He added that he saw no value in merging the code, and dropped out of the conversation. Mel Gorman also expressed his opposition to merging sched_ext, echoing Zijlstra's concern that enterprise software would start requiring the use of special-purpose schedulers. He later added that, in his opinion (one shared with Zijlstra), sched_ext would work actively against the improvement of the current scheduler:
I generally worry that certain things may not have existed in the shipped scheduler if plugging was an option including EAS, throttling control, schedutil integration, big.Little, adapting to chiplets and picking preferred SMT siblings for turbo boost. In each case, integrating support was time consuming painful and a pluggable scheduler would have been a relatively easy out that would ultimately cost us if it was never properly integrated.
Heo, naturally, disagreed
with a lot of the concerns that had been raised. There are, he said,
scheduling problems that cannot be addressed with tweaks to the current
scheduler, especially in "hyperscaling" environments like Meta. He
disagreed that sched_ext would impose a maintenance burden, arguing that
the intrusion of BPF into other parts of the kernel has not had that
result. Making it possible for users to do something new is beneficial,
even if there will inevitably be "stupid cases
" resulting from how
some choose to use the new feature. In summary, he said, opponents are
focused on the potential (and, in his opinion, overstated) costs of
sched_ext without taking into account the benefits it would bring.
Restarting the conversation
That message, in October, was the end of the conversation at the time. Heo is clearly hoping for a better result this time around, but Zijlstra's response was not encouraging:
I fundamentally believe the approach to be detrimental to the scheduler eco-system. Witness the metric ton of toy schedulers written for it, that's all effort not put into improving the existing code.
He said that he would not accept any part of this patch series until
"the cgroup situation
" has been resolved. That "situation" is a
performance problem that affects certain workloads when a number of control
groups are in use. Rik van Riel had put together a patch
series to address this problem in 2019, but it never reached the point
of being merged; Zijlstra seems to be insisting that this work be completed
before sched_ext can be considered, and he gave little encouragement that
it would be more favorably considered even afterward.
Heo expressed
a willingness (albeit reluctantly) to work on the control-group problem
if it would clear the way for sched_ext. He strongly disagreed with
Zijlstra's characterization of sched_ext schedulers as "toy schedulers" and
the claim that working on sched_ext will take effort away from the mainline
scheduler, though. There is, he said, no perfect CPU scheduler, so the
mainline scheduler has to settle for being good enough for all users. That
makes it almost impossible to experiment with "radical ideas
", and
severely limits the pool of people who can work on the scheduler. Much of
the energy that goes into sched_ext schedulers, he said, is otherwise
unavailable for scheduler development at all.
There is, he said, value in some of those radical ideas:
Yet, the many different ways that even simple schedulers can demonstrates sometimes significant behavior and performance benefits for specific workloads suggest that there are a lot of low hanging fruits in the area. Low hanging fruits that we can't easily reach from our current local optimum. A single implementation which has to satisfy all users all the time is unlikely to be an effective vehicle for mapping out such landscape.
Igalia developer Changwoo Min, who is working with Valve on gaming-oriented
scheduling, supported
Heo's argument, saying that: "The successful
implementation of sched_ext enriches the scheduler community with
fresh insights, ideas, and code
".
That, as of this writing, is where this conversation stands.
What next?
Sched_ext is on the schedule for the BPF track of the Linux Storage, Filesystem, Memory-Management, and BPF Summit, which begins on May 13. That discussion will cover the future development of sched_ext but, most likely, will not be able to address the question of whether this work should be merged at all. That discussion could continue, on the mailing lists and elsewhere, for some time yet.
Sometimes, when a significant kernel development stalls in this way, distributors that see value in it will ship the patches anyway, as Ubuntu, Valve, and ChromeOS are considering doing. While shipping out-of-tree code is often discouraged, it can also serve to demonstrate interest in a feature and flush out any early problems that result from its inclusion. If things go well, this practice can strengthen the argument for merging the code into the mainline, albeit with the ever-present possibility of changes that create pain for the early adopters.
Whether that will be the path taken for sched_ext remains to be seen. What
is certain is that this work has attracted a lot of interest and is
unlikely to go away anytime soon. Sched_ext has the potential to enable a
new level of creativity in scheduler development, even if it remains out of
the mainline — but that potential will be stronger if it does end up being
merged. Significant scheduler patches are not merged quickly even when
they are uncontroversial; this one will be slower than most if it is
accepted at all.
| Index entries for this article | |
|---|---|
| Kernel | BPF/CPU scheduling |
| Kernel | Scheduler/Extensible scheduler class |
Posted May 9, 2024 15:36 UTC (Thu)
by flussence (guest, #85566)
[Link] (6 responses)
No, no. There isn't a “scheduler ecosystem”. There is a scheduler *monoculture*, and for twenty years now people have interpreted that as damage and routed around it. The few of those who've dared to negotiate with the scheduler tyrant directly in the past have burned out and left kernel development forever - which is why it took a witheringly embarrassing fifteen years to properly address the Wasted Cores paper with more than a band-aid.
The only two choices remaining here are either accept the reality of the situation and address the root cause (in old hacker parlance, "maintainer needs face time with a LART"), or continue to be obstinate about it and encourage downstream hacks to proliferate - both individual and corporate. The corporations don't need royal consent to F up the kernels they ship on their devices, they've *been* doing it. And because they're butchering their device-specific kernels with hardwired hacks, for want of a sane pluggable mechanism, *users* are the only ones that get screwed because they can't turn it off.
Posted May 9, 2024 15:48 UTC (Thu)
by intelfx (subscriber, #130118)
[Link] (2 responses)
Could you please translate that? I’m not old hacker enough and that doesn’t seem to parse (or google).
Posted May 9, 2024 16:03 UTC (Thu)
by schessman (subscriber, #82966)
[Link]
Posted May 9, 2024 19:56 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
FWIW, DDG gave the "Luser" expansion in its info box. My Google result has Wikipedia/Wiktionary as the first result (same thing).
Posted May 15, 2024 21:05 UTC (Wed)
by Conan_Kudo (subscriber, #103240)
[Link] (2 responses)
I personally would love to see sched_ext to land upstream for the simple reason that nobody upstream cares about desktop Linux and regular user needs for scheduling. Having a way to build optimized schedulers outside of the kernel to bypass CFS and EEVDF and their flawed behaviors for regular users gives us an actual chance to deal with long-running responsiveness and performance-under-load issues that desktop Linux users observe that otherwise drive them to use alternative kernels that patch in other schedulers (like Con Kolivas' MuQSS and its predecessors).
Posted May 15, 2024 22:06 UTC (Wed)
by jordalgo (guest, #170580)
[Link]
Posted May 16, 2024 17:00 UTC (Thu)
by Manifault (guest, #155796)
[Link]
Posted May 9, 2024 18:30 UTC (Thu)
by summentier (guest, #100638)
[Link] (3 responses)
I understand that it has been common practice to strongarm^Wencourage patch submitters into cleaning up about as much mildly related kernel code as they are willing to in order to get their patches merged.
This on the other hand seems to make just as much sense as the argument that Ukraine aid cannot be passed until the US-Mexican border is completely secure.
Or am I missing something here? Unfortunately, the maintainer does not really elaborate in his post ...
Posted May 9, 2024 19:13 UTC (Thu)
by summentier (guest, #100638)
[Link] (2 responses)
I was trying to be funny there, but upon rereading this, I realized I was WAY out of line. This is the kernel, nobody's life is on the line. Apologies to Mr Zijlsta for drawing such a connection.
Unfortunately, there is no way to delete a comment. Granted, it would have been better to catch this before posting. Mr Corbet, feel free to delete this.
Posted May 12, 2024 14:48 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Posted May 28, 2024 2:44 UTC (Tue)
by DanilaBerezin (guest, #168271)
[Link]
Posted May 9, 2024 23:32 UTC (Thu)
by mattdm (subscriber, #18)
[Link] (4 responses)
That situation _still_ may not make everyone happy, but it'd at least be nice if people were mad about the actual thing, rather than something very different.
If Red Hat makes improvements to the scheduler, via the mainline scheduler or BPF, they'll be shared back in a way that ultimately benefits everyone, and which emphasizes upstream collaboration -- _even if_ the BPF approach would technically allow us to not do so. (If this policy changes in the future, I'll go on record as saying that I'll be one of the people angry about it.)
Posted May 10, 2024 11:11 UTC (Fri)
by hkario (subscriber, #94864)
[Link] (3 responses)
(there are exceptions, of course, like security fixes, or unresponsive upstreams, etc. but the policy is still very much to get every patch shipped merged upstream)
disclaimer: I work at Red Hat
Posted May 10, 2024 11:24 UTC (Fri)
by sdalley (subscriber, #18550)
[Link] (2 responses)
Posted May 11, 2024 16:51 UTC (Sat)
by mattdm (subscriber, #18)
[Link] (1 responses)
Posted May 13, 2024 9:03 UTC (Mon)
by sdalley (subscriber, #18550)
[Link]
Posted May 10, 2024 1:26 UTC (Fri)
by hmanning77 (subscriber, #160992)
[Link] (7 responses)
Posted May 10, 2024 2:14 UTC (Fri)
by Manifault (guest, #155796)
[Link] (6 responses)
Posted May 10, 2024 14:08 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (5 responses)
From v3, section 2
All rights granted under this License are granted for the term of
So if the verifier refuses to load a BPF program at runtime, ANY BPF program, isn't it in breach of the GPL?
Cheers,
Posted May 10, 2024 14:42 UTC (Fri)
by Manifault (guest, #155796)
[Link] (4 responses)
Also, my comment said that the verifier will reject sched_ext programs that are _not_ licensed with GPLv2. If it rejects programs with a different license, they wouldn't get any protections from GPLv2 regardless.
Posted May 10, 2024 21:10 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
1. Is it legal to make a modified version of the BPF verifier which does not reject non-GPL-licensed programs? It might violate 17 USC 1201 (and similar laws in other countries). Version 3 of the GPL explicitly repudiates that provision of copyright law, but the kernel is licensed under GPLv2. OTOH, the GPL is much more permissive than a "typical" (proprietary) copyright license, so I don't know what courts would say in that situation. Certainly the spirit of the license is that you can change whatever you want, so long as you distribute those changes (if at all) under the GPL. I also believe this is the intent of most if not all of the people who have actually contributed source code to the kernel.
Posted May 11, 2024 1:07 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
This feels more like a DMCA kind of thing to me.
Posted May 11, 2024 8:39 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
Are BPF programs distributed as source? I thought the kernel ran them through a jit compiler?
Because we have the problem there that the GPL is not a particularly suitable licence for stuff distributed as source, about the only Freedom you're missing is the ability to distribute - the mere fact it's source gives you everything else.
Oh - and by the way, does it REALLY refuse to run anything that's not GPL2? What about BSD? MIT? MPL2? PD etc?
Cheers,
Posted May 11, 2024 6:54 UTC (Sat)
by mb (subscriber, #50428)
[Link]
Yes, of course. That does not change the licensing situation at all.
You can:
It may be a license violation (courts have to decide), if:
It might also be totally fine to distribute the non-GPL BPF program along with the modified verifier, if a court rules that the BPF program is not a derived work.
Posted May 11, 2024 23:58 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
I would expect most of the work there to be not so much about "code" but much more about experimenting with and benchmarking a very wide range of workloads. It sounds like the real question here is not really about "improving existing code" but rather "Can a unique scheduler do everything well enough?" If feels like a very difficult question but it's probably even harder to answer with less room to experiment...
> While shipping out-of-tree code is often discouraged, it can also serve to demonstrate interest in a feature and flush out any early problems that result from its inclusion. If things go well, this practice can strengthen the argument for merging the code into the mainline, albeit with the ever-present possibility of changes that create pain for the early adopters.
Absolutely: trial and error. Sometimes that requires actually shipping it to gather measurements and experience on a very large scale.
I heard some guys even send rockets in space nowadays with the expectation that they will fail.
Posted May 12, 2024 15:15 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (10 responses)
And? What's new?
I bet the vast majority of CPU cycles "scheduled" by Linux already involve large amounts of closed-source software. Even when most web browsers (the OS on top of the OS) are now "open-core", (1) they still include a fair amount of closed-source code (2) The javascript they run is complex and minified (3) WASM is probably closed-source most of the time.
In other words, it's never been possible to reproduce and test "Enterprise" or any other real workload out of the box. If you want help from the maintainer and community, then you've always had to simplify, open and share your workload first. If no one can understand what you do then you're on your own, good bye. That's always been the deal. Simple.
Same logic with source code. Products almost always ship with custom stable branches with various backports and out of tree code. Even Linux distributions have always done this. So to engage from the community and get "free" support, you always had to switch to the latest commit on the main branch and to the maintainer's .config first (minor exaggeration to get the point across).
So I really don't see why a different logic would suddenly apply to custom BPF schedulers. If it's private then you're on your own as usual. Same thing if you share your BPF scheduler but maintainers think it's cr*p, as in "doctor, it hurts when I do this..." The answer to that question has never changed.
BTW what are the open-source test suites and workloads available for the scheduler? I'm surprised none was mentioned, it seems like a key element in that discussion.
Posted May 12, 2024 20:17 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (9 responses)
The problem comes when the patient says "doctor it hurts when I do ..." something that's necessary for normal life, like go for a walk. That regularly lands my grand-daughter in considerable pain. It could easily land my daughter temporarily paralysed on the floor.
And this is the problem. I get the impression there are quite a few important people in the Linux community who's attitude seems to be "the computer is there to run the OS. Who cares about the users" ... this discussion seems to have added another one to the list ...
"BOFH, my computer crashes every time I run our production software" - "Well, don't run your production software, then ..."
Cheers,
Posted May 12, 2024 22:51 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (8 responses)
- ... crashes when I run my production software.
Or like:
Posted May 13, 2024 9:43 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (7 responses)
"We're running this commercial software on stock <distro of choice>".
Like my daughter / grand-daughter going for a walk.
What then?
Cheers,
Posted May 13, 2024 14:02 UTC (Mon)
by pizza (subscriber, #46)
[Link] (6 responses)
"Contact <commercial software vendor> for support."
(And, more often than not, the <vendor> will say "you're not using the supported OS/hardware we specified, goodbye.")
Posted May 13, 2024 15:15 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (5 responses)
I'm running stock Excel on stock Windows. Response times are absolute shite - to the point that we are in danger of it taking longer to run than the time available.
(Crap choice of software, I know. Not my choice.)
The most likely response from the vendors I know of is "ask the community".
But the point is, you're all coming up with POSSIBLE excuses for the vendor. What happens when the vendor's recommended, supported environment is "not fit for the purpose intended" to quote UK legalese? ESPECIALLY if said environment contains obvious flaws (not necessarily the vendor's fault) that would help you massively if they were fixed.
That was the complaint with ext4 years ago. That's the complaint with ext_sched now. That's the complaint with Rust. There are people who are actively obstructing attempts to improve Linux, because all they can see is their personal downside.
Are they Luddites? I don't know, I sincerely hope not. After all, the true Luddites could see the benefits of the technology they destroyed - that's why they destroyed it, because they could see the benefits would go to others, not them.
Cheers,
Posted May 13, 2024 16:49 UTC (Mon)
by pizza (subscriber, #46)
[Link] (4 responses)
The fact that the vendors you know are universally crappy doesn't change whose obligation it is to provide support.
Again, $user has a *commercial* relationship with $vendor. $user has no such relationship with "the community", which made $user zero promises, received zero compensation, and thus has precisely zero legal or moral obligation to give $user the time of day, much less provide support for a product they had nothing to do with.
Posted May 13, 2024 17:08 UTC (Mon)
by mb (subscriber, #50428)
[Link] (3 responses)
Why should we merge a feature that doesn't benefit the community?
Posted May 13, 2024 17:30 UTC (Mon)
by pizza (subscriber, #46)
[Link]
Yep. And that's what $vendors are counting on as it lets them save money on knowledgeable support staff.
(Mind you, user support request to vendors are not necessarily reasonable. After all, you wouldn't expect a company that makes hammers to "support" a customer complaining that the structure they are building keeps falling apart)
> Why should we merge a feature that doesn't benefit the community?
Every feature benefits someone. Unfortunately every feature also brings along costs that rarely fall onto the same someones that are reaping the benefits.
Those costs can be short term ("performance regression under every other workload") or longer term (technical debt, combinatorial complexity, security vulnerabilities with cutsey names, etc)
Posted May 13, 2024 22:02 UTC (Mon)
by marcH (subscriber, #57642)
[Link] (1 responses)
And? What's new?
> If sched_ext benefits the community, I'm all for it. If not, why can't the exceptional use cases just ship a patched kernel?
Agreed: if sched_ext turns out to be used ONLY by closed-source vendors then it shouldn't be merged. But that seems rather unlikely, doesn't it?
Posted May 25, 2024 0:16 UTC (Sat)
by mrugiero (guest, #153040)
[Link]
Posted May 13, 2024 16:17 UTC (Mon)
by riking (subscriber, #95706)
[Link]
Posted May 16, 2024 9:14 UTC (Thu)
by rwmj (subscriber, #5474)
[Link] (2 responses)
(1) Is BPF actually fast enough here? I mean, surely this code is called every time you do schedule(), which would be very frequent, so you'd want that to be as fast as it can be. BPF is JITted but does it compete with AOT-compiled C code?
(2) Related to (1), why wouldn't a pluggable system of regular kernel modules work for this use case? From a quick scan of kernel/sched in the sources it seems like the current schedulers are not modules.
Posted May 16, 2024 13:27 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
As I understand it, BPF is indeed fast enough to use in this way. It's all native code by the time it runs.
Using regular modules would forego many of the safety features of BPF, making it much easier to hose the system. That would be undesirable in a system meant to encourage experimentation.
Posted May 17, 2024 14:57 UTC (Fri)
by daroc (editor, #160859)
[Link]
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
This is probably futile, but...
This is probably futile, but...
This is probably futile, but...
This is probably futile, but...
This is probably futile, but...
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
Wol
Another push for sched_ext
Another push for sched_ext
2. Is it legal to make a BPF sched_ext program that is not GPL? In most countries, this will turn on some kind of substantial similarity analysis. In the US, some courts have adopted a software-specific test called the "abstraction-filtration-comparison test," which complicates things, but the short version is that there must be portions of the allegedly infringing work (the BPF program) which are both eligible for copyright protection and also present in the original work (the kernel), and those portions must be substantial enough to infer that some kind of "copying" happened. This is not necessarily limited to literal copying of source code, but it can't be something as abstract as an entire algorithm, either (nobody owns quicksort, for example). The AFC test is basically just a more methodical and specific way of doing that analysis for software. Probably other countries won't do the exact same analysis, but it is hard to imagine them doing something wildly different.
Another push for sched_ext
Another push for sched_ext
Wol
Another push for sched_ext
- Remove this check from the verifier.
- Use the modified verifier.
- Distribute the resulting verifier code.
- Write a non-GPL BPF program and use it locally without distributing.
- You distribute your non-GPL BPF program.
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Wol
Another push for sched_ext
- Interesting! What is that software, what does it do?
- Can't tell you.
- .....
- Here is the list of all the stupid things the documentation told me not to do and that I'm doing anyway with my custom BPF scheduler...
- Good bye!
Another push for sched_ext
Wol
Another push for sched_ext
Another push for sched_ext
Wol
Another push for sched_ext
Another push for sched_ext
Yet, the users will spam the community mailing lists, because that's the next logical step for them, if the vendor refuses support.
If sched_ext benefits the community, I'm all for it. If not, why can't the exceptional use cases just ship a patched kernel?
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
Another push for sched_ext
I'm not the expert in this area, but can try to answer the questions...
Another push for sched_ext
Another push for sched_ext
