An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 19:04 UTC (Thu) by geofft (subscriber, #59789)
In reply to: An EEVDF CPU scheduler for Linux by dullfire
Parent article: An EEVDF CPU scheduler for Linux

This would actually be a useful feature: it's what's implemented by CFS's "quota" mechanism, more or less. A process that has a quota of, say, 100 ms per second is ineligible to run once it's consumed 100 ms of CPU time, even if the system is otherwise idle. The advantage of this is that it means the process's behavior is more predictable/deterministic, which helps you figure out how many system resources you need, especially on something like a large heterogeneous fleet (think Kubernetes/Borg/etc.). If I'm running a little monitoring daemon that I think needs half a CPU, or a transcoding pipeline for customers that I think needs four CPUs, or something, but the system makes a habit of giving me more than that because the CPUs are free, then I'm not very confident that my process indeed only needs half a CPU or four CPUs. So when someone else comes along and schedules a computationally-intensive job, I might see significant slowdowns.

Unfortunately, the CFS quota mechanism tends to result in a lot of weird runtime behavior; the high-level problem, I think, is that you can use a lot of CPU right after being scheduled without the scheduler stopping you, especially in a multi-threaded process, and once you get descheduled, the scheduler will realize that you're so deeply in the red for quota that it won't reschedule you for seconds or even minutes afterwards, which isn't really what users - or the TCP services they talk to - expect. So even though Kubernetes turns it on by default, lots of operators turn it off in practice. It's gotten better recently (see https://lwn.net/Articles/844976/, which also does a better job of explaining the overall mechanism than I'm doing) but I haven't gotten a chance to try it yet.

I'd be curious to know whether EEVDF can implement the quota concept in a way that is less jittery.

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 22:33 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

And what happens if you have 11 processes each entitled to 100ms/s :-)

I'd be surprised if CPU was actually allocated as "so much each second", I'd allocate it per cycle. So effectively you should allocate a bunch of CPU to all your processes, and then when it runs out you go round the cycle and allocate again. That way your CPU is not idle if you have processes waiting.

Of course, there's plenty of other things to take into account - what happens if a process fork-bombs or something? And this idea of smaller chunks increasing your likelihood of scheduling might not be so easy to implement this way.

Actually, it's given me an idea for a fork-bomb-proof scheduler :-) If, instead of forking processes each getting a fresh new slice of CPU, you set a limit of how much is available each cycle. Let's say for example that we want a cycle to last a maximum of 2 secs, and the default allocation is 100ms. That gives us a maximum of 20 processes getting a full timeslice allocation. So when process 20 forks, the parent loses half its allocation to its child, giving them 50ms each. Effectively, as it forks the "nice" value goes up. Then as processes die their slice gets given to other processes.

So a fork bomb can't DoS processes that are already running. And if say you had a terminal session already running it at least stands a chance of getting going and letting you kill the fork-bomb ...

Cheers,
Wol

An EEVDF CPU scheduler for Linux

Posted Mar 11, 2023 19:30 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

> And what happens if you have 11 processes each entitled to 100ms/s :-)

I imagine this is resolved by the "virtual time" that corbet mentioned in another comment. If you have too many processes, your virtual clock runs fast, so now there are more than 1000 (virtual) ms in a (real) second, and everybody gets scheduled as allocated. It's just that the 100 (virtual) ms that they get is significantly less than 100 real milliseconds. This is mathematically equivalent to multiplying everyone's quota by 10/11, but you don't have to actually go around doing all those multiplies and divides, nor do you have to deal with the rounding errors failing to line up with each other.

Of course, that wouldn't work for realtime scheduling (where you have actually given each process a contractual guarantee of 100 ms/s, and the process likely expects that to be 100 *real* milliseconds), but we're not talking about that. If you try to configure SCHED_DEADLINE in such a way, it will simply refuse the request as impossible to fulfill.

An EEVDF CPU scheduler for Linux

Posted Mar 17, 2023 20:49 UTC (Fri) by prauld (subscriber, #39414) [Link]

This is not a replacement for the quota system. It only prevents negative lag tasks from running in the face of contention for the resource. EEVDF is work conserving as is CFS. Don't be caught up by that sentence which could have been clearer I suppose. Throttled tasks due to cpu limits are removed from the run queue and are not runnable and are not contending.