Completing the EEVDF scheduler

By Jonathan Corbet
April 11, 2024

The Earliest Virtual Deadline First (EEVDF) scheduler was merged as an option for the 6.6 kernel. It represents a major change to how CPU scheduling is done on Linux systems, but the EEVDF front has been relatively quiet since then. Now, though, scheduler developer Peter Zijlstra has returned from a long absence to post a patch series intended to finish the EEVDF work. Beyond some fixes, this work includes a significant behavioral change and a new feature intended to help latency-sensitive tasks.

A quick EEVDF review

The EEVDF scheduler works to divide the available CPU time equally between all of the runnable tasks in the system (assuming all have the same priority). If four equal-priority tasks are contending for the CPU, each will be given a 25% share. Every task is assigned a virtual run time that represents its allocated share of the CPU; in the four-task example, the virtual run time can be thought of as a clock that runs at 25% of wall-clock speed. As tasks run, the kernel computes the difference between a task's virtual run time and its actual running time; the result is called "lag". A positive lag value means that a task is owed CPU time, while a negative value indicates that a task has received more than its share.

A task is deemed "eligible" if its lag value is zero or greater; whenever the CPU scheduler must pick a task to run, it chooses from the set of eligible tasks. For each of those tasks, a virtual deadline is computed by adding the time remaining in its time slice to the time it became eligible. The task with the earliest virtual deadline will be the one that is chosen to run. Since a longer time slice will lead to a later virtual deadline, tasks with shorter time slices (which are often latency sensitive) will tend to run first.

An example might help to visualize how lag works. Imagine three CPU-bound tasks, called A, B, and C, that all start at the same time. Before any of them runs, they will all have a lag of zero:

A B C

Lag: 0ms 0ms 0ms

	A	B	C
Lag:	0ms	0ms	0ms

Since none of the tasks have a negative lag, all are eligible. If the scheduler picks A to run first with a 30ms (to pick a number) time slice, and if A runs until the time slice is exhausted, the lag situation will look like this:

A B C

Lag: -20ms 10ms 10ms

	A	B	C
Lag:	-20ms	10ms	10ms

Over those 30ms, each task was entitled to 10ms (one-third of the total) of CPU time. A actually got 30ms, so it accumulated a lag of -20ms; the other two tasks, which got no CPU time at all, ended up with 10ms of lag, reflecting the 10ms of CPU time that they should have received.

Task A is no longer eligible, so the scheduler will have to pick one of the others next. If B is given (and uses) a 30ms time slice, the situation becomes:

A B C

Lag: -10ms -10ms 20ms

	A	B	C
Lag:	-10ms	-10ms	20ms

Once again, each task has earned 10ms of lag corresponding to the CPU time it was entitled to, and B burned 30ms by actually running. Now only C is eligible, so the scheduler's next decision is easy.

One property of the EEVDF scheduler that can be seen in the above tables is that the sum of all the lag values in the system is always zero.

Lag and sleeping

The lag calculation is only relevant for tasks that are runnable; a task that sleeps for a day is not actually missing out on its virtual run time (since it has none), so it does not accumulate a huge lag value. The scheduler does, however, retain a task's current lag value when it goes to sleep, and starts from that value when the task wakes. So, if a task had run beyond its allocation before it sleeps, it will pay the penalty for that later, when it wakes.

There does come a point, though, where it may not make sense to preserve a task's lag. Should that task that just slept for a day really be penalized for having been allowed to run a bit over its allocation yesterday? It seems clear that, sooner or later, a task's lag should revert back to zero. But when that should happen is not entirely clear. As Zijlstra pointed out in this patch from the series, forgetting lag immediately on sleep would make it possible for tasks to game the system by sleeping briefly at the end of their time slice (when their lag is probably negative), with the result that they get more than their share of CPU time. Simply decaying the lag value over time will not work well either, he concluded, since lag is tied to virtual run time, which passes at a different (and varying) rate.

The solution is to decay a sleeping task's lag over virtual run time instead. The implementation of this idea in the patch set is somewhat interesting. When a task sleeps, it is normally removed from the run queue so that the scheduler need not consider it. With the new patch, instead, an ineligible process that goes to sleep will be left on the queue, but marked for "deferred dequeue". Since it is ineligible, it will not be chosen to execute, but its lag will increase according to the virtual run time that passes. Once the lag goes positive, the scheduler will notice the task and remove it from the run queue.

The result of this implementation is that a task that sleeps briefly will not be able to escape a negative lag value, but long-sleeping tasks will eventually have their lag debt forgiven. Interestingly, a positive lag value is, instead, retained indefinitely until the task runs again.

Time-slice control

As noted above, tasks with a shorter time slice will have an earlier virtual deadline, causing them to be selected sooner by the scheduler. But, in current kernels, that implicit priority only takes effect when the scheduler is looking for a new task to run. If a latency-sensitive task with a short time slice wakes up, it may still have to wait for the current task to exhaust its time slice (which might be long) before being able to run. Zijlstra's patch series changes that, though, by allowing one task to preempt another if its virtual deadline is earlier. This change provides more consistent timings for short-time-slice tasks, while perhaps slowing long-running tasks slightly.

That leaves one open question, though: how does one specify that a given task should be given a short time slice? In current kernels, there is no way for a non-realtime process to tell the kernel what its time slice should be, so this patch series adds that capability. Specifically, a task can use the sched_setattr() system call, passing the desired slice time (in nanoseconds) in the sched_runtime field of the sched_attr structure. This field, in current kernels, is only used for deadline scheduling. With this addition, any task can request shorter time slices, which will cause it to be run sooner and, possibly, more frequently. If, however, the requested time slice is too short, the task will find itself frequently preempted and will run slower overall.

The allowed range for time slices is 100µs to 100ms. For the curious, Zijlstra has illustrated the results of various time-slice choices as an impressive set of ASCII-art diagrams in the changelog for this patch.

These changes are in a relatively early state and seem likely to require revisions before they can be considered for merging. Among other things, the interaction with control groups has not yet been investigated and may well not work properly. But, once the details have been taken care of, the EEVDF scheduler should be getting to the point where it is ready for wider use.

Index entries for this article
Kernel	Releases/6.12
Kernel	Scheduler/EEVDF

Completing the EEVDF scheduler

Posted Apr 11, 2024 15:00 UTC (Thu) by brchrisman (subscriber, #71769) [Link] (5 responses)

"One property of the EEVDF scheduler that can be seen in the above tables is that the sum of all the lag values in the system is always zero."

This is mentioned a little offhand, but is this intentional? And is it preserved with respect to the handling of processes with perhaps long sleep times mentioned later in the article? It sounds like the chosen strategy of letting slept processes accumulate lag time until they're positive, would maintain conservation of lag, whereas the decay-type possibilities discussed earlier in the article would potentially violate it without specific work to maintain that conservation.

Zero total lag

Posted Apr 11, 2024 15:26 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

Yes, I believe the "sum of all lag is zero" property is intentional. The changes described in the article would tend to conserve that property. That said, there are complications when, for example, a task with a non-zero lag simply exits. The EEVDF paper describes some elaborate schemes for distributing that lag across the remaining tasks, but I'm not sure the Linux scheduler does that.

Zero total lag

Posted Apr 11, 2024 22:10 UTC (Thu) by Wol (subscriber, #4433) [Link]

> The EEVDF paper describes some elaborate schemes for distributing that lag across the remaining tasks, but I'm not sure the Linux scheduler does that.

Surely the easy way to do that is, as each task exits its timeslice and is allocated -ve lag, that same (+ve) lag is shared out amongst the other tasks. Any sleeping tasks with -ve lag less than their share simply get the -ve lag wiped into it and the share is recalculated. And if the running task exits with +ve lag, that lag is suspended and then merged with the -ve lag accumulated by the next task to run before being shared.

So basically, every time a timeslice expires, that's when everything adds up to 0.

Cheers,
Wol

Completing the EEVDF scheduler

Posted Apr 11, 2024 15:44 UTC (Thu) by shironeko (subscriber, #159952) [Link] (2 responses)

I think it is a critical condition for the scheme to work well, otherwise too much/not enough jobs are marked eligible depending on if the net is negative or positive. In the worse case all jobs can have a negative lag and be ineligible and that is going to be a bad time.

Completing the EEVDF scheduler

Posted Apr 11, 2024 22:19 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Dunno. See my other comment.

As each task is rescheduled (be it completed or timeslice expired) you sum lag across all tasks that are not "sleeping with +ve lag", and if that is -ve, you cancel that negativity by creating and sharing the matching positivity across across those tasks ...

So you can't get into an "all jobs have -ve lag" state.

The only problem is it's possible you'd have all tasks with matching 0 lag, but that is rather unlikely ...

Cheers,
Wol

Completing the EEVDF scheduler

Posted Apr 14, 2024 20:12 UTC (Sun) by shironeko (subscriber, #159952) [Link]

I'm not sure I understood what you are describing, are you describing a way to maintain the "net lag=0" invariant? All I said was that invariant seem to be important for the algorithm.

Completing the EEVDF scheduler

Posted Apr 11, 2024 16:37 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

I've spent almost two decades whining about the linux scheduler to anyone who'd listen, so take this in that context:

EEVDF has been great. MPI/thread-heavy workloads just work and I've felt no temptation to go back to third party patchsets.

Completing the EEVDF scheduler

Posted Apr 11, 2024 23:50 UTC (Thu) by shuhao (guest, #166834) [Link] (1 responses)

Do you have something hat I can read to learn more about the problems?

Completing the EEVDF scheduler

Posted Apr 12, 2024 14:48 UTC (Fri) by flussence (guest, #85566) [Link]

DOI:10.1145/2901318.2901326 is the most well known one... that paper provoked *some* fixes, but CFS was still exhibiting the same symptoms up until last year.

Completing the EEVDF scheduler

Posted Apr 11, 2024 23:19 UTC (Thu) by atai (subscriber, #10977) [Link] (6 responses)

The Linux kernel tends to have a behavior experienced by developers: when the machine is busy compiling/building, even if the number of cores used is below the number of cores of the CPU, the machine UI may temporarily freeze when there is heavy IO going on. Any scheduler can keep the UI responsive in this condition regardless of the heavy load due to running a build/compile?

Completing the EEVDF scheduler

Posted Apr 12, 2024 1:22 UTC (Fri) by Wol (subscriber, #4433) [Link]

If the UI needs to read from disk? That's a likely cause of the slowdown.

And on a stack like mine (dm-integrity, raid, llvm) disk i/o can also chew up cpu ...

Cheers,
Wol

Completing the EEVDF scheduler

Posted Apr 12, 2024 2:24 UTC (Fri) by willy (subscriber, #9762) [Link]

I suspect if you actually delve into what's going on with profiles, you'll find that you're trying to swap something in, so you're actually waiting on I/O. Having an SSD will be your best improvement, but an I/O scheduler might be able to help you. A CPU scheduler can't do much to help you.

Completing the EEVDF scheduler

Posted Apr 13, 2024 15:11 UTC (Sat) by heftig (subscriber, #73632) [Link] (3 responses)

I have an Intel laptop with a i7-1260P (4P + 8E cores) and UI becomes very laggy when a compilation utilizes all available CPU time. This is with the compositor, apps and the compilation all in different CPU cgroups.

The slowdown may be coming from the thermal limit slowing down the entire processor plus desktop apps getting scheduled to E cores.

I also have an AMD laptop (7840U, 8 cores) with almost the same software where compilation doesn't affect UI nearly as much, even when the system hits its thermal limit.

Completing the EEVDF scheduler

Posted Apr 30, 2024 3:20 UTC (Tue) by fest3er (guest, #60379) [Link] (2 responses)

I had a Vishera 8350 (IIRC) with 16GiB RAM. The entire build of Smoothwall Express used about 13GiB RAM (cache). Back when parallel compile worked, the Linux build would use all eight (semi-)cores and finish the build in about five minutes. I presently have a Ryzen 9 3950X 16 core CPU with 32GiB RAM. Even building in a QEMU VM on the Ryzen was as fast as running on the host. While I wasn't using the UI extensively during compiles, I never noticed any lag in response time. In fact, going back even further (to, oh, 2003 or so), once I switched to multi-core CPUs, UI and user response lag all but disappeared. (Except when I was using early-ish EXT4; flushing dirty pages to disk did cause the UI and computer to 'freeze' periodically.) In addition, I found that the runtime difference between using a pre-loaded RAM disk (with cache emptied) and letting Linux cache the disk as needed was essentially the time needed to initially read the data from the HD. Hard drives were pretty much Hitachi/Toshiba 1TB SATA drives. Even the Phenom-II-965 I had before the Vishera was well behaved.

Is it possible that memory pressure is the source of that UI lag (such as parts of the UI being swapped/paged out)?

Memory or heat lag?

Posted Apr 30, 2024 4:02 UTC (Tue) by raven667 (subscriber, #5198) [Link] (1 responses)

Was that on a laptop or desktop with better cooling? The raw cpu/gpu speed on portable computers is often wasted because they have no way to reject enough heat in that volume to actually run at full speed for anything other than _very_ short bursts. I've seen my i9/64GB laptop throttle down to ~700Mhz if it gets too busy/hot, which introduces a ton of latency to everything.

Memory or heat lag?

Posted Jun 8, 2024 5:14 UTC (Sat) by fest3er (guest, #60379) [Link]

Agreed. Inadequte heat rejection destroys lots of computers and other electronic equipment.

It's a desktop with generally adequate cooling (an Antec case I bought for the Phenom-II I had years ago, and standard, non-exotic cooling).

I might try a linux build on my Zenbook Ryzen 9 6900H laptop for comparison. And the same build on my desktop. Limit each to 12 concurrent tasks (-j 12). They should be about the same (desktop better cooling but slower NVME), laptop faster CPU/RAM/NVME but maybe more CPU heat throttling, and both happy to run at much higher temps than the Vishera liked).

Lag inheritance?

Posted Apr 12, 2024 1:04 UTC (Fri) by glenn (subscriber, #102223) [Link]

EEVDF resembles plain old EDF (SCHED_DEADLINE) pretty well. I wonder if "lag inheritance" (analogous to priority/deadline inheritance) would improve the responsiveness of blocked higher-priority EEVDF threads in a meaningful way.

I have certainly run into priority-inheritance-like problems when SCHED_IDLE and SCHED_OTHER threads contend for a mutex. A SCHED_IDLE thread grabs a mutex, and then SCHED_OTHER threads blocked for the mutex experience a classic priority inversion problem: The mutex-holding SCHED_IDLE thread keeps getting preempted because of it's SCHED_IDLE. The scheduler doesn't know that it's impeding the progress of blocked SCHED_OTHER threads.

The [19]Earliest Virtual Deadline First (EEVDF) scheduler

Posted Apr 13, 2024 21:37 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

WTF says EEVDF stands for earliest eligible virtual deadline first, FWIW.

Completing the EEVDF scheduler

Posted Apr 21, 2024 10:13 UTC (Sun) by iq-0 (subscriber, #36655) [Link]

So next we're gonna have a workload monitoring job that analyses the runtime behavior of running threads and adjusts their timeslices accordingly? Possibly by running intermittently for short period to sample the then active tasks.