Completing the EEVDF scheduler
A quick EEVDF review
The EEVDF scheduler works to divide the available CPU time equally between all of the runnable tasks in the system (assuming all have the same priority). If four equal-priority tasks are contending for the CPU, each will be given a 25% share. Every task is assigned a virtual run time that represents its allocated share of the CPU; in the four-task example, the virtual run time can be thought of as a clock that runs at 25% of wall-clock speed. As tasks run, the kernel computes the difference between a task's virtual run time and its actual running time; the result is called "lag". A positive lag value means that a task is owed CPU time, while a negative value indicates that a task has received more than its share.
A task is deemed "eligible" if its lag value is zero or greater; whenever the CPU scheduler must pick a task to run, it chooses from the set of eligible tasks. For each of those tasks, a virtual deadline is computed by adding the time remaining in its time slice to the time it became eligible. The task with the earliest virtual deadline will be the one that is chosen to run. Since a longer time slice will lead to a later virtual deadline, tasks with shorter time slices (which are often latency sensitive) will tend to run first.
An example might help to visualize how lag works. Imagine three CPU-bound tasks, called A, B, and C, that all start at the same time. Before any of them runs, they will all have a lag of zero:
A B C Lag: 0ms 0ms 0ms
Since none of the tasks have a negative lag, all are eligible. If the scheduler picks A to run first with a 30ms (to pick a number) time slice, and if A runs until the time slice is exhausted, the lag situation will look like this:
A B C Lag: -20ms 10ms 10ms
Over those 30ms, each task was entitled to 10ms (one-third of the total) of CPU time. A actually got 30ms, so it accumulated a lag of -20ms; the other two tasks, which got no CPU time at all, ended up with 10ms of lag, reflecting the 10ms of CPU time that they should have received.
Task A is no longer eligible, so the scheduler will have to pick one of the others next. If B is given (and uses) a 30ms time slice, the situation becomes:
A B C Lag: -10ms -10ms 20ms
Once again, each task has earned 10ms of lag corresponding to the CPU time it was entitled to, and B burned 30ms by actually running. Now only C is eligible, so the scheduler's next decision is easy.
One property of the EEVDF scheduler that can be seen in the above tables is that the sum of all the lag values in the system is always zero.
Lag and sleeping
The lag calculation is only relevant for tasks that are runnable; a task that sleeps for a day is not actually missing out on its virtual run time (since it has none), so it does not accumulate a huge lag value. The scheduler does, however, retain a task's current lag value when it goes to sleep, and starts from that value when the task wakes. So, if a task had run beyond its allocation before it sleeps, it will pay the penalty for that later, when it wakes.
There does come a point, though, where it may not make sense to preserve a task's lag. Should that task that just slept for a day really be penalized for having been allowed to run a bit over its allocation yesterday? It seems clear that, sooner or later, a task's lag should revert back to zero. But when that should happen is not entirely clear. As Zijlstra pointed out in this patch from the series, forgetting lag immediately on sleep would make it possible for tasks to game the system by sleeping briefly at the end of their time slice (when their lag is probably negative), with the result that they get more than their share of CPU time. Simply decaying the lag value over time will not work well either, he concluded, since lag is tied to virtual run time, which passes at a different (and varying) rate.
The solution is to decay a sleeping task's lag over virtual run time instead. The implementation of this idea in the patch set is somewhat interesting. When a task sleeps, it is normally removed from the run queue so that the scheduler need not consider it. With the new patch, instead, an ineligible process that goes to sleep will be left on the queue, but marked for "deferred dequeue". Since it is ineligible, it will not be chosen to execute, but its lag will increase according to the virtual run time that passes. Once the lag goes positive, the scheduler will notice the task and remove it from the run queue.
The result of this implementation is that a task that sleeps briefly will not be able to escape a negative lag value, but long-sleeping tasks will eventually have their lag debt forgiven. Interestingly, a positive lag value is, instead, retained indefinitely until the task runs again.
Time-slice control
As noted above, tasks with a shorter time slice will have an earlier virtual deadline, causing them to be selected sooner by the scheduler. But, in current kernels, that implicit priority only takes effect when the scheduler is looking for a new task to run. If a latency-sensitive task with a short time slice wakes up, it may still have to wait for the current task to exhaust its time slice (which might be long) before being able to run. Zijlstra's patch series changes that, though, by allowing one task to preempt another if its virtual deadline is earlier. This change provides more consistent timings for short-time-slice tasks, while perhaps slowing long-running tasks slightly.
That leaves one open question, though: how does one specify that a given task should be given a short time slice? In current kernels, there is no way for a non-realtime process to tell the kernel what its time slice should be, so this patch series adds that capability. Specifically, a task can use the sched_setattr() system call, passing the desired slice time (in nanoseconds) in the sched_runtime field of the sched_attr structure. This field, in current kernels, is only used for deadline scheduling. With this addition, any task can request shorter time slices, which will cause it to be run sooner and, possibly, more frequently. If, however, the requested time slice is too short, the task will find itself frequently preempted and will run slower overall.
The allowed range for time slices is 100µs to 100ms. For the curious, Zijlstra has illustrated the results of various time-slice choices as an impressive set of ASCII-art diagrams in the changelog for this patch.
These changes are in a relatively early state and seem likely to require
revisions before they can be considered for merging. Among other
things, the interaction with control groups has not yet been investigated
and may well not work properly. But, once the details have been taken care
of, the EEVDF scheduler should be getting to the point where it is ready
for wider use.
Index entries for this article | |
---|---|
Kernel | Releases/6.12 |
Kernel | Scheduler/EEVDF |
Posted Apr 11, 2024 15:00 UTC (Thu)
by brchrisman (subscriber, #71769)
[Link] (5 responses)
This is mentioned a little offhand, but is this intentional? And is it preserved with respect to the handling of processes with perhaps long sleep times mentioned later in the article? It sounds like the chosen strategy of letting slept processes accumulate lag time until they're positive, would maintain conservation of lag, whereas the decay-type possibilities discussed earlier in the article would potentially violate it without specific work to maintain that conservation.
Posted Apr 11, 2024 15:26 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Apr 11, 2024 22:10 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Surely the easy way to do that is, as each task exits its timeslice and is allocated -ve lag, that same (+ve) lag is shared out amongst the other tasks. Any sleeping tasks with -ve lag less than their share simply get the -ve lag wiped into it and the share is recalculated. And if the running task exits with +ve lag, that lag is suspended and then merged with the -ve lag accumulated by the next task to run before being shared.
So basically, every time a timeslice expires, that's when everything adds up to 0.
Cheers,
Posted Apr 11, 2024 15:44 UTC (Thu)
by shironeko (subscriber, #159952)
[Link] (2 responses)
Posted Apr 11, 2024 22:19 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
As each task is rescheduled (be it completed or timeslice expired) you sum lag across all tasks that are not "sleeping with +ve lag", and if that is -ve, you cancel that negativity by creating and sharing the matching positivity across across those tasks ...
So you can't get into an "all jobs have -ve lag" state.
The only problem is it's possible you'd have all tasks with matching 0 lag, but that is rather unlikely ...
Cheers,
Posted Apr 14, 2024 20:12 UTC (Sun)
by shironeko (subscriber, #159952)
[Link]
Posted Apr 11, 2024 16:37 UTC (Thu)
by flussence (guest, #85566)
[Link] (2 responses)
EEVDF has been great. MPI/thread-heavy workloads just work and I've felt no temptation to go back to third party patchsets.
Posted Apr 11, 2024 23:50 UTC (Thu)
by shuhao (guest, #166834)
[Link] (1 responses)
Posted Apr 12, 2024 14:48 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted Apr 11, 2024 23:19 UTC (Thu)
by atai (subscriber, #10977)
[Link] (6 responses)
Posted Apr 12, 2024 1:22 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
And on a stack like mine (dm-integrity, raid, llvm) disk i/o can also chew up cpu ...
Cheers,
Posted Apr 12, 2024 2:24 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Posted Apr 13, 2024 15:11 UTC (Sat)
by heftig (subscriber, #73632)
[Link] (3 responses)
The slowdown may be coming from the thermal limit slowing down the entire processor plus desktop apps getting scheduled to E cores.
I also have an AMD laptop (7840U, 8 cores) with almost the same software where compilation doesn't affect UI nearly as much, even when the system hits its thermal limit.
Posted Apr 30, 2024 3:20 UTC (Tue)
by fest3er (guest, #60379)
[Link] (2 responses)
Is it possible that memory pressure is the source of that UI lag (such as parts of the UI being swapped/paged out)?
Posted Apr 30, 2024 4:02 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Jun 8, 2024 5:14 UTC (Sat)
by fest3er (guest, #60379)
[Link]
It's a desktop with generally adequate cooling (an Antec case I bought for the Phenom-II I had years ago, and standard, non-exotic cooling).
I might try a linux build on my Zenbook Ryzen 9 6900H laptop for comparison. And the same build on my desktop. Limit each to 12 concurrent tasks (-j 12). They should be about the same (desktop better cooling but slower NVME), laptop faster CPU/RAM/NVME but maybe more CPU heat throttling, and both happy to run at much higher temps than the Vishera liked).
Posted Apr 12, 2024 1:04 UTC (Fri)
by glenn (subscriber, #102223)
[Link]
I have certainly run into priority-inheritance-like problems when SCHED_IDLE and SCHED_OTHER threads contend for a mutex. A SCHED_IDLE thread grabs a mutex, and then SCHED_OTHER threads blocked for the mutex experience a classic priority inversion problem: The mutex-holding SCHED_IDLE thread keeps getting preempted because of it's SCHED_IDLE. The scheduler doesn't know that it's impeding the progress of blocked SCHED_OTHER threads.
Posted Apr 13, 2024 21:37 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link]
Posted Apr 21, 2024 10:13 UTC (Sun)
by iq-0 (subscriber, #36655)
[Link]
Completing the EEVDF scheduler
Yes, I believe the "sum of all lag is zero" property is intentional. The changes described in the article would tend to conserve that property. That said, there are complications when, for example, a task with a non-zero lag simply exits. The EEVDF paper describes some elaborate schemes for distributing that lag across the remaining tasks, but I'm not sure the Linux scheduler does that.
Zero total lag
Zero total lag
Wol
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Wol
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Wol
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Completing the EEVDF scheduler
Memory or heat lag?
Memory or heat lag?
Lag inheritance?
The [19]Earliest Virtual Deadline First (EEVDF) scheduler
Completing the EEVDF scheduler