TurboSched: the return of small-task packing

By Jonathan Corbet
July 1, 2019

CPU scheduling is a difficult task in the best of times; it is not trivial to pick the next process to run while maintaining fairness, minimizing energy use, and using the available CPUs to their fullest potential. The advent of increasingly complex system architectures is not making things easier; scheduling on asymmetric systems (such as the big.LITTLE architecture) is a case in point. The "turbo" mode provided by some recent processors is another. The TurboSched patch set from Parth Shah is an attempt to improve the scheduler's ability to get the best performance from such processors.

Those of us who have been in this field for far too long will, when seeing "turbo mode", think back to the "turbo button" that appeared on personal computers in the 1980s. Pushing it would clock the processor beyond its original breathtaking 4.77MHz rate to something even faster — a rate that certain applications were unprepared for, which is why the "go slower" mode was provided at all. Modern turbo mode is a different thing, though, and it's not just a matter of a missing front-panel button. In short, it allows a processor to be overclocked above its rated maximum frequency for a period of time when the load on the rest of system overall allows it.

Turbo mode can thus increase the CPU cycles available to a given process, but there is a reason why the CPU's rated maximum frequency is lower than what turbo mode provides. The high-speed mode can only be sustained as long as the CPU temperature does not get too high and, crucially (for the scheduler), the overall power load on the system must not be too high. That, in turn, implies that some CPUs must be powered down; if all CPUs are running, there will not be enough power available for any of those CPUs to go into the turbo mode. This mode, thus, is only usable for certain types of workloads and will not be usable (or beneficial) for many others.

A workload that would work well in turbo mode is one where the system as a whole is not fully utilized (so that some CPUs can be shut down), and where a relatively small number of processes can benefit from the higher CPU speeds. But that benefit will only be realized if the turbo mode can actually be used. The CPU scheduler in current kernels balances a great many requirements, but "make sure that some CPUs can go into turbo mode" has not been expressed as a need to be balanced in the past. It's thus unsurprising that the scheduler's operation is not optimal for systems with turbo mode and workloads that want to take advantage of that mode.

One problem in particular is that the scheduler is designed to keep the system as responsive as possible and to make the fullest use of the available CPUs. That goal is reflected in how processes are placed on CPUs throughout the system. If a sleeping process wakes up and needs to execute, the scheduler will try to place that process on an idle CPU, thus allowing it to execute immediately rather than waiting in a run queue. That is the right thing to do much of the time, but it is not ideal if your objective is to keep some CPUs powered down so that the others can run in turbo mode. In such cases, it might be better to put a newly awakened process onto a CPU that is already busy and let sleeping CPUs lie.

Getting the scheduler to pack more processes into running CPUs is the objective of the TurboSched patch set. But such packing needs to be done carefully; otherwise, scheduling latency could increase significantly and system utilization could be reduced. To avoid such problem, TurboSched limits this packing behavior to "jitter" processes — those that run sporadically for limited periods of time and which do not have significant response-time requirements. These processes are often doing some sort of housekeeping work and do not suffer from having to share a CPU with other work.

A question that immediately comes to mind is: how does the scheduler decide which processes fit into this "jitter" category? The answer is that it doesn't; such processes need to be specifically marked by user space. Specifically, TurboSched is built on top of the (still unmerged) scheduler utilization clamping patch set, which allows an administrator to impose limits on how much load any given process appears to put on the system. By putting an upper limit on the apparent load, the administrator can keep a given process from forcing a CPU's frequency to increase, even if that process will happily run 100% of the time. Processes marked this way already have a reduced claim to system CPU resources; TurboSched extends this interpretation and concludes that a sleeping CPU should not be powered up for processes whose maximum utilization is clamped.

The logic as implemented in the patch set actually goes a little beyond that, in that jitter processes will not be placed onto a CPU that is running at less that 12.5% of its capacity. The reasoning is that an underutilized CPU might well go idle soon; putting a new process there could prevent that from happening, which would be an undesirable result. Of course, it would also not be good to overload the running CPUs with jitter tasks, so there is an upper limit to how much jitter load can be placed on any given CPU.

This approach may seem familiar; it is reminiscent of the small-task packing algorithms that have been discussed since (at least) 2012. Small-task packing has never made its way into the mainline, so one might wonder why this variation would be different. The biggest difference this time is in the explicit marking of jitter tasks, which will effectively make TurboSched be a no-op on the bulk of the systems out there. In the absence of clamped tasks, the scheduler will run as it does now, so there should be no performance regressions for any existing workloads.

Meanwhile, the benefit for some workloads can be up to a 13% performance increase, according to some impressive ASCII graphics in the patch posting. This increase won't happen with all workloads, but on dedicated systems with well-understood and tuned workloads with the right mix of processes, TurboSched should make things run better. That, along with the relatively noninvasive nature of the patch set, suggests that it might just clear the high bar for scheduler changes.

Index entries for this article
Kernel	Scheduler

TurboSched: the return of small-task packing

Posted Jul 2, 2019 10:32 UTC (Tue) by nix (subscriber, #2304) [Link]

It feels like the system should be able to detect processes that run sporadically for limited periods of time (below, I call these 'interactive') without any sort of explicit marking. (Spotting processes with less than instant response time requirements automatically seems harder.)

Every desktop and server machine I've owned for the last decade has had turbo mode, so it's lovely to see a patchset that makes the scheduler properly aware of it --- but if this patchset requires explicit marking of processes, it seems unlikely that it'll be widely used :(

Speculating, perhaps the explicit marking could be automated away if there was a PAM module so you could group entire uids as interactive or not, and/or just mark everything that's a direct child of the X session as interactive -- or have the Wayland and X libraries mark clients of X or Wayland as interactive, and maybe arrange that things that talk to a TTY or PTY directly get an interactive marking... and then treat everything else with a low enough recent utilization as amenable to packing. (I'd imagine that new processes inherit the initial state of this mark from their parents, and probably you'd need a bit of code change to have make(1) and ninja etc mark themselves, since they are seriously noninteractive but only operate briefly).

TurboSched: the return of small-task packing

Posted Jul 2, 2019 11:57 UTC (Tue) by james (subscriber, #1325) [Link]

The biggest difference this time is in the explicit marking of jitter tasks, which will effectively make TurboSched be a no-op on the bulk of the systems out there.

Once upon a time this would have been true, but I'd be surprised if systemd didn't pick this up, and Android too (to the extent it isn't doing its own thing).

Android has a very good idea which tasks are interactive at any one point. Systemd doesn't quite have that overview of the system, but it shouldn't be difficult for a distro or admin to identify services that "run sporadically for limited periods of time and ... do not have significant response-time requirements" and mark them as such in the unit file.

I note that the most recent version "removed cgroup based task classification": I hope the systemd maintainers are consulted on the API.

TurboSched: the return of small-task packing

Posted Jul 2, 2019 15:16 UTC (Tue) by NightMonkey (guest, #23051) [Link] (1 responses)

Sorry, I have an affliction that sometimes makes me post this video whenever anything is even remotely close to "timesharing" as a computing topic. :) It's a classic, and, I think, still instructive on the issues and problems of managing computing resources. Has anything discussed here really gone away? They've been there since the beginning... :)

"1963 Timesharing: A Solution to Computer Bottlenecks": https://www.youtube.com/watch?v=Q07PhW5sCEk

Cheers!

TurboSched: the return of small-task packing

Posted Jul 2, 2019 15:35 UTC (Tue) by fotoba (subscriber, #61150) [Link]

Really? There was "bouncing cow problem," and this is to be named "indian run problem"

3.9 Merge window part 1
By Jonathan Corbet
February 20, 2013
A relatively simple scheduler patch fixes the "bouncing cow problem," wherein, on a system with more processors than running processes, those processes can wander across the processors, yielding poor cache behavior. For a "worst-case" tbench benchmark run, the result is a 15x improvement in performance.
https://lwn.net/Articles/538101/

As of Inian run
https://www.youtube.com/watch?v=oFS8dxZ2SdU

TurboSched: the return of small-task packing

Posted Jul 2, 2019 19:04 UTC (Tue) by jhoblitt (subscriber, #77733) [Link] (2 responses)

Wouldn't it be easier to just bring back the turbo button?

TurboSched: the return of small-task packing

Posted Jul 3, 2019 10:56 UTC (Wed) by nilsmeyer (guest, #122604) [Link] (1 responses)

I think some mainboards aimed at Overclockers / Gamers already support that to some extent through some ghastly GUI app in Windows.

TurboSched: the return of small-task packing

Posted Jul 3, 2019 13:50 UTC (Wed) by Clozure (guest, #125255) [Link]

Some are even acute enough to offer such buttons in UEFI's SETUP. Case in point, MSI Z270 Gaming Pro Carbon, which has a big, round OC in the first screen once you enter the configuration utility.

TurboSched: the return of small-task packing

Posted Jul 6, 2019 19:40 UTC (Sat) by robert_s (subscriber, #42402) [Link]

All this "12.5%" and "clamped maximum utilization" thing feels a bit of a hairy heuristic to me. Wouldn't there be cases where someone would specifically want a RT process to have its maximum usage clamped to prevent it being able to lock out system administration if it went wrong?

Wouldn't it be possible for this feature to piggyback on the PR_SET_TIMERSLACK setting in prctl? Surely a task which is marked as "ok being a little bit late" is exactly what this feature is looking for...