LWN.net Logo

Various scheduler-related topics

By Jonathan Corbet
September 15, 2009
Scheduler-related development seems to come in bursts. Things will be relatively quiet for a few development cycles, then activity will suddenly increase. We would appear to be in one of those periods where developers start to show a higher level of interest in what the scheduler is doing. The posting of the BFS scheduler has certainly motivated some of this activity, but there is more than that going on.

Interactivity

On the BFS front, the (mildly) inflammatory part of the discussion would appear to have run its course. Anybody who has watched the linux-kernel list knows that serious attempts to fix problems often follow the storm; that appears to be the case this time around. Benchmarks are being posted by a number of people; as a general rule, the results of these benchmark runs tend to be mixed. There are also developers and users posting about problems that they are observing; see, for example, Jens Axboe's report of a ten-second pause while trying to run the xmodmap command.

As part of the process of tracking down problems, the conversation turned to tuning the scheduler. Ingo Molnar pointed out that there is a whole set of flags governing scheduler behavior, all of which can be tweaked by the system administrator:

Note, these flags are all runtime, the new settings take effect almost immediately (and at the latest it takes effect when a task has started up) and safe to do runtime. It basically gives us 32768 pluggable schedulers each with a slightly separate algorithm - each setting in essence creates a new scheduler.

The idea here is not that each user should be required to pick out the correct scheduler from a set of 32768 - a number which presumably seems high even to the "Linux is about choice" crowd. But these flags can be useful for anybody who is trying to track down why the behavior of the scheduler is not as good as it should be. When a tuning change improves things, it gives developers a hint about where they should be looking to find the source of the problem.

A particular test suggested by Ingo was this:

    echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

(Politically-correct developers will, of course, have debugfs mounted under /sys/kernel/debug. Your editor takes no position on the proper debugfs mount point.)

One tester reported immediately that setting this flag made the problems go away. Jens also noted that his ten-second xmodmap problem was solved. The evidence of problems with the NEW_FAIR_SLEEPERS feature was compelling enough that Ingo posted a patch to disable it by default; that patch has been merged for 2.6.32.

For the curious, the NEW_FAIR_SLEEPERS feature is a simple tweak which gives a process a small runtime credit when it returns to the run queue after a sleep. It is meant to help interactive processes, but, clearly, something is not working as expected. Once the real problem has been tracked down, it's possible that the NEW_FAIR_SLEEPERS feature could, once again, be enabled by default. In the mean time, users experiencing interactivity problems may want to try disabling it and seeing if things get better.

Child-runs-first

Another default parameter is changing for 2.6.32; it controls which process runs first after a fork(). For much of the recent past, fork() has arranged things such that the child process gets to run before fork() returns to the parent; this behavior was based on the general observation that the child's work is often more important. There is a good reason to run the parent first, though: the parent's state is active in the processor, the translation lookaside buffer (TLB) contains the right information, etc. So parent-runs-first should perform better. It appears that recent tests showed that parent-runs-first does, indeed, outperform child-runs-first on that most important benchmark: kernel builds. That was enough to get the default changed.

There are some concerns that this change could expose application bugs. Jesper Juhl expresses those concerns this way:

I'm just worried that userspace programs have come to rely on a certain behaviour and changing that behaviour may result in undesired results for some apps. In a perfect world people would just fix those apps that (incorrectly) relied on a certain child-/parent-runs-first behaviour, but the world is not perfect, and many apps may not even have source available.

Child-runs-first has never been a part of the fork() API, though; it's not something that applications should rely on. Even before the change, behavior could differ as a result of preemption, SMP systems, and more. So it's really true that child-runs-first was never guaranteed. But that will not make users feel any better if applications break. To help those users, there is a new kernel.sched_child_runs_first sysctl knob; setting it to one will restore the previous behavior.

Better cpuidle governance

Active CPU scheduling is interesting, but there is also work happening in another area: what happens when nobody wants the CPU? Contemporary processors include a number of power management features which can be used to reduce power consumption when nothing is going on. Clearly, anybody who is concerned about power consumption will want the processor to be in a low-power state whenever possible. There are, however, some problems with a naive "go into a low power state when idle" policy:

  • Transitions between power states will, themselves, consume power. If a CPU is put into a very low-power state, only to be brought back into operation a few microseconds later, the total power consumption will increase.

  • Power state transitions have a performance cost. An extreme example would be simply pulling the plug altogether; power consumption will be admirably low, but the system will experience poor response times that not even the BFS scheduler can fix. Putting the CPU into a more conventional low-power state will still create latencies; it takes a while for the processor to get back into a working mode. So going into a low-power state too easily will hurt the performance of the system.

It turns out that the CPU "governor" code in the mainline kernel often gets this decision wrong, especially for the newer Intel "Nehalem" processors; the result is wasted energy and poor performance, where "poor performance" means a nearly 50% hit on some tests that Arjan van de Ven ran. His response was to put together a patch aimed at fixing the problems. The approach taken is interesting.

Clearly, it makes no sense to put the processor into a low-power state if it will be brought back to full power in the very near future. So all the governor code really has to do is to come up with a convincing prediction of the future so it knows when the CPU will be needed again. Unfortunately, the chip vendors have delayed the availability of the long-promised crystal-ball peripherals yet again, forcing the governor code to rely on heuristics; once again, software must make up for deficiencies in the hardware.

When trying to make a guess about when a CPU might wake up, there are two things to consider. One is entirely well known: the time of the next scheduled timer event. The timer will put an upper bound on the time that the CPU might sleep, but it is not a definitive number; interrupts may wake up the CPU before the timer goes off. Arjan's governor tries to guess when that interrupt might happen by looking at the previous behavior of the system. Every time that the processor wakes up, the governor code calculates the difference between the estimated and actual idle times. A running average of that difference is maintained and used to make a (hopefully) more accurate guess as to what the next idle time will really be.

Actually, several running averages are kept. The probability of a very long idle stretch being interrupted by an interrupt is rather higher than the probability when expected idle period is quite short. So there is a separate correction factor maintained for each order of magnitude of idle time - a 1ms estimate will have a different correction factor than a 100µs or a 10ms guess will. Beyond that, a completely different set of correction factors is used (and maintained) if there is I/O outstanding on the current CPU. If there are processes waiting on short-term (block) I/O, the chances of an early wakeup are higher.

The performance concern, meanwhile, is addressed by trying to come up with some sort of estimate of how badly power-management latency would hurt the system. A CPU which is doing very little work will probably cause little pain if it goes to sleep for a while. If, instead, the CPU is quite busy, it's probably better to stay powered up and ready to work. In an attempt to quantify "busy," the governor code calculates a "multiplier":

    multiplier = 1 + 20*load_average + 10*iowait_count

All of the numbers are specific to the current CPU. So the multiplier is heavily influenced by the system load average, and a bit less so by the number of processes waiting for I/O. Or so it seems - but remember that processes in uninterruptible waits (as are used for block I/O) are counted in the load average, so their influence is higher than it might seem. In summary, this multiplier grows quickly as the number of active processes increases.

The final step is to examine all of the possible sleep states that the processor provides, starting with the deepest sleep. Each sleep state has an associated "exit latency" value, describing how long it takes to get out of that state; deeper sleeps have higher exit latencies. The new governor code multiplies the exit latency by the multiplier calculated above, then compares the result to its best guess for the idle time. If that idle time exceeds the adjusted latency value, that sleep state is chosen. Given the large multipliers involved, one can see that expected idle times must get fairly long fairly quickly as the system load goes up.

According to Arjan, this change restores performance to something very close to that of a system which is not using sleep states at all. The improvement is significant enough that Arjan would like to see the code merged for 2.6.32, even though it just appeared during the merge window. That might happen, though it is possible that it will turned into a separate CPU governor for one development cycle just in case regressions turn up.


(Log in to post comments)

Various scheduler-related topics

Posted Sep 17, 2009 10:02 UTC (Thu) by TRS-80 (subscriber, #1804) [Link]

Has anyone looked at the OpenSolaris power-aware dispatcher and compared it to the Linux scheduler?

Various scheduler-related topics

Posted Sep 17, 2009 20:32 UTC (Thu) by arjan (subscriber, #36785) [Link]

Linux has had a power aware scheduler policy for several years already...

Various scheduler-related topics

Posted Sep 17, 2009 11:45 UTC (Thu) by hppnq (guest, #14462) [Link]

There is a good reason to run the parent first, though: the parent's state is active in the processor, the translation lookaside buffer (TLB) contains the right information, etc. So parent-runs-first should perform better. It appears that recent tests showed that parent-runs-first does, indeed, outperform child-runs-first on that most important benchmark: kernel builds. That was enough to get the default changed.

Ah, nostalgia... Here -- scroll down a bit -- the idea to run the child first was introduced, thinking that this would improve performance in many common cases, because it prevented unnecessary COWs when the child would not do very complicated things. Of course, Linus seemed to use lmbench instead of kernel builds. ;-)

Also, ESR released CML2-1.2.0!

Various scheduler-related topics

Posted Sep 17, 2009 15:11 UTC (Thu) by jimparis (subscriber, #38647) [Link]

Linus' comment at the time:
> We've invalidated the TLB anyway due to the page table copy
seems to contradict one point in this current article:
> .. the translation lookaside buffer (TLB) contains the right information, etc.
Unless things have changed in 8 years, of course.

Various scheduler-related topics

Posted Sep 18, 2009 7:12 UTC (Fri) by mingo (subscriber, #31122) [Link]

Unless things have changed in 8 years, of course.

Two fundamental things have changed.

Firstly, we have an increase in parallelism on the hardware level (more hardware threads, more cores, more sockets) - so pushing child tasks away to other CPUs is generally a good idea.

Secondly, back then in the 2.4 heydays we also didnt have proper vfork() yet. vfork() will in essence guarantee child-runs-first via an explicit handshake between parent and the exec()-ing child. (regardless of the sysctl_sched_child_runs_first control.) That's faster and more efficient than any heuristics the scheduler can provide in this area.

Various scheduler-related topics

Posted Sep 24, 2009 8:33 UTC (Thu) by renox (subscriber, #23785) [Link]

Erm, the vfork manpage comment "It is rather unfortunate that Linux revived this spectre from the past." isn't very motivating for application developers to use vfork..

I think that the API is faulty here: whether it is forking or sending a message, the application developers know if they would prefer scheduling immediately the child/target process or not, so the API should allow the developer to say what he prefers instead of relying on one global default.

Various scheduler-related topics

Posted Sep 18, 2009 7:37 UTC (Fri) by Kamilion (subscriber, #42576) [Link]

Could be the whole SMP-gone-mainstream thing... ;)
The x86 architecture has gone through some interesting permutations since 2001.

*sigh* I remember kernel 2.4.4... Wrestling with going from Slink to Potato and ReiserFS in 2.4.6 going haywire... The install discs still have a warm place in my big CD binder of Linux CDs going back to The Linux System Administrator's Survival Guide's Slackware 3.0 pack-in from 1996.

We've come so far. B-tree filesystems (B+TRee now) are hip again, gentoo's visibly early console/splash work has paved the way for xsplash, 1.6ghz chips have become a hot item again, and we're finally at the point where 256GB SSDs are competing with the 250GB Hard Drives everyone was awestruck with years ago. Good times.

Various scheduler-related topics

Posted Sep 17, 2009 17:44 UTC (Thu) by michel (subscriber, #10186) [Link]

An inspired article. That was a joy to read.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds