Some notes from the BFS discussion

By Jonathan Corbet
September 9, 2009

As was recently reported here, Con Kolivas recently resurfaced with a new CPU scheduler called "BFS". This scheduler, he said, addresses the problems which ail the mainline CFS scheduler; the biggest of these, it seems, is the prioritization of "scalability" over use on normal desktop systems. BFS was meant to put the focus back on user-level systems and, perhaps, make the case for supporting multiple schedulers in the kernel.

Since then, CFS creator Ingo Molnar has responded with a series of benchmark results comparing the two schedulers. Tests included kernel build times, pipe performance, messaging performance, and an online transaction processing test; graphs were posted showing how each scheduler performed on each test. Ingo's conclusion: "Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box." In fact, the opposite was true: BFS generally performed worse than the mainline scheduler.

Con's answer was best described as "dismissive":

/me sees Ingo run off to find the right combination of hardware and benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]

As far as your editor can tell, Con's objections to the results mirror those heard elsewhere: Ingo chose an atypical machine for his tests, and those tests, in any case, do not really measure the performance of a scheduler in a desktop situation. The more cynical observers seem to believe that Ingo is more interested in defending the current scheduler than improving the desktop experience for "normal" users.

The machine chosen was certainly at the high end of the "desktop" scale:

So the testbox i picked fits into the upper portion of what i consider a sane range of systems to tune for - and should still fit into BFS's design bracket as well according to your description: it's a dual quad core system with hyperthreading. It has twice as many cores as the quad you tested on but it's not excessive and certainly does not have 4096 CPUs.

A number of people thought that this box is not a typical desktop Linux system. That may indeed be true - today. But, as Ingo (among others) has pointed out, it's important to be a little ahead of the curve when designing kernel subsystems:

But when it comes to scheduler design and merge decisions that will trickle down and affect users 1-2 years down the line (once it gets upstream, once distros use the new kernels, once users install the new distros, etc.), i have to "look ahead" quite a bit (1-2 years) in terms of the hardware spectrum.

Btw., that's why the Linux scheduler performs so well on quad core systems today - the groundwork for that was laid two years ago when scheduler developers were testing on a quads. If we discovered fundamental problems on quads _today_ it would be way too late to help Linux users.

Partly in response to the criticisms, though, Ingo reran his tests on a single quad-core system, the same type of system as Con's box. The end results were just about the same.

The hardware used is irrelevant, though, if the benchmarks are not testing performance characteristics that desktop users care about. The concern here is latency: how long it takes before a runnable process can get its work done. If latencies are too high, audio or video streams will skip, the pointer will lag the mouse, scrolling will be jerky, and Maelstrom players will lose their ships. A number of Ingo's original tests were latency-related, and he added a couple more in the second round. So it looks like the benchmarks at least tried to measure the relevant quantity.

Benchmark results are not the same as a better desktop experience, though, and a number of users are reporting a "smoother" desktop when running with BFS. On the other hand, making significant scheduler changes in response to reports of subjective "feel" is a sure recipe for trouble: if one cannot measure improvement, one not only risks failing to fix any problems, one is also at significant risk of introducing performance regressions for other users. There has to be some sort of relatively objective way to judge scheduler improvements.

The way preferred by the current scheduler maintainers is to identify causes of latencies and fix them. The kernel's infrastructure for the identification of latency problems has improved considerably over the last year or two. One useful tool is latencytop, which collects data on what is delaying applications and presents the results to the user. The ftrace tracing framework is also able to create data on the delay between when a process is awakened and when it actually gets into the CPU; see this post from Frederic Weisbecker for an overview of how these measurements can be taken.

If there are real latency problems remaining in the Linux scheduler - and there are enough "BFS is better" reports to suggest that there are - then using the available tools to describe them seems like the right direction to take. Once the problem is better understood, it will be possible to consider possible remedies. It may well be that the mainline scheduler can be adjusted to make those problems go away. Or, possibly, a more radical sort of approach is necessary. But, without some understanding of the problem - and associated ability to measure it - attempted fixes seem a bit like a risky shot in the dark.

Ingo welcomed Con back to the development community and invited him to help improve the Linux scheduler. This seems unlikely to happen, though. Con's way of working has never meshed well with the kernel development community, and he is showing little sign of wanting to change that situation. That is unfortunate; he is a talented developer who could do a lot to improve Linux for an important user community. The adoption of the current CFS scheduler is a direct result of his earlier work, even if he did not write the code which was actually merged. In general, though, improving Linux requires working with the Linux development community; in the absence of a desire to do that effectively, there will be severe limits on what a developer will be able to accomplish.

(See also: Frans Pop's benchmark tests, which show decidedly mixed results.)

Index entries for this article
Kernel	Latency
Kernel	Scheduler

benchmarks

Posted Sep 10, 2009 2:34 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

It sounds as if Con could have much greater effect by posting benchmarks that mimic what he and those who agree with him consider typical use cases, and that do poorly under the current scheduler. The kernel people seem to be pretty good at hitting numeric targets they can reproduce.

benchmarks

Posted Sep 10, 2009 16:10 UTC (Thu) by mingo (guest, #31122) [Link]

Note that Con did write a tool that measures various latency aspects of the kernel scheduler: InterBench.

Interestingly, the BFS vs. mainline numbers Con posted are showing a mainline desktop latency advantage (also here).

(Caveat emptor: i have not done those measurements so i dont know how reliable they are - the standard deviation seems very high.)

Note that you dont 'have to' come up with a numeric result - a deterministic result that is described well and can be reproduced by a kernel developer is useful too.

Obviously numeric results have the not to be under-estimated advantage of removing subjective bias from tests. It turns a subjective impression into a hard number that cannot be ignored by either side of an argument. On the flip side, it's harder to obtain. latencytop should help out there for example.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 6:12 UTC (Thu) by fredrik (subscriber, #232) [Link] (8 responses)

To Con Kolivas' defense he actually came back with what I think is a very reasonable response later in the thread.

http://thread.gmane.org/gmane.linux.kernel/886319/focus=8...

And if I interpret the thread correctly it seems like Ingo Molnar and Jens Axboe actually managed to pinpoint and fix a latency related issue in the CFS. An issue that maybe would have gone undetected if it hadn't been for the BFS.

Yay for "trolls" that spur kernel improvement. ;)

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 6:26 UTC (Thu) by drag (guest, #31333) [Link] (5 responses)

It was critical that Con actually has code to show for his ideas and eats his own dogfood.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 11:21 UTC (Thu) by Tracey (guest, #30515) [Link] (4 responses)

After reading(or trying to keep up with all of the messages on LWN) I went through it and noted a few of Ingo's scheduler tuning parameters.

I wasn't sure when I'd have the time to try them, but later into the night I was tuning up a fedora 11 system for audio work. After I had set it up and was testing audio latency via the jack-audio system I decided to start tuning(err, poking things into) some of the scheduler stuff in /proc/sys/kernel.

This was on a older dual core with 4 gig ram running fedora 11. I tried the scheduler tweaks on the kernels 2.6.30.5-43.fc11.x86_64(stock fedora) and kernel-rt-2.6.29.6-1.rt23.4.fc11.ccrma.x86_64(Fernando at CCRMA's real time patched kernel).

What I was looking for was how low I could get the audio latency down to without getting xruns in the audio system. I noticed that when tweaking sched_latency_ns, sched_wakeup_granularity_ns, and sched_min_granularity_ns that I could get better latency on both the fedora and ccrma kernels.

The testing mostly consisted of starting jack from qjackctl, starting the hydrogen drum machine and sometimes another soft-synth; the starting glxgears and dragging it or something else quickly around the screen. I also opened firefox and other things, just to try to harass the audio session.

I could get the fedora kernel down to about 5msec latency and the ccrma-rt just above 1msec latency while using the scheduler tweaks. That was an improvement of 30-50% from using the kernel defaults. So, I did prove to myself at least, that the cfs scheduler can be tweaked. Of course, the system load took a hit somewhat(just as was told it would).

Anyway, here's the real funny part: After I would set the scheduler parameters lower I "noticed" that the screen was smoother and more responsive. Totally subjective on my part. Of course, it was very late and I needed sleep.

This whole BFS versus CFS things seems to be a black hole that likes to tear the folks up who get to close to it.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 15:54 UTC (Thu) by mingo (guest, #31122) [Link] (3 responses)

That's very much possible. The upstream scheduler is a deadline scheduler in essence, and /proc/sys/kernel/sched_latency_ns sets the latency target. The scheduler tries to schedules tasks so that no task ever gets a longer delay than this latency target. (i.e. no task misses its deadline)

The defaults on 2.6.31 are 20 msecs for 1-CPU systems, 40 msecs for 2-CPU systems and 60 msecs for 4-CPU systems (etc. - growing logarithmically by CPU count).

Smaller value there means more scheduling - but also faster reaction and 'smoother' mixing of workloads. So if you lower your 40 msecs down to 20 msecs, you could get a "two times smoother" visual experience for certain GUI workloads.

You can think of it as if your 50 Hz flickering screen went to 100 Hz by halving its latency target. Such changes can affect the subjective end result rather spectacularly.

It would be nice if you documented your latency parameter changes so that we could consider them for the mainline scheduler. Those parameters were always meant to be (and were regularly) tweaked and its effects were re-measured.

The latest scheduler tree (the 2.6.32 scheduler bits) also has them lowered - you can test it by booting the -tip kernel.

Does the -tip tree feel more interactive to you, or do you still need to lower the latency targets there too?

(Feel free to report it in email or here on LWN.net.)

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 17, 2009 6:34 UTC (Thu) by eduperez (guest, #11232) [Link] (2 responses)

The defaults on 2.6.31 are 20 msecs for 1-CPU systems, 40 msecs for 2-CPU systems and 60 msecs for 4-CPU systems (etc. - growing logarithmically by CPU count).

From my complete ignorance of how it works, may I ask why? This seems counter-intuitive to me: as the number of CPU's increase, users expect to feel a lower latency; and having more CPU's means the scheduler has it easier to find and empty CPU where the delayed task can execute. Thanks.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 17, 2009 16:44 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

shorter time slices are inefficient (remember cache is many times faster than ram) so with more CPUs you can let the per-cpu latency creep higher and get equivalent or better overall responsiveness due to the additional CPUs being available to do the work.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 21, 2009 12:56 UTC (Mon) by eduperez (guest, #11232) [Link]

per-cpu!!!
I did not notice that those latencies where _per-cpu_, and (wrongly) assumed they where _global_...; it makes a lot more sense, now; thanks.

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 10:45 UTC (Thu) by ctg (guest, #3459) [Link] (1 responses)

Thanks for pointing that out. It is a critical piece of the story - enough to warrant our editor updating the article, IMHO!

Some notes from the BFS discussion - and Con Kolivas responded...

Posted Sep 10, 2009 22:21 UTC (Thu) by Velmont (guest, #46433) [Link]

Yes, that email should really be read. It gives me a warm fuzzy feeling all over. Just a quote from the mail Con Kolivas sent

What does please me now, though, is that this message thread is finally concentrating on what BFS was all about. The fact that it doesn't scale is no mystery whatsoever. The fact that that throughput and lack of scaling was what was given attention was missing the point entirely. To point that out I used the bluntest response possible, because I know that works on lkml (does it not?). Unfortunately I was so blunt that I ended up writing it in another language; Troll. So for that, I apologise.

[snip]

It pleases me immensely to see that it has alreadyIt pleases me immensely to see that it has already spurred on a flood of changes to the interactivity side of mainline development in its few days of existence, including some ideas that BFS uses itself. That in itself, to me, means it has already started to accomplish its goal, spurred on a flood of changes to the interactivity side of mainline development in its few days of existence, including some ideas that BFS uses itself. That in itself, to me, means it has already started to accomplish its goal,

Benchmarking the scheduler on desktop machine

Posted Sep 10, 2009 11:58 UTC (Thu) by liw (subscriber, #6379) [Link]

Until someone comes up with good, generally accepted objective benchmarks for schdulers with good coverage, perhaps it would make sense to benchmark things using double-blind tests and real people.

Some notes from the BFS discussion

Posted Sep 10, 2009 13:08 UTC (Thu) by mjthayer (guest, #39183) [Link] (7 responses)

This has actually made me wonder - what benefit is there to trying adjusting a processes scheduling based on how much time it got in the past? It seems to me that unless you are giving very hard QoS guarantees, or something is wrong with your scheduler algorithm, any deviations from what the time the process should have had will be blips which can be ignored in the longer run, and trying to compensate for them is likely to introduce unnecessary complexity.

(That is, the blips should be irrelevant for throughput once you average things out over a period of time, and as far as the responsiveness is concerned, they have already happened and you can no longer take them back.)

Some notes from the BFS discussion

Posted Sep 10, 2009 15:22 UTC (Thu) by anton (subscriber, #25547) [Link] (6 responses)

what benefit is there to trying adjusting a processes scheduling based on how much time it got in the past?

If a process is I/O bound (e.g., waiting for the user most of the time rather than computing), it can be useful to prefer it, because it will give a faster response to the user (or, for disk-bound processes, it will come up faster with the next request for the disk, increasing disk utilization and hopefully total run-time), whereas a CPU-bound process usually does not benefit from getting its timeslice now rather than later.

Some notes from the BFS discussion

Posted Sep 10, 2009 16:18 UTC (Thu) by mingo (guest, #31122) [Link] (5 responses)

Beyond IO bound tasks, there's also a general quality argument behind rewarding sleepers:

Lighter, leaner tasks get an advantage. They run less and subsequently sleep more.

Tasks that do intelligent multi-threading with a nice, parallel set of tasks get an advantage too.

CPU hogs that slow down the desktop and eat battery like the end of the world is nigh should take a back seat compared to ligher, friendlier, 'more interactive' tasks.

So the Linux scheduler always tried to reward tasks that are more judicious with CPU resources. An app can get 10% snappier by using 5% less CPU time.

Some notes from the BFS discussion

Posted Sep 10, 2009 19:08 UTC (Thu) by mjthayer (guest, #39183) [Link] (4 responses)

Hm, I'm not sure if either of those arguments quite convince me :) As far as the "CPU bound processes don't need good latencies" is concerned, well, it is a heuristic, and heuristics are only as good as the set of usage cases that the author thought of. Does a process rendering animation, or mixing music which was played back as it is rendered/mixed fare well enough here? You are of course much better qualified than me to think of the possible edge cases of the heuristic...

And regarding the rewarding of processes, it sounds a bit like the scheduler wanting to know better than the user what the user wants. It would be much less heavy handed to just let the user know that a thread was not behaving nicely, and to let the user deal with it. They might have a good reason for running it after all.

Just my thoughts, not to be given more weight than they deserve.

Some notes from the BFS discussion

Posted Sep 10, 2009 19:20 UTC (Thu) by mingo (guest, #31122) [Link]

I agree with your observations - these are the basic tradeoffs to consider.

Note that the reward for tasks is limited. (unlimited would open up a starvation hole)

But you are right to suggest that the scheduler should not be guessing about the purpose of tasks.

So this capability was always kept optional, and was turned on/off during the fair scheduler's evolution, mainly driven by user feedback and by benchmarks. We might turn it off again - there are indications that it's causing problems.

Some notes from the BFS discussion

Posted Sep 10, 2009 21:11 UTC (Thu) by anton (subscriber, #25547) [Link] (2 responses)

Does a process rendering animation, or mixing music which was played back as it is rendered/mixed fare well enough here?

Such processes normally won't use all of the CPU (unless the CPU is too slow for them), because they are limited by the speed in which they want to play back the content, so a scheduler prefering sleepers over CPU hogs will prefer them over, say, oggenc. Of course, a browser might get even more preferred treatment, which you may not want; and clock scaling will tend to make stuff that consumes a significant mostly-constant amount of CPU look almost CPU-bound if they are alone on the CPU (but then it does not really matter).

It would be much less heavy handed to just let the user know that a thread was not behaving nicely, and to let the user deal with it.

Traditionally Unix had nice for that. I'm not sure that this still works properly with current Linux schedulers. The last time I tried it did not work well.

Some notes from the BFS discussion

Posted Sep 11, 2009 5:18 UTC (Fri) by mjthayer (guest, #39183) [Link] (1 responses)

A process like gcc, which alternates between I/O and CPU bound, may also get more than its share under this algorithm - perhaps that is why people always give build processes as examples of what negatively affects their interactivity?

Some notes from the BFS discussion

Posted Sep 11, 2009 5:20 UTC (Fri) by mjthayer (guest, #39183) [Link]

s/algorithm/heuristic/. And of course, since CFS considers the behaviour of the process over a long period of time, this effect should be somewhat limited.

Some notes from the BFS discussion

Posted Sep 10, 2009 13:10 UTC (Thu) by busterb (subscriber, #560) [Link] (1 responses)

One of the hacked Android firmwares has switched to BFS in its experimental branch. Early
reactions appear to be a noticable increase in responsiveness and decrease in stability.

http://www.cyanogenmod.com/home/4-1-6-is-here-with-100-mo...

Some notes from the BFS discussion

Posted Sep 10, 2009 14:58 UTC (Thu) by kirkengaard (guest, #15022) [Link]

Cum grano salis -- while this has multiple anecdotes behind it, I'm not sure you can safely chalk instability in a development Android firmware image up to a BFS problem without isolating that change and doing real testing.

Some notes from the BFS discussion

Posted Sep 11, 2009 18:03 UTC (Fri) by iabervon (subscriber, #722) [Link]

I think one issue is that a lot of interactivity issues probably come down to the X server not using its share of the CPU time to process the thing that the user is actually watching. I have the sneaking suspicion that Maelstrom is getting the CPU time to generate plenty of frames each second, and sending them off to X, which is not quite keeping up. Once the X server is sufficiently behind, there's back pressure on Maelstrom making requests, at which point it seems to be I/O-bound, and therefore gets all the CPU it can use to generate more frames, ensuring plenty of lag between the time that Maelstrom generates a frame and the time that the user sees it. And, of course, the benchmarks all look really good, because the game never has to wait for processor before generating a frame and the X server is drawing lots of frames. And, of course, the game is effectively trying to benchmark the system, in order to determine how closely-spaced frames should be, and our efficient system has hidden the work that it's trying to measure.

BFS may give better interactivity by not giving X clients as low latency in their attempts to generate work for the server. Also, disabling the "new fair sleepers" feature helps some people, which also suggests that this is actually effectively a priority inversion problem between the task that seems to be slow and tasks that are doing work on behalf of that task and are actually slow.

Some notes from the BFS discussion

Posted Sep 12, 2009 18:04 UTC (Sat) by Thalience (subscriber, #4217) [Link] (1 responses)

As much as we find it frustrating that there will always be some people who insist that they perceive a subjective improvement that is not measured by any benchmark you care to name, it is human nature.

I think the best response we can give to this assertion is the same one used in the audiophile community: Blind A/B tests. Have someone else switch between the two systems while doing the subjective evaluation. Don't tell the user which one is which. If they consistently prefer one over the other, perhaps there is a real effect. Otherwise....

Some notes from the BFS discussion

Posted Sep 17, 2009 16:55 UTC (Thu) by realnc (guest, #60393) [Link]

I think the best response we can give to this assertion is the same one used in the audiophile community: Blind A/B tests.

I don't need an ABX test to tell that sound stops with CFS if I start alt+tabbing while the sound continues playing when doing the same with BFS. Unless you think that some psychosomatic effect exists within BFS that makes me hear stuff that isn't there :)