|
|
Log in / Subscribe / Register

BFS vs. mainline scheduler benchmarks and measurements

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 11:49 UTC (Mon) by bvdm (guest, #42755)
In reply to: BFS vs. mainline scheduler benchmarks and measurements by ikm
Parent article: BFS vs. mainline scheduler benchmarks and measurements

ikm: i don't think you should expect to convince the lwn.net audience with arguments suggesting Ingo Molnar's technical incompetence. Really.

everyone: can we raise the level of this debate a bit?


to post comments

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 12:09 UTC (Mon) by kragil (guest, #34373) [Link] (10 responses)

Ingo is probably one of best hackers on this planet but that does not mean he is living in the world as everyone else.

When I read:
"So the testbox i picked fits into the upper portion of what i
consider a sane range of systems to tune for - and should still fit
into BFS's design bracket as well according to your description:
it's a dual quad core system with hyperthreading."

Tune the scheduler for 16 core machine? Thank you very much. I know nobody with more than a quadcore and those are spanking new.

And it is really really unfair to test a scheduler that wants to enhance interactivity for pure performance on a system that is clearly the upper limit of what the scheduler was designed for.

What I take from this discussion is that Kernel devs live in a world where Intels fastest chips in multi socket systems are low end and they will cater only to the enterprise bullcrap that pays their bills.

Despite what Linus says Linux is not intended to be used on the desktop(at least not in the real world).

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 13:18 UTC (Mon) by aigarius (subscriber, #7329) [Link] (3 responses)

i7 has been around for what? A year already? 8 cores there. Benchmarking forward a couple years for kernel development is a reasonable assumption. Meanwhile, even people with quad-cores say that Ingo's tests are still showing the same results.

Cory needs to show quantifiable tests so that performance of different versions of schedulers can actually be compared. How can we know that a patch improves on the code if there is no quantifiable number showing that conclusively?

Scientific approach, please. Insulting people does not win arguments in technical communities. Facts, tests and numbers do.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 16:13 UTC (Mon) by andreashappe (subscriber, #4810) [Link]

> i7 has been around for what? A year already? 8 cores there.

4 cores plus ht.

Still makes me smile when I see the htop output.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 7:48 UTC (Tue) by epa (subscriber, #39769) [Link]

It might help to see some numbers. Take Fedora's smolt data, which is from people who have clicked 'yes' when installing Fedora and have reported what hardware they use.

This shows that more than half of Fedora systems are dual-processor, with another 38% having a single CPU. So based on hardware that's in use now, a one- or two- processor test would be more reasonable. Of course it's useful to test on 16-processor monsters as well, but that is not the typical desktop and won't be for some time. (And by the time it is, all sorts of other assumptions will have changed too.)

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 8:34 UTC (Tue) by branden (guest, #7029) [Link]

Aigarius,

How about we bench based on the profiles of the machines people bring to
Debconf?

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 13:37 UTC (Mon) by mingo (subscriber, #31122) [Link] (3 responses)

What I take from this discussion is that Kernel devs live in a world where Intels fastest chips in multi socket systems are low end and they will cater only to the enterprise bullcrap that pays their bills.

I certainly dont live in such a world and i use a bog standard dual core system as my main desktop. I also have a 833 MHz Pentium-3 laptop that i booted into a new kernel 4 times today alone:

  #0, d5f8b495, Mon_Sep__7_08_39_36_CEST_2009: 0 kernels/hour
  #1, b9e808ca, Mon_Sep__7_09_19_47_CEST_2009: 1 kernels/hour
  #2, b9e808ca, Mon_Sep__7_10_26_28_CEST_2009: 1 kernels/hour
  #3, b9e808ca, Mon_Sep__7_14_58_48_CEST_2009: 0 kernels/hour

  $ head /proc/cpuinfo 
  processor	: 0
  vendor_id	: GenuineIntel
  cpu family	: 6
  model		: 8
  model name	: Pentium III (Coppermine)
  stepping	: 10
  cpu MHz	: 846.242
  cache size	: 256 KB

  $ uname -a
  Linux m 2.6.31-rc9-tip-01360-gb9e808c-dirty #1178 SMP Mon Sep 7 22:38:18 CEST 2009 i686 i686 i386 GNU/Linux

And that test-system does that every day - today isnt a special day. Look at the build count: #1178. This means that i booted more than a thousand development kernels on this system already.

Now, to reply to your suggestion: for scheduler performance i picked the 8 core system because that's where i do scheduler tests: it allows me to characterise that system _and_ also allows me to characterise lower performance systems to a fair degree.

Check out the updated jpgs with quad-core results.

See how similar the single-socket quad results are to the 8-core results i posted initially? People who do scheduler development do this trick frequently: most of the "obvious" results can be downscaled as a ballpark figure.

(the reason for that is very fundamental: you dont see new scheduler limitations pop up as you go down with the number of cores. The larger system already includes all the limitations the scheduler has on 4, 2 or 1 core, and reflects those properties already so there's no surprises. Plus, testing is a lot faster. It took me 8 hours today to get all the results from the quad system. And this is right before the 2.6.32 merge window opens, when Linux maintainers like me are very busy.)

Certainly there are borderline graphs and also trickier cases that cannot be downscaled like that, and in general 'interactivity' - i.e. all things latency related come out on smaller systems in a more pronounced way.

But when it comes to scheduler design and merge decisions that will trickle down and affect users 1-2 years down the line (once it gets upstream, once distros use the new kernels, once users install the new distros, etc.), i have to "look ahead" quite a bit (1-2 years) in terms of the hardware spectrum.

Btw., that's why the Linux scheduler performs so well on quad core systems today - the groundwork for that was laid two years ago when scheduler developers were testing on a quads. If we discovered fundamental problems on quads _today_ it would be way too late to help Linux users.

Hope this explains why kernel devs are sometimes seen to be ahead of the hardware curve. It's really essential, and it does not mean we are detached from reality.

In any case - if you see any interactivity problems, on any class of systems, please do report them to lkml and help us fix them.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 8:46 UTC (Tue) by kragil (guest, #34373) [Link] (2 responses)

Reading all your answers calmed me down a bit :) Thanks

I think our major disagreement here is the "look ahead".

I strongly believe that computers have reached the point where this relentless upgrade cycle should and has stopped. If you bought a P4 with HT and 1 GB in 2003 it is still perfectly capable of running the newest software 95% of desktop users need. Machines like that can turn 7 YEARS soon. People will look for computers that use less engery and don't have moving parts that just break after a few years.
PCs will be like old TV sets and work for many many years (10 to 15 years). The software has to adapt. That is the "look ahead" I see, but I can understand why Red Hat plans for something different.

I think faster ARM,Mips and Atom CPUs are the architecture most desktop Linux kernels will run on and the relative percentage of X-core X86 monsters will decline (maybe even rapidly).

And no I don't think Fedoras smolt data is any good here. Fedora users are technical people and are unlikely to run really old hardware like my sisters for example.

I also don't think Linux will ever get problems with the fastest computers, its dominance in the HPC area will make sure of that.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 9:30 UTC (Tue) by mingo (subscriber, #31122) [Link] (1 responses)

And no I don't think Fedoras smolt data is any good here. Fedora users are technical people and are unlikely to run really old hardware like my sisters for example.

That's all fine and i have a Fedora Core 6 box too on old hardware - which is very old.

I wouldnt upgrade the kernel on it though - and non-technical users would do that even less likely. Software and hardware is in a single unit and for similar reasons it is hard to upgrade hardware is it difficult to upgrade software as well. Yes, you pick up security fixes, etc. - but otherwise main components like the kernel tend to be cast into stone at install time. (And no, if you are reading this on LWN.Net then your box probably does not qualify ;-)

Which means that most of the 4 years old systems have a 4 years old distribution on them, with a 4 years old kernel. That kernel was developed 5 years ago and any deep scheduler decisions were done 6 years ago or even later.

So yes, i agree that the upgrade treadmill has to stop eventually, but _I_ cannot make it stop - i just observe reality and adopt to it. I see what users do, i see what vendors do and i try to develop the kernel in the best possible technical way, matching those externalities.

What i'm seeing right now as the scheduler and as the x86 co-maintainer is that the hardware side shows no signs of slowing down and that users who are willing to install new kernels show eagerness to buy shiny new hardware. Quads yesterday, six-cores today, opto-cores in a year or two.

Most of the new kernel installs goes to fresh new systems, so that's an important focus of the upstream kernel - and of any distribution maker. That is the space where we _can_ do something realistically and if we did something else we'd be ignoring our users.

I could certainly be wrong about all that in some subtle (or not so subtle) way - but right now the fact is that most of the bugreports i get against development code we release is done on relatively new hardware.

That is natural to a certain degree - new hardware triggers new, previously unknown limitations and bottlenecks, and new hardware has its own problems too that gets mixed into kernel problems, etc. Old hardware is also already settled into its workload so there's little reason to upgrade an old, working box in general. There's also the built-in human excitement factor that shiny new hardware triggers on a genetic level ;-)

There's an easy way out though: please report bugs on old hardware and make old hardware count. The mainline kernel can only recognize and consider people who are willing to engage. The upstream kernel process is a fundamentally auto-tuning and auto-correcting mechanism and it is mainly influenced by people willing to improve code.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 9, 2009 11:41 UTC (Wed) by nix (subscriber, #2304) [Link]

Well, I'm a counterexample: I upgrade my hardware every decade, if that, but the kernels are normally as new as possible, because I'd like newish software, thanks, and that often likes new kernels. Further, everyone I know who isn't made of money and runs Linux does the same thing: they tend to run Fedora, recentish Ubuntu, or Debian testing, because non-enterprise users generally do not want to run enterprise distros because all the software on them is ancient, and non-enterprise distro kernels *do* get upgraded.

I suspect your argument is pretty much only true for corporate uses of Linux (i.e. 'just work with *this* set of software', as opposed to other uses which often involve installation of new stuff). But perhaps those are the only uses that matter to you...

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 16:19 UTC (Mon) by einstein (subscriber, #2052) [Link]

> Despite what Linus says Linux is not intended to be used on the desktop(at least not in the real world).

Speak for yourself. I've been using linux on the desktop in the real world for years, as have a number of other people I know, your snarky little jabs notwithstanding.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 19:38 UTC (Mon) by leoc (guest, #39773) [Link]

Despite what Linus says Linux is not intended to be used on the desktop(at least not in the real world).

For a system not intended to be used in the "real world" it is doing pretty well considering it has around 1/4 the market share of OS X.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 12:22 UTC (Mon) by ikm (subscriber, #493) [Link] (11 responses)

> i don't think you should expect to convince the lwn.net audience with arguments suggesting Ingo Molnar's technical incompetence. Really.

I expect everyone can draw the conclusions of their own. I've made mine. Ingo's a nice guy, but I don't think he's measuring the right things here. But how are you going to measure things like:
  • mplayer using OpenGL renderer doesn't drop frames anymore when dragging and dropping the video window around in an OpenGL composited desktop
  • Composite desktop effects like zoom and fade out don't stall for sub-second periods of time while there's CPU load in the background
  • LMMS (a tool utilizing real-time sound synthesis) does not produce "pops", "crackles" and drops in the sound during real-time playback due to buffer under-runs
  • Games like Doom 3 and such don't "freeze" periodically for small amounts of time (again for sub-second amounts) when something in the background grabs CPU time
Those are things a person has reported as a followup on the thread in question. Do you think his was lying?

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 12:41 UTC (Mon) by bvdm (guest, #42755) [Link] (1 responses)

Do you have a point other than that the current scheduler is not perfect? We all knew that. And Ingo invited Con to help improve it. So you don't really have a point at all, do you?

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 12:59 UTC (Mon) by ikm (subscriber, #493) [Link]

Go troll elsewhere. Thank you.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 13:11 UTC (Mon) by mingo (subscriber, #31122) [Link] (2 responses)

But how are you going to measure things like:

* mplayer using OpenGL renderer doesn't drop frames anymore when dragging and dropping the video window around in an OpenGL composited desktop

* Composite desktop effects like zoom and fade out don't stall for sub-second periods of time while there's CPU load in the background

* LMMS (a tool utilizing real-time sound synthesis) does not produce "pops", "crackles" and drops in the sound during real-time playback due to buffer under-runs

* Games like Doom 3 and such don't "freeze" periodically for small amounts of time (again for sub-second amounts) when something in the background grabs CPU time

This is a list of routine interactivity problems that we track down and address. In the past few years we've got extensive infrastructure built up in the mainline kernel that allows their measurement and allows us to eliminate them.

A good place to start would be to try the latency tracing suggestions from Frederic Weisbecker on lkml:

Such properties of the desktop are measured routinely (sometimes easily - sometimes it needs quite a bit of work) - so please report them and help out tracking them down.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 14:02 UTC (Mon) by ikm (subscriber, #493) [Link] (1 responses)

Yay, that's a start. I hope this can go somewhere eventually. Clearly it's the interactivity issues Con has always been after, not the bulk workloads. With a way to measure and quantify those issues and scenarios, something might get going somewhere.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 21:53 UTC (Mon) by mingo (subscriber, #31122) [Link]

You might want to try latencytop. We added the instrumentation for that after the CFS merge - to make it easier to prove/report scheduler (and other) latencies.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 16:21 UTC (Mon) by lacostej (guest, #2760) [Link] (5 responses)

> But how are you going to measure things like:

Can't these tool detect when they hang/stall ?

Can't we pipe modify them to report the issues in a known format (or to a third party daemon) and use those tools as tests ?

I mean if I was Con, that's the first thing I would do: create a measurable suite of tests.

Instead of talking of feelings, we would talk about measurable things. It's not like we're talking about usability. Even usability can be tested up to some degree.

So, can't we elevate the debate ?

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 18:35 UTC (Mon) by hppnq (guest, #14462) [Link] (4 responses)

I mean if I was Con, that's the first thing I would do: create a measurable suite of tests.

Actually, he did that: you may find interbench interesting. It was used to produce Con's performance statistics. Also, see this 2002 interview with Con, discussing his earlier effort ConTest and scheduler benchmarking in general.

The challenge, it seems, is to get scheduler developers to agree on what constitutes a normal workload on normal systems tuned in normal ways.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 7, 2009 21:45 UTC (Mon) by mingo (subscriber, #31122) [Link] (3 responses)

The challenge, it seems, is to get scheduler developers to agree on what constitutes a normal workload on normal systems tuned in normal ways.

There's not much disagreement really. Everyone agrees that interactivity problems need to be investigated and fixed - it's as simple as that. We have a lot of tools to do just that, and things that get reported to us we try to get fixed.

In practice, interactivity fixes rarely get in the way of server tunings - and if they do, the upstream kernel perspective was always for desktop/latency tunings to have precedence over server/thoughput tunings.

I'm aware that the opposite is being claimed, but that does not make it a fact.

Try a simple experiment: post a patch to lkml with Linus Cc:-ed that blatantly changes some tunable to be more server friendly (double the default latency target or increase some IO batching default) at the expense of desktop latencies. My guess is that you'll see a very quick NAK.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 8:01 UTC (Tue) by hppnq (guest, #14462) [Link] (2 responses)

We have a lot of tools to do just that, and things that get reported to us we try to get fixed.

Ah, my point is that you claim to compare apples to apples while you use different tools than Con to compare the performance of the BFS and CFS schedulers. It is entirely possible that I missed the comparison of benchmarking tools, of course, and I'm not saying that you or Con should choose any particular tool: I am simply observing there is a difference.

But, looking at the interbench results, I cannot help but think that it would have been better if Con had used some other benchmarks as well: one could drive a truck through those standard deviations.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 8:48 UTC (Tue) by mingo (subscriber, #31122) [Link] (1 responses)

Ah, my point is that you claim to compare apples to apples while you use different tools than Con to compare the performance of the BFS and CFS schedulers. It is entirely possible that I missed the comparison of benchmarking tools, of course, and I'm not saying that you or Con should choose any particular tool: I am simply observing there is a difference.

Well, the reason i spent 8+ hours for each round of testing is because i threw a lot of reliable and relevant benchmarks/workloads at the schedulers. Most of those were used by Con too in the past for scheduler work he did so it's not like he never runs them or disagrees with them on some fundamental basis - he just chose not to test them on BFS this time around. Sysbench comes from FreeBSD for example, hackbench was written many years ago to test chat server latencies/throughput, kbuild, lat_tcp and lat_pipe is well-known as well, etc.

Basically i applied a wide spectrum of tests that _I_ find useful to build a picture about how good a scheduler is, and posted the results. (I wanted to find the strong spot of BFS - which by in turn would be a weak spot of the mainline scheduler.)

So i tested what i was curious about (basic latency in four tests, throughput and scalability in two other tests) - others can test what they are curious about - testing these schedulers is not that hard, it's not like i have a monopoly on posting scheduler comparisons ;-)

But, looking at the interbench results, I cannot help but think that it would have been better if Con had used some other benchmarks as well: one could drive a truck through those standard deviations.

The inherent noise in the interbench numbers does not look particularly good - and i found that too in the past. But it's still a useful test, so i'm not dissing it - it's just very noisy in general. I prefer low noise tests as i want to be able to stand behind them later on. When i post benchmarks they get a lot of scrutiny, for natural reasons, so i want sound results. You wont find many (any?) measurements from me in the lkml archives that were discredited later.

Also, on the theoretical angle, i dont think there's much to be won on the interactivity front either: the mainline scheduler has a fixed deadline (/proc/sys/kernel/sched_latency_ns) which you can tune down if you wish to and it works hard to meet that latency goal for every task. If it doesn't then that's a bug we want to fix, not some fundamental design weakness.

But ... theory is one thing and practice is another, so it always makes sense to walk the walk and keep an open mind about all this.

So what we need now are bugreports and testers willing to help us. These kinds of heated discussions about the scheduler are always useful as the attention on the scheduler increases and we are able to fix bugs that don't get reported otherwise - so i'm not complaining ;-)

For latency characterisation and debugging we use the latency tests i did post (pipe, messaging, etc.), plus to measure a live desktop we use latencytop, latency tracer, the 'perf' tool, etc.

So there's plenty of good tools, plenty of well-known benchmarks, plenty of good and reliable data, and a decade old kernel policy that desktop latencies have a precedence over server throughput - and the scheduler developers are eager to fix all bugs that get reported.

Let me note here that based on these 100+ comment discussions here on LWN and on Slashdot as well, we only got a single specific latency bugreport against the upstream scheduler in the past 24 hours. So there's a lot of smoke, a lot of wild claims and complaints - but little actionable feedback from real Linux users right now.

So please, if you see some weirdness that is suspected to be caused by the scheduler then please post it to lkml. (Please Cc: Peter Zijstra and me as well to any email.) I'm sure the scheduler is not bug-free and i'm sure there's interactivity bugs to fix as well, so dont hesitate to help out.

BFS vs. mainline scheduler benchmarks and measurements

Posted Sep 8, 2009 11:45 UTC (Tue) by hppnq (guest, #14462) [Link]

Thanks for clarifying! Not only do I appreciate all those hours of developing and testing wonderful software, I also like it a lot that you take the time to comment about it here at LWN. :-)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds