LWN.net Logo

Tbench troubles II

By Jonathan Corbet
November 19, 2008
LWN has previously covered concerns over slowly deteriorating performance by current Linux systems on the network- and scheduler-heavy tbench benchmark. Tbench runs have been getting worse since roughly 2.6.22. At the end of the last episode, attention had been directed toward the CFS scheduler as the presumptive culprit. That article concluded with the suggestion that, now that attention had been focused on the scheduler's role in the tbench performance regression, fixes would be relatively quick in coming. One month later, it would appear that those fixes have indeed come, and that developers looking for better tbench results will need to cast their gaze beyond the scheduler.

The discussion resumed after a routine weekly posting of the post-2.6.26 regression list; one entry in that list is the tbench performance issue. Ingo Molnar responded to that posting with a pointer to an extensive set of benchmark runs done by Mike Galbraith. The conclusion Ingo draws from all those runs is that the CFS scheduler is now faster than the old O(1) scheduler, and that "all scheduler components of this regression have been eliminated." Beyond that:

In fact his numbers show that scheduler speedups since 2.6.22 have offset and hidden most other sources of tbench regression. (i.e. the scheduler portion got 5% faster, hence it was able to offset a slowdown of 5% in other areas of the kernel that tbench triggers)

This improvement is not something that just happened; it is the result of a focused effort on the part of the scheduler developers. Quite a few changes have been merged; they all seem like small tweaks, but, together, they add up to substantial improvements in scheduler performance. One change fixes a spot where the scheduler code disabled interrupts needlessly. Some others (here and here) adjust the scheduler's "wakeup buddy" mechanism, a feature which ties processes together in the scheduler's view. As an example, consider a process which wakes up a second process, then runs out of its allocated time on the CPU. The wakeup buddy system will cause the scheduler to bias its selection mechanism to favor the just-waked process, on the theory that said process will be consuming cache-warm data created by the waking process. By allowing cooperating processes like this to run slightly ahead of what a strictly fair scheduling algorithm would provide, the scheduler gets better performance out of the system as a whole.

The recent changes add a "backward buddy" concept. If there is no recently-waked process to switch to, the scheduler will, instead, bias the selection toward the process which was preempted to enable the outgoing process to run. Chances are relatively good that the preempted process might (1) be cooperating with the outgoing process or (2) have some data still in cache - or both. So running that process next is likely to yield better performance overall.

A number of other small changes have been merged, to the point that the scheduler developers think that the tbench regressions are no longer their problem. Networking maintainer David Miller has disagreed with this assessment, though, claiming that performance problems still exist in the scheduler. Ingo responded in a couple of ways, starting with the posting of some profiling results which show very little scheduler overhead. Interestingly, it turns out that the networking developers get different results from their profiling runs than the scheduler developers do. And that, in turn, is a result of the different hardware that they are using for their work. Ingo has a bleeding-edge Intel processor to play with; the networking folks have processors which are not quite so new. David Miller tends to run on SPARC processors, which may be adding unique problems of their own.

The other thing Ingo did was, for all practical purposes, to profile the entire kernel code path involved in a tbench run, then to disassemble the executable and examine the profile results on a per-instruction basis. The postings that resulted (example) point out a number of potential problem spots, most of which are in the networking code. Some of those have already been fixed, while others are being disputed. It is, in the end, a large amount of raw data which is likely to inspire discussion for a while.

To an outsider, this whole affair can have the look of an ongoing finger-pointing exercise. And, perhaps, that's what it is. But it's highly-technical finger-pointing which has increased the understanding of how the kernel responds to a specific type of stress while also demonstrating the limits of some of our measurement tools and the performance differences exhibited by various types of hardware. The end result will be a faster, more tightly-tuned kernel - and better tbench numbers too.


(Log in to post comments)

Tbench troubles II

Posted Nov 20, 2008 3:49 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link]

Will any of these improvements be merged in 2.6.27 -stable, or are performance regressions not considered important enough for -stable?

No way!

Posted Nov 20, 2008 7:45 UTC (Thu) by khim (subscriber, #9252) [Link]

First: the performance regressions are not security-related at all (unless it's slowdown of 100 times or more).
Second: changes involved tend to be small but potentlially dangerous (oh, we don't need this lock, so we can drop it - but what if we are wrong? ).
Not -stable material at all.

No way!

Posted Nov 20, 2008 7:53 UTC (Thu) by nix (subscriber, #2304) [Link]

IIRC the hrtick scheduler has been turned off in -stable already :)

Tbench troubles II

Posted Nov 20, 2008 8:20 UTC (Thu) by tajyrink (subscriber, #2750) [Link]

I'd hope someone would put a similar amount of effort to tracking disk usage performance.

Tbench troubles II

Posted Nov 20, 2008 13:03 UTC (Thu) by nix (subscriber, #2304) [Link]

David Miller tends to run on SPARC processors
May I be the first to say wow! advanced! and no wonder he cares about SPARC performance. ;}

Tbench troubles II

Posted Nov 20, 2008 15:26 UTC (Thu) by sayler (guest, #3164) [Link]

Unfortunately, he's stuck on a first-generation Niagara. He can work on a lot of patches at the same time, but the latency is horrible..

Tbench troubles II

Posted Nov 21, 2008 4:45 UTC (Fri) by pflugstad (subscriber, #224) [Link]

+5 Funny!!!

Caching

Posted Nov 20, 2008 20:44 UTC (Thu) by ncm (subscriber, #165) [Link]

We are reminded, once again, that a cache is a deal with the Devil.

Caching

Posted Nov 21, 2008 17:05 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

A deal made long, long ago, when we constructed the first memory hierarchies of rotating drums, mercury delay lines, cathode ray storage tubes and fast, fast core memory.

A great quote appears in my copy of Hennessy and Patterson:

Ideally one would desire an indefinitely large memory capacity such that any particular . . . word would be immediately available . . . . We are . . . forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
— A. W. Burks, H. H. Goldstine and J. von Neumann, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, 1946

Nineteen Hundred Forty Six. Over 60 years ago! Before the invention of the first compiler!

So if caches are a deal with the devil, would this be computing's original sin?

Caching

Posted Nov 21, 2008 18:41 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

A deal made long, long ago, when we constructed the first memory hierarchies of rotating drums, mercury delay lines, cathode ray storage tubes and fast, fast core memory.

Don't forget cards. The drum was mostly a cache for data principally stored on cards. Since it would take a computer many minutes to access data on cards (it involved instructing a human to load them), the decks used most frequently, such as compilers, typically stayed on the drum to optimize throughput. The system also did readahead and writebehind of cards via the drum, i.e. spooling.

And at the other end: registers. Computers always tried to keep some data in registers (made out of electrical feedback circuits) to avoid the slowness of core memory, and the same scheduling complexities we're talking about now applied to optimizing use of that cache.

N/W subsystem might be tuned too much for the O(1)

Posted Nov 24, 2008 3:49 UTC (Mon) by nikanth (guest, #50093) [Link]

The problem could be that just because of these various improvements to the scheduler, now it might pick the best process to run, but other subsystems might have been tuned to the O(1)[old/non-optimal ;)] way of picking the next process.

IOW networking stack might perform best only with O(1).

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds