LWN.net Logo

Some patches of interest

There's a few patches in circulation which merit a quick look.

What if you could improve kernel performance by 10% without writing any code? Arjan van de Ven has posted a patch which, he says, does just that - at least, for some specific benchmarks. This patch uses an obscure gcc option which causes the compiler to put every function into its own ELF section. Then, the linker is instructed to arrange those functions into a specific order in the final executable.

A typical, current x86-64 kernel (the architecture Arjan has been working with) fills on the order of 4MB of memory. The kernel uses large pages to hold its text, but a kernel of that size will still require at least two translation buffer (TLB) entries to cover its entire code body. But some kernel functions are used more heavily than others; much of the code in the kernel - error handling, for example - never gets run at all if you are lucky. So, if all of the regularly-used functions are moved to the beginning of the kernel image, the kernel should be able to operate with a single TLB entry for its text - most of the time. TLB entries are important: if an address is found in the TLB, the processor can avoid looking it up in the page tables, speeding access significantly. They are also scarce. So allowing the kernel to operate within a single TLB entry makes a big difference.

There are some details to work out yet. Optimizing TLB use will require that the kernel be loaded at a TLB-aligned address, which is not currently done on many architectures. There is another part of Arjan's patch which, using another gcc option, can move blocks marked with unlikely() into a separate section. Since this option can expand the code, require long-distance jumps within functions, and make stack backtraces hard to read, it is not yet clear whether it makes sense or not. Then, there is the issue of ordering the functions properly. That task will require looking at a lot of kernel profiles to be sure that some workloads won't be optimized at the expense of others. But, once these issues are taken care of, a reorganized and faster kernel will likely result.

On another front: it is generally easy to see, on a Linux system, what resources a given process is using. What's harder to find out is what the process is not using because the resources are not available. As a way of giving more visibility to that side of the equation, Shailabh Nagar has been working on a set of task delay accounting patches. This facility is intended for use with large-scale load management applications, but the information may be useful in other contexts as well.

This patch adds a new structure (struct task_delay_info) which is attached to the task structure. It contains a lock, a couple of timestamp variables, and sets of delay counters. Whenever a process goes into a delayed state (meaning, currently, waiting on a run queue, performing synchronous block I/O, or waiting for a page fault), the time is noted. At the end of the delay, when the process can run again, the system notes how much time has passed and updates a counter in the task_delay_info structure. Thus, over time, one can get a picture of how much time the process has spent waiting for things when it would have rather been executing.

Perhaps the most complicated part of the patch set is the netlink interface used to report delay statistics back to user space. This interface has been carefully written to be as generic as possible on the theory that it may eventually be used for other sorts of process-related reporting as well. There has been a request that some of this information, at least, also be made available through /proc, so that it could be easily displayed by tools like top.

Finally, those who worked with kernel modules in 2.4 and prior kernels will remember the MODULE_PARM() macro, used to define load-time parameters. This macro has been deprecated since 2004, but there are still a few hundred uses of MODULE_PARM() spread across several dozen files in the 2.6.16-rc kernels. These old uses came to attention recently when gcc started optimizing them out. Given the choice between making the old macro work with current gcc and simply getting rid of it, Rusty Russell chose to get rid of it. This patch has not yet been merged anywhere, but it seems uncontroversial. If there are any out-of-tree modules still using MODULE_PARM(), updating them soon might be a good idea.


(Log in to post comments)

whole-program optimisation the hard way

Posted Mar 2, 2006 6:12 UTC (Thu) by xoddam (subscriber, #2322) [Link]

> This patch uses an obscure gcc option which causes the compiler to put
> every function into its own ELF section. Then, the linker is instructed
> to arrange those functions into a specific order in the final
> executable.

This looks like doing whole-program optimisation the hard way (by hand).

The narrow interface (ELF only) between the compiler and the linker
prevents all sorts of potential clever tricks. Is anyone working on
them?

whole-program optimisation the hard way

Posted Mar 2, 2006 13:38 UTC (Thu) by nix (subscriber, #2304) [Link]

GCC 4.1 provides the -fwhole-program and --combine options to do this, but you have to pass all translation units you want optimized together to GCC at once. This is very unfriendly to existing makefiles and counter to existing practice.

There are several plans afoot in the GCC development community to fix this: possibilities include an intermediate representation (in the form of a bytecoded language for a nonexistent virtual machine) which GCC can save, load several of, and optimize. Politics is involved here, though, and whatever's done it'll be a lot of work.

whole-program optimisation the hard way

Posted Mar 10, 2006 7:29 UTC (Fri) by massimiliano (subscriber, #3048) [Link]

...possibilities include an intermediate representation (in the form of a bytecoded language for a nonexistent virtual machine) which GCC can save, load several of, and optimize. Politics is involved here, though, and whatever's done it'll be a lot of work.

Yes, but the approach would be the right one IMHO.
For instance, in Mono the JIT lays out the compiled methods sequentially in memory, and since methods are compiled on demand, this naturally creates a "cache friendly" memory layout for the machine code, where methods close in the call tree are close in memory.
We have an AOT compiler, but it misses this (and other) optimization opportunities, and we can see it.

And having a CPU independent intermediate representation can solve a lot of other problems as well (and is the whole point of the existance of the ECMA standards implemented by Mono and MS .NET).

Now, of course this involves politics :-(

whole-program optimisation the hard way

Posted May 18, 2006 8:16 UTC (Thu) by job (guest, #670) [Link]

I think the poltics nix refers to is that if gcc writes down intermediate representations to disk you suddenly have a perfect opportunity to extend gcc with non-free software. One of the reasons gcc has been the most successful free compiler is that new architectures and optimizers has to go in via free code, if you had the ability to swap out the back end that would not have happened.

whole-program optimisation the hard way

Posted Jul 16, 2006 7:55 UTC (Sun) by jzbiciak (✭ supporter ✭, #5246) [Link]

One easy way to address that is to generate the specific encoding when you build the compiler. That is, if the output of GCC really is the binary representation of GCC's internal representation, it will be *very* dependent on the specific version of GCC you're using.

Thus, it won't be a stable interface. Not even close.

I think it's perfectly acceptable to require a complete rebuild if the compiler version changes on you and you're trying to do whole-program optimization. Sure, it makes the feature harder to use, but it doesn't constrain it unnecessarily.

The fact of the matter is that unless the IR is merely the GIMPLE output of the front end, different versions of the compiler are going to have different things to say about the code within the IR it outputs. And if the IR is merely just the GIMPLE output from the front end, well, you've only saved parsing time across all your source files. Every build -fwhole-program runs everything else on the entire program. You only start saving build time noticeably if you output stuff from later stages of analysis.

Some patches of interest

Posted Mar 2, 2006 13:29 UTC (Thu) by balbir_singh (subscriber, #34142) [Link]

The genetlink interface is quite new, so the code might seem complex. genetlink allows the code to be flexible (its controller module is very nicely designed). It's usage and design will make a good article for lwn readers.

Netlink

Posted Mar 2, 2006 13:33 UTC (Thu) by corbet (editor, #1) [Link]

The (relatively) new netlink stuff is definitely on my list...I just have to go in there and figure out how it really works...

Netlink

Posted Mar 2, 2006 16:04 UTC (Thu) by balbir_singh (subscriber, #34142) [Link]

I suspect that genetlink will become a very popular method to exchange data between kernel and user space.

I'll look forward to reading your article about it on LWN.

Some patches of interest

Posted Mar 2, 2006 13:33 UTC (Thu) by nix (subscriber, #2304) [Link]

The problem with -ffunction-sections is that it requires long-distance jumps *between* static functions in the same translation unit. (The only calls which are immune from that are self-recursive ones.)

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds