LWN.net Logo

Removing four bytes from the kernel ABI

By Jake Edge
May 23, 2012

Four bytes may not seem like a lot of space—typically it isn't—but when that space is wasted millions of times, it starts to add up. In addition, if the extra space has become part of the kernel ABI (intentionally or not), it will be difficult to remove it. That particular problem came up again in a recent linux-kernel discussion regarding the trace event header.

Just over a year ago, we looked at the unused lock_depth field in event headers. Frederic Weisbecker had added the field temporarily to assist in removal of the big kernel lock (BKL), and once the BKL was gone Steven Rostedt removed those, now useless, four bytes from the header. Unfortunately, in the interim, PowerTOP had started accessing events in the perf ring buffer, so removing lock_depth broke PowerTOP. That field wasn't actually used by PowerTOP, but the tool expected the header to have a particular size, which changed after Rostedt removed the wasted space.

That led to a reversion of the removal, which means that every event recorded by ftrace or perf has added overhead. The event format is fully self-describing, however, so there is no need for utilities like PowerTOP to grub around in the binary data making assumptions about what the format is. It was, however, easier to read the data directly rather than parse the format description, which is why PowerTOP did so. Rostedt has created a library to parse trace events using the format data that the kernel provides to avoid that situation in the future. That library was picked up by the recently released PowerTOP 2.0, so Rostedt posted an RFC asking when the lock_depth field—renamed to padding as part of the revert—could be removed.

Linus Torvalds was not particularly concerned about the wasted space, but did want to understand which distributions were picking up the new PowerTOP. It turns out that the version in Fedora 14 (which Torvalds said he still uses sometimes) is old enough that it doesn't use perf events at all, so it is unaffected. More recent Fedoras (16, 17) are using PowerTOP 1.98 which won't work with kernels built without the padding.

The assumption in the thread is that distributions will be picking up PowerTOP 2.0 for releases coming later in the year, but that still leaves users who build their own kernels on existing distributions in a bit of a bind if the padding is removed. Existing distributions also have various lifespans, and some will not be picking up the latest PowerTOP at all. Rostedt asked how long the kernel needed to support older distributions. PowerTOP, it seems, is in a different category from other applications because it is a developer-oriented tool. So Torvalds was willing to see the kernel change even if some distributions get left behind:

But breaking something like a F14-15 timeframe distro or something staid like a SLES (or "Debian Stale" or whatever they call that thing that only takes crazy-old binaries)? It's fine. We don't want to *rush* into it, but no, if those distros are basically not updating, we can't care about them forever for something like powertop.

Things that break *normal* applications are different. There the rule really must be "never".

Arjan van de Ven concurred, pointing to 3.6 as a potential time frame to remove the padding, noting that those who haven't updated their distribution to get the newer PowerTOP are unlikely to be updating their kernel either. Rostedt said he will queue the patch up for 3.6 or 3.7.

While the four bytes seems unimportant to both Torvalds and Ingo Molnar, Rostedt pointed out that it is a frequent problem for tracing users. Beyond that, though, he disagrees with Molnar's contention that the wasted space is merely a "cosmetic detail":

4 bytes is not cosmetic for a 32 byte event. That's 1/8th overhead. If we could get rid of 4 bytes from struct page, would we do that? It's only just 4 bytes for [every] 4096 bytes. Just a 1/1024 overhead. Of course perf events are much bigger than 32 bytes. It's one of the biggest complaints I hear about perf, the size of its events. We should be trying hard to fix that.

For memory-constrained situations, for example on embedded devices or for users trying to squeeze every process they can onto their systems, reducing the overhead of events can make a difference. By capturing more events in the same amount of memory, there is a better chance of finding the problem that tracing was enabled for. When the issue came up a year ago, David Sharp of Google noted that the size of events was a big problem for the search giant. Others undoubtedly face similar challenges.

While the format of the perf ring buffer data may soon be a solved problem—though it's possible, if unlikely, that other tools are manually pulling data from the ring buffer—tracepoints as a whole are still an unresolved ABI issue. Right now, much of the work is in adding new tracepoints, but some day one or more of those may need to come out or be modified. If tools are dependent on specific tracepoints providing the exact same information in just the right place in the code, changing those will be a real problem. And it will be one that is difficult for a library to paper over.


(Log in to post comments)

Removing four bytes from the kernel ABI

Posted May 24, 2012 8:43 UTC (Thu) by fhuberts (subscriber, #64683) [Link]

they could also do a patch on powertop 1.x ...

Removing four bytes from the kernel ABI

Posted May 24, 2012 10:47 UTC (Thu) by intgr (subscriber, #39733) [Link]

The real PowerTOP 1.x (up to 1.2 I think) doesn't use perf events at all, so it is unaffected.

The problem is PowerTOP 1.9x versions, which are actually prereleases (beta) of 2.0, but already shipped by some distros. In my opinion, they should just update to 2.0 since it's a bugfix update over the 1.9x branch.

FTA:
> It was, however, easier to read the data directly rather than parse the format description, which is why PowerTOP did so

Well, sounds to me like the kernel didn't actually break its promised ABI -- PowerTOP didn't respect the event description so it misused the ABI.

Removing four bytes from the kernel ABI

Posted May 24, 2012 15:57 UTC (Thu) by nevets (subscriber, #11875) [Link]

Well, sounds to me like the kernel didn't actually break its promised ABI -- PowerTOP didn't respect the event description so it misused the ABI.

True but unfortunately that doesn't matter. As Linus pointed out:

And if binaries don't use the interface to parse the format (or just parse it wrongly - see the fairly recent example of adding uuid's to /proc/self/mountinfo), then it's a regression.

[...]

If you made an interface that can be used without parsing the interface description, then we're stuck with the interface. Theory simply doesn't matter.

Basically it came down to the fact that we didn't push the library that parses the data strong enough. And we also made it too easy for apps to circumvent the library. Peter Zijlstra once asked me to make the field order random, to keep tools from doing this (before PowerTop actually did), but to do so would have added a high overhead to tracing, that I did not think was worth it at the time. Then when this happened, I realized that I was mistaken.

If the author of PowerTop wasn't a kernel developer, I highly doubt we would have had this problem. But the author was and for him, it was much easier to look at what the kernel code was doing and access it directly than to create a parsing library. I do not blame him for this. It was our fault for letting this happen.

Removing four bytes from the kernel ABI

Posted May 25, 2012 0:20 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Linus:
If you made an interface that can be used without parsing the interface description, then we're stuck with the interface.

Linus is always oversimplifying things. I know he doesn't really believe that the kernel is stuck with an interface just because kernel developers made it possible for someone to consider it to exist. It simply isn't technically possible to prevent someone from using an intended interface that wasn't intended.

Linus' real and more reasonable policy would probably be better exemplified by:

If an important user found a way to use your interface without parsing the interface description, then we're stuck with the interface.

Removing four bytes from the kernel ABI

Posted May 25, 2012 6:32 UTC (Fri) by drag (subscriber, #31333) [Link]

> they could also do a patch on powertop 1.x ...

How can they do that?

Last time I checked Linux kernel developers don't have a back door that will allow them to update random affect binaries on my machines when I update the kernel. At least I hope not.

Removing four bytes from the kernel ABI

Posted May 25, 2012 19:34 UTC (Fri) by nevets (subscriber, #11875) [Link]

>Last time I checked Linux kernel developers don't have a back door that will allow them to update random affect binaries on my machines when I update the kernel.

BruhahahaHAHAH! The kernel 0wns your box! Why do you think we became kernel developers?

WORLD DOMINATION!

Removing four bytes from the kernel ABI

Posted May 28, 2012 13:10 UTC (Mon) by nix (subscriber, #2304) [Link]

Quite. Why hack the binaries when you can have binfmt_elf.c detect affected binaries at runtime and slam in a binary patch? Plus, that works everywhere (FSVO 'everywhere' equal to 'on the machine it was tested on') and makes debugging when you don't know the feature is there so much more exciting!

(I wish I was joking, but Windows does exactly this routinely.)

Removing four bytes from the kernel ABI

Posted May 29, 2012 19:21 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]

So far as I know Windows doesn't patch third-party code in memory, but it does enable compatibility quirks on a per-process basis based on recognition of certain programs. In some cases that approach may be superior to maintaining the old behaviour for all programs - the usual reason for wanting to change the implementation is to improve performance, and Windows can provide that improvement for most programs.

Linux does have per-process compatibility quirks (see setarch(8)) but no provision for enabling them automatically. I'm not sure why, though it may be that such recognition would be better implemented in userland.

Some tracepoints have already been removed

Posted May 24, 2012 16:53 UTC (Thu) by Anssi (subscriber, #52242) [Link]

> Right now, much of the work is in adding new tracepoints, but some day one or more of those may need to come out or be modified.

Actually some tracepoints have already been removed, at least in this i915 commit from Feb 2011:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-...

e.g. i915_gem_request_submit was removed (in favor of a more generic tracepoint), and it was used by PowerTop to determine GPU ops/s. PowerTop got patched for that in December.

Fuzzing

Posted May 24, 2012 17:29 UTC (Thu) by cesarb (subscriber, #6266) [Link]

They should add a CONFIG_BREAK_PERF option which randomly adds padding to perf events, and a CONFIG_BREAK_PERF_HARDER option which randomly removes fields from the events.

Utilities which can deal with event formats changing would keep working, with just a bit of performance loss (or information loss in the second case), and utilities which do not parse the format description would break. Even better, by being a config option it would now be part of the ABI: perf events can change randomly from one boot to another, so you better use the format description.

(Now you have to decide if I am joking or if this was a serious suggestion. Or both.)

Modutils

Posted May 24, 2012 18:34 UTC (Thu) by ncm (subscriber, #165) [Link]

I got used to updating modutils and pcmcia-tools when I built a new kernel. The new modprobe still worked with the old kernel. Is this case really any different?

Modutils

Posted May 25, 2012 6:35 UTC (Fri) by drag (subscriber, #31333) [Link]

Maybe the kernel developers can stick some logic in the make files that goes and checks all the installed software on your system and will refuse to compile if any of them get broken by a ABI change.

Modutils

Posted May 25, 2012 6:37 UTC (Fri) by drag (subscriber, #31333) [Link]

Well that wouldn't work if you got a kernel built by somebody else, so maybe they need to add a runtime checker that will cause the kernel to refuse to boot if it breaks any of your software. That way you know when you need to update your userland. :P

Removing four bytes from the kernel ABI

Posted Jun 2, 2012 10:15 UTC (Sat) by Duncan (guest, #6647) [Link]

Why not handle it the same way they handled the stale udev API -- twice? Just create a kconfig option for legacy-powertop-compatible event headers. New distros using the new powertop can simply turn it off, as can users building their own kernel who either don't use powertop or use a new enough version, while the distros and users building new kernels for an old distro install can turn it on if they need to.

Then, after some time (preferably somewhat longer than the 3.6/3.7 timeframe mentioned in TFA, we're already in the 3.5 cycle, after all, and 3.7 could well be before year-end), that option could disappear. But meanwhile, only folks unwilling to upgrade what was after all a 2.0-pre-release powertop to the full 2.0+, would have to suffer the additional overhead.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds