Another attempt to address the tracepoint ABI problem

By Jonathan Corbet
October 27, 2017

2017 Kernel Summit

Tracepoints provide a great deal of visibility into the inner workings of the kernel, which is both a blessing and a curse. The advantages of knowing what the kernel is doing are obvious; the disadvantage is that tracepoints risk becoming a part of the kernel's ABI if applications start to depend on them. The need to maintain tracepoints could impede the ongoing development of the kernel. Ways of avoiding this problem have been discussed for years; at the 2017 Kernel Summit, Steve Rostedt talked about yet another scheme.

The risk of creating a new ABI has made some maintainers reluctant to add instrumentation to their parts of the kernel, he said. They might be willing to add new interfaces to provide access to specific information but, in the absence of tools that use this information, it is hard to figure out which information is needed or what a proper interface would be. The solution might be to adopt an approach that is similar to the staging tree, where not-ready-for-prime-time drivers can go until they are brought up to the necessary level of quality.

People talk about "tracepoints", but there are actually two mechanisms in the kernel. Internally, a tracepoint is a simple marker in the code, a hook to which a kernel function can be attached. What user space sees as a tracepoint is actually a "trace event", which is a specific interface that is implemented using the internal tracepoints. Without trace events, there is no interface visible to user space.

The proposed solution to the ABI problem is to place a tracepoint at locations of interest, but not bother with the trace event. Making the tracepoint available to user space would then require loading a kernel module; this module would be kept out of the mainline tree. It would be, he said, a development space to try out interfaces for the more sensitive tracepoints. Since it is not a part of the mainline kernel, it could not become part of the kernel ABI. But distributors could ship this module, making the tracepoints available to user-space developers.

Ben Hutchings, a Debian kernel maintainer, said that this approach would not work in a number of cases. There are many situations where it's not possible to just load a random module into the kernel. Many customers are using module signing, for example, to prevent exactly that from happening. Even if distributions ship this module, users of different distribution would have different modules and the tracepoints would be incompatible; that would make it harder to write tools to use them.

Another member of the audience expressed skepticism, saying that if every distributor ships this module, it will become an ABI that has to be maintained anyway. Ben Herrenschmidt agreed and suggested that the right solution was to make the tracepoints be self-describing. But, as Rostedt pointed out, they are already self-describing, but changing the availability of information will still break things. Tools may depend on specific information that is no longer available, or they may simply ignore the format information for the tracepoint. That makes it hard to remove obsolete tracepoints which, since they each occupy about 5KB of memory, is unfortunate.

Matthew Wilcox asked whether the proposed scheme would have solved the problem with powertop, which broke some years ago when a variable was removed from a tracepoint. Rostedt said that it would have; Ted Ts'o noted that the powertop problem shows that self-describing formats do not work as a solution to this problem.

Much of the current work is being pushed by developers within Facebook, who use a vast library of tracepoints to diagnose performance problems. They are willing to deal with their tools breaking when the kernel changes. That led Andrew Morton to ask whether Linus Torvalds made the right call by including tracepoints in the kernel ABI. Rostedt said he disagrees with that decision, but it doesn't matter, since Torvalds has the final say. David Woodhouse complained that the group was talking about "arbitrary technical nonsense"; perhaps the loaded module should just set a flag to make the tracepoints available. Morton agreed that the module idea "sounds like bullshit" and suggested that perhaps it was time to get the rule changed. But Rostedt has tried that before, he said, and he still bears the scars that resulted.

Chris Mason said that, while Facebook can handle tracepoint changes that break its tools, there is a need to know when such a change has happened. Just moving the ABI to a loadable module will not solve that problem; it just pushes the problem onto the distributors instead.

Ts'o then launched into a discussion of the growing set of tools that work by attaching BPF scripts to tracepoints. These tools are becoming popular and soon will be as popular as powertop; that will result in the same kinds of problems when they break. The problem is here now and needs to be addressed.

Doing so will be hard, he said. The topic had been suggested for the invitation-only Maintainers Summit, since it is "fundamentally a Linus problem", but Torvalds had vetoed it. Torvalds wants to make a guarantee to user-space tools that works in 99% of the cases, but it is hard to live up to for tools that are closely tied to the kernel. So the powertop problem will come again, only worse; BPF will "turn it into a trainwreck". Rostedt added that Linux started off as "a desktop toy", but it is no longer a simple system. Nobody knows the whole thing, so they are relying more on tooling to know what is going on.

The conversation came to an end about here, but the topic did return at the Maintainers Summit later that week, after Torvalds and Rostedt had discussed it. The solution that was arrived at for now, as related by Torvalds, is to hold off on adding explicit tracepoints to the kernel. Instead, support will be added to make it easy for an application to attach a BPF script to any function in the kernel, with access to that function's arguments. That should give tools access to the information they need, and may make it possible to (eventually) remove many of the existing explicit tracepoints.

Arnd Bergmann asked what would happen if a popular script breaks as the result of the removal of a function; Torvalds replied that he would not see it as a regression that needs to be fixed. But, he said, if that happens it should be seen as a sign that the kernel should be providing that information in a more straightforward manner. A tracepoint or other interface could be added at that time.

Whether this solution provides what the tools need will take time to determine. But if it does, it may just be possible that a multi-year debate has finally come to some sort of conclusion that all of the parties involved can live with.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Index entries for this article
Kernel	Tracing/ABI issues
Conference	Kernel Summit/2017

Another attempt to address the tracepoint ABI problem

Posted Oct 27, 2017 15:14 UTC (Fri) by josh (subscriber, #17465) [Link] (1 responses)

> Instead, support will be added to make it easy for an application to attach a BPF script to any function in the kernel, with access to that function's arguments. That should give tools access to the information they need, and may make it possible to (eventually) remove many of the existing explicit tracepoints.

> Arnd Bergmann asked what would happen if a popular script breaks as the result of the removal of a function; Torvalds replied that it would not see it as a regression that needs to be fixed.

Sounds like the "attach a BPF script to any function" mechanism would have an ABI like that of init_module or finit_module: the thing it loads lives inside the kernel, making it fall under the kernel's internal ABI (lack of) guarantees rather than the userspace ABI guarantees.

Another attempt to address the tracepoint ABI problem

Posted Oct 27, 2017 17:30 UTC (Fri) by davecb (subscriber, #1574) [Link]

We had a variant of that in Solaris, specifically in the libraries, which had to be both stable and changable (;-))

David J. Brown attached version numbers to function entry points, like SUNW_1.1 (for something public: SUNWprivate for everything else). If the entry point changed, even if the function signature was the same, the number got bumped to SUNW_1.2.

If a tracepoint points to a function to attach a BPF script to, the script for that tracepoint can check if the function version has changed, and if it has, the person using it can go and do a "git blame" hunt to see what they need to do.

Another attempt to address the tracepoint ABI problem

Posted Oct 27, 2017 17:43 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]

It sounds like there are conflicting use cases for tracepoints: kernel debugging and exporting data to userspace. The powertop example begs the question of why a tracing framework is needed to collect process statics (looks like wake ups). Couldn't that be accomplished by enhanced process statics that could be exposed via sysfs when enabled?

Another attempt to address the tracepoint ABI problem

Posted Oct 28, 2017 9:15 UTC (Sat) by SelaLWN (guest, #118519) [Link] (3 responses)

There is already support for running a BPF program when an arbitrary kernel function is called, with access to that function's arguments -- this can be done using kprobes. Indeed, in BCC (https://github.com/iovisor/bcc) we have a large collection of BPF-based tools that rely on tracepoints when available, and kprobes when not. Over the last 1.5 years, we had multiple tools break because of intentional and unintentional changes to kernel functions when using kprobes; not so much when using tracepoints.

Another attempt to address the tracepoint ABI problem

Posted Oct 31, 2017 15:39 UTC (Tue) by jikos (subscriber, #43140) [Link] (2 responses)

BTW how does that tool access the function parameters? Does it assume that x86_64 ABI is always being followed, or do you require DWARF2 debuginfo data for those?

Another attempt to address the tracepoint ABI problem

Posted Nov 1, 2017 14:33 UTC (Wed) by SelaLWN (guest, #118519) [Link] (1 responses)

It's ABI-sensitive, but if a language or runtime decides to pass arguments in a heap location or something that's not the standard ABI, it won't work (i.e. debuginfo isn't used).

Another attempt to address the tracepoint ABI problem

Posted Nov 1, 2017 16:25 UTC (Wed) by jikos (subscriber, #43140) [Link]

That makes is super-fragile though. GCC does a lot of optimizations on static functions that break x86 ABI (IPA-RA for example).