Tracepoints with BPF

By Jonathan Corbet
April 13, 2016

One of the attractive features of tracing tools like SystemTap or DTrace is the ability to load code into the kernel to perform first-level analysis on the trace data stream. Tracing can produce vast amounts of data, but that data can often be reduced considerably by some simple processing — incrementing histogram buckets, for example. Current kernels have a wealth of tracepoints, but they lack the ability to perform arbitrary processing of trace events in kernel space before exporting the result. It would appear, though, that this situation will change as the result of a set of patches targeted for the 4.7 release.

It should come as no surprise to regular LWN readers at this point that the technology being used for the loading of code into the kernel is the BPF virtual machine. BPF allows code to be executed in kernel space under tight constraints; among other things, it can only access data that is explicitly provided to it and it cannot contain loops; thus, it is guaranteed to run within a bounded time. BPF code can also be translated to native code with the in-kernel just-in-time compiler, making it fast to run. This combination of attributes has helped BPF to move beyond the networking stack and make inroads into a number of kernel subsystems.

Every BPF program loaded into the kernel has a specific type assigned to it; that type restricts the places where the program may be run. The patch set from Alexei Starovoitov creates a new type (BPF_PROG_TYPE_TRACEPOINT) for programs intended to be attached to tracepoints. Those programs can then be loaded into the kernel with the bpf() system call. Actually attaching a program to a tracepoint is done by opening the tracepoint file (in debugfs or tracefs), reading the tracepoint ID, then using the PERF_EVENT_IOC_SET_BPF ioctl() command. That command exists in current kernels to allow BPF programs to be attached to kprobes; the patch set extends it to do the right thing depending on the type of BPF program passed to it.

When a tracepoint with a BPF program attached to it fires, that program will be run. The "context" area passed to the program is simply the tracepoint data as it would be passed to user space, except that the "common" fields are not accessible. As an example, the patch set includes a sample that attaches to the sched/sched_switch tracepoint, which fires when the scheduler switches execution from one process to another. The format file for that tracepoint (found in the tracepoint directory in debugfs or tracefs) provides the following data:

    field:unsigned short common_type;	offset:0;	size:2;	signed:0;
    field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
    field:unsigned char common_preempt_count;	offset:3;	size:1;signed:0;
    field:int common_pid;	offset:4;	size:4;	signed:1;

    field:char prev_comm[16];	offset:8;	size:16;	signed:1;
    field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
    field:int prev_prio;	offset:28;	size:4;	signed:1;
    field:long prev_state;	offset:32;	size:8;	signed:1;
    field:char next_comm[16];	offset:40;	size:16;	signed:1;
    field:pid_t next_pid;	offset:56;	size:4;	signed:1;
    field:int next_prio;	offset:60;	size:4;	signed:1;

Any program that accesses tracepoint data is expected to read this file to figure out which data is available and where it is to be found; failure to do so risks trouble in the future should the data associated with this tracepoint change. An in-kernel BPF program cannot read this file, so another solution must be found. That solution is for the developer to read the format file and turn it into a C structure; there is a tool (called tplist) that will do this job. The patch set contains the following structure, which was generated with tplist:

    /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
    struct sched_switch_args {
	unsigned long long pad;
	char prev_comm[16];
	int prev_pid;
	int prev_prio;
	long long prev_state;
	char next_comm[16];
	int next_pid;
	int next_prio;
    };

The pad field exists because the first four fields (those common to all tracepoints) are not accessible to BPF programs. The rest, however, can be accessed by name in a C program (which will be compiled to BPF and loaded into the kernel). This program will likely extract the data of interest from this structure, process it in its own special way, and store the result in a BPF map; user space can then access the result.

As with the other BPF program types, the helper code supplied with the kernel uses section names as a directive for what should be done with a specific program. So a program meant to be attached to a tracepoint should be explicitly placed in a section called "tracepoint/name", where "name" is the name of the tracepoint of interest. So, for the sample program, the section name is "tracepoint/sched/sched_switch".

The mechanism works, and, importantly, a tracepoint-attached BPF program is quite a bit more efficient than placing a kprobe and attaching a program there. There are already tools in development (argdist, for example) that will create BPF programs for specific tasks; argdist will create a program to make a histogram of the values of a given tracepoint field. All told, it looks like a useful advance in the kernel's instrumentation.

There is a potential catch, though: the old issue of tracepoints and ABI stability. Tracepoints expose the inner workings of the kernel, which suggests that they must change if the kernel does. Changing tracepoints can, however, break applications that use them; this is an issue that has come up many times in the past. It is also why certain subsystems (the virtual filesystem layer, for example) do not allow tracepoints at all: the maintainers are worried that they might be unable to make important changes because they may break applications dependent on those tracepoints.

For user-space programs, the issue has been mitigated somewhat by providing library code to access tracepoint data. An application that uses these utilities should be portable across multiple kernel versions. BPF programs, though, do not have access to such libraries and will break, perhaps silently, if the tracepoints they are using are changed. ABI concerns have stalled the merging of this capability in the past, but there was little discussion of ABI worries this time around. Alexei maintains that the interface available to BPF programs is the same as that seen by user-space programs, so there should be no new ABI worries. Whether the BPF interface truly brings no new ABI issues is something that will have to be seen over the coming years.

And it does appear that we will have the chance to see how that plays out; David Miller has applied the patches to the net-next tree, meaning that they should reach the mainline in the 4.7 merge window. Users wanting more visibility into what's happening inside the kernel will likely be happy to have it.

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing/with BPF

Tracepoints with BPF

Posted Apr 14, 2016 0:41 UTC (Thu) by bgregg (guest, #46639) [Link] (2 responses)

Getting tracepoint access will be huge. We're currently making do with kprobes, but I've been holding back on tool creation until tracepoints are available, and new tools can be much more stable. Coincidentally, this morning I saw a "remove rw argument from submit_bio" patch on lkml, and I bet that breaks some kprobe-based scripts. Common tracing like block device I/O should be tracepoints, leaving kprobes for internals.

There's not many major features left to add to the kernel, for tracing support. We're almost there. And we've been getting there quickly in the last 3 months -- there's been so many BPF enhancements added that it's hard to keep up (thanks for doing so with this article, and thanks to Alexei and others for great progress). This has been a really exciting episode of "As the tracing world turns". :-) The next episode might be the finale.

Tracepoints with BPF

Posted Apr 14, 2016 2:03 UTC (Thu) by viro (subscriber, #7872) [Link] (1 responses)

For the "tracepoints are harmless" theory? Hopefully it will be, at that; the question is how steep a price will we pay for that lesson...

Tracepoints with BPF

Posted Apr 14, 2016 9:22 UTC (Thu) by mjthayer (guest, #39183) [Link]

I wonder whether there is a sweet spot between this and systemtap - using loadable modules to add tracepoints, to put the API-to-support issue to rest, and doing the rest with BPF. Of course, I am playing the armchair expert here without knowing if this is even feasible.

Tracepoints with BPF

Posted Apr 14, 2016 1:50 UTC (Thu) by dw (guest, #12017) [Link] (1 responses)

Reading about the proliferation of methods to introspect the kernel's structure leaves me wondering if Linux couldn't borrow something like FreeBSD CTF for uniformly describing kernel types.

CTF is just a compact version of DWARF that minimally describes function and structures, it was introduced for DTrace, but recently there have been attempts to extend it to more areas where userland needs the same information, like ps and netstat (e.g. https://wiki.freebsd.org/SummerOfCode2014Ideas#Use_CTF_da... )

Tracepoints with BPF

Posted Apr 15, 2016 17:51 UTC (Fri) by nix (subscriber, #2304) [Link]

There is already code available to do precisely that, meant for DTrace for Linux but GPLv2-licensed: <https://oss.oracle.com/git/?p=linux-uek.git;a=tree;f=scri...>. This code depends on a port/extension of the old Solaris libctf library, which has been dual-licensed under the GPLv2 for the purpose: <https://oss.oracle.com/git/?p=libdtrace-ctf.git;a=summary> (despite its name this is not DTrace-specific: I simply wanted to use a different name from the old Solaris library to indicate that it isn't precisely compatible at the data-structure level and you can't expect to use Solaris-generated CTF with it. It has a few new features, too, such as the ability to encode the types of global variables.)

The types are deduplicated, gzip-compressed, and (in conjunction with the modified Makefile.modpost in the parent directory in that git tree) written to new ELF sections in the kernel modules, with types shared between more than one module or between modules and the core kernel written into a new, empty-but-for-types 'ctf.ko' module, whose contained CTF is meant to be set as the parent of the other CTFs at runtime, so that types depending on or pointing to those shared types will be correctly represented. The core UEK4 kernel and shared types uses about a megabyte: all 3000+ modules' types come to about 3.5MiB all told.

(Caveat: this code has at least one known bug in that it generates wrong type graphs under some, rare situations. I have a significant rewrite of the deduplicator planned that will fix this, but it's a rare enough problem that I haven't done it yet... and to be honest there are plenty of things that CTF just can't represent which force us to leave the occasional thing out of the type graph, still well under .001% of types, but still, there are limits to how accurate we can get. See the blacklists in that directory and the commit logs associated with them.)

I would have no objection whatsoever to people using this code, fixing it, enhancing it, contributing improvements back, etc. (DWARF4 support is one big missing piece, and I've only really tested it with GCC 4.4, 4.8 and 4.9, so other versions might well generate debuginfo that it needs adjustments to comprehend. That's the nature of these sorts of things, really.)