Leading items

Welcome to the LWN.net Weekly Edition for April 15, 2021

This edition contains the following feature content:

Enabling debuginfod for Fedora by default: Fedora proposes to follow some other community distributions by setting up a debuginfod server.
Seccomp user-space notification and signals: Go's scheduler drives a change to the seccomp() mechanism.
NUMA-aware qspinlocks: there is always another way to make spinlocks faster.
Debian votes on a statement — and a leader: the project is considering a wide range of responses to the Free Software Foundation.
Comparing SystemTap and bpftrace: the relative advantages of two key Linux tracing mechanisms.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Enabling debuginfod for Fedora by default

By Jake Edge
April 14, 2021

In early April, Fedora program manager Ben Cotton posted a proposal to use the distribution's debuginfod servers by default in Fedora 35. This feature would help developers who are trying to debug or trace their programs using various tools, but who are lacking the source code and debugging symbols needed. The servers can provide that data directly to the tools as needed, but there are some security and privacy concerns to work through before turning the feature on by default.

The required source code and debugging information is available for Fedora already, of course, but it lives in debuginfo and src RPMs that must be installed to be used by the tools. Those RPM files are quite large and generally cover much more than the symbols and source for a single file that a user might want to look at in a tracing or debugging session. In addition, installing them via DNF requires root privileges, which may not be available to the user. Grabbing just the pieces needed, at the right time and without extra privileges, is a highly useful service that the debuginfod feature can provide.

An October 2019 Red Hat blog post describes debuginfod and notes that it is a new feature coming in the elfutils tools. The idea is that the Build.ID hash that gets stored in object files by GCC and LLVM can be used to identify which version of the symbols and source code correspond to the object. Build.ID support was added for Fedora 8 in 2007. The Build.ID directly identifies the debugging symbols for the object file; the source code path is also stored in the object file, which can be used to identify (thus serve) the right source file package as well.

In October 2020, Frank Ch. Eigler posted a note to the fedora-devel mailing list about the debuginfod feature and a test server that had been set up. He wanted to gauge if there was interest in having Fedora set up its own:

The problem is that Fedora itself doesn't run a server, and our test server can afford to carry only a subset of debuginfo/debugsource rpms & architectures. So, fedora developers / users cannot get at all the info, or from an official source. I wonder if it's time to get one set up. If there is interest, I'd be happy to start discussing logistics with fedora infrastructure folks.

The overall reaction was positive, which led to the proposal to set the DEBUGINFOD_URLS environment variable in the Fedora 35 release so that the new Fedora debuginfod servers would be used when debugging information is needed. As Eigler, who is the owner of the feature proposal, pointed out, there is a staging server available at debuginfod.stg.fedoraproject.org, so Fedora users can already try things out:

[...] it already is there in most tools, try it.
% export DEBUGINFOD_URLS=https://debuginfod.stg.fedoraproject.org/
and shortly all F32+ packages/versions/architectures will be debuggable that way.

Once the staging version of the server has been populated and is working, the data will be moved to the production server at debuginfod.fedoraproject.org, which is where DEBUGINFOD_URLS would be pointed for Fedora 35. Overall, using the staging server has been working well, though Michael Catanzaro was concerned about its performance:

I started testing the staging server yesterday. It seems a little slow -- I worry for anyone debugging anything that links to WebKit, which on the desktop is a lot -- but otherwise it works *very* well. I'm impressed. Very nice.

Owen Taylor noted that the feature works for Flatpak applications as well. Eigler explained that some of the performance problems Catanzaro saw could have been caused by the processing that the staging server had been doing to pull in the RPM files from the Fedora Koji build system. But there are some inherent delays in the process, though caching may help:

In the case where a new file is sought from inside some large rpm, yeah there will be some inherent latency in decompressing it (CPU bound), saving the selected file to tmp disk (RAM/storage bound) and then sending you the file (network bound).
At the server, there are some prefetching/caching options available to take advantage of the spacious servers kindly provided by fedora-infra.
There is also aggressive caching on the client side, with an adjustable week-long retention for files.

David Malcolm wondered about Python programs that are currently part of the debuginfo packages for Python. Back in 2010, for the Fedora 13 release, he had added these programs as part of a feature for easier Python debugging with GDB. But he was worried about having them automatically installed:

I'm much more nervous about arbitrary python scripts being supplied over this service, as the barrier to entry for bad guys to do Bad Things would be so much lower as compared to malformed DWARF, so perhaps if people want the .py files, they have to install the debuginfo package in the traditional way? (It's still .py files from rpm payloads, but having them autodownloaded with little user involvement seems troublesome as compared to manually installing debuginfo rpms).

Eigler said that debuginfod has no way to serve standalone .py programs, but that a way could be added if it was deemed important. He noted that the provenance of the data is meant to provide the security:

[...] I'm not sure bad DWARF is inherently safer than bad Python. Nevertheless, all this is based on the proposition that the files are coming from a generally trusted build system, serving trusted artifacts maintained by trusted individuals. If malevolent enough files get in there, the service can be degraded and/or clients can be affected.

In a similar vein, Björn Persson pointed out that the proposal was lacking answers for a variety of security and privacy questions. The proposal does mention some privacy concerns raised by Debian users after a debuginfod service for Debian was announced in February, but also noted that openSUSE Tumbleweed enabled the feature by default (in February 2020) "and we have heard of no controversy". The Build.ID and source file name being debugged are leaked to the server if the local system lacks the proper files; Debian added an install-time query to decide whether to enable the feature.

Beyond the privacy issues, Persson also wanted the proposal to address the security implications of the server: what kinds of attacks might be possible, how are the files received from the server verified, and what kind of signing and authentication is done for the files. Eigler thanked him for the questions and added a "Security FAQ" section to the online version of the proposal with initial answers to Persson's questions.

As might be guessed, the scope of attacks depends in large part on the tool that is consuming the data, its privileges and robustness, as well as the privileges of the program being debugged. There is no special verification being added for the information provided from the server; as Eigler's response about Python noted, it all comes down to trusting the distribution—and trusting HTTPS. As the FAQ puts it:

Debuginfod servers provide the verbatim contents of the verbatim distro archives, and transmit them securely across HTTPS. There is no per-file signing infrastructure in Fedora, and debuginfod doesn't add one. Thus there is no mechanism to manually verify these files, beyond downloading a corresponding signed archive out-of-band and comparing. The client side code will be taking some rudimentary measures with file permissions to reduce risk of accidental change. In principle, if the received files were tampered with, then the same tamperers could mess with the user's consumer tools and/or take over the account.

There are multiple benefits to the feature, including making it easier for users who may not be able to install the debuginfo packages using DNF and reducing the amount of data that needs to be sent and stored in comparison to the, generally rather large, debuginfo packages. The debuginfod server only sends the pieces needed for the user's immediate debugging task, rather than all of the associated source code and debugging information for the entire package.

While there may be a need to add an explicit option for administrators to disable (or opt-out of) consulting the debuginfod servers, it seems highly likely that Fedora 35 will default to using them. In the meantime, users can already start consulting the server by setting their DEBUGINFOD_URLS environment variable appropriately.

Comments (4 posted)

Seccomp user-space notification and signals

By Jonathan Corbet
April 9, 2021

The seccomp() mechanism allows the imposition of a filter program (expressed in "classic" BPF) that makes policy decisions on whether to allow each system call invoked by the target process. The user-space notification feature further allows those decisions to be deferred to another process. As this recent patch set from Sargun Dhillon shows, though, user-space notification still has some rough edges, especially when it comes to signals. This patch makes a simple change to try to address a rather complex problem brought to the fore by changes in the Go language's preemption model.

Normally, seccomp() is used to implement a simple sort of attack-surface reduction, making much of the system-call space off limits for the affected process. User-space notification can be used to that end, but the objective there is often different: it allows a supervisor process to emulate system calls for the target process. An example might be a container manager that wishes to make mount() available inside a container, but with some strict limits on what can actually be mounted. User-space notification allows the (privileged) supervisor to actually perform the mount operations it approves of and return the results to the target process.

While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back. Should that process receive a signal, though, it will stop waiting and respond immediately to the signal; if the signal itself is not fatal, the result may well be the system call returning an EINTR error to the target process. The supervisor, instead, will not know about the signal until it tries to give the kernel its answer to the original notification; at that point, it will get an ENOENT error indicating that the notification is no longer alive.

This sort of interruption can be inconvenient, especially if the supervisor has carried out some sort of long task on the target's behalf. If the signal does not kill the target process, it is likely that the same operation will be retried shortly, leading to extra work being done. Most of the time, though, non-fatal signals of this type are likely to be rare in programs running under seccomp() monitoring.

Go signal

More accurately, that was once true, but the developers of the Go language had a problem of their own to solve. That language's "goroutine" lightweight thread model requires that the Go runtime handle scheduling, switching between goroutines as needed so that they all get a chance to run. Beyond that, there is a need for occasional "stop the world" events where all goroutines are paused so that the garbage collector can do its job. This has been handled by having the compiler put preemption checks at the beginning of each function.

What happens, though, if a goroutine runs for a long time without calling any functions? This can happen if the routine is running inside of some sort of tight loop; in the worst case, that loop could be spinning on a lock and preventing the lock holder from running to release it, a situation that tends to increase the overall level of user disgruntlement. Another way to delay preemption is to make a long-running system call.

The Go developers have tried a few ways of solving this problem. One of them involved inserting preemption checks at backward jumps in the code (thus at the end of a loop, for example). Even when that check was reduced to a single instruction, the resulting performance penalty was deemed to be too high; this approach also doesn't help in the long-running system-call case. So the Go community decided to address this problem with a non-cooperative preemption mechanism instead. In simple terms, any goroutine that runs for 10ms without yielding will receive a SIGURG signal from the runtime, which will then reschedule the thread, initiate garbage collection, or do whatever else needs to be done at that time.

System calls that end up being referred to another process via seccomp() tend to run longer than usual, and the sorts of tasks that a supervisor process might carry out — mounting a filesystem, for example — can take longer yet. This has evidently led to a lot of interrupted, seccomp()-mediated system calls in Go programs and an associated desire to find a way to stop those interruptions.

Masking non-fatal signals

To address this problem, Dhillon's patch set adds a new flag (called SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) to the SECCOMP_IOCTL_NOTIF_RECV ioctl() command that is used by the supervisor process to receive notifications. If that flag is set when a notification is given to the supervisor, the target process will be put into a "killable" wait, meaning that fatal signals will still be delivered, but any others will be masked until after the supervisor has responded to the notification. Non-fatal signals will thus no longer interrupt system calls while the supervisor process is working on them.

Note that if a non-fatal signal arrives before the supervisor reads the notification, the target's system call will be interrupted as usual. The notification will be canceled, and the supervisor will get an error if it tries to read that notification. The end result in that case is as if the system call never happened in the first place. Once the notification is delivered, though, the system call will run to completion. It is a relatively small change that solves this problem, though that solution comes at the expense of adding arbitrary delays to Go's preemption mechanism when seccomp() and user-space notification are in use. That is just the sort of delay that the preemption mechanism was created to prevent, but it will at least be under the control of the supervisor and, presumably, bounded.

This patch set has been posted twice as of this writing; it has not received much in the way of responses. That may suggest that few people have looked at it so far, which is not an ideal situation for a security-related change to the user-space API. Until that situation changes, this work seems unlikely to advance and users of Go with seccomp() and user-space notifications will continue to have problems.

Comments (12 posted)

NUMA-aware qspinlocks

By Jonathan Corbet
April 12, 2021

While some parts of the core kernel reached a relatively stable "done" state years ago, others never really seem to be finished. One of the latter variety is undoubtedly the kernel's implementation of spinlocks, which arbitrate access to data at the lowest levels of the kernel. Lock performance can have a significant effect on the performance of the system as a whole, so optimization work can pay back big dividends. Lest one think that this work is finally done, the NUMA-aware qspinlock patch set shows how some more performance can be squeezed out of the kernel's spinlock implementation.

In its simplest form, a spinlock is a single word in memory, initially set to one. Any CPU wishing to acquire the lock will perform an atomic decrement-and-test operation; if the result is zero, the lock has been successfully taken. Otherwise the CPU will increment the value, then "spin" in tight loop until the operation succeeds. The kernel has long since left this sort of implementation behind, though, for a number of reasons, including performance. All those atomic operations on the lock word cause its cache line to be bounced around the system, slowing things considerably even if contention for the lock is light.

The current "qspinlock" implementation is based on MCS locks, which implement a queue of CPUs waiting for the lock as a simple linked list. Normally, linked lists are just the sort of data structure that one wants to avoid when cache efficiency is a concern, but nobody ever has to traverse this list. Instead, each CPU will spin on its own entry in the list, and only reach into the next entry to release the lock. See this article for a more complete description, complete with cheesy diagrams, of how MCS locks work.

MCS locks on NUMA systems

MCS locks seem nearly optimal; each CPU focuses on its own queue entry, so cache-line bouncing between processors is nearly eliminated. They are also fair; the queue of waiters ensures that no CPU is starved of access. But it seems that there is a way to do better, at least on non-uniform memory-access (NUMA) systems. Such machines are made up of multiple nodes, each of which contains some number of CPUs; memory attached to a CPU's node will be faster to access than memory attached to a remote node. Access to cached memory is (relatively) fast, of course, regardless of the node that memory is attached to, but moving cache lines between nodes is expensive, even more expensive than bouncing cache lines between CPUs on the same node. Thus, minimizing cache-line movement between NUMA nodes will be good for performance.

If a spinlock is released by a CPU on one node and subsequently acquired by a CPU on a different node, its cache line will have to move between the nodes. If, instead, a contended spinlock could be passed to another CPU on the same node, that expense will be avoided. That alone can make a difference, but it's worth remembering that spinlocks protect data structures. Two processors contending for a given lock are quite likely to be trying to access the same data, so moving the lock between nodes will also cause the cache lines for the protected data to move. For heavily contended data structures, the resulting slowdown can hurt.

The NUMA-aware qspinlock attempts to keep locks from bouncing between NUMA nodes by handing them off to another CPU on the same node whenever possible. To do this, the queue of CPUs waiting for the lock is split into two — a primary and secondary queue. If a CPU finds the lock unavailable, it will add itself to the primary queue and wait as usual. When a CPU gets to the head of the queue, though, it will look at the next CPU in line; if that next CPU is on a different NUMA node, it will be shunted over to the secondary queue.

In this way, the waiting CPUs will eventually be sorted into two queues, one of which (the primary queue) consists only of CPUs on the same node as the current owner of the lock, and one (the secondary) which contains all the rest. When a CPU releases the lock it will hand it to the next CPU in the primary queue, thus keeping the lock on the same NUMA node. The lock will only move to another node once the primary queue empties out, at which point the secondary queue will be moved to the primary and the process starts all over again.

Tweaks and benchmarks

There is an obvious pitfall with a scheme like this: if the lock is heavily contended, the primary queue may never empty out and the other nodes in the system will be starved for the lock. The solution to this problem is to make a note of the time when the first CPU was moved to the secondary queue. If the primary queue does not empty out for 10ms (by default), the entire secondary queue will be promoted to the head of the primary queue, thus forcing the lock to move to another node. The timeout can be changed (within a range of 1-100ms) with the numa_spinlock_threshold command-line parameter.

One optimization that has been added is called "shuffle reduction". If the lock is not all that heavily contended, the extra work of maintaining the secondary queue does not really buy anything. To mitigate this extra cost, the code uses a pseudo-random number generator to only try to create the secondary queue one time out of every 128 lock acquisitions. If the lock gets busy, that will happen relatively often, after which the secondary queue will be maintained until the primary queue empties again (or the above-mentioned timeout occurs).

Finally, the code exempts CPUs running in interrupt (or non-maskable interrupt) mode, and those running realtime tasks, from being pushed to the secondary queue. That allows these CPUs, which presumably have a higher priority, to acquire the lock relatively quickly even if they are running on the wrong NUMA node.

A number of benchmark results are included with the patch set. For lightly contended locks the performance benefits of NUMA awareness are relatively modest. As the number of contending threads grows, though, the speedup does as well, approaching a factor of two for ridiculously heavily contended loads.

This patch set has been through 14 revisions since it was first posted in January 2019. It has evolved quite a bit over that time as comments were raised and addressed; it would appear to be approaching a sort of steady state where it is getting close to being ready to merge. Given that this work has been pending for over two years already, though, and given that it makes significant changes to one of the kernel's fundamental synchronization primitives, it would not be surprising if it took a little longer yet before it hits the mainline.

This paper describes the NUMA-aware qspinlock algorithm in more detail, though the details of the implementation have diverged somewhat from what is described there.

Comments (17 posted)

Debian votes on a statement — and a leader

By Jonathan Corbet
April 8, 2021

Richard Stallman's return to the Free Software Foundation's board of directors has provoked a flurry of responses, and many organizations in the free-software community have expressed their unhappiness with that appointment. In almost every case, the process leading up to that expression has been carried out behind closed doors. The Debian project, instead, is deciding what to do in a classic Debian way — holding a public vote on a general resolution with a wide range of possible outcomes.

The discussion appears to have started on March 23, when Gunnar Wolf floated the possibility of the project taking a position on this issue. One day later, Steve Langasek proposed a general resolution that would make the project a signatory of this open letter opposing Stallman's return. Several hundred (not always pleasant) emails and many proposed amendments later, the final resolution was put out for a vote. In fitting with Debian's reputation for packaging everything, this ballot contains eight options for developers to rank, covering a whole spectrum of potential actions.

Choice 1, at one end of the scale, is essentially the original proposal; it would cause the project as a whole to sign onto the open letter calling for the removal of the entire FSF board. It is a long-winded option, as it contains the full text of that open letter. Choice 2 also calls for Stallman's resignation, but does not ask that the rest of the FSF board resign. It does, however, say that the FSF "needs to seriously reflect" on the decision to reappoint him and states that Debian is "unable to collaborate" with the FSF as long as he remains there.

Choice 3 weakens the message further by removing all talk of resignations or removals and stating instead that "the Debian Project discourages collaborating both with the FSF and any other organisation in which Richard Stallman has a leading position". Weaker yet is choice 4, which calls on the FSF "to further steps it has taken in March 2021 to overhaul governance of the organisation" but takes no position on collaboration with the FSF while that process happens.

Choice 5 is a strong move toward the other end of the spectrum; it would cause Debian to become a signatory to this open letter supporting Stallman's reinstatement. Like the first choice, it includes the full text of that letter. Choice 6, instead, takes little time to work through; it reads, in full: "Debian refuses to participate in and denounces the witch-hunt against Richard Stallman, the Free Software Foundation, and the members of the board of the Free Software Foundation." The intent is clear, even if the style might be better suited to a political Twitter account.

For those who are tired of this whole discussion, choice 7 states simply that Debian will not issue a public statement of any type on this subject. And for those who have just begun to talk about it and want more, choice 8 is the obligatory "further discussion" option that appears on all Debian ballots — as if, as Russ Allbery wryly noted, there were ever any hope of getting Debian to stop discussing a topic.

For people who are accustomed to first-past-the-post voting systems, the Debian ballot may look a little crazy. Votes would be split across a number of similar options and the ultimate outcome might be far from the actual majority position in the project. In the Debian scheme, though, voters will rank their choices and the process will hopefully coalesce on the option that pleases most of them. We will find out sometime after April 17, when voting closes.

Early in the process, Sam Hartman asked that the customary discussion period prior to the vote be shortened in this case. The Debian project leader has the power to make that decision, and current leader Jonathan Carter went along with that request. There was also a request from Jean Duprat that votes be kept secret in this case. There is, evidently, a fair amount of fear that voting on this issue will make developers into targets for attacks — regardless of how they actually voted. The Debian constitution does not allow for secret votes on general resolutions, though, so the actual votes will be made public once the election concludes.

This discussion has nearly obscured another vote happening within the Debian project at the same time: the annual choice of the project leader for the next year. There are two candidates: Carter and challenger Sruthi Chandran. The platforms for the two campaigns can be found on this page. Both were candidates in 2020 as well; LWN covered those campaigns at the time. This vote, too, closes on the 17th.

The Debian project has been through a number of divisive general-resolution votes in its history, with the systemd battles still being fresh in the memory of many. Regardless of the outcome of this vote, "further discussion" seems inevitable, along with loud resignations from the project and more. If past experience holds, though, Debian itself will pick itself up from the low points of this discussion and continue to develop and improve its distribution, as it has with all of its past controversies.

Comments (89 posted)

Comparing SystemTap and bpftrace

April 13, 2021

This article was contributed by Emanuele Rocca

There are times when developers and system administrators need to diagnose problems in running code. The program to be examined can be a user-space process, the kernel, or both. Two of the major tools available on Linux to perform this sort of analysis are SystemTap and bpftrace. SystemTap has been available since 2005, while bpftrace is a more recent contender that, to some, may appear to have made SystemTap obsolete. However, SystemTap is still the preferred tool for some real-world use cases.

Although dynamic instrumentation capabilities, in the form of KProbes, were added to Linux as early as 2004, the functionality was hard to use and not particularly well known. Sun released DTrace one year later, and soon that system became one of the highlights of Solaris. Naturally, Linux users started asking for something similar, and SystemTap quickly emerged as the most promising answer. But SystemTap was criticized as being difficult to get working, while DTrace on Solaris could be expected to simply work out of the box.

While DTrace came with both kernel and user-space tracing capabilities, it wasn't until 2012 that Linux gained support for user-space tracing in the form of Uprobes. Around 2019, bpftrace gained significant traction, in part due to the general attention being paid to the various use cases for BPF. More recently, Oracle has been working on a re-implementation of DTrace, for Linux, based on the latest tracing facilities in the kernel, although, at this point, it may be too late for DTrace given the options that are already available in this space.

The underlying kernel infrastructure used by both SystemTap and bpftrace is largely the same: KProbes, for dynamically tracing kernel functions, tracepoints for static kernel instrumentation, Uprobes for dynamic instrumentation of user-level functions, and user-level statically defined tracing (USDT) for static user-space instrumentation. Both systems allow instrumenting the kernel and user-space programs through a "script" in a high-level language that can be used to specify what needs to be probed and how.

The important design distinction between the two is that SystemTap translates the user-supplied script into C code, which is then compiled and loaded as a module into a running Linux kernel. Instead, bpftrace converts the script to LLVM intermediate representation, which is then compiled to BPF. Using BPF has several advantages: creating and running a BPF program is significantly faster than building and loading a kernel module. Support for data structures consisting of key/value pairs can be easily added by using BPF maps. The BPF verifier ensures that BPF programs will not cause the system to crash, while the kernel module approach used by SystemTap implies the need for implementing various safety checks in the runtime. On the other hand, using BPF makes certain features hard to implement, for example, a custom stack walker, as we shall see later in the article.

The following example shows the similarity between the two systems from the user standpoint. A simple SystemTap program to instrument the kernel function icmp_echo() looks like this:

    probe kernel.function("icmp_echo") {
        println("icmp_echo was called")
    }

The equivalent bpftrace program is:

    kprobe:icmp_echo {
        print("icmp_echo was called")
    }

We will now look at the differences between SystemTap and bpftrace in terms of installation procedure, program structure, and features.

Installation

Both SystemTap and bpftrace are packaged by all major Linux distributions and can be installed easily using the familiar package managers. SystemTap requires the Linux kernel headers to be installed in order to work, while bpftrace does not, as long as the kernel has BPF Type Format (BTF) support enabled. Depending on whether the user wants to analyze a user-space program or the kernel, there might be additional requirements. For user-space software, both SystemTap and bpftrace require the debugging symbols of the software under examination. The details of how to install the symbol data depend on the distribution.

On systems with elfutils 0.178 or later, SystemTap makes the process of finding and installing the right debug symbols fully automatic by using a remote debuginfod server. For example, on Debian systems:

    # export DEBUGINFOD_URLS=https://debuginfod.debian.net
    # export DEBUGINFOD_PROGRESS=1
    # stap -ve 'probe process("/bin/ls").function("format_user_or_group") { println(pp()) }'
    Downloading from https://debuginfod.debian.net/
    [...]

This feature is not yet available for bpftrace.

For kernel instrumentation, SystemTap requires the kernel debugging symbols to be installed in order to use the advanced features of the tool, such as looking up the arguments or local variables of a function, as well as instrumenting specific lines of code within the function body. In this case, too, a remote debuginfod server can be used to automate the process.

Program structure

Both systems provide an AWK-like language, inspired by DTrace's D, to describe predicates and actions. The bpftrace language is pretty much the same as D, and follows this general structure:

    probe-descriptions
    /predicate/
    {
        action-statements
    }

That is to say: when the probes fire, if the given (optional) predicate matches, perform the specified actions.

The structure of SystemTap programs is slightly different:

    probe PROBEPOINT [, PROBEPOINT] {
        [STMT ...]
    }

In SystemTap there is no support for specifying a predicate built into the language, but conditional statements can be used to achieve the same goal.

For example, the following bpftrace program prints all mmap() calls issued by the process with PID 31316:

    uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap
    /pid == 31316/
    {
        print("mmap by 31316")
    }

The SystemTap equivalent is:

    probe process("/lib/x86_64-linux-gnu/libc.so.6").function("mmap") {
        if (pid() == 31316) {
            println("mmap by 31316")
        }
    }

Data aggregation and reporting in bpftrace is done exactly the same way as it is done in DTrace. For example, the following program does a by-PID sum and aggregation of the number of bytes sent with the tcp_sendmsg() kernel function:

    $ sudo bpftrace -e 'kprobe:tcp_sendmsg { @bytes[pid] = sum(arg2); }'
    Attaching 1 probe...
    ^C
    
    @bytes[58832]: 75
    @bytes[58847]: 77
    @bytes[58852]: 857

Like DTrace, bpftrace defaults to automatically printing aggregation results when the program exits: no code had to be written to print the breakdown by PID above. The downside of this implicit behavior is that, to avoid automatic printing of all data structures, users have to explicitly clear() those that should not be printed. For instance, to change the script above and only print the top 5 processes, the bytes map must be cleared upon program termination.

    kprobe:tcp_sendmsg {
        @bytes[pid] = sum(arg2);
    }
    
    END {
        print(@bytes, 5);
        clear(@bytes);
    }

Some powerful facilities for generating histograms are available too, allowing for terse scripts such as the following, which operates on the number of bytes read in calls to vfs_read():

    $ sudo bpftrace -e 'kretprobe:vfs_read { @bytes = hist(retval); }'
    Attaching 1 probe...
    ^C
    
    @bytes: 
    (..., 0)             169 |@@                                                  |
    [0]                  206 |@@@                                                 |
    [1]                 1579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
    [2, 4)                13 |                                                    |
    [4, 8)                 9 |                                                    |
    [8, 16)             2970 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [16, 32)              45 |                                                    |
    [32, 64)              91 |@                                                   |
    [64, 128)            108 |@                                                   |
    [128, 256)            10 |                                                    |
    [256, 512)             8 |                                                    |
    [512, 1K)             69 |@                                                   |
    [1K, 2K)              97 |@                                                   |
    [2K, 4K)              37 |                                                    |
    [4K, 8K)              64 |@                                                   |
    [8K, 16K)             24 |                                                    |
    [16K, 32K)            29 |                                                    |
    [32K, 64K)            80 |@                                                   |
    [64K, 128K)           18 |                                                    |
    [128K, 256K)           0 |                                                    |
    [256K, 512K)           2 |                                                    |
    [512K, 1M)             1 |                                                    |

Statistical aggregates are also available in SystemTap. The <<< operator allows adding values to a statistical aggregate. SystemTap does not automatically print aggregation results when the program exits, so it needs to be done explicitly.

    global bytes
    probe kernel.function("vfs_read").return {
        bytes <<< $return
    }
    
    probe end {
        print(@hist_log(bytes))
    }

Features

A very useful feature of DTrace-like systems is the ability to obtain a stack trace to see which sequence of function calls lead to a given probe point. Kernel stack traces can be obtained in bpftrace as follows:

    kprobe:icmp_echo {
        print(kstack);
        exit()
    }

Equivalently, with SystemTap:

    probe kernel.function("icmp_echo") {
        print_backtrace();
        exit()
    }

An important problem affecting bpftrace is that it cannot generate user-space stack traces unless the program being traced was built with frame pointers. For the vast majority of cases, that means that users must recompile the software under examination in order to instrument it.

SystemTap's user-space stack backtrace mechanism, instead, provides a full stack trace by making use of debug information to walk the stack. This means that no recompilation is needed.

    probe process("/bin/ls").function("format_user_or_group") {
        print_ubacktrace();
        exit()
    }

The script above produces a full backtrace, here shortened for readability:

     0x55767a467f60 : format_user_or_group+0x0/0xc0 [/bin/ls]
     0x55767a46d26a : print_long_format+0x58a/0x9f0 [/bin/ls]
     0x55767a46d840 : print_current_files+0x170/0x3e0 [/bin/ls]
     0x55767a465d8d : main+0x62d/0x1a00 [/bin/ls]

The same feature is unlikely to be added to bpftrace, as it would need to be implemented either by the kernel or in BPF bytecode.

Real world uses

Consider the following example of a practical production investigation that could not proceed further with bpftrace due to the backtrace limitation, so SystemTap needed to be used to track it down. At Wikimedia we ran into an interesting problem with LuaJIT — after observing high system CPU usage on behalf of Apache Traffic Server, we could confirm that it was due to mmap() being called unusually often:

    $ sudo bpftrace -e 'kprobe:do_mmap /pid == 31316/ { @[arg2]=count(); } interval:s:1 { exit(); }'
    Attaching 2 probes...
    @[65536]: 64988

That is where the investigation would have stopped, had it not been possible to generate user-space backtraces with SystemTap. Note that in this case the issue affected the Lua JIT component: rebuilding Apache Traffic Server with frame pointers to make bpftrace produce a stack trace would not have been sufficient, we would have had to rebuild LuaJIT too.

Another important advantage of SystemTap over bpftrace is that it allows accessing function arguments and local variables by their name. With bpftrace, arguments can only be accessed by name when instrumenting the kernel, and specifically when using static kernel tracepoints or the experimental kfunc feature that is available for recent kernels. The kfunc feature is based on BPF trampolines and seems promising. When using regular kprobes, or when instrumenting user-space software, bpftrace can access arguments only by position (arg0, arg1, ... argN).

SystemTap is also able to list available probe points by source file, and to match by filename in the definition of probes too. The feature can be used to focus the analysis only on specific areas of the code base. For instance, the following command can be used to list (-L) all of the functions defined in Apache Traffic Server's iocore/cache/Cache.cc:

    $ stap -L 'process("/usr/bin/traffic_server").function("*@./iocore/cache/Cache.cc")

It is often necessary to probe a specific point somewhere in the body of a function, rather than limiting the analysis to the function entry point or to the return statement. This can be done in SystemTap using statement probes; the following will list the probe points available along with the variables available at each point:

    $ stap -L 'process("/bin/ls").statement("format_user_or_group@src/ls.c:*")'
    process("/bin/ls").statement("format_user_or_group@src/ls.c:4110") \
        $name:char const* $id:long unsigned int $width:int
    process("/bin/ls").statement("format_user_or_group@src/ls.c:4115") \
        $name:char const* $id:long unsigned int $width:int
    process("/bin/ls").statement("format_user_or_group@src/ls.c:4116") \
        $width_gap:int $name:char const* $id:long unsigned int $width:int	
    process("/bin/ls").statement("format_user_or_group@src/ls.c:4118") \
        $pad:int $name:char const* $id:long unsigned int $width:int
    [...]
    process("/bin/ls").statement("format_user_or_group@src/ls.c:4131") \
        $name:char const* $id:long unsigned int $width:int $len:size_t

The full output shows that there are ten different lines that can be probed inside the function format_user_or_group(), together with the various variables available in scope. By looking at the source code we can see which one exactly needs to be probed, and write the SystemTap program accordingly.

To try to achieve the same goal with bpftrace we would need to disassemble the function and specify the right offset to the Uprobe based on the assembly instead, which is cumbersome at best. Additionally, bpftrace needs to be explicitly built with Binary File Descriptor (BFD) support for this feature to work.

While all software is sooner or later affected by bugs, issues affecting debugging tools are particularly thorny. One specific issue affects bpftrace on systems with certain LLVM versions, and it seems worth mentioning. Due to an LLVM bug causing load/store instructions in the intermediate representation to be reordered when they should not be, valid bpftrace scripts can misbehave in ways that are difficult to figure out. Adding or removing unrelated code might work around or trigger the bug. The same underlying LLVM bug causes other bpftrace scripts to fail. The problem has recently been fixed in LLVM 12; bpftrace users should ensure they are running a recent LLVM version that is not affected by this issue.

Conclusions

SystemTap and bpftrace offer similar functionality, but differ significantly in their design choices by using loadable kernel module in one case and BPF in the other. The approach based on kernel modules offers greater flexibility, and allows implementing features that are hard if not impossible to do using BPF. On the other hand, BPF is an obviously good choice for tracing tools, as it provides a fast and safe environment to base observability tools on.

For many use cases, bpftrace just works out of the box, while SystemTap generally requires installing additional dependencies in order to take full advantage of all of its features. Bpftrace is generally faster, and provides various facilities for quick aggregation and reporting that are arguably simpler to use than those provided by SystemTap. On the other hand, SystemTap provides several distinguishing features such as: generating user-space backtraces without the need for frame pointers, accessing function arguments and local variables by name, and the ability to probe arbitrary statements. Both would seem to have their place for diagnosing problems in today's Linux systems.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>