Comparing SystemTap and bpftrace
There are times when developers and system administrators need to diagnose problems in running code. The program to be examined can be a user-space process, the kernel, or both. Two of the major tools available on Linux to perform this sort of analysis are SystemTap and bpftrace. SystemTap has been available since 2005, while bpftrace is a more recent contender that, to some, may appear to have made SystemTap obsolete. However, SystemTap is still the preferred tool for some real-world use cases.
Although dynamic instrumentation capabilities, in the form of KProbes, were added to Linux as early as 2004, the functionality was hard to use and not particularly well known. Sun released DTrace one year later, and soon that system became one of the highlights of Solaris. Naturally, Linux users started asking for something similar, and SystemTap quickly emerged as the most promising answer. But SystemTap was criticized as being difficult to get working, while DTrace on Solaris could be expected to simply work out of the box.
While DTrace came with both kernel and user-space tracing capabilities, it wasn't until 2012 that Linux gained support for user-space tracing in the form of Uprobes. Around 2019, bpftrace gained significant traction, in part due to the general attention being paid to the various use cases for BPF. More recently, Oracle has been working on a re-implementation of DTrace, for Linux, based on the latest tracing facilities in the kernel, although, at this point, it may be too late for DTrace given the options that are already available in this space.
The underlying kernel infrastructure used by both SystemTap and bpftrace is largely the same: KProbes, for dynamically tracing kernel functions, tracepoints for static kernel instrumentation, Uprobes for dynamic instrumentation of user-level functions, and user-level statically defined tracing (USDT) for static user-space instrumentation. Both systems allow instrumenting the kernel and user-space programs through a "script" in a high-level language that can be used to specify what needs to be probed and how.
The important design distinction between the two is that SystemTap translates the user-supplied script into C code, which is then compiled and loaded as a module into a running Linux kernel. Instead, bpftrace converts the script to LLVM intermediate representation, which is then compiled to BPF. Using BPF has several advantages: creating and running a BPF program is significantly faster than building and loading a kernel module. Support for data structures consisting of key/value pairs can be easily added by using BPF maps. The BPF verifier ensures that BPF programs will not cause the system to crash, while the kernel module approach used by SystemTap implies the need for implementing various safety checks in the runtime. On the other hand, using BPF makes certain features hard to implement, for example, a custom stack walker, as we shall see later in the article.
The following example shows the similarity between the two systems from the user standpoint. A simple SystemTap program to instrument the kernel function icmp_echo() looks like this:
probe kernel.function("icmp_echo") { println("icmp_echo was called") }
The equivalent bpftrace program is:
kprobe:icmp_echo { print("icmp_echo was called") }
We will now look at the differences between SystemTap and bpftrace in terms of installation procedure, program structure, and features.
Installation
Both SystemTap and bpftrace are packaged by all major Linux distributions and can be installed easily using the familiar package managers. SystemTap requires the Linux kernel headers to be installed in order to work, while bpftrace does not, as long as the kernel has BPF Type Format (BTF) support enabled. Depending on whether the user wants to analyze a user-space program or the kernel, there might be additional requirements. For user-space software, both SystemTap and bpftrace require the debugging symbols of the software under examination. The details of how to install the symbol data depend on the distribution.
On systems with elfutils 0.178 or later, SystemTap makes the process of finding and installing the right debug symbols fully automatic by using a remote debuginfod server. For example, on Debian systems:
# export DEBUGINFOD_URLS=https://debuginfod.debian.net # export DEBUGINFOD_PROGRESS=1 # stap -ve 'probe process("/bin/ls").function("format_user_or_group") { println(pp()) }' Downloading from https://debuginfod.debian.net/ [...]
This feature is not yet available for bpftrace.
For kernel instrumentation, SystemTap requires the kernel debugging symbols to be installed in order to use the advanced features of the tool, such as looking up the arguments or local variables of a function, as well as instrumenting specific lines of code within the function body. In this case, too, a remote debuginfod server can be used to automate the process.
Program structure
Both systems provide an AWK-like language, inspired by DTrace's D, to describe predicates and actions. The bpftrace language is pretty much the same as D, and follows this general structure:
probe-descriptions /predicate/ { action-statements }
That is to say: when the probes fire, if the given (optional) predicate matches, perform the specified actions.
The structure of SystemTap programs is slightly different:
probe PROBEPOINT [, PROBEPOINT] { [STMT ...] }
In SystemTap there is no support for specifying a predicate built into the language, but conditional statements can be used to achieve the same goal.
For example, the following bpftrace program prints all mmap() calls issued by the process with PID 31316:
uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap /pid == 31316/ { print("mmap by 31316") }
The SystemTap equivalent is:
probe process("/lib/x86_64-linux-gnu/libc.so.6").function("mmap") { if (pid() == 31316) { println("mmap by 31316") } }
Data aggregation and reporting in bpftrace is done exactly the same way as it is done in DTrace. For example, the following program does a by-PID sum and aggregation of the number of bytes sent with the tcp_sendmsg() kernel function:
$ sudo bpftrace -e 'kprobe:tcp_sendmsg { @bytes[pid] = sum(arg2); }' Attaching 1 probe... ^C @bytes[58832]: 75 @bytes[58847]: 77 @bytes[58852]: 857
Like DTrace, bpftrace defaults to automatically printing aggregation results when the program exits: no code had to be written to print the breakdown by PID above. The downside of this implicit behavior is that, to avoid automatic printing of all data structures, users have to explicitly clear() those that should not be printed. For instance, to change the script above and only print the top 5 processes, the bytes map must be cleared upon program termination.
kprobe:tcp_sendmsg { @bytes[pid] = sum(arg2); } END { print(@bytes, 5); clear(@bytes); }
Some powerful facilities for generating histograms are available too, allowing for terse scripts such as the following, which operates on the number of bytes read in calls to vfs_read():
$ sudo bpftrace -e 'kretprobe:vfs_read { @bytes = hist(retval); }' Attaching 1 probe... ^C @bytes: (..., 0) 169 |@@ | [0] 206 |@@@ | [1] 1579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2, 4) 13 | | [4, 8) 9 | | [8, 16) 2970 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 45 | | [32, 64) 91 |@ | [64, 128) 108 |@ | [128, 256) 10 | | [256, 512) 8 | | [512, 1K) 69 |@ | [1K, 2K) 97 |@ | [2K, 4K) 37 | | [4K, 8K) 64 |@ | [8K, 16K) 24 | | [16K, 32K) 29 | | [32K, 64K) 80 |@ | [64K, 128K) 18 | | [128K, 256K) 0 | | [256K, 512K) 2 | | [512K, 1M) 1 | |
Statistical aggregates are also available in SystemTap. The <<< operator allows adding values to a statistical aggregate. SystemTap does not automatically print aggregation results when the program exits, so it needs to be done explicitly.
global bytes probe kernel.function("vfs_read").return { bytes <<< $return } probe end { print(@hist_log(bytes)) }
Features
A very useful feature of DTrace-like systems is the ability to obtain a stack trace to see which sequence of function calls lead to a given probe point. Kernel stack traces can be obtained in bpftrace as follows:
kprobe:icmp_echo { print(kstack); exit() }
Equivalently, with SystemTap:
probe kernel.function("icmp_echo") { print_backtrace(); exit() }
An important problem affecting bpftrace is that it cannot generate user-space stack traces unless the program being traced was built with frame pointers. For the vast majority of cases, that means that users must recompile the software under examination in order to instrument it.
SystemTap's user-space stack backtrace mechanism, instead, provides a full stack trace by making use of debug information to walk the stack. This means that no recompilation is needed.
probe process("/bin/ls").function("format_user_or_group") { print_ubacktrace(); exit() }
The script above produces a full backtrace, here shortened for readability:
0x55767a467f60 : format_user_or_group+0x0/0xc0 [/bin/ls] 0x55767a46d26a : print_long_format+0x58a/0x9f0 [/bin/ls] 0x55767a46d840 : print_current_files+0x170/0x3e0 [/bin/ls] 0x55767a465d8d : main+0x62d/0x1a00 [/bin/ls]
The same feature is unlikely to be added to bpftrace, as it would need to be implemented either by the kernel or in BPF bytecode.
Real world uses
Consider the following example of a practical production investigation that could not proceed further with bpftrace due to the backtrace limitation, so SystemTap needed to be used to track it down. At Wikimedia we ran into an interesting problem with LuaJIT — after observing high system CPU usage on behalf of Apache Traffic Server, we could confirm that it was due to mmap() being called unusually often:
$ sudo bpftrace -e 'kprobe:do_mmap /pid == 31316/ { @[arg2]=count(); } interval:s:1 { exit(); }' Attaching 2 probes... @[65536]: 64988
That is where the investigation would have stopped, had it not been possible to generate user-space backtraces with SystemTap. Note that in this case the issue affected the Lua JIT component: rebuilding Apache Traffic Server with frame pointers to make bpftrace produce a stack trace would not have been sufficient, we would have had to rebuild LuaJIT too.
Another important advantage of SystemTap over bpftrace is that it allows accessing function arguments and local variables by their name. With bpftrace, arguments can only be accessed by name when instrumenting the kernel, and specifically when using static kernel tracepoints or the experimental kfunc feature that is available for recent kernels. The kfunc feature is based on BPF trampolines and seems promising. When using regular kprobes, or when instrumenting user-space software, bpftrace can access arguments only by position (arg0, arg1, ... argN).
SystemTap is also able to list available probe points by source file, and to match by filename in the definition of probes too. The feature can be used to focus the analysis only on specific areas of the code base. For instance, the following command can be used to list (-L) all of the functions defined in Apache Traffic Server's iocore/cache/Cache.cc:
$ stap -L 'process("/usr/bin/traffic_server").function("*@./iocore/cache/Cache.cc")
It is often necessary to probe a specific point somewhere in the body of a function, rather than limiting the analysis to the function entry point or to the return statement. This can be done in SystemTap using statement probes; the following will list the probe points available along with the variables available at each point:
$ stap -L 'process("/bin/ls").statement("format_user_or_group@src/ls.c:*")' process("/bin/ls").statement("format_user_or_group@src/ls.c:4110") \ $name:char const* $id:long unsigned int $width:int process("/bin/ls").statement("format_user_or_group@src/ls.c:4115") \ $name:char const* $id:long unsigned int $width:int process("/bin/ls").statement("format_user_or_group@src/ls.c:4116") \ $width_gap:int $name:char const* $id:long unsigned int $width:int process("/bin/ls").statement("format_user_or_group@src/ls.c:4118") \ $pad:int $name:char const* $id:long unsigned int $width:int [...] process("/bin/ls").statement("format_user_or_group@src/ls.c:4131") \ $name:char const* $id:long unsigned int $width:int $len:size_t
The full output shows that there are ten different lines that can be probed inside the function format_user_or_group(), together with the various variables available in scope. By looking at the source code we can see which one exactly needs to be probed, and write the SystemTap program accordingly.
To try to achieve the same goal with bpftrace we would need to disassemble the function and specify the right offset to the Uprobe based on the assembly instead, which is cumbersome at best. Additionally, bpftrace needs to be explicitly built with Binary File Descriptor (BFD) support for this feature to work.
While all software is sooner or later affected by bugs, issues affecting debugging tools are particularly thorny. One specific issue affects bpftrace on systems with certain LLVM versions, and it seems worth mentioning. Due to an LLVM bug causing load/store instructions in the intermediate representation to be reordered when they should not be, valid bpftrace scripts can misbehave in ways that are difficult to figure out. Adding or removing unrelated code might work around or trigger the bug. The same underlying LLVM bug causes other bpftrace scripts to fail. The problem has recently been fixed in LLVM 12; bpftrace users should ensure they are running a recent LLVM version that is not affected by this issue.
Conclusions
SystemTap and bpftrace offer similar functionality, but differ significantly in their design choices by using loadable kernel module in one case and BPF in the other. The approach based on kernel modules offers greater flexibility, and allows implementing features that are hard if not impossible to do using BPF. On the other hand, BPF is an obviously good choice for tracing tools, as it provides a fast and safe environment to base observability tools on.
For many use cases, bpftrace just works out of the box, while SystemTap generally requires installing additional dependencies in order to take full advantage of all of its features. Bpftrace is generally faster, and provides various facilities for quick aggregation and reporting that are arguably simpler to use than those provided by SystemTap. On the other hand, SystemTap provides several distinguishing features such as: generating user-space backtraces without the need for frame pointers, accessing function arguments and local variables by name, and the ability to probe arbitrary statements. Both would seem to have their place for diagnosing problems in today's Linux systems.
Index entries for this article | |
---|---|
Kernel | Development tools/Kernel tracing |
GuestArticles | Rocca, Emanuele |
Posted Apr 13, 2021 20:46 UTC (Tue)
by atnot (subscriber, #124910)
[Link]
I hope in the future some solution can be found to not have to compromise between security and debuggability in production systems in this way, but with the deep reach that kprobes have into the kernel it seems somewhat unlikely.
Posted Apr 13, 2021 23:50 UTC (Tue)
by fuhchee (guest, #40059)
[Link]
Posted Apr 14, 2021 9:02 UTC (Wed)
by seanyoung (subscriber, #28711)
[Link] (2 responses)
Posted Apr 14, 2021 13:27 UTC (Wed)
by fuhchee (guest, #40059)
[Link] (1 responses)
Last I heard, this was compile-time bounded, implemented by unrolling.
Posted Apr 14, 2021 14:14 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link]
Posted Apr 14, 2021 10:07 UTC (Wed)
by hazmat (subscriber, #668)
[Link] (1 responses)
Posted Apr 14, 2021 11:03 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Apr 17, 2021 4:08 UTC (Sat)
by ghane (guest, #1805)
[Link] (1 responses)
I do not use either debugging tool, or debug compiled code, but this article was a pleasure to read for its clarity.
It provides an excellent overview of the *different use cases* where one might use these tools. There are too few of these articles, I think.
Thank you
Posted Apr 20, 2021 7:35 UTC (Tue)
by ema (subscriber, #17750)
[Link]
Posted Apr 20, 2021 1:21 UTC (Tue)
by ringerc (subscriber, #3071)
[Link] (1 responses)
The TL;DR is that in practice the effective, reliable use of any of these tools tends to require that you be able to plan ahead in order to install necessary debuginfo, kernel headers, utilities etc well ahead of time, before they age out of repositories. For best results you'll want a much newer kernel than the enterprise-y distro defaults too. So they work best in tightly controlled farms of machines where the people who care about tracing can control how the systems are installed and updated.
* SystemTap kmod mode needs kernel headers and prefers kernel debuginfo. Both age out of repos quickly.
So on older kernels you can use systemtap, except you probably can't get the kernels headers and debuginfo installed. And you're generally not going to encounter newer bpf-friendly kernels in the wild on production systems unless you're managing your own clusters of systems. If you do have a newer kernel and want to use bpf, you get to fight with its lack of DWARF based unwinding and its primitive to nonexistent ability to understand userspace memory contents.
I find this intensely frustrating, as I get a great deal of value out of both tools in my own debugging and performance work. But systemtap sometimes breaks when I update my kernel on my laptop, and I need a bleeding-edge bcc for even some of the basic functionality I needed for simple userspace tracers.
Posted Apr 20, 2021 1:22 UTC (Tue)
by ringerc (subscriber, #3071)
[Link]
SystemTap is widely available even for older systems, though the packaged versions are usually older so it's a bit of a pain to write scripts that work with them. It's easy to compile if you're allowed to install the needed toolchain and dependencies on the target and you have the time, but that adds to the hassle. Especially when you're not hands-on and you just want the other end (customer, or whatever) to run a tapscript for you.
Additionally, for its most fully featured and default runtime (kmod) SystemTap requires kernel headers and preferably debuginfo. These are frequently unavailable for whatever older kernel point release happens to be running on the target system at the time you need to run some tracing tools. Or at best you have to go digging manually through some archive of old packages that have aged out of the main repositories for the OS. The stap-prep tool can't usually find them for you. So to reliably use systemtap's kmod runtime you need to plan ahead and install kernel headers and debuginfo whenever you update the kernel, which nobody ever does. This drastically limits its practical utility.
But lots of eBPF features and helper functions are only available in much newer kernels. On widely deployed "enterprise" system kernels it's basically useless for nontrivial userspace tracing and analysis. eBPF is quite fragile in the face of kernel version changes as soon as you step outside the canned tracepoints, and the set of helper functions is extremely limited.
Even if you can run your bpf scripts, your userspace stacks are going to look like "-" most of the time, because everything is compiled with -fomit-frame-pointer. AFAICS most bpf tools don't handle external DWARF debuginfo or use tools like libunwind to help them out. So you land up having to recompile with -fno-omit-frame-pointer and use unstripped binaries with debuginfo in the main binary. This basically means you can't do much tracing of packaged userspace binaries as are the norm on production systems.
SystemTap on the other hand will not only get you your userspace stacks using DWARF detached debuginfo, it'll now even talk to a debuginfod to download symbols for you during probe compilation. It'll walk userspace pointers chains, examine struct members, recursively print structs, handle unions and so much more using simple built-in syntax. So it's currently infinitely more powerful for userspace probing and analysis ...
... or it would be if only you could find and install the kernel headers.
SystemTap also has 'dyninst' and 'bpf' runtimes, which entirely avoid the need for kernel headers and can often be used without kernel debuginfo. But a considerable number of the built-in systemtap "tapsets" rely on embedded-C code written for kernelspace, which simply won't work for a dyninst or bpf tapscript. Or they rely on helper functions exported by the kmod runtime that are not implemented for the dyninst or bpf runtimes. So in practice most of your existing systemtap scripts won't work, and scripts are more difficult to write for the dyninst or bpf runtimes.
Additionally, the dyninst runtime requires that you wrap the target using LD_PRELOAD. So it's cool for development and QA work but for a production system it's often impractical, as you frequently want to non-intrusively trace an already-server running process.
This means you can't usually apply eBPF or use SystemTap with any of its runtimes to any system you encounter in the wild.
Comparing SystemTap and bpftrace
Comparing SystemTap and bpftrace
dwarf stacktrace in bpf
dwarf stacktrace in bpf
dwarf stacktrace in bpf
Comparing SystemTap and bpftrace
https://github.com/iovisor/bpftrace/issues/created_by/ema
Comparing SystemTap and bpftrace
Comparing SystemTap and bpftrace
--
Sanjeev
Comparing SystemTap and bpftrace
Comparing SystemTap and bpftrace
* SystemTap dyninst runtime requires restarting the target so it can LD_PRELOAD, and is more limited than kmod runtime
* Effective eBPF userspace tracing in practice requires quite new kernels, so bpftrace, bcc, etc are hard or impractical to apply to older kernels common in the wild
* SystemTap bpf runtime is limited by the same kernel version concerns as bcc etc *and* its own systemtap-specific limitations.
* SystemTap doesn't usually work on kernels newer than the systemtap release due to internal kernel API changes. It spews compile errors. So you usually need to get a newer systemtap to work with newer kernels.
* bpftrace and bcc don't currently handle detached DWARF debuginfo, and don't even handle binaries built with the x64 default -fomit-frame-pointer compile flag properly.
* The rich debuginfo based access to userspace state available in systemtap is mostly absent from bpf tooling targeting userspace, so access to your program state is very painful with bpf.
Comparing SystemTap and bpftrace