A SystemTap update
SystemTap has been under active development for a some years. More than 35 people have contributed enhancements in the last year. But newer developments, like the ability to dynamically trace user space programs, seem to have been very quietly introduced and, thus, have not always been noticed by users that are not yet using SystemTap extensively. So this article will take a look at what currently works out of the box, what that box should contain to make things work, the work in progress, and the challenges SystemTap faces to be more powerful and get more widespread adoption.
SystemTap's goal is to provide full system observability on production systems, which is safe, non-intrusive, (near) zero-overhead and which allows ubiquitous data collection across the whole system for any interesting event that could happen. To achieve this goal, SystemTap defines the stap language, in which the user defines probes, actions, and data acquisition. The SystemTap translator and runtime guarantees that probe points are only placed on safe locations and that probe functions cannot generate too much overhead when collecting data. For dynamic probes on addresses inside the kernel, SystemTap uses kprobes; for dynamic probes in user space programs, instead, SystemTap uses its cousin uprobes [PDF]. This provides a unified way of probing and then collecting data for observing the whole system. To dynamically find locations for probe points, arguments of the probed functions and the variables in scope at the probe point, SystemTap uses the debuginfo (Dwarf) standard debugging information that the compiler generates.
So, to provide an ideal setting for using SystemTap, GNU/Linux distributions should provide easy access to debuginfo for the kernel and user space programs. Almost all distributors do this. The kernel supports kprobes, which has been in the upstream kernel for some years, and uprobes, which comes with (and is automatically loaded by) SystemTap, but which relies on the full utrace framework, which isn't yet in the mainline kernel. (The latest few releases of the Fedora family, including Red Hat Enterprise Linux and CentOS, do include full utrace support by default). SystemTap works without debuginfo, but the range of probes and the amount of data you can collect is then very limited. And it works without utrace support, but then you won't be able to do deep user space probing, only observe direct user/kernel space interactions.
There are various probe variants one can use with SystemTap, but the most interesting ones are the debuginfo-based probes for the kernel, kernel modules, and user space applications. These can use function, statement or return variants, and wildcards, such as:
-
kernel.function("rpc_new_task"): a named kernel function, -
process("/bin/ls").function("*"): any function entry in a specific process, -
module("usb*").function("*sync*").return: every return of a function containing the word sync, in any module starting with usb, or -
kernel.statement("bio_init@fs/bio.c+3"): for a specific statement in a particular file
Depending on the type of probe, one can access specifics of the
probe point. For the debuginfo based probes these
are $var for in-scope variables or function
arguments, $var->field for accessing structure
fields, $var[N] for array elements, $return
for the return value of a function in a return probe, and meta
variables like $$vars to get a string representation of
all the in-scope variables at a particular probe point. All access to
such constructs are safeguarded by the SystemTap runtime to make sure
no illegal accesses can occur.
Given that one has the debuginfo of a program installed, one can easily get a simple call trace of a specific program, including all function parameters and return values with the following stap script:
probe process("/bin/ls").function("*").call
{
printf("=>%s(%s)\n", probefunc(), $$parms);
}
probe process("/bin/ls").function("*").return
{
printf("<=%s:%s\n", probefunc(), $$return);
}
The examples included with SystemTap come with much more powerful versions that show timed, per-thread call graphs, optionally showing only children of a particular function call.
While these probing and data extraction constructs are powerful, they do require some knowledge of the kernel or program code base. Since you are often interested in what is happening and not precisely how, SystemTap comes with "tapsets," which are utility functions and aliases for groups of interesting probes in a particular subsystem. Examples include system calls, NFS operations, signals, sockets, etc. Currently these tapsets are distributed with SystemTap itself, but ideally each program or subsystem would come with its own tapset of interesting events provided by the program or subsystem maintainer.
Just printing out events while they occur is not always ideal. First, you may be overwhelmed by volume of the output; second, you might only be interested in a specific subset of the same event (only certain parameters, only calls that take longer than a specific time, only from the process that does the most calls over a specific time frame, etc.). Finally, processing all the events on your production system might interfere with the thing you are trying to observe. Especially at the start of your investigations, when you might not yet be sure what the interesting events are, you may do some very wide probing to see what is going on.
For this reason the stap language supports variables that can be used as associative arrays, simple control structures and data aggregation functions to do simple statistics during probe time, with very low overhead and without having to call external programs that might interfere with the system being probed.
The following script might be how you would start investigating a problem involving a system which seems to do an excessive amount of reads. It uses the "vfs" tapset and an associative array to store the number of reads a particular executable with a specific process ID does:
global totals;
probe vfs.read
{
totals[execname(), pid()]++
}
probe end
{
printf("== totals ==\n")
foreach ([name,pid] in totals-)
printf("%s (%d): %d \n", name, pid, totals[name,pid])
}
This will give you a list of executables and their pid sorted by the total number of vfs reads done while the script was running. These facilities in the stap language help greatly to minimize any overhead of the tracing framework. If you would try to do the same thing by just printing each vfs event and then post-processing the results with Perl, you might end up with Perl itself being the process doing the most vfs calls, or worse, by having to parse megabytes of trace data, Perl might start trashing the system even more, making it harder to determine the root cause of the original problem.
SystemTap now also supports static markers in the kernel. This allows subsystem maintainers to mark specific events as interesting, providing a format string of the arguments to the event that can be easily parsed by tracing tools. The advantage of static markers over tapsets is that they are in-code and so might be easier to maintain, though you probably still want to have an associated tapset for utilities to nicely format the arguments or associate various markers with each other. Also, they can work without needing any DWARF debuginfo around, but you lose the ability to inspect local variables or function parameters not passed to the marker. You use them with a command like:
probe kernel.mark("kernel_sched_wakeup")
The tapset can then
access the arguments through $argN and get the argument
format string of the marker with $format.
An alternate way of adding static markers to the kernel, tracepoints, is not yet directly supported in SystemTap. Tracepoints have the disadvantage that they require the DWARF debuginfo to be around because they don't currently specify the types of their arguments except through their function prototypes. So SystemTap can currently only use tracepoints via hand-written intermediary code that maps them to markers.
The development version of SystemTap recently got support for user-space static markers. Although SystemTap defines its own STAP_PROBE macros for usage in applications that want to add static markers, there is also an alternative tracing tool, Dtrace, that has its own way for programs to embed static markers. SystemTap supports the convention used by Dtrace by providing an alternative include file and build preprocessor so that programs using DTRACE_PROBE macros can be compiled as if for Dtrace and have their static markers show up with SystemTap.
Luckily, there are various programs that already have such markers defined. For example PostgreSQL has various static markers to trace higher-level events like transactions and database locks. Currently one has to adapt the build process of such programs by hand, but the next version of SystemTap will come with scripts that will automate that process.
While SystemTap works well on GNU/Linux distributions that support it, there are a couple of challenges to overcome to make it more ubiquitous and easier for more people to use out of the box. This goes beyond work on the SystemTap code base itself. Since the goal is to provide full system observability, from low-level kernel events to high-level application events, there is work to be done all across the GNU/Linux stack. Also needed is better integration into more distributions, providing default installation of SystemTap and tapsets, easy access to debuginfo for deep inspection, binaries compiled with marker support for high-level events, etc. The two main challenges to make SystemTap more powerful and easier to use on any distribution are debuginfo and better kernel support.
A lot of power of SystemTap comes from the fact that it can use DWARF debuginfo from the kernel and applications to do very detailed inspection. But this power comes at a price, since the debuginfo is often large. For example, on Fedora, the kernel debuginfo package is far larger than the kernel package itself. One easy win will be to split the debuginfo package into the DWARF files and the source files, which are needed for a debugger, but not directly for a tracer like SystemTap. Fedora plans to do this for its next release. The elfutils team is also working on a framework for Dwarf transformation and compression that could be used as post-processor on the output of the compiler.
SystemTap sometimes suffers from the same issues you might have with a debugger: the compiler has optimized the code, but forgot where it put a certain variable after the optimization. Of course this is always the variable you are most interested in. Alexandre Oliva is working on improving the local variable debug information in GCC. His variable tracking assignments [PDF] branch in GCC aims to improve debug information by annotating assignments early in the compilation process and carrying over such annotations throughout all optimization passes so that you can always accurately track variables, even in optimized code.
Finally, there is work being done on having a SystemTap "client and server" that could be used on production systems where you might not even want to have any tools or debuginfo installed. You can then set up a development client that has the same configuration as the production system with the addition of the SystemTap translator and all debuginfo, create and test your scripts there. The final result of this work could then be used on the bare-bones production server.
Most of the SystemTap runtime, like the kprobes support, is maintained in the upstream linux kernel, but there is some stuff still missing. This leads to distributions having to add patches to their kernel, especially to support user space tracing. In particular, the utrace framework is still not upstream. Over the last few kernel releases, various parts have been merged, including the utrace user_regset framework, which creates an interface for code accessing the user-space view of any machine specific state, and the tracehook work, which provides a framework for all the user process tracing. The actual utrace framework sits on top of these components; the ptrace() interface is implemented as utrace client. Anything that changes the ptrace implementation is hairy stuff, so there is a large ptrace testsuite to make sure that nothing breaks. One idea under consideration is to push utrace upstream in two installments. At first, using utrace or ptrace on a process would be mutually exclusive. That could pave to path to get pure-utrace upstream in first and then do proper ptrace cooperation in a second go.
This approach would also provide the way for uprobes, which depends on the utrace framework, to be submitted upstream. Uprobes components such as breakpoint insertion and removal and the single-stepping infrastructure are also potentially useful for other user space tracers and debuggers. Like with utrace, one idea is to factor out these portions of uprobes so that it can be used by multiple clients as a shared user-space breakpoint support (ubs) layer. With multiple clients using the same layer, upstream acceptance might be easier.
One candidate for using both the utrace and the uprobes layer besides SystemTap is Froggy, which provides an alternative debugger interface to ptrace. The GDB Archer project would like to serve as testbed for Froggy, which they hope will also make GDB more robust when linked with libpython, which is being used for GDB scripting.
In the past, kernel maintainers were skeptical about tracing, which resulted in tracing frameworks like dprobes, LTT and parts of the SystemTap runtime being maintained outside the main kernel tree. But now that there is actually no shortage of tracing options in the kernel, people like Ted Ts'o have been urging the SystemTap hackers to push as much as possible upstream. Ted also encourages the developers to focus more on the kernel hackers as first-rate customers, rather than focusing exclusively on the whole system experience for production setups. The SystemTap developers have been successful in making their module support "just work" with any kernel. It currently works with kernel versions between 2.6.9 and the latest, 2.6.28; it is also regularly tested against the latest -rc kernels. But, maybe they have been a little too successful, because having this activity be more visible on the linux kernel mailing list would be good publicity. In response, there is now an active SystemTap bug called "Make upstream kernel developers happy" that calls for more frequent postings on the main kernel mailing list, improvements in the usage of debuginfo as described above, and pushing utrace and uprobes upstream first as a priority.
There is still work to do, but over the last couple of years the GNU/Linux tracing and debugging experience has kept improving. Hopefully soon, all these parts will fall into place and provide hackers with a fairly nice environment for not only debugging on development systems, but also for unobtrusive tracing on production systems.
About the author: Mark Wielaard is a Senior Software engineer at Red Hat working in the Engineering Tools group hacking on SystemTap.
| Index entries for this article | |
|---|---|
| Kernel | SystemTap |
| GuestArticles | Wielaard, Mark |
