A SystemTap update

January 21, 2009

This article was contributed by Mark Wielaard

SystemTap has been under active development for a some years. More than 35 people have contributed enhancements in the last year. But newer developments, like the ability to dynamically trace user space programs, seem to have been very quietly introduced and, thus, have not always been noticed by users that are not yet using SystemTap extensively. So this article will take a look at what currently works out of the box, what that box should contain to make things work, the work in progress, and the challenges SystemTap faces to be more powerful and get more widespread adoption.

SystemTap's goal is to provide full system observability on production systems, which is safe, non-intrusive, (near) zero-overhead and which allows ubiquitous data collection across the whole system for any interesting event that could happen. To achieve this goal, SystemTap defines the stap language, in which the user defines probes, actions, and data acquisition. The SystemTap translator and runtime guarantees that probe points are only placed on safe locations and that probe functions cannot generate too much overhead when collecting data. For dynamic probes on addresses inside the kernel, SystemTap uses kprobes; for dynamic probes in user space programs, instead, SystemTap uses its cousin uprobes [PDF]. This provides a unified way of probing and then collecting data for observing the whole system. To dynamically find locations for probe points, arguments of the probed functions and the variables in scope at the probe point, SystemTap uses the debuginfo (Dwarf) standard debugging information that the compiler generates.

So, to provide an ideal setting for using SystemTap, GNU/Linux distributions should provide easy access to debuginfo for the kernel and user space programs. Almost all distributors do this. The kernel supports kprobes, which has been in the upstream kernel for some years, and uprobes, which comes with (and is automatically loaded by) SystemTap, but which relies on the full utrace framework, which isn't yet in the mainline kernel. (The latest few releases of the Fedora family, including Red Hat Enterprise Linux and CentOS, do include full utrace support by default). SystemTap works without debuginfo, but the range of probes and the amount of data you can collect is then very limited. And it works without utrace support, but then you won't be able to do deep user space probing, only observe direct user/kernel space interactions.

There are various probe variants one can use with SystemTap, but the most interesting ones are the debuginfo-based probes for the kernel, kernel modules, and user space applications. These can use function, statement or return variants, and wildcards, such as:

kernel.function("rpc_new_task"): a named kernel function,
process("/bin/ls").function("*"): any function entry in a specific process,
module("usb*").function("*sync*").return: every return of a function containing the word sync, in any module starting with usb, or
kernel.statement("bio_init@fs/bio.c+3"): for a specific statement in a particular file

Depending on the type of probe, one can access specifics of the probe point. For the debuginfo based probes these are $var for in-scope variables or function arguments, $var->field for accessing structure fields, $var[N] for array elements, $return for the return value of a function in a return probe, and meta variables like $$vars to get a string representation of all the in-scope variables at a particular probe point. All access to such constructs are safeguarded by the SystemTap runtime to make sure no illegal accesses can occur.

Given that one has the debuginfo of a program installed, one can easily get a simple call trace of a specific program, including all function parameters and return values with the following stap script:

  probe process("/bin/ls").function("*").call
  {
    printf("=>%s(%s)\n", probefunc(), $$parms);
  }

  probe process("/bin/ls").function("*").return
  {
    printf("<=%s:%s\n", probefunc(), $$return);
  }

The examples included with SystemTap come with much more powerful versions that show timed, per-thread call graphs, optionally showing only children of a particular function call.

While these probing and data extraction constructs are powerful, they do require some knowledge of the kernel or program code base. Since you are often interested in what is happening and not precisely how, SystemTap comes with "tapsets," which are utility functions and aliases for groups of interesting probes in a particular subsystem. Examples include system calls, NFS operations, signals, sockets, etc. Currently these tapsets are distributed with SystemTap itself, but ideally each program or subsystem would come with its own tapset of interesting events provided by the program or subsystem maintainer.

Just printing out events while they occur is not always ideal. First, you may be overwhelmed by volume of the output; second, you might only be interested in a specific subset of the same event (only certain parameters, only calls that take longer than a specific time, only from the process that does the most calls over a specific time frame, etc.). Finally, processing all the events on your production system might interfere with the thing you are trying to observe. Especially at the start of your investigations, when you might not yet be sure what the interesting events are, you may do some very wide probing to see what is going on.

For this reason the stap language supports variables that can be used as associative arrays, simple control structures and data aggregation functions to do simple statistics during probe time, with very low overhead and without having to call external programs that might interfere with the system being probed.

The following script might be how you would start investigating a problem involving a system which seems to do an excessive amount of reads. It uses the "vfs" tapset and an associative array to store the number of reads a particular executable with a specific process ID does:

  global totals;
  probe vfs.read
  {
    totals[execname(), pid()]++
  }

  probe end
  {
    printf("== totals ==\n")
    foreach ([name,pid] in totals-)
      printf("%s (%d): %d \n", name, pid, totals[name,pid])
  }

This will give you a list of executables and their pid sorted by the total number of vfs reads done while the script was running. These facilities in the stap language help greatly to minimize any overhead of the tracing framework. If you would try to do the same thing by just printing each vfs event and then post-processing the results with Perl, you might end up with Perl itself being the process doing the most vfs calls, or worse, by having to parse megabytes of trace data, Perl might start trashing the system even more, making it harder to determine the root cause of the original problem.

SystemTap now also supports static markers in the kernel. This allows subsystem maintainers to mark specific events as interesting, providing a format string of the arguments to the event that can be easily parsed by tracing tools. The advantage of static markers over tapsets is that they are in-code and so might be easier to maintain, though you probably still want to have an associated tapset for utilities to nicely format the arguments or associate various markers with each other. Also, they can work without needing any DWARF debuginfo around, but you lose the ability to inspect local variables or function parameters not passed to the marker. You use them with a command like:

    probe kernel.mark("kernel_sched_wakeup")

The tapset can then access the arguments through $argN and get the argument format string of the marker with $format.

An alternate way of adding static markers to the kernel, tracepoints, is not yet directly supported in SystemTap. Tracepoints have the disadvantage that they require the DWARF debuginfo to be around because they don't currently specify the types of their arguments except through their function prototypes. So SystemTap can currently only use tracepoints via hand-written intermediary code that maps them to markers.

The development version of SystemTap recently got support for user-space static markers. Although SystemTap defines its own STAP_PROBE macros for usage in applications that want to add static markers, there is also an alternative tracing tool, Dtrace, that has its own way for programs to embed static markers. SystemTap supports the convention used by Dtrace by providing an alternative include file and build preprocessor so that programs using DTRACE_PROBE macros can be compiled as if for Dtrace and have their static markers show up with SystemTap.

Luckily, there are various programs that already have such markers defined. For example PostgreSQL has various static markers to trace higher-level events like transactions and database locks. Currently one has to adapt the build process of such programs by hand, but the next version of SystemTap will come with scripts that will automate that process.

While SystemTap works well on GNU/Linux distributions that support it, there are a couple of challenges to overcome to make it more ubiquitous and easier for more people to use out of the box. This goes beyond work on the SystemTap code base itself. Since the goal is to provide full system observability, from low-level kernel events to high-level application events, there is work to be done all across the GNU/Linux stack. Also needed is better integration into more distributions, providing default installation of SystemTap and tapsets, easy access to debuginfo for deep inspection, binaries compiled with marker support for high-level events, etc. The two main challenges to make SystemTap more powerful and easier to use on any distribution are debuginfo and better kernel support.

A lot of power of SystemTap comes from the fact that it can use DWARF debuginfo from the kernel and applications to do very detailed inspection. But this power comes at a price, since the debuginfo is often large. For example, on Fedora, the kernel debuginfo package is far larger than the kernel package itself. One easy win will be to split the debuginfo package into the DWARF files and the source files, which are needed for a debugger, but not directly for a tracer like SystemTap. Fedora plans to do this for its next release. The elfutils team is also working on a framework for Dwarf transformation and compression that could be used as post-processor on the output of the compiler.

SystemTap sometimes suffers from the same issues you might have with a debugger: the compiler has optimized the code, but forgot where it put a certain variable after the optimization. Of course this is always the variable you are most interested in. Alexandre Oliva is working on improving the local variable debug information in GCC. His variable tracking assignments [PDF] branch in GCC aims to improve debug information by annotating assignments early in the compilation process and carrying over such annotations throughout all optimization passes so that you can always accurately track variables, even in optimized code.

Finally, there is work being done on having a SystemTap "client and server" that could be used on production systems where you might not even want to have any tools or debuginfo installed. You can then set up a development client that has the same configuration as the production system with the addition of the SystemTap translator and all debuginfo, create and test your scripts there. The final result of this work could then be used on the bare-bones production server.

Most of the SystemTap runtime, like the kprobes support, is maintained in the upstream linux kernel, but there is some stuff still missing. This leads to distributions having to add patches to their kernel, especially to support user space tracing. In particular, the utrace framework is still not upstream. Over the last few kernel releases, various parts have been merged, including the utrace user_regset framework, which creates an interface for code accessing the user-space view of any machine specific state, and the tracehook work, which provides a framework for all the user process tracing. The actual utrace framework sits on top of these components; the ptrace() interface is implemented as utrace client. Anything that changes the ptrace implementation is hairy stuff, so there is a large ptrace testsuite to make sure that nothing breaks. One idea under consideration is to push utrace upstream in two installments. At first, using utrace or ptrace on a process would be mutually exclusive. That could pave to path to get pure-utrace upstream in first and then do proper ptrace cooperation in a second go.

This approach would also provide the way for uprobes, which depends on the utrace framework, to be submitted upstream. Uprobes components such as breakpoint insertion and removal and the single-stepping infrastructure are also potentially useful for other user space tracers and debuggers. Like with utrace, one idea is to factor out these portions of uprobes so that it can be used by multiple clients as a shared user-space breakpoint support (ubs) layer. With multiple clients using the same layer, upstream acceptance might be easier.

One candidate for using both the utrace and the uprobes layer besides SystemTap is Froggy, which provides an alternative debugger interface to ptrace. The GDB Archer project would like to serve as testbed for Froggy, which they hope will also make GDB more robust when linked with libpython, which is being used for GDB scripting.

In the past, kernel maintainers were skeptical about tracing, which resulted in tracing frameworks like dprobes, LTT and parts of the SystemTap runtime being maintained outside the main kernel tree. But now that there is actually no shortage of tracing options in the kernel, people like Ted Ts'o have been urging the SystemTap hackers to push as much as possible upstream. Ted also encourages the developers to focus more on the kernel hackers as first-rate customers, rather than focusing exclusively on the whole system experience for production setups. The SystemTap developers have been successful in making their module support "just work" with any kernel. It currently works with kernel versions between 2.6.9 and the latest, 2.6.28; it is also regularly tested against the latest -rc kernels. But, maybe they have been a little too successful, because having this activity be more visible on the linux kernel mailing list would be good publicity. In response, there is now an active SystemTap bug called "Make upstream kernel developers happy" that calls for more frequent postings on the main kernel mailing list, improvements in the usage of debuginfo as described above, and pushing utrace and uprobes upstream first as a priority.

There is still work to do, but over the last couple of years the GNU/Linux tracing and debugging experience has kept improving. Hopefully soon, all these parts will fall into place and provide hackers with a fairly nice environment for not only debugging on development systems, but also for unobtrusive tracing on production systems.

About the author: Mark Wielaard is a Senior Software engineer at Red Hat working in the Engineering Tools group hacking on SystemTap.

Index entries for this article
Kernel	SystemTap
GuestArticles	Wielaard, Mark

A SystemTap update

Posted Jan 29, 2009 4:04 UTC (Thu) by akpm (guest, #4826) [Link] (16 responses)

um, which genius decided to make systemtap dependent upon two large
kernel patches (utrace and uprobes) which have dim-to-zero prospects
of ever being included in Linux?

A SystemTap update

Posted Jan 29, 2009 8:23 UTC (Thu) by eugeniy (guest, #24280) [Link] (7 responses)

SystemTap is not dependent on any patches, it works fine with unpatched kernel.

A SystemTap update

Posted Jan 29, 2009 10:08 UTC (Thu) by ctg (guest, #3459) [Link] (5 responses)

Me and a colleague were discussing just last night the thorny problem of how we work out which of many competing processes are using disk access - causing contention on the disk, so that everything queues up, and more and more disk access is caused... not really a lot of tools in linux todo that... (If we knew the worst offenders, then we could focus our effort on making them more efficient - having to put instrumentation in each process is really time consuming).

.. so reading this article was timely. Looks like systemtap would enable us to quickly home in on the big disk users..

.. the article quite clearly states that to get the best out of systemtap you need these patches, so when Mr Morton himself makes this sort of criticism, then its a bit of a concern.

Despite all that, I'm off to look at systemtap in a bit more detail (it's lack of ubiquity has put me off before), but the lack of decent tools for working out what is really going on in a complex system is pretty frustrating (I'm still suffering from the lack of the "W" flag in the output of ps(1) to show which processes are swapped out - I understand why it doesn't show that any more - but when your system goes into swap, it's useful to see which processes are being paged out.. I suspect systemtap might be able to help with this too).

A SystemTap update

Posted Jan 29, 2009 10:42 UTC (Thu) by mjw (subscriber, #16740) [Link]

The vfs tapset example in the article works without needing any additional user space hook patches.

Also take a look at some of the examples that come with Systemtap. disktop.stp probably does what you want:
http://sourceware.org/systemtap/examples/keyword-index.ht...

A SystemTap update

Posted Jan 29, 2009 17:05 UTC (Thu) by knobunc (guest, #4678) [Link]

A SystemTap update

Posted Jan 29, 2009 20:07 UTC (Thu) by epb205 (guest, #50182) [Link] (1 responses)

Why doesn't ps show which processes are swapped out anymore? Is that somehow a security hole?

A SystemTap update

Posted Feb 3, 2009 22:33 UTC (Tue) by oak (guest, #2786) [Link]

No idea, but you can get the same information also from /proc/PID/smaps.
It's separate for each of the memory mapping the process has i.e. you may
need to write a small script to process the data.

If the process has stuff that's marked as swapped, but not anymore as
dirty, it's completely swapped out. For some reason kernel/SMAPS doesn't
think swapped pages to be anymore dirty which loses the distinction
between shared dirty and private dirty that SMAPS shows for pages still in
RAM.

A SystemTap update

Posted Jan 30, 2009 3:15 UTC (Fri) by SEJeff (guest, #51588) [Link]

Actually the block_dump feature of modern 2.6 linux kernels will show you which processes are writing to which devices. I wrote a proof of concept script to show them:

http://www.digitalprognosis.com/opensource/scripts/top-di...

The output looks like this:
root@desktopmonster:~# ./top-disk-users
COMMAND PID NUM ACTION DEVICE
banshee-1 23999 8 READ sda9
kjournald 2494 131 WRITE sda5
kjournald 5182 5 WRITE sda8
pdflush 228 15 WRITE sda5
pdflush 228 1 WRITE sda8
pdflush 228 32 WRITE sda9

A SystemTap update

Posted Jan 29, 2009 12:46 UTC (Thu) by eugeniy (guest, #24280) [Link]

Correction: patches are not needed for probing kernel. It looks like for userspace utrace is required. uprobes source, it seems, is included in systemtap.

A SystemTap update

Posted Jan 29, 2009 10:38 UTC (Thu) by mjw (subscriber, #16740) [Link]

As the article states there is no hard dependency, they are just used for deeper user space probing if wanted. And some of the utrace foundations have been going in, with the groundwork now upstream.

The last part of the article gives some idea of ways people are working on getting this functionality faster upstream, so they are included with more distributions by default. By splitting it up, providing other users, etc. One recent example is the utrace->ftrace engine proof of concept: http://lkml.org/lkml/2009/1/27/294

If you have any hints and tips for getting these things, or similar user space hooks that Systemtap can use, upstream faster that would be appreciated.

A SystemTap update

Posted Jan 29, 2009 13:28 UTC (Thu) by fuhchee (guest, #40059) [Link]

> [why is] systemtap dependent upon two large
> kernel patches (utrace and uprobes)

For probing user-space, there is apprx. no alternative: one needs a
kprobes-like infrastructure.

> which have dim-to-zero prospects of ever being included in Linux?

While skepticism may be warranted, we are making efforts to make this
code more palatable to the gatekeepers.

A SystemTap update

Posted Jan 29, 2009 16:15 UTC (Thu) by jejb (subscriber, #6654) [Link] (5 responses)

Actuallly, only the user space tracing aspect of systemtap is dependent on these. You can still do kernel space tracing without them.

We've spent quite a lot of effort explaining the problems with the utrace/uprobes dependency (especially the issues of having to pull the process symbol table into the kernel and of having the kernel actually execute the compiled code to do the traps). There is hope that we might be able to go with a lighter weight infrastructure that simply vectors traps to the user space stap runtime and does all the interpreting in user space. It's just we still haven't quite got system tap buy in yet.

A SystemTap update

Posted Jan 29, 2009 16:36 UTC (Thu) by fuhchee (guest, #40059) [Link] (3 responses)

> We've spent quite a lot of effort explaining the problems with the
> utrace/uprobes dependency

Can you provide some links to discussion about these specifics: ?

> (especially the issues of having to pull the
> process symbol table into the kernel

User-space symbol tables are made available to the systemtap module
only if it is required by the script - if it performs symbolic
address or backtrace type lookups.

> and of having the kernel actually
> execute the compiled code to do the traps

Like in dtrace, instrumentation is run within the kernel because
having user-space processes instrument each other is too disruptive.
We're looking for microsecond-level probe effect, not something
involving multiple context switches, indirect address space accesses,
and so on.

A SystemTap update

Posted Jan 29, 2009 16:55 UTC (Thu) by jejb (subscriber, #6654) [Link] (2 responses)

>> We've spent quite a lot of effort explaining the problems with the
>> utrace/uprobes dependency
>
> Can you provide some links to discussion about these specifics: ?

Um, just use a search ... if you search lkml for utrace you get the less polite version .. if you search the systemtap lists on the same thing, you get the more polite one.

>> (especially the issues of having to pull the
>> process symbol table into the kernel
>
> User-space symbol tables are made available to the systemtap module
> only if it is required by the script - if it performs symbolic
> address or backtrace type lookups.

Only if you buy the premise that the kernel has to be intimately involved in the trace instead of being a simple conduit for mediating it.

>> and of having the kernel actually
>> execute the compiled code to do the traps
>
> Like in dtrace, instrumentation is run within the kernel because
> having user-space processes instrument each other is too disruptive.
> We're looking for microsecond-level probe effect, not something
> involving multiple context switches, indirect address space accesses,
> and so on.

Well, this would be the classic illustration of the problems systemtap faces. Nothing on the above laundry list is impossible even if the kernel merely controls the traced process and lets userspace poke at it ... that, after all, is how gdb works. The brick wall is that kernel developers don't think this is at all a compelling argument and apparently systemtap people think it is.

A SystemTap update

Posted Jan 29, 2009 17:26 UTC (Thu) by fuhchee (guest, #40059) [Link]

> Um, just use a search

I asked because I recall no serious debate about the two specific items ("process symbol tables in the kernel" and "having kernel ... execute code ... to do the traps") you listed. Please humor fellow readers and give some links.

> > User-space symbol tables are made available to the systemtap module
> > only if it is required by the script

> Only if you buy the premise that the kernel has to be intimately involved
> in the trace instead of being a simple conduit for mediating it.

There are many possible details behind such a summary. If one wants dtrace-level introspection and manipulation, never mind going beyond it, some "intimate involvement" (kernel-side processing?) is necessary. Merely "mediating" (data copying?) is not sufficient, since the choice of data and the nature of the programmed reaction is itself variable.

> [...] that, after all, is how gdb works. [...]

The work involved in how gdb does its thing is several orders of magnitude heavier.

> The brick wall is that kernel developers don't think this is at all a
> compelling argument and apparently systemtap people think it is.

Individual kernel people don't need to buy into every argument for systemtap to bloom. We have promoted numerous "dual-use" kernel-side technologies that can stand on their own feet. For example, with utrace, if you believe that user-space instrumentation is plausible, you should support utrace and forthcoming ("froggy" or "ubs"-like) layers on top, for dispatching those events to a hypothetical user-space handler.

The details deserve more in-depth discussion.

A SystemTap update

Posted Feb 3, 2009 22:41 UTC (Tue) by oak (guest, #2786) [Link]

> Well, this would be the classic illustration of the problems systemtap
faces. Nothing on the above laundry list is impossible even if the kernel
merely controls the traced process and lets userspace poke at it ... that,
after all, is how gdb works.

As to why to do it in kernel... Doing it from user space is just too slow.
Try e.g. get backtraces to mallocs through ptrace and you notice how
infeasible this is from user-space (at least through the interface ptrace
offers). With the modern desktop apps that use malloc pretty heavily, the
programs become unusable slow (in addition to their usability, also their
functionality may suffer if they use timeouts for responses etc).

A SystemTap update

Posted Jan 30, 2009 10:31 UTC (Fri) by mjw (subscriber, #16740) [Link]

> Actuallly, only the user space tracing aspect of systemtap is dependent on these. You can still do kernel space tracing without them.

Correct.

> We've spent quite a lot of effort explaining the problems with the utrace/uprobes dependency (especially the issues of having to pull the process symbol table into the kernel and of having the kernel actually execute the compiled code to do the traps).

Could you post the problems you see?

How a tracing tool like systemtap processes and uses the symbol table is kind of orthogonal from utrace and uprobes. utrace and uprobes might make it easier to access them during runtime. But that isn't what Systemtap currently does. If you want a tracer to do these things dynamically at trace event time, or even push the whole thing towards user space in reaction to trace events and hand it off to a user space helper then that is certainly a design choice you can make (unlike tracers, debuggers do this for example since they don't mind suspending the tracee for a longer period). The article does hint at why "offloading" this to a user space helper might not be practical (see the vfs example and the explanation of what might happen if you try to offload something like that to a perl script). But those are tradeoffs you can make independent of the infrastructure you use in the kernel to handle events and trace point insertion.

> There is hope that we might be able to go with a lighter weight infrastructure that simply vectors traps to the user space stap runtime and does all the interpreting in user space.

Yes, there is nothing inherent in utrace or uprobes about how you handle trace events or how you use and insert vector traps into user space. That is the basic idea behind pushing them upstream, because they are useful apart from systemtap. They should also be useful for other tracers like connecting them to ftrace or lttng. You could even use them for a new debugger interface if you aren't interested in a no-overhead tracer. That is what the froggy project is exploring. It seems time to provide something better than the ptrace interface for debuggers.

"Client and server"

Posted Jan 30, 2009 23:25 UTC (Fri) by saffroy (guest, #43999) [Link]

I'm really glad to see that the idea of the "client and server" to debug constrained platforms has caught. Systemtap, even when "only" used for kernel debugging, is way too good to be only for big servers where you can install large debug RPMs. Embedded platform developpers will bless such a tool when it becomes available!