Tracing: no shortage of options

By Jonathan Corbet
July 22, 2008

Three weeks ago, LWN looked at the renewed interest in dynamic tracing, with an emphasis on SystemTap. Tracing is a perennial presence on end-user wishlists; it remains a handy tool for companies like Sun Microsystems, which wish to show that their offerings (Solaris, for example) are superior to Linux. It is not surprising that there is a lot of interest in tracing implementations for Linux; the main surprise is that, after all this time, Linux still does not have a top-quality answer to DTrace - though, arguably, Linux had a working tracing mechanism long before DTrace made its appearance.

Even a casual reader of the kernel mailing list will have noticed that there are a lot of tracing-related patches in circulation at the moment. There are so many, in fact, that it is hard to keep track of them all. So this article will take a quick look at the code which has been posted in an attempt to make the various options a bit clearer.

SystemTap

SystemTap remains the presumptive Linux tracing solution of choice. It is hampered by a few problems, though, including usability issues, a complete lack of static trace points in the mainline kernel, and no user-space tracing capability. On the usability side, we are seeing a few more kernel developers trying to put SystemTap to work and posting about the problems they are having. If one takes as a working hypothesis the notion that, if kernel hackers cannot make SystemTap work, many other users are likely to encounter difficulties as well, then one might conclude that addressing the reported problems would be a priority for the SystemTap developers.

The SystemTap developers do seem to be interested in these reports, which is a good sign. There are other things happening in the SystemTap arena, including the release of version 0.7 on July 15. This release adds a number of new features and tapsets, and a substantial set of examples as well. Meanwhile, Anup Shan has posted an interesting integration of SystemTap and the fault injection framework, allowing tapsets to control fault injection and trace the results.

James Bottomley has been playing some with the SystemTap code; one result of that work is changes to SystemTap's internal relocation code in an attempt to make it more acceptable for mainline kernel inclusion. There can be no doubt that the out-of-tree nature of much of the SystemTap support code has made it harder for that code to progress, so any improvement which makes it more likely that some of this code will be merged is welcome.

Also by James is this patch implementing a new way to put markers into the kernel. The addition of markers (or static tracepoints) has always been problematic in that many of these markers, by their nature, need to go into some of the hottest code paths in the kernel. To support dynamic tracing, these markers need to be available on production systems, so they must work without creating any significant performance regressions. Quite a bit of work has gone into the static marker code which is in the kernel (but mostly unused) now, but some developers are still uncomfortable with putting them into performance-critical paths.

James's patch addresses these concerns by putting the tracepoints entirely outside of the code paths. Rather than add some sort of marker to the code, these markers just make a note of just where in the code the marker is supposed to be; this note is stored in a separate part of the kernel binary. That information is enough for a run-time tool to patch in an actual jump to a tracing function should somebody want to see the information from that tracepoint. An additional benefit is that these markers do not interfere with any optimizations done by the compiler. Other solutions can insert optimization barriers which, while they do make life easier for the tracing subsystem, also affect the speed of the code even when the trace points are not active.

Ftrace

The text above said that the kernel's static tracepoint code is "mostly unused." That would have been better expressed as "completely," except that the 2.6.27 kernel will include a user in the form of the ftrace framework. One of the things which makes ftrace truly unique is that its documentation was not only merged before the code itself, but well before: the 2.6.26 kernel includes the excellent Documentation/ftrace.txt file.

The ftrace (which stands for "function tracer") framework is one of the many improvements to come out of the realtime effort. Unlike SystemTap, it does not attempt to be a comprehensive, scriptable facility; ftrace is much more oriented toward simplicity. There is a set of virtual files in a debugfs directory which can be used to enable specific tracers and see the results. The function tracer after which ftrace is named simply outputs each function called in the kernel as it happens. Other tracers look at wakeup latency, events enabling and disabling interrupts and preemption, task switches, etc. As one might expect, the available information is best suited for developers working on improving realtime response in Linux. The ftrace framework makes it easy to add new tracers, though, so chances are good that other types of events will be added as developers think of things they would like to look at.

Tracepoints

The kernel markers mechanism is meant to be the way that static tracepoints are inserted into the kernel. To that end, a great deal of effort went into making these markers fast; they are, for all practical purposes, a set of no-op instructions until somebody wants to turn one on, at which point the real tracing code is patched into the running kernel. Since they were merged, however, kernel markers have been the subject of a few grumbles.

In particular, kernel markers use a somewhat awkward mechanism to ensure that any arguments passed to the tracing function are interpreted correctly there. Each marker has a printk()-style format string associated with it; that string describes the type of each "argument" (a variable or expression within the code being traced). When tracing code activates a marker, it will supply a function to be called when the marker is hit and a format string describing the arguments that the function expects. The marker code will ensure that both format strings match; otherwise the marker will not be enabled. The problem is that the format string requires extra work to write and is only approximate in its specification of the types involved. These strings can make it clear that a given argument is a pointer, for example, but they say nothing about what type is pointed to.

In response to various efforts to get around this issue, Mathieu Desnoyers (the original author of the kernel marker work) has proposed a new mechanism called tracepoints. They are another way of putting static trace points into the kernel, but with a simpler and more type-safe way of putting the pieces together.

With tracepoints, every trace point must be declared in a header file with a mildly ugly set of macros:

    #include <linux/tracepoint.h>

    DEFINE_TRACE(tracepoint_name,
                 TPPROTO(trace_function_prototype),
		 TPARGS(trace_function_args));

This definition will create a new tracepoint called tracepoint_name. Any function attached to that tracepoint must have a function prototype as provided in the TPPROTO() macro; the names of the associated arguments are provided with TPARGS().

Perhaps this is better understood with an example. The tracepoints patch set includes quite a few static points for use with the LTTng tracing toolkit. There is one called sched_wakeup which fires whenever the scheduler wakes up a process. It is defined with:

    DEFINE_TRACE(sched_wakeup,
	         TPPROTO(struct rq *rq, struct task_struct *p),
		 TPARGS(rq, p));

The actual insertion of the tracepoint is a line like this:

    trace_sched_wakeup(rq, p);

Note the trace_ prefix added to the supplied name. At this point in the code, a tracing function can be called with rq (the run queue of interest) and p (the process which is waking up) as parameters. Until an actual function is connected to the tracepoint, though, this declaration is essentially a no-op. Connection of a trace function is done through a call to:

    void my_sched_wakeup_tracer(struct rq *rq, struct task_struct *p);

    register_trace_sched_wakeup(my_sched_wakeup_tracer);

The register_trace_sched_wakeup() function (created as part of the DEFINE_TRACE() definition) will connect the supplied trace function to the tracepoint. The fact that the function prototype for the trace function is supplied as part of the tracepoint definition means that the compiler can perform thorough type checking; if the prototypes do not match up, compilation will fail. And that, in turn, should put an end to those embarrassing situations where turning on tracing causes the system to go down in flames.

Interestingly, tracepoints have dispensed with much of the mechanism developed to minimize the runtime impact of kernel markers; in particular, they do not use the "immediate values" code. Profiling has shown that the performance impact of tracepoints is so low that there is little value in the added complexity of runtime patching of kernel code. Still, there are signs that some kernel developers will object to the addition of tracepoints in their current form. Developers want tracing support - but not at the cost of slower performance, even if that cost is hard to measure.

Tracehook

Finally, Roland McGrath recently surfaced with the tracehook patch set. Tracehook has a rather different focus; it is, essentially, a cleanup of the way the kernel handles the ptrace() system call. The tracehook patches try to organize all of the process tracing code (much of which is architecture-dependent) into one place where it can be dealt with as a unit.

Tracehook is meant to be a first step toward the merging of a new version of the utrace code. Utrace has long been planned as the successor to the current ptrace() implementation, which has few admirers. But utrace has encountered a number of difficulties, so its path into the kernel has been slow. It disappeared from the lists entirely for a while, but a new version of the patches is said to be coming soon; Roland notes that he expects "some vigorous feedback" when that happens.

The real importance of the ptrace() rework is that it is the path toward integrated tracing of kernel- and user-space events. And that, of course, is one of the biggest features offered by DTrace which is not yet available in SystemTap. Getting user-space tracing into the kernel - especially if it could work with the tracepoints already being inserted into some applications for DTrace - would be a major step forward for Linux. A lot of people will be watching when this patch set comes around again.

Meanwhile, Roland would like to see the tracehook code merged for 2.6.27. He is late to the party, though, and this code has not done any time in linux-next. So it is not yet clear whether tracehook will go in before the merge window closes, or whether, instead, it will have to wait for 2.6.28.

In summary...

As can be seen, there is a lot happening in the area of tracing support for Linux. Tracing, it seems, is an idea whose time has come, at last. If the pieces described here can be merged and integrated into a unified framework, and if it can all be made sufficiently easy to use, the time for "DTrace envy" will come to an end. Those "ifs" are not small ones, though. There is quite a bit of work to be done yet; hopefully the current level of energy will remain until the job is done.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	SystemTap
Kernel	Tracing

Tracing: no shortage of options

Posted Jul 22, 2008 18:59 UTC (Tue) by fuhchee (guest, #40059) [Link]

> James Bottomley has been playing some with the SystemTap code; one result 
> of that work is changes to SystemTap's internal relocation code in an 
> attempt to make it more acceptable for mainline kernel inclusion.

The part of systemtap code that James is proposing to modify lives
entirely in user-space, so mainline kernel inclusion of that part
hasn't even come up.

Tracing: no shortage of options

Posted Jul 22, 2008 21:40 UTC (Tue) by bronson (subscriber, #4806) [Link] (4 responses)

Also, Paul Fox is porting DTrace to Linux: http://www.crisp.demon.co.uk/blog/index.html  I
haven't tried it yet but I'm sure I will soon.

Tracing: no shortage of options

Posted Jul 22, 2008 23:57 UTC (Tue) by SEJeff (guest, #51588) [Link] (3 responses)

DTrace on Linux is basicly a noop due to licensing conflict. Now what would happen if someone
wrote a syntax compatible "dtrace interpreter" that used systemtap and utrace, once it comes
out?

That would be a win/win because you have 1 language, D, for tracing events on two of the most
popular posix platforms.

Tracing: no shortage of options

Posted Jul 23, 2008 4:05 UTC (Wed) by willy (subscriber, #9762) [Link]

Four -- Linux, Solaris, MacOS and FreeBSD.  The other three already use DTrace.

Tracing: no shortage of options

Posted Jul 23, 2008 12:32 UTC (Wed) by fuhchee (guest, #40059) [Link] (1 responses)

It might not be hard for systemtap to parse dtrace's smaller scripting
language, but there's more to it than that.  A variety of dtrace
kernel and userspace features have no equivalent yet, so many real
dtrace scripts would not work.  Plus, may dtrace scripts use
type names/structs/fields that do not correspond to those in
linux, so one would need to change those scripts or create a
solaris emulation layer someplace.

Tracing: no shortage of options

Posted Jul 31, 2008 9:13 UTC (Thu) by renox (guest, #23785) [Link]

[[Plus, may dtrace scripts use type names/structs/fields that do not correspond to those in
linux, so one would need to change those scripts or create a solaris emulation layer
someplace.]]

Those issue are the same for FreeBSD, MacOS X, no?
And AFAIK they have not made a Solaris emulation layer, so Solaris-specific dtrace's script
won't work here either..

A compatibility layer with dtrace portable scripts would be nice, but compatibility with
OS-specific dtrace scripts doesn't matter much IMHO.

Why trace the kernel?

Posted Jul 24, 2008 23:26 UTC (Thu) by pr1268 (guest, #24648) [Link] (7 responses)

Forgive me for sounding ignorant, but why all the fuss about tracing the kernel? And more specifically, why is there DTrace "envy" (referring to a previous week's LWN article)?

My impressions are that in the Linux kernel, the code is already well-optimized and (relatively) error-free, so why is there such a demand as of late to insert all these probes like stuffing pins in a pincushion? Granted, I'm familiar with using tools like strace(1) (and similar tools, both open-source and proprietary), but I'm still unsure why this has become such an important issue with regards to the kernel lately.

And, what's the envy all about over Sun Microsystems' DTrace? Linux kernel devs are certainly capable of creating their own tracing tools (as this article explains).

Thanks in advance for any comments--I'm just trying to feed my curiosity by asking here.

Why trace the kernel?

Posted Jul 25, 2008 1:33 UTC (Fri) by corbet (editor, #1) [Link] (6 responses)

This kind of tracing is wanted by people trying to track down problems which happen in production environments. You need to see why the system is behaving poorly (or crashing) while operating under its real workload. That requires the ability to hook into the system almost anywhere without perturbing its operation. Tracing can be nice for normal kernel debugging, but it's the folks wondering why their monster trading systems are bogging down that really want it.

The "DTrace envy" is there (1) because Sun has been using it as one of its primary anti-Linux marketing tools, and (2) because what we have isn't as good. We'll get there.

Why trace the kernel?

Posted Jul 29, 2008 21:01 UTC (Tue) by oak (guest, #2786) [Link] (5 responses)

Another point is that unlike strace, SystemTap's system-wide[1], and being 
kernel side instead of going through the fragile (threads/signals...) & 
slow ptrace() interface it can be much faster, especially after it starts 
using the static markers / tracepoints.

[1] One can attach strace to multiple processes, but it really slows down 
the system and is pretty inconvenient (ptrace has side-effects).  If one 
wants to do full system tracing, currently LTT is much better alternative 
than SystemTap though due to its much smaller overhead (especially with 
multiple probes).  LTT and SystemTap are a bit for different purposes 
though.

Currently neither LTT nor SystemTap provide user-space tracing like DTrace 
does, but I think Frysk does some of that.  Frysk doesn't do kernel 
tracing though and it would also profit from utrace as currently it uses 
ptrace().  (Note: ltrace tool is pretty useless for user-space tracing as 
it doesn't trace library->library calls and it doesn't support dlopen() 
which is used almost by anything a bit more complicated on Linux desktop).

Once one has system wide monitoring, one could have something like this:
- http://idea.opensuse.org/content/ideas/filemon-like-syste...
- Or the RedHat Google SOC about getting BootChart type of visuals from 
SystemTap...

PS. This is not just DTrace envy, it would be nice to have a GUIs like 
these:
http://developer.apple.com/tools/performance/optimizingwi...
http://developer.apple.com/documentation/DeveloperTools/C...
(Instruments uses DTrace, Shark doesn't) :-)


Btw. This is more about performance analysis framework stuff than tracing, 
but what is happening with perfmon2?  Is that going to be used e.g. by 
LTT?

Why trace the kernel?

Posted Jul 30, 2008 1:05 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

Regarding your ltrace comments, yes, it's a kludge. Thankfully glibc 
provides LD_AUDIT which is much more capable, because the dynamic linker 
really can tell what dynamic calls are being made. It would be even better 
if there was even the slightest hint of documentation about it, but 
thankfully we don't need that as Jiri Olsa has done it for us: 
<http://latrace.sf.net/>.

Why trace the kernel?

Posted Jul 30, 2008 14:37 UTC (Wed) by oak (guest, #2786) [Link] (1 responses)

Thanks, this looks useful!

(Although dynamic linker based things are faster than ptrace, they require 
restarting the monitored process to enabled the tracing so they are not 
very suitable for ad-hoc system monitoring.)

Why trace the kernel?

Posted Jul 30, 2008 22:22 UTC (Wed) by nix (subscriber, #2304) [Link]

That's true. (Particularly true in this case, where quite a bit of the 
magic is handled at ld.so startup time.)

Why trace the kernel?

Posted Aug 5, 2008 17:01 UTC (Tue) by compudj (subscriber, #43335) [Link] (1 responses)

Currently, LTTng implements support for user-space markers on x86 32 and 64 bits. It's a bit
slow, since it goes through a system call each time an event record must be recorded, and the
API is subject to change, but one can currently add markers to their userspace program or
library. See the package : http://ltt.polymtl.ca/packages/markers-userspace-0.5.tar.bz2. It
depends on the LTTng patchset to enable/disable markers and to record data. The tarball
contains examples telling how to modify the makefiles and linker scripts to use markers in
userspace.

Being the LTTng project lead, I dream about a simple in-kernel API to manage the performance
counters, which would aim at managing these limited resources for the various users (watchdog,
user-space perfmon-like API, in-kernel LTTng). The tracer is itself easily extensible and can
record new events which include performance counters either in an interrupt mode or at
specific events occuring on the system (system call entry/exit, interrupt handler entry/exit,
trap entry/exit...). I just need something to setup these counters and let them run free on
the system without changing them when switching from one task to another : this is something
really annoying when gathering system-wide information. I haven't looked at the perfmon code
lately, but I think that most of the user-space system call API is useless to an in-kernel
user like LTTng. A minimalistic perfmon would be welcome.

oprofile

Posted Mar 18, 2009 5:57 UTC (Wed) by xoddam (subscriber, #2322) [Link]

Forgive me if it's an obtuse question, but if it's performance metrics you're after and not investigating a particularly thorny race condition, isn't oprofile enough?