On the value of static tracepoints

By Jonathan Corbet
April 28, 2009

As has been well publicized by now, the Linux kernel lacks the sort of tracing features which can be found in certain other Unix-like kernels. That gap is not the result of a want of trying. In the past, developers trying to put tracing infrastructure into the kernel have often run into a number of challenges, including opposition from their colleagues who do not see the value of that infrastructure and resent its perceived overhead. More recently, it would seem that the increased interest in tracing has helped developers to overcome some of those objections; an ongoing discussion shows, though, that concerns about tracing are still alive and have the potential to derail the addition of tracing facilities to the kernel.

Sun's DTrace is famously a dynamic tracing facility, meaning that it can be used to insert tracepoints at (almost) any location in the kernel. But the Solaris kernel also comes with an extensive and well-documented set of static tracepoints which can be activated by name. These tracepoints have been placed at carefully-considered locations which facilitate investigations into what the kernel is actually doing. Many real-world DTrace scripts need only the static tracepoints and do no dynamic tracepoint insertion at all.

There is clear value in these static tracepoints. They represent the wisdom of the developers who (presumably) are the most familiar with each kernel subsystem. System administrators can use them to extract a great deal of useful information without having to know the code in question. Properly-placed static tracepoints bring a significant amount of transparency to the kernel. As tracing capabilities in Linux improve, developers naturally want to provide a similar set of static tracepoints. The fact that static tracing is reasonably well supported (via FTrace) in mainline kernels - with more extensive support available via SystemTap and LTTng - also encourages the creation of static tracepoints. As a result, there have been recent patches adding tracepoints to workqueues and some core memory management functions, among others.

Digression: static tracepoints

As an aside, it's worth looking at the form these tracepoints take; the design of Linux tracepoints gives a perspective on the problems they were intended to solve. As an example, consider the following tracepoints for the memory management code which reports on page allocations. The declaration of the tracepoint looks like this:

    #include <linux/tracepoint.h>
  
    TRACE_EVENT(mm_page_allocation,

	TP_PROTO(unsigned long pfn, unsigned long free),

	TP_ARGS(pfn, free),

	TP_STRUCT__entry(
		__field(unsigned long, pfn)
		__field(unsigned long, free)
	),

	TP_fast_assign(
		__entry->pfn = pfn;
		__entry->free = free;
	),

	TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free)
	);

That seems like a lot of boilerplate for what is, in a sense, a switchable printk() call. But, naturally, there is a reason for each piece. The TRACE_EVENT() macro declares a tracepoint - this one is called mm_page_allocation - but does not yet instantiate it in the code. The tracepoint has arguments which are passed to at its actual instantiation (which we'll get to below); they are declared fully in the TP_PROTO() macro and named in the TP_ARGS() macro. Essentially, TP_PROTO() provides a function prototype for the tracepoint, while TP_ARGS() looks like a call to that tracepoint.

These values are enough to let the programmer place a tracepoint in the code with a line like:

    trace_mm_page_allocation(page_to_pfn(page),
			     zone_page_state(zone, NR_FREE_PAGES));

This tracepoint is really just a known point in the code which can have, at run time, one or more function pointers stored into it by in-kernel tracing utilities like SystemTap or Ftrace. When the tracepoint is enabled, any functions stored there will be called with the given arguments. In this case, enabling the tracepoint will result in calls whenever a page is allocated; those calls will receive the page frame number of the allocated page and the number of free pages remaining as parameters.

As can be seen in the declaration above, there's more to the tracepoint than those arguments; the rest of the information in the tracepoint declaration is used by the Ftrace subsystem. Ftrace has a couple of seemingly conflicting goals; it wants to be able to quickly enable human-readable output from a tracepoint with no external tools, but the Ftrace developers also want to be able to export trace data from the kernel quickly, without the overhead of encoding it first. And that's where the remaining arguments to TRACE_EVENT() come in.

When properly defined (the magic exists in a bunch of header files under kernel/trace), TP_STRUCT__entry() adds extra fields to the structure which represent the tracepoint; those fields should be capable of holding the binary parameters associated with the tracepoint. The TP_fast_assign() macro provides the code needed to copy the relevant data into that structure. That data can, with some changes merged for 2.6.30, be exported directly to user space in binary format. But, if the user just wants to see formatted information, the TP_printk() macro gives the format string and arguments needed to make that happen.

The end result is that defining a tracepoint takes a small amount of work, but using it thereafter is relatively easy. With Ftrace, it's a simple matter of accessing a couple of debugfs files. But other tools, including LTTng and SystemTap, are also able to make use of these tracepoints.

The disagreement

Given all the talk about tracing in recent years, there is clearly demand for this sort of facility in the kernel. So one might think that adding tracepoints would be uncontroversial. But, naturally enough, it's not that simple.

The first objection that usually arises has to do with the performance impact of tracepoints, which are often placed in the most performance-critical code paths in the kernel. That is, after all, where the real action happens. So adding an unconditional function call to implement a tracepoint is out of the question; even putting an if test around it is problematic. After literally years of work, the developers came up with a scheme ~~involving run-time code patching~~ that reduces the performance cost of an inactive tracepoint to, for all practical purposes, zero. Even the most performance-conscious developers have stopped fretting about this particular issue. But, of course, there are others.

A tracepoint exists to make specific kernel information available to user space. So, in some real sense, it becomes part of the kernel ABI. As an ABI feature, a tracepoint becomes set in stone once it's shipped in a stable kernel. There is not a universal agreement on the immutability of kernel tracepoints, but the simple fact is that, once these tracepoints become established and prove their usefulness, changing them will cause user-space tracing tools to break. That means that, even if tracepoints are not seen as a stable ABI the way system calls are, there will still be considerable resistance to changing them.

Keeping tracepoints stable when the code around them changes will be a challenge. A substantial subset of the developer community will probably never use those tracepoints, so they will tend to be unaware of them and will not notice when they break. But even a developer who is trying to keep tracepoints stable is going to run into trouble when the code evolves to the point that the original tracepoint no longer makes sense. One can imagine all kinds of cruft being added so that a set of tracepoints gives the illusion of a very different set of decisions than is being made in a future kernel; one can also imagine the hostile reception any such code will find.

The maintenance burden associated with tracepoints is the reason behind Andrew Morton's opposition to their addition. With regard to the workqueue tracepoints, Andrew said:

If someone wants to get down and optimise our use of workqueues then good for them, but that exercise doesn't require the permanent addition of large amounts of code to the kernel. The same amount of additional code and additional churn could be added to probably tens of core kernel subsystems, but what _point_ is there to all this? Who is using it, what problems are they solving?

We keep on adding all these fancy debug gizmos to the core kernel which look like they will be used by one person, once. If that!

Needless to say, the tracing developers see the code as being more widely useful than that. Frederic Weisbecker gave a detailed description of the sort of debugging which can be done with the workqueue tracepoints. Ingo Molnar's response appears to be an attempt to hold up the addition of other types of kernel instrumentation until the tracepoint issue is resolved. Andrew remains unconvinced, though; it seems he would rather see much of this work done with dynamic tracing tools instead.

As of this writing, that's where things stand. If these tracepoints do not get into the mainline, it is hard to see developers going out and creating others in the future. So Linux could end up without a set of well-defined static tracepoints for a long time yet - though it would not be surprising to see the enterprise Linux vendors adding some to their own kernels. Perhaps that is the outcome that the development community as a whole wants, but it's not clear that this feeling is universal at this time. If, instead, Linux is going to end up with a reasonable set of tracepoints, the development community will need to come to some sort of consensus on which kinds of tracing instrumentation is acceptable.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Ftrace
Kernel	Tracing

On the value of static tracepoints

Posted Apr 28, 2009 17:33 UTC (Tue) by mjthayer (guest, #39183) [Link] (2 responses)

Would it help to define the tracepoints deeper in the kernel internals as kernel version-dependent APIs, and warn that any application or script using them may be tied to a few kernel versions? It might help if the user space APIs for accessing static tracepoints were able to fail with something equivalent to "tracepoint no longer available", and this was clearly documented. And including the version number of the kernel where the tracepoint first appeared in the tracepoint name might be a hint too. Presumably though most users of these tracepoints would be sufficiently tied to those kernel versions anyway that this would not be such an issue.

On the value of static tracepoints

Posted Apr 28, 2009 18:13 UTC (Tue) by fuhchee (guest, #40059) [Link]

> It might help if the user space APIs for accessing static tracepoints were
> able to fail with something equivalent to "tracepoint no longer available",
> and this was clearly documented

It's not an API in the sense of a programming interface, but the way
systemtap exposes tracepoints (and indeed any other event source such
as kprobes, utrace events, timers, ....), is that an attempt to attach
to a facility that does not exist/match results in just such an error
message.

Systemtap also has some facilities to adapt to changes in kernels and
elsewhere, using both a preprocessor construct (similar to a #if
KERNEL_VERSION), and a syntax to specify a list of alternative sources
of the same events (the "?" and "!" operators in a probe point sequence).

So systemtap's use of tracepoints in no way imposes a requirement that
they be unmodified and permanent.

On the value of static tracepoints

Posted Apr 28, 2009 19:59 UTC (Tue) by ajb (subscriber, #9694) [Link]

Maybe these developer-only API's should only get enabled by a magic sysrq key being pressed on the console. That way software which uses them can't become embedded in end-user software.

On the value of static tracepoints

Posted Apr 28, 2009 18:06 UTC (Tue) by fuhchee (guest, #40059) [Link] (2 responses)

> After literally years of work, the developers came up with a scheme
> involving run-time code patching that reduces the performance cost of an
> inactive tracepoint to, for all practical purposes, zero.

Can someone point me to the code that applies code patching to tracepoints?

Code patching

Posted Apr 28, 2009 18:33 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

Sigh. I had figured that the immediate values/kernel marker stuff was being used here, but a closer look makes it clear that tracepoints do not use that infrastructure. Not sure why. Sorry for the confusion.

Code patching

Posted Apr 28, 2009 19:21 UTC (Tue) by compudj (subscriber, #43335) [Link]

The Immediate Values are still waiting in the LTTng tree. Actually, I am waiting to see enough tracepoints in the mainline Linux kernel to justify the use of Immediate Values before I re-post them. Otherwise, I seem to be the only one convinced of their use, given the number of tracepoints present in the LTTng kernel tree.

Regarding Mainline, kmemtrace adds enough tracepoints to have a tiny, but measurable, impact on the localhost tbench workload (one would still have to figure out if it's really statistically significant by running more passes than I can given the time I have on my hands). Note that this workload is _very_ heavy on the number of tracepoint sites executed and sensitive to cache-line layout changes.

I won't fight to push them, but if there is sufficient willingness to have them merged, I will consider posting them as a "git pull" request.

Mathieu

patchset

Posted Apr 28, 2009 18:26 UTC (Tue) by mattmelton (guest, #34842) [Link] (1 responses)

it would make more sense if someone maintained a patchset that provided the static trace points - ie: a dev kernel build. before a hacker makes the changes to their git repo public, they pop the patchset - or git does it for the developer magically. does git have the ability to transparently add/remove private patches like this?

patchset

Posted Apr 28, 2009 19:59 UTC (Tue) by compudj (subscriber, #43335) [Link]

It already exists, and it's called LTTng.

See :

http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2...

Note that it contains both the instrumentation and the LTTng tracer.

The patchsets are available at :

http://www.kernel.org/pub/linux/kernel/people/compudj/pat...

Mathieu

On the value of static tracepoints

Posted Apr 28, 2009 18:43 UTC (Tue) by bronson (subscriber, #4806) [Link] (7 responses)

Presumably DTrace has solved a lot of these concerns. What decisions did they make?

Also, what's wrong with letting tracepoints just disappear in new kernel releases? Userspace should expect this and it can deal with it gracefully.

Adding code to the kernel just to try to maintain tracepoint backward compatibility? Hah! That'll be the day.

On the value of static tracepoints

Posted Apr 28, 2009 20:23 UTC (Tue) by NAR (subscriber, #1313) [Link] (6 responses)

Presumably DTrace has solved a lot of these concerns. What decisions did they make?

I guess they are not releasing a new kernel every 3 months... I think this "new release every 3 months" schedule with the "no stable ABI" policy just doesn't work well with "enterprise". I mean an application developer or system administrator might spend a considerable time to get to know these tracepoints, but if they change with every release, then the users won't be happy. And it's not just tracepoints, but tuning parameters under /proc or /sys, configuration parameters, etc. Even filesystems can start to work differently with each new kernel version (see the ext3 issues). I would hate to develop for such a moving target (it's quite enough to follow the customer's requests).

On the other hand people are forced to upgrade to get the security fixes, so they can't afford to stay with the stable well-known solution. I know that this is the market for the enterprise distributions, but it also means that the kernels of the (enterprise) distributions are diverging from the mainline kernel, even though the new kernel development methodology supposed to prevent this.

Mark Shuttleworth had this idea some time ago that the distributions (or applications) should sync their releases. It might be useful if let's say RHEL, SLE[SD], Ubuntu LTS (and maybe Debian stable) would be released around the same time, would get the same kernel and the same tracepoints, tuning parameters, etc. This could be labelled as a .0 release. This way the enterprise distrubitions could also backport the same security fixes from the later kernel versions.

On the value of static tracepoints

Posted Apr 29, 2009 9:54 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (5 responses)

On the other hand people are forced to upgrade to get the security fixes, so they can't afford to stay with the stable well-known solution.

That's what the stable tree is for. The 2.6.x.y ones. You do know about those, yes?

On the value of static tracepoints

Posted Apr 29, 2009 10:01 UTC (Wed) by NAR (subscriber, #1313) [Link] (4 responses)

And a 2.6.29.1 stable release would help in what way on a 2.6.16 kernel? Because running the same kernel for two years *is* stability, changing kernels every 6 months is not.

On the value of static tracepoints

Posted Apr 29, 2009 10:21 UTC (Wed) by hppnq (guest, #14462) [Link]

Obviously one would get the latest release for the kernel you are running. If that doesn't exist, maybe "stable" has become "obsolete". Arguing that "changing kernels" every six months is necessary is nonsense, even disregarding the fact that it has little or nothing to do with stability in the first place.

If it has, you are doing something wrong.

On the value of static tracepoints

Posted Apr 29, 2009 10:22 UTC (Wed) by abacus (guest, #49001) [Link]

If you can't live with fast changes, stick to an enterprise distro. These distro's even offer a stable binary kernel API.

On the value of static tracepoints

Posted Apr 29, 2009 16:12 UTC (Wed) by nye (subscriber, #51576) [Link] (1 responses)

2.6.16 *was* maintained for over two years (and is no doubt still maintained by distributions even if the mainline support has been dropped). It's now been replaced by 2.6.27.x as the stable line. There's rather more than 6 months between 16 and 27, but if you really want long-term maintainence then you could always use 2.4.37.1.

On the value of static tracepoints

Posted May 6, 2009 21:13 UTC (Wed) by roelofs (guest, #2599) [Link]

... but if you really want long-term maintainence then you could always use 2.4.37.1.

Not if you want a working ntpd.

Greg

Simplifying the declaration may help

Posted Apr 28, 2009 19:26 UTC (Tue) by dw (subscriber, #12017) [Link] (1 responses)

While I appreciate the value of trace points, if someone said they'd add 30 lines of macros to my lovely code because it might benefit a user someday, I'd probably not take kindly to it either.

The printk support doesn't seem useful, it's like a design argument was abandoned in favour of implementing both sides of the debate. Even with printk, it should be possible to combine instantiation + declaration like so:

#define TRACE_EVENT(name, assign, fields...)

TRACE_EVENT(mm_page_allocation,
            ({ t->pfn = pfn; t->free = free; }),
            unsigned long pfn, unsigned long free)

The fields argument could be stringified with cpp's paste operator and converted to a format string on first use at runtime, or perhaps by reading debug information during the build.

Simplifying the declaration may help

Posted Apr 28, 2009 19:56 UTC (Tue) by compudj (subscriber, #43335) [Link]

This is a no-go, as Linux Kernel Markers history has shown. It's good for debug-style ad-hoc tracing, but having a declaration in a global header, like the tracepoints are doing, _really_ helps for tracepoint maintainability.

Having to dig into the C source to find buried trace_mark() instances is not a pretty picture.

I think keeping the trace_mark() infrastructure is good as far as quick local tracing addition is concerned, but should not be added to mainline kernel code, given it is hard to maintain.

And regarding the declaration complexity, it's added by the TRACE_EVENT() macros done by Steven. The original tracepoint flavor I did use DECLARE_TRACE, which is far simpler, e.g.

DECLARE_TRACE(irq_entry,
TP_PROTO(unsigned int id, struct pt_regs *regs,
struct irqaction *action),
TP_ARGS(id, regs, action));

But it involves creating a probe module which contains the callback elsewhere. Adding tracer-specific callbacks in a different location is seen as too much of a burden by Ingo and Steven, hence their TRACE_EVENT() macro, which combines declaration and callback "definition".

I am still unconvinced it's the best way to go though, as keeping the callbacks separated from the header declaration isolates the "tracer ABI" from the kernel instrumentation. But given we stipulate that the tracer ABI *will* evolve over time (given we give the userspace tools enough flexibility to cope with this), it can be argued that having an "all in one place" TRACE_EVENT() declaration/definition is more valuable than isolation of instrumentation for userspace-visible ABI. I guess usage will tell.

Mathieu

On the value of static tracepoints

Posted Apr 28, 2009 19:45 UTC (Tue) by compudj (subscriber, #43335) [Link] (4 responses)

As an answer to Andrew Morton (which I should have probably posted on LKML rather than here), I would say that one of the primary strength of a system-side kernel tracer is to give Linux users the ability to answer to this simple question : "Why am I not getting the expected performance or latency when I run such application or use such device on my system ?"

The answer to this question is rather easy when parts of the system are _actively_ eating up CPU time (oprofile is very good at system-wide profiling), but becomes less clear when the issue is a "worse-case latency" or involves delays caused by process "waiting time". Having a trace of wakeup dependencies and the identity of each thread consuming CPU time along with scheduler decisions are incredibly valuable in getting an overall view of the system's behavior.

If this has not been expressed clearly enough in the many presentations many of us have done in the past years, then I guess we are simply unable to reach the right audience. A good case study of the static tracepoint value has been presented in this paper 2 years ago. It presents how static tracing has been used to debug problems at Google, IBM and Autodesk.

Linux Kernel Debugging on Google-sized clusters at Ottawa Linux Symposium 2007

I have, in addition, personally been involved with and helped static tracer deployment at Google, IBM, Autodesk, Nokia, Ericsson, Siemens, Novell (SuSE Enterprise real-time), WindRiver, Montavista (Carrier Grade Linux distribution).

And if kernel developers still think that a kernel tracer is only valuable to kernel developers, then we have a big marketing job to do because they are just not getting the message : kernel tracing is _very_ valuable to Linux *users*.

P.S.: I did not reply on this topic on LKML because I think I have done my share of the explanation in the past 4 years, and I would just be repeating myself. *Linux users* have to speak up, not me.

On the value of static tracepoints

Posted Apr 29, 2009 2:22 UTC (Wed) by k8to (guest, #15413) [Link] (1 responses)

LKML is scary and forbidding.
You're suggesting i wander into a contentious topic and start opinionating?

I am a Linux user who does, among other things, systems performance tuning. I do this ad hoc and also as a significant portion of my job. Luckily I have the freedom to do portions of my work on Solaris, where dtrace is accessible. Unfortunately the Solaris internals are generally less well documented or at least less familiar than Linux to me.

On the value of static tracepoints

Posted Apr 29, 2009 5:22 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link]

Andrew Morton has said before that he would like to hear from more users and perhaps you should just do that with some good information on why and how these features are useful. The worst you will get is some flames from other people. Not a huge deal, really.

On the value of static tracepoints

Posted Apr 29, 2009 9:46 UTC (Wed) by dunlapg (guest, #57764) [Link] (1 responses)

"kernel tracing is _very_ valuable to Linux *users*."

Isn't this exactly what the article says Andrew Morton is resistant to? If it's ultimately exposed to the end-user, then it will have to have a set of stable user-space tools to gather the information, which means it will essentially be a part of the ABI that has to be maintained, or which people will complain of if broken.

The Xen hypervisor has a binary-only static tracing facility that I use extensively for my development. The particular traces change on a regular basis as the code evolves; trying to maintain the same interface for user-land tools would be basically impossible. As it is, before each release I have to go through and make sure that all of the traces I need are still there and haven't been broken by someone else. I think it's worth my time as a developer dealing with the instability. But I wouldn't want that promise exposed to an end-user.

On the value of static tracepoints

Posted Apr 29, 2009 13:08 UTC (Wed) by compudj (subscriber, #43335) [Link]

I think think Andrew has no problem with exposing this information to trace analysis tools, and that eventually trace analysis tool developers will have to adapt these tools to follow kernel revisions. We just have to make sure we add version identifiers in the exported data and make the tools flexible enough so we don't end up breaking at each kernel version. As an example, maintaining LTTV for about 7 years has not required any tremendous effort to follow new kernel releases, given we made it flexible enough.

I think the main issue he raises here is that Ftrace looks like a gathering of single-purpose tracer which will be useful only to kernel developers (and probably only once, as he say). Maybe Andrew exaggerates a bit, but his main concern, which I think is plausible, is whether Ftrace approach is useful to the Linux end-users.

Kernel developers can replace some of the static tracepoints discussed above by dynamic instrumentation because they usually won't face the low performance impact requirements as users doing system-wide tracing on heavily-loaded production systems face (yes, people do this with LTTng). So the addition of such tracepoints for either a special-purpose tracer or for a tracer which does not care so much about slowing the system down because it only collects a specific subset of data can clearly be arguable. I think the main answer to this is to bring a high-performance, system-wide user-available tracer in Linux, so those tracepoints have a in-tree user which uses them extensively. LTTng happens to have been providing this out-of-tree for a few years now.

Macros

Posted Apr 30, 2009 10:14 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

When are the kernel developers going to fork gcc (or build their own language compiler) that
includes LISP-like macros over kernel code? This is a classic case for them.

Macros

Posted Apr 30, 2009 13:30 UTC (Thu) by compudj (subscriber, #43335) [Link]

Sounds like you are reading in Ingo Molnar's mind :

[rfc] built-in native compiler for Linux?
http://lkml.org/lkml/2009/4/22/78

Mathieu