On the value of static tracepoints
Sun's DTrace is famously a dynamic tracing facility, meaning that it can be used to insert tracepoints at (almost) any location in the kernel. But the Solaris kernel also comes with an extensive and well-documented set of static tracepoints which can be activated by name. These tracepoints have been placed at carefully-considered locations which facilitate investigations into what the kernel is actually doing. Many real-world DTrace scripts need only the static tracepoints and do no dynamic tracepoint insertion at all.
There is clear value in these static tracepoints. They represent the wisdom of the developers who (presumably) are the most familiar with each kernel subsystem. System administrators can use them to extract a great deal of useful information without having to know the code in question. Properly-placed static tracepoints bring a significant amount of transparency to the kernel. As tracing capabilities in Linux improve, developers naturally want to provide a similar set of static tracepoints. The fact that static tracing is reasonably well supported (via FTrace) in mainline kernels - with more extensive support available via SystemTap and LTTng - also encourages the creation of static tracepoints. As a result, there have been recent patches adding tracepoints to workqueues and some core memory management functions, among others.
Digression: static tracepoints
As an aside, it's worth looking at the form these tracepoints take; the design of Linux tracepoints gives a perspective on the problems they were intended to solve. As an example, consider the following tracepoints for the memory management code which reports on page allocations. The declaration of the tracepoint looks like this:
#include <linux/tracepoint.h> TRACE_EVENT(mm_page_allocation, TP_PROTO(unsigned long pfn, unsigned long free), TP_ARGS(pfn, free), TP_STRUCT__entry( __field(unsigned long, pfn) __field(unsigned long, free) ), TP_fast_assign( __entry->pfn = pfn; __entry->free = free; ), TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free) );
That seems like a lot of boilerplate for what is, in a sense, a switchable printk() call. But, naturally, there is a reason for each piece. The TRACE_EVENT() macro declares a tracepoint - this one is called mm_page_allocation - but does not yet instantiate it in the code. The tracepoint has arguments which are passed to at its actual instantiation (which we'll get to below); they are declared fully in the TP_PROTO() macro and named in the TP_ARGS() macro. Essentially, TP_PROTO() provides a function prototype for the tracepoint, while TP_ARGS() looks like a call to that tracepoint.
These values are enough to let the programmer place a tracepoint in the code with a line like:
trace_mm_page_allocation(page_to_pfn(page), zone_page_state(zone, NR_FREE_PAGES));
This tracepoint is really just a known point in the code which can have, at run time, one or more function pointers stored into it by in-kernel tracing utilities like SystemTap or Ftrace. When the tracepoint is enabled, any functions stored there will be called with the given arguments. In this case, enabling the tracepoint will result in calls whenever a page is allocated; those calls will receive the page frame number of the allocated page and the number of free pages remaining as parameters.
As can be seen in the declaration above, there's more to the tracepoint than those arguments; the rest of the information in the tracepoint declaration is used by the Ftrace subsystem. Ftrace has a couple of seemingly conflicting goals; it wants to be able to quickly enable human-readable output from a tracepoint with no external tools, but the Ftrace developers also want to be able to export trace data from the kernel quickly, without the overhead of encoding it first. And that's where the remaining arguments to TRACE_EVENT() come in.
When properly defined (the magic exists in a bunch of header files under kernel/trace), TP_STRUCT__entry() adds extra fields to the structure which represent the tracepoint; those fields should be capable of holding the binary parameters associated with the tracepoint. The TP_fast_assign() macro provides the code needed to copy the relevant data into that structure. That data can, with some changes merged for 2.6.30, be exported directly to user space in binary format. But, if the user just wants to see formatted information, the TP_printk() macro gives the format string and arguments needed to make that happen.
The end result is that defining a tracepoint takes a small amount of work, but using it thereafter is relatively easy. With Ftrace, it's a simple matter of accessing a couple of debugfs files. But other tools, including LTTng and SystemTap, are also able to make use of these tracepoints.
The disagreement
Given all the talk about tracing in recent years, there is clearly demand for this sort of facility in the kernel. So one might think that adding tracepoints would be uncontroversial. But, naturally enough, it's not that simple.
The first objection that usually arises has to do with the performance
impact of tracepoints, which are often placed in the most
performance-critical code paths in the kernel. That is, after all, where
the real action happens. So adding an unconditional function call to
implement a tracepoint is out of the question; even putting an if
test around it is problematic. After literally years of work, the
developers came up with a scheme involving run-time code patching that
reduces the performance cost of an inactive tracepoint to, for all
practical purposes, zero. Even the most performance-conscious developers
have stopped fretting about this particular issue. But, of course, there
are others.
A tracepoint exists to make specific kernel information available to user space. So, in some real sense, it becomes part of the kernel ABI. As an ABI feature, a tracepoint becomes set in stone once it's shipped in a stable kernel. There is not a universal agreement on the immutability of kernel tracepoints, but the simple fact is that, once these tracepoints become established and prove their usefulness, changing them will cause user-space tracing tools to break. That means that, even if tracepoints are not seen as a stable ABI the way system calls are, there will still be considerable resistance to changing them.
Keeping tracepoints stable when the code around them changes will be a challenge. A substantial subset of the developer community will probably never use those tracepoints, so they will tend to be unaware of them and will not notice when they break. But even a developer who is trying to keep tracepoints stable is going to run into trouble when the code evolves to the point that the original tracepoint no longer makes sense. One can imagine all kinds of cruft being added so that a set of tracepoints gives the illusion of a very different set of decisions than is being made in a future kernel; one can also imagine the hostile reception any such code will find.
The maintenance burden associated with tracepoints is the reason behind Andrew Morton's opposition to their addition. With regard to the workqueue tracepoints, Andrew said:
We keep on adding all these fancy debug gizmos to the core kernel which look like they will be used by one person, once. If that!
Needless to say, the tracing developers see the code as being more widely useful than that. Frederic Weisbecker gave a detailed description of the sort of debugging which can be done with the workqueue tracepoints. Ingo Molnar's response appears to be an attempt to hold up the addition of other types of kernel instrumentation until the tracepoint issue is resolved. Andrew remains unconvinced, though; it seems he would rather see much of this work done with dynamic tracing tools instead.
As of this writing, that's where things stand. If these tracepoints do not
get into the mainline, it is hard to see developers going out and creating
others in the future. So Linux could end up without a set of well-defined
static tracepoints for a long time yet - though it would not be surprising
to see the enterprise Linux vendors adding some to their own kernels. Perhaps
that is the outcome that the development community as a whole wants, but
it's not clear that this
feeling is universal at this time. If, instead, Linux is going to end up
with a reasonable set of tracepoints, the development community will need
to come to some sort of consensus on which kinds of tracing instrumentation
is acceptable.
Index entries for this article | |
---|---|
Kernel | Development tools/Kernel tracing |
Kernel | Ftrace |
Kernel | Tracing |
Posted Apr 28, 2009 17:33 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Apr 28, 2009 18:13 UTC (Tue)
by fuhchee (guest, #40059)
[Link]
It's not an API in the sense of a programming interface, but the way
Systemtap also has some facilities to adapt to changes in kernels and
So systemtap's use of tracepoints in no way imposes a requirement that
Posted Apr 28, 2009 19:59 UTC (Tue)
by ajb (subscriber, #9694)
[Link]
Posted Apr 28, 2009 18:06 UTC (Tue)
by fuhchee (guest, #40059)
[Link] (2 responses)
Can someone point me to the code that applies code patching to tracepoints?
Posted Apr 28, 2009 18:33 UTC (Tue)
by corbet (editor, #1)
[Link] (1 responses)
Posted Apr 28, 2009 19:21 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
Regarding Mainline, kmemtrace adds enough tracepoints to have a tiny, but measurable, impact on the localhost tbench workload (one would still have to figure out if it's really statistically significant by running more passes than I can given the time I have on my hands). Note that this workload is _very_ heavy on the number of tracepoint sites executed and sensitive to cache-line layout changes.
I won't fight to push them, but if there is sufficient willingness to have them merged, I will consider posting them as a "git pull" request.
Mathieu
Posted Apr 28, 2009 18:26 UTC (Tue)
by mattmelton (guest, #34842)
[Link] (1 responses)
Posted Apr 28, 2009 19:59 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
See :
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2...
Note that it contains both the instrumentation and the LTTng tracer.
The patchsets are available at :
http://www.kernel.org/pub/linux/kernel/people/compudj/pat...
Mathieu
Posted Apr 28, 2009 18:43 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (7 responses)
Also, what's wrong with letting tracepoints just disappear in new kernel releases? Userspace should expect this and it can deal with it gracefully.
Adding code to the kernel just to try to maintain tracepoint backward compatibility? Hah! That'll be the day.
Posted Apr 28, 2009 20:23 UTC (Tue)
by NAR (subscriber, #1313)
[Link] (6 responses)
I guess they are not releasing a new kernel every 3 months... I think this "new release every 3 months" schedule with the "no stable ABI" policy just doesn't work well with "enterprise". I mean an application developer or system administrator might spend a considerable time to get to know these tracepoints, but if they change with every release, then the users won't be happy. And it's not just tracepoints, but tuning parameters under /proc or /sys, configuration parameters, etc. Even filesystems can start to work differently with each new kernel version (see the ext3 issues). I would hate to develop for such a moving target (it's quite enough to follow the customer's requests).
On the other hand people are forced to upgrade to get the security fixes, so they can't afford to stay with the stable well-known solution. I know that this is the market for the enterprise distributions, but it also means that the kernels of the (enterprise) distributions are diverging from the mainline kernel, even though the new kernel development methodology supposed to prevent this.
Mark Shuttleworth had this idea some time ago that the distributions (or applications) should sync their releases. It might be useful if let's say RHEL, SLE[SD], Ubuntu LTS (and maybe Debian stable) would be released around the same time, would get the same kernel and the same tracepoints, tuning parameters, etc. This could be labelled as a .0 release. This way the enterprise distrubitions could also backport the same security fixes from the later kernel versions.
Posted Apr 29, 2009 9:54 UTC (Wed)
by flewellyn (subscriber, #5047)
[Link] (5 responses)
On the other hand people are forced to upgrade to get the security fixes, so they can't afford to stay with the stable well-known solution. That's what the stable tree is for. The 2.6.x.y ones. You do know about those, yes?
Posted Apr 29, 2009 10:01 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (4 responses)
Posted Apr 29, 2009 10:21 UTC (Wed)
by hppnq (guest, #14462)
[Link]
If it has, you are doing something wrong.
Posted Apr 29, 2009 10:22 UTC (Wed)
by abacus (guest, #49001)
[Link]
Posted Apr 29, 2009 16:12 UTC (Wed)
by nye (subscriber, #51576)
[Link] (1 responses)
Posted May 6, 2009 21:13 UTC (Wed)
by roelofs (guest, #2599)
[Link]
Not if you want a working ntpd.
Greg
Posted Apr 28, 2009 19:26 UTC (Tue)
by dw (subscriber, #12017)
[Link] (1 responses)
While I appreciate the value of trace points, if someone said they'd add 30 lines of macros to my lovely code because it might benefit a user someday, I'd probably not take kindly to it either.
The printk support doesn't seem useful, it's like a design argument was abandoned in favour of implementing both sides of the debate. Even with printk, it should be possible to combine instantiation + declaration like so:
The fields argument could be stringified with cpp's paste operator and converted to a format string on first use at runtime, or perhaps by reading debug information during the build.
Posted Apr 28, 2009 19:56 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
Having to dig into the C source to find buried trace_mark() instances is not a pretty picture.
I think keeping the trace_mark() infrastructure is good as far as quick local tracing addition is concerned, but should not be added to mainline kernel code, given it is hard to maintain.
And regarding the declaration complexity, it's added by the TRACE_EVENT() macros done by Steven. The original tracepoint flavor I did use DECLARE_TRACE, which is far simpler, e.g.
DECLARE_TRACE(irq_entry,
But it involves creating a probe module which contains the callback elsewhere. Adding tracer-specific callbacks in a different location is seen as too much of a burden by Ingo and Steven, hence their TRACE_EVENT() macro, which combines declaration and callback "definition".
I am still unconvinced it's the best way to go though, as keeping the callbacks separated from the header declaration isolates the "tracer ABI" from the kernel instrumentation. But given we stipulate that the tracer ABI *will* evolve over time (given we give the userspace tools enough flexibility to cope with this), it can be argued that having an "all in one place" TRACE_EVENT() declaration/definition is more valuable than isolation of instrumentation for userspace-visible ABI. I guess usage will tell.
Mathieu
Posted Apr 28, 2009 19:45 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (4 responses)
As an answer to Andrew Morton (which I should have probably posted on LKML rather than here), I would say that one of the primary strength of a system-side kernel tracer is to give Linux users the ability to answer to this simple question : "Why am I not getting the expected performance or latency when I run such application or use such device on my system ?"
The answer to this question is rather easy when parts of the system are _actively_ eating up CPU time (oprofile is very good at system-wide profiling), but becomes less clear when the issue is a "worse-case latency" or involves delays caused by process "waiting time". Having a trace of wakeup dependencies and the identity of each thread consuming CPU time along with scheduler decisions are incredibly valuable in getting an overall view of the system's behavior.
If this has not been expressed clearly enough in the many presentations many of us have done in the past years, then I guess we are simply unable to reach the right audience. A good case study of the static tracepoint value has been presented in this paper 2 years ago. It presents how static tracing has been used to debug problems at Google, IBM and Autodesk.
Linux Kernel Debugging on Google-sized clusters at Ottawa Linux Symposium 2007
I have, in addition, personally been involved with and helped static tracer deployment at Google, IBM, Autodesk, Nokia, Ericsson, Siemens, Novell (SuSE Enterprise real-time), WindRiver, Montavista (Carrier Grade Linux distribution).
And if kernel developers still think that a kernel tracer is only valuable to kernel developers, then we have a big marketing job to do because they are just not getting the message : kernel tracing is _very_ valuable to Linux *users*.
P.S.: I did not reply on this topic on LKML because I think I have done my share of the explanation in the past 4 years, and I would just be repeating myself. *Linux users* have to speak up, not me.
Posted Apr 29, 2009 2:22 UTC (Wed)
by k8to (guest, #15413)
[Link] (1 responses)
I am a Linux user who does, among other things, systems performance tuning. I do this ad hoc and also as a significant portion of my job. Luckily I have the freedom to do portions of my work on Solaris, where dtrace is accessible. Unfortunately the Solaris internals are generally less well documented or at least less familiar than Linux to me.
Posted Apr 29, 2009 5:22 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Apr 29, 2009 9:46 UTC (Wed)
by dunlapg (guest, #57764)
[Link] (1 responses)
Isn't this exactly what the article says Andrew Morton is resistant to? If it's ultimately exposed to the end-user, then it will have to have a set of stable user-space tools to gather the information, which means it will essentially be a part of the ABI that has to be maintained, or which people will complain of if broken.
The Xen hypervisor has a binary-only static tracing facility that I use extensively for my development. The particular traces change on a regular basis as the code evolves; trying to maintain the same interface for user-land tools would be basically impossible. As it is, before each release I have to go through and make sure that all of the traces I need are still there and haven't been broken by someone else. I think it's worth my time as a developer dealing with the instability. But I wouldn't want that promise exposed to an end-user.
Posted Apr 29, 2009 13:08 UTC (Wed)
by compudj (subscriber, #43335)
[Link]
I think the main issue he raises here is that Ftrace looks like a gathering of single-purpose tracer which will be useful only to kernel developers (and probably only once, as he say). Maybe Andrew exaggerates a bit, but his main concern, which I think is plausible, is whether Ftrace approach is useful to the Linux end-users.
Kernel developers can replace some of the static tracepoints discussed above by dynamic instrumentation because they usually won't face the low performance impact requirements as users doing system-wide tracing on heavily-loaded production systems face (yes, people do this with LTTng). So the addition of such tracepoints for either a special-purpose tracer or for a tracer which does not care so much about slowing the system down because it only collects a specific subset of data can clearly be arguable. I think the main answer to this is to bring a high-performance, system-wide user-available tracer in Linux, so those tracepoints have a in-tree user which uses them extensively. LTTng happens to have been providing this out-of-tree for a few years now.
Posted Apr 30, 2009 10:14 UTC (Thu)
by rwmj (subscriber, #5474)
[Link] (1 responses)
Posted Apr 30, 2009 13:30 UTC (Thu)
by compudj (subscriber, #43335)
[Link]
[rfc] built-in native compiler for Linux?
Mathieu
On the value of static tracepoints
On the value of static tracepoints
> able to fail with something equivalent to "tracepoint no longer available",
> and this was clearly documented
systemtap exposes tracepoints (and indeed any other event source such
as kprobes, utrace events, timers, ....), is that an attempt to attach
to a facility that does not exist/match results in just such an error
message.
elsewhere, using both a preprocessor construct (similar to a #if
KERNEL_VERSION), and a syntax to specify a list of alternative sources
of the same events (the "?" and "!" operators in a probe point sequence).
they be unmodified and permanent.
On the value of static tracepoints
On the value of static tracepoints
> involving run-time code patching that reduces the performance cost of an
> inactive tracepoint to, for all practical purposes, zero.
Sigh. I had figured that the immediate values/kernel marker stuff was being used here, but a closer look makes it clear that tracepoints do not use that infrastructure. Not sure why. Sorry for the confusion.
Code patching
Code patching
patchset
patchset
On the value of static tracepoints
Presumably DTrace has solved a lot of these concerns. What decisions did they make?
On the value of static tracepoints
On the value of static tracepoints
On the value of static tracepoints
Obviously one would get the latest release for the kernel you are running. If that doesn't exist, maybe "stable" has become "obsolete". Arguing that "changing kernels" every six months is necessary is nonsense, even disregarding the fact that it has little or nothing to do with stability in the first place.
On the value of static tracepoints
On the value of static tracepoints
On the value of static tracepoints
... but if you really want long-term maintainence then you could always use 2.4.37.1.
On the value of static tracepoints
Simplifying the declaration may help
#define TRACE_EVENT(name, assign, fields...)
TRACE_EVENT(mm_page_allocation,
({ t->pfn = pfn; t->free = free; }),
unsigned long pfn, unsigned long free)
Simplifying the declaration may help
TP_PROTO(unsigned int id, struct pt_regs *regs,
struct irqaction *action),
TP_ARGS(id, regs, action));
On the value of static tracepoints
On the value of static tracepoints
You're suggesting i wander into a contentious topic and start opinionating?
On the value of static tracepoints
On the value of static tracepoints
On the value of static tracepoints
Macros
includes LISP-like macros over kernel code? This is a classic case for them.
Macros
http://lkml.org/lkml/2009/4/22/78