A collection of tracing topics
The tracing ABI
Once upon a time, Linux had no tracing-oriented interfaces at all. Now, instead, we have two: ftrace and perf events. Some types of information are only available via the ftrace interface, others are only available from perf, and some sorts of events can be obtained in either way. From the discussions that have been happening for some time it's clear that neither interface satisfies everybody's needs. In addition, there are other subsystems waiting on the wings - LTTng and a recently proposed system health subsystem, for example - which bring requirements of their own. The last thing that the system needs is an even wider variety of tracing interfaces; it would be nice, instead, to pull everything together into a single, unified interface.
Almost everybody involved agrees on that point, but that is about where the agreement stops. Your editor, unfortunately, missed the tempestuous session at the Linux Plumbers Conference where a number of tracing developers came to an agreement of sorts: a new ABI would be developed with the explicit goal of being a unified tracing and event interface for the system as a whole. This ABI would be kept out of the mainline until a number of tools had been written to use it; only when it became clear that everybody's needs are met would it be merged. Your editor talked to a number of the people involved in that discussion; all seemed pleased with the outcome.
Ftrace developer Steven Rostedt interpreted the discussion as a mandate to develop an entirely new ABI for tracing purposes:
LTTng developer Mathieu Desnoyers took things even further, posting a "tracing ABI work plan" for discussion. That posting was poorly received, being seen as a document better suited to managerial conference rooms - a perception which was not helped by Mathieu's subsequent posting of a massive common trace format document which would make a standards committee proud. Kernel developers, as always, would rather see code than extensive design documents.
When the code comes, though, it seems that there will be resistance to the idea of creating an entirely new tracing ABI. Thomas Gleixner has expressed his dislike for the current state of affairs and attempts to create complex replacements; he is calling for a gradual move toward a better interface. Ingo Molnar has said similar things:
We'll need to embark on this incremental path instead of a rewrite-the-world thing. As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can and will do better here.
The existing ABI that Ingo likes, of course, is the perf interface. He would clearly like to see all tracing and event reporting move to the perf side of the house. The perf ABI, he says, is sufficiently extendable to accommodate everybody's needs; there does not seem to be a lot of room for negotiation on this point.
Stable tracepoints
One of the conclusions reached at the 2010 Kernel Summit was that a small set of system tracepoints would be designated "stable" and moved to a separate location in the filesystem hierarchy. Tools using these tracepoints would have a high level of assurance that things would not change in future kernel releases; meanwhile, kernel developers could feel free to add and use tracepoints elsewhere without worrying that they could end up maintaining them forever. It seemed like an outcome that everybody could live with.
Steven recently posted an implementation of stable tracepoints to implement that decision. His patch adds another tricky macro (STABLE_EVENT()) which creates a stable tracepoint; all such tracepoints are essentially a second, restricted view of an existing "raw" tracepoint. That allows development-oriented tracepoints to provide more information than is deemed suitable for a stable interface and does not require cluttering the code with multiple tracepoint invocations. There is also a new "eventfs" filesystem to host stable tracepoints which is expected to be mounted on /sys/kernel/events. A small number of core tracepoints have been marked as stable - just enough to show how it's done.
There were a number of complaints about eventfs, not the least of which being Greg Kroah-Hartman's gripe that he had already written tracefs for just this purpose. Ingo had a different complaint, though: he is pushing an effort to distribute tracepoints throughout the sysfs hierarchy. The current /sys/kernel/debug/tracing/events directory would not go away (there are tools which depend on it), but future users of, say, ext4-related tracepoints would be expected to look for them in /sys/fs/ext4. It is an interesting idea which possibly makes good sense, but it is somewhat orthogonal to Steven's stable tracepoint posting; it doesn't address the stable/development distinction at all.
It eventually became clear that Ingo is opposed to the concept of marking some tracepoints as stable. He is, instead, taking the position that anything which is used by tools becomes part of the ABI, and that an excess of tools using too many tracepoints is a problem we wish we had. This opposition, needless to say, could make it hard to get the stable tracepoint concept into the kernel.
Here we see one of the hazards of skipping important developer meetings. The stable tracepoint discussion was expected to be one of the more contentious sessions at the kernel summit; in the end, though, everybody present seemed happy with the conclusion that was reached. But Ingo was not present. His point of view was not heard there, and the community believes it has reached consensus on something he apparently disagrees with. If Ingo succeeds in overriding that consensus, then Steven might not be the only person to express thoughts like:
That conversation has quieted for now, but it will almost certainly
return. If nothing else, some developers are determined to change tracepoints when the need
arises, so this issue can be expected to come up again at some point.
One possible source of conflict is the recently-announced trace utility which, according to Ingo, has "no conceptual
restrictions
" and will use tracepoints without regard for any sort
of "stable" designation.
trace_printk()
One useful, but little used tracing-related tool is trace_printk(). It can be called like printk() (though without a logging level), but its output does not go to the system log; instead, everything printed via this path goes into the tracing stream as seen by ftrace. When tracing is off, trace_printk() calls have no effect. When tracing is enabled, instead, trace_printk() data can be made available to a developer with far less overhead than normal printk() output. That overhead can matter - the slowdown caused by printk() calls is often enough to change timing-related behavior, leading to "heisenbugs" which are difficult to track down.
Output from trace_printk() does not look like a normal kernel event, though, so it is not available to the perf interface. Steven has posted a patch to rectify that, at the cost of potentially creating large numbers of new trace events. With this patch, every trace_printk() call will create a new event under ...events/printk/ based on the file name. So, to use Steven's example, a trace_printk() on line 2180 in kernel/sched.c would show up in the events hierarchy as .../events/printk/kernel/sched.c/2180. Each call could then be enabled and disabled independently, just like ordinary tracepoints. It's a convenient and understandable interface, but, if use of trace_printk() ever takes off, it could lead to the creation of large numbers of events.
That idea drew a grumble from Peter Zijlstra, who said that it would be painful to use in perf. One of the reasons for that has to do with how the perf API works: every event must be opened separately with a perf_event_open() call and managed as a separate file descriptor. If the number of events gets large, so does the number of open files which must be juggled.
A potential solution also came from Peter,
in the form of a new "tracepoint collection" event for perf. This special
event will, when opened, collect no data at all, but it supports an
ioctl() call allowing tracepoints to be added to it. All
tracepoints associated with the collection event will report through the
same file descriptor, allowing tools to deal with multiple tracepoints in a
single stream of data. Peter says that the patch "
Finally: access to tracepoints is currently limited to privileged users.
Tracepoints provide a great deal of information about what is going on
inside the kernel, so allowing anybody to watch them does not seem secure.
There is a desire, though, to make some tracepoints generally available so
that tools like trace can work in a non-privileged mode. Frederic
Weisbecker has posted a patch which makes
that possible.
Frederic's patch adds an optional TRACE_EVENT_FLAGS() declaration
for tracepoints; currently, the only defined flag is
TRACE_EVENT_FL_CAP_ANY, which grants access to unprivileged
users. This flag has been applied to the system call tracepoints, allowing
anybody to trace system calls - at least, when tracing is focused on a
process they own.
An obvious conclusion from all of the above is that there are still a lot
of problems to be solved in the tracing area. The nature of the task is
shifting, though. We now have significant tracing capabilities in place,
and the developers involved have learned a lot about how the problem should
(and should not) be solved. So we're no longer in the position of
wondering how tracing can be done at all, and there no longer seems to be
any trouble selling the concept of kernel visibility to developers. What
needs to be done now is to develop the existing capability into something
which is truly useful for the development community and beyond; that looks
like a task which will keep developers busy for some time.is lightly tested
and wants some serious testing/review before merging
", but we may
see this ABI addition become ready in time for 2.6.38.
Unprivileged tracepoints
In conclusion...
