For a long time, tracing was seen as one of the weaker points of the
Linux system. Things have changed dramatically over the last few years, to
the point that Linux has a number of interesting tracing interfaces. The
job is far from done, though, and there is not always agreement on how this
work should proceed. There have been a number of conversations related to
tracing recently; this article will survey some in an attempt to highlight
where the remaining challenges are.
The tracing ABI
Once upon a time, Linux had no tracing-oriented interfaces at all. Now,
instead, we have two: ftrace and perf events. Some types of information
are only available via the ftrace interface, others are only available from
perf, and some sorts of events can be obtained in either way. From the
discussions that have been happening for some time it's clear that neither
interface satisfies everybody's needs. In addition, there are other
subsystems waiting on the wings - LTTng and
a recently proposed system health
subsystem, for example - which bring requirements of their own. The
last thing that the system needs is an even wider variety of tracing
interfaces; it would be nice, instead, to pull everything together into a
single, unified interface.
Almost everybody involved agrees on that point, but that is about where the
agreement stops. Your editor, unfortunately, missed the tempestuous
session at the Linux Plumbers Conference where a number of tracing
developers came to an agreement of sorts: a new ABI would be developed with
the explicit goal of being a unified tracing and event interface for the
system as a whole. This ABI would be kept out of the mainline until a
number of tools had been written to use it; only when it became clear that
everybody's needs are met would it be merged. Your editor talked to a
number of the people involved in that discussion; all seemed pleased with
Ftrace developer Steven Rostedt interpreted the
discussion as a mandate to develop an entirely new ABI for tracing
I think if we take a step back, we can come up with a new
buffering/ABI system that can satisfy everyone. We will still
support the current method now, but I really don't think it is
designed with everything we had in mind. I do not envision that we
can "evolve" to where we want to be. We may have to bite the
bullet, just like iptables did when they saw the failures of
ipchains, and redesign something new now that we understand what
the requirements are.
LTTng developer Mathieu Desnoyers took things even further, posting a "tracing ABI work plan" for discussion. That
posting was poorly received, being seen as a document better suited to
managerial conference rooms - a perception which was not helped by
Mathieu's subsequent posting of a massive common trace format document which would make
a standards committee proud. Kernel developers, as always, would rather see
code than extensive design documents.
When the code comes, though, it seems that there will be resistance to the
idea of creating an entirely new tracing ABI. Thomas Gleixner has expressed his dislike for the current state of
affairs and attempts to create complex replacements; he is calling for a
gradual move toward a better interface. Ingo Molnar has said similar things:
Fact is that we have an ABI, happy users, happy tools and happy
developers, so going incrementally is important and allows us to
validate and measure every step while still having a full
tool-space in place - and it will help everyone, in addition to the
We'll need to embark on this incremental path instead of a
rewrite-the-world thing. As a maintainer my task is to say 'no' to
rewrite-the-world approaches - and we can and will do better here.
The existing ABI that Ingo likes, of course, is the perf interface. He
would clearly like to see all tracing and event reporting move to the perf
side of the house. The perf ABI, he says, is sufficiently extendable to
accommodate everybody's needs; there does not seem to be a lot of room for
negotiation on this point.
One of the conclusions reached at the 2010 Kernel Summit was that a small
set of system tracepoints would be designated "stable" and moved to a
separate location in the filesystem hierarchy. Tools using these
tracepoints would have a high level of assurance that things would not
change in future kernel releases; meanwhile, kernel developers could feel
free to add and use tracepoints elsewhere without worrying that they could
end up maintaining them forever. It seemed like an outcome that everybody
could live with.
Steven recently posted an implementation of
stable tracepoints to implement that decision. His patch adds another
tricky macro (STABLE_EVENT()) which creates a stable tracepoint;
all such tracepoints are essentially a second, restricted view of an
existing "raw" tracepoint. That allows development-oriented tracepoints to
provide more information than is deemed suitable for a stable interface and
does not require cluttering the code with multiple tracepoint invocations.
There is also a new "eventfs" filesystem to host stable tracepoints which
is expected to be mounted on /sys/kernel/events. A small number
of core tracepoints have been marked as stable - just enough to show how
There were a number of complaints about eventfs, not the least of which
being Greg Kroah-Hartman's gripe that he had already written tracefs for just this purpose. Ingo had a different complaint, though: he is pushing
an effort to distribute tracepoints throughout the sysfs hierarchy. The
current /sys/kernel/debug/tracing/events directory would not go
away (there are tools which depend on it), but future users of, say,
ext4-related tracepoints would be expected to look for them in
/sys/fs/ext4. It is an interesting idea which possibly makes good
sense, but it is somewhat orthogonal to Steven's stable tracepoint posting;
it doesn't address the stable/development distinction at all.
It eventually became clear that Ingo is
opposed to the
concept of marking some tracepoints as stable. He is, instead, taking the
position that anything which is used by tools becomes part of the ABI, and
that an excess of tools using too many tracepoints is a problem we wish we
had. This opposition, needless to say, could make it hard to get the
stable tracepoint concept into the kernel.
Here we see one of the hazards of skipping important developer meetings.
The stable tracepoint discussion was expected to be one of the more
contentious sessions at the kernel summit; in the end, though, everybody
present seemed happy with the conclusion that was reached. But Ingo was
not present. His point of view was not heard there, and the community
believes it has reached consensus on something he apparently disagrees
with. If Ingo succeeds in overriding that consensus, then Steven might not
be the only person to express thoughts
Hmm, seems that every decision that we came to agreement with at
Kernel Summit has been declined in practice. Makes me think that
Kernel Summit is pointless, and was a waste of my time.
That conversation has quieted for now, but it will almost certainly
return. If nothing else, some developers are determined to change tracepoints when the need
arises, so this issue can be expected to come up again at some point.
One possible source of conflict is the recently-announced trace utility which, according to Ingo, has "no conceptual
restrictions" and will use tracepoints without regard for any sort
of "stable" designation.
One useful, but little used tracing-related tool is
trace_printk(). It can be called like printk() (though
without a logging level), but its output does not go to the system log;
instead, everything printed via this path goes into the tracing stream as
seen by ftrace. When tracing is off, trace_printk() calls have no
effect. When tracing is enabled, instead, trace_printk() data can
be made available to a developer with far less overhead than normal
printk() output. That overhead can matter - the slowdown caused
by printk() calls is often enough to change timing-related
behavior, leading to "heisenbugs" which are difficult to track down.
Output from trace_printk() does not look like a normal kernel
event, though, so it is not available to the perf interface. Steven has
posted a patch to rectify that, at the cost
of potentially creating large numbers of new trace events. With this
patch, every trace_printk() call will create a new event under
...events/printk/ based on the file name. So, to use Steven's
trace_printk() on line 2180 in kernel/sched.c would show
up in the events hierarchy as
.../events/printk/kernel/sched.c/2180. Each call could then be
enabled and disabled independently, just like ordinary tracepoints. It's a
convenient and understandable interface, but, if use of
trace_printk() ever takes off, it could lead to the creation of
large numbers of events.
That idea drew a grumble from Peter
Zijlstra, who said that it would be painful to use in perf. One of the
reasons for that has to do with how the perf API works: every event must be
opened separately with a perf_event_open() call and managed as a
separate file descriptor. If the number of events gets large, so does the
number of open files which must be juggled.
A potential solution also came from Peter,
in the form of a new "tracepoint collection" event for perf. This special
event will, when opened, collect no data at all, but it supports an
ioctl() call allowing tracepoints to be added to it. All
tracepoints associated with the collection event will report through the
same file descriptor, allowing tools to deal with multiple tracepoints in a
single stream of data. Peter says that the patch "is lightly tested
and wants some serious testing/review before merging," but we may
see this ABI addition become ready in time for 2.6.38.
Finally: access to tracepoints is currently limited to privileged users.
Tracepoints provide a great deal of information about what is going on
inside the kernel, so allowing anybody to watch them does not seem secure.
There is a desire, though, to make some tracepoints generally available so
that tools like trace can work in a non-privileged mode. Frederic
Weisbecker has posted a patch which makes
Frederic's patch adds an optional TRACE_EVENT_FLAGS() declaration
for tracepoints; currently, the only defined flag is
TRACE_EVENT_FL_CAP_ANY, which grants access to unprivileged
users. This flag has been applied to the system call tracepoints, allowing
anybody to trace system calls - at least, when tracing is focused on a
process they own.
An obvious conclusion from all of the above is that there are still a lot
of problems to be solved in the tracing area. The nature of the task is
shifting, though. We now have significant tracing capabilities in place,
and the developers involved have learned a lot about how the problem should
(and should not) be solved. So we're no longer in the position of
wondering how tracing can be done at all, and there no longer seems to be
any trouble selling the concept of kernel visibility to developers. What
needs to be done now is to develop the existing capability into something
which is truly useful for the development community and beyond; that looks
like a task which will keep developers busy for some time.
to post comments)