By Jonathan Corbet
September 30, 2008
Kernel and user-space tracing were heavily discussed at both the kernel
summit and the Linux Plumbers Conference. Attendees did not emerge from
those discussions with any sort of comprehensive vision of how the tracing
problem will be solved; there is not, yet, a consensus on that point. But
one clear message did come out: we may end up with several different
tracing mechanisms in the kernel, but there is no patience for redundant
low-level tracing buffer implementations. All of the potential tracing
frameworks are going to have to find a way to live with a single mechanism
for collecting trace data and getting it to user space.
This conclusion may look like a way of diverting attention from the
intractable problems at the higher levels and, instead, focusing everybody
on something so low-level that the real issues disappear. There may be
some truth to that. It is also true, though, that there is no call for
duplicating the same sort of machinery across several different tracing
frameworks; coming up with a common solution to this part of the problem
can only lead to a better kernel
in the long run. But there is another objective here which is just as
important: having all the tracing frameworks using a single buffer allows
them to be used together. It is not hard to imagine a future tracing tool
integrating information gathered with simultaneous use of ftrace, LTTng,
SystemTap, and other tracing tools that have not been written yet. Having
all of those tools using the same low-level plumbing should make that
integration easier.
With that in mind, Steven Rostedt set out to create a new, unified tracing
buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the
patch might well leave a reader confused; 2000 lines of relatively complex
code to implement what is, in the end, just a circular buffer.
This circular buffer is not even
suitable for use by tracing frameworks yet; a separate "tracing" layer is to
be added for that. The key point here is that, with tracing code,
efficiency is crucially important. One of the main use cases for tracing
is to debug performance problems in highly stressed production
environments. A heavyweight tracing mechanism will create an observer
effect which can obscure the situation which called for tracing in the
first place, disrupt the production use of the system, or both. To be
accepted, a tracing framework must have the smallest possible impact on the
system.
So the unified trace buffer patch applies just about every known trick to
limit its runtime cost. The circular buffer is actually a set of per-CPU
buffers, each of which allows lockless addition and consumption of events.
The event format is highly compact, and
every effort is made to avoid copying it, ever. Rather than maintain a
separate structure to track the contents of an individual page in the
buffer, the patch employs yet another overloaded variant of struct
page in the system memory map. (Your editor would not want to be the
next luckless developer who has to modify struct page and, in the
process, track down and fix all of the tricky
not-really-struct-page uses throughout the kernel). And so on.
The patch itself does a fairly good job of describing the trace buffer API;
that discussion will not be repeated here. It is worth taking a quick look
at the low-level event format, though:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
This format was driven by the desire to keep the per-event overhead as
small as possible, so there is a single 32-bit word of header information.
Here, type is the type of the event, len is its length
(except when it's not, see below), time_delta is a time
offset value, and array contains the actual event data.
There are four types of events; one of them (RINGBUF_TYPE_PADDING)
is just a way of filling out empty space at the end of a page. Normal
events generated by the tracing system (RINGBUF_TYPE_DATA) have a
length given by the len field, which is right-shifted by two
bits. So the maximum event length is 28 bytes (32 bytes minus four for the
header word), which is not very long. For longer events, len is
set to zero and the first word of the array field contains the
real length.
The other two event types have to do with time stamps. Over the course of
the discussion, it became clear that high-resolution timing information is
needed with all events, for two reasons. The recording of events into
per-CPU arrays, while essential for performance, does have the effect of
separating events which are related in time; the addition of precise
timekeeping will allow events to be collated in the proper order. That
collation could be handled through some sort of serial counter, but some
performance issues can only be understood by looking closely at the precise
timing of specific events. So events need to have real time data, at the highest
resolution which is practical.
Just how that data will be recorded is still unclear, and may end up being
architecture dependent. Some systems may use timestamp counter data
directly, while others may be able to provide real times in nanoseconds.
Whatever format turns out to be used, there is no doubt that it will
require 64 bits of storage. But most of the time data is redundant between
any two events, so there is no real desire to add a full 64-bit time stamp
to every event in the stream. The compromise which was reached was to
store the amount of time which passes between one event and the next in the
27 bits allotted. Should the time delta be too large to fit in that space,
the trace buffer code will insert an artificial event (of type
RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space.
The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to
help keep the buffer timestamps in sync." This little bit of functionality
has not yet been implemented, though.
The rate of change of the trace buffer code appears to be slowing somewhat
as comments from various directions are addressed; it may be getting close
to its final form. Then it will be a matter of implementing the
higher-level protocols on top of it. In the mean time, though, the
attentive reader may be wondering: what about relayfs? The relay code has
been in the kernel for years, and it was intended to solve just this kind
of problem.
The most direct (if not most politic) answer to that question was probably posted by
Peter Zijlstra:
Dude, relayfs is such a bad performing mess that extending it seems
like a bad idea. Better to write something new and delete
everything relayfs related.
Deleting relayfs would not be that hard; there are only a couple of users,
currently. But relayfs developer Tom Zanussi is not convinced that the problems with
relayfs are severe enough to justify tossing it out and starting over. He
has posted a series of patches cleaning up
the relayfs API and addressing some of its performance problems. At this
point, though, it is not clear that anybody is really looking at that work;
it has not received much in the way of comments.
One way or the other, the kernel seems set to have a low-level trace buffer
implementation in place soon. That just leaves a few other little problems
to solve, including making dynamic tracing work, instrumenting the kernel
with static trace points, implementing user-space tracing, etc. Working
those issues out is likely to take a while, and it is likely to result in a
few different tracing solutions aimed at different needs. But we'll have
the low-level plumbing, and that's a start.
(
Log in to post comments)