Low-level tracing plumbing
This conclusion may look like a way of diverting attention from the intractable problems at the higher levels and, instead, focusing everybody on something so low-level that the real issues disappear. There may be some truth to that. It is also true, though, that there is no call for duplicating the same sort of machinery across several different tracing frameworks; coming up with a common solution to this part of the problem can only lead to a better kernel in the long run. But there is another objective here which is just as important: having all the tracing frameworks using a single buffer allows them to be used together. It is not hard to imagine a future tracing tool integrating information gathered with simultaneous use of ftrace, LTTng, SystemTap, and other tracing tools that have not been written yet. Having all of those tools using the same low-level plumbing should make that integration easier.
With that in mind, Steven Rostedt set out to create a new, unified tracing buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the patch might well leave a reader confused; 2000 lines of relatively complex code to implement what is, in the end, just a circular buffer. This circular buffer is not even suitable for use by tracing frameworks yet; a separate "tracing" layer is to be added for that. The key point here is that, with tracing code, efficiency is crucially important. One of the main use cases for tracing is to debug performance problems in highly stressed production environments. A heavyweight tracing mechanism will create an observer effect which can obscure the situation which called for tracing in the first place, disrupt the production use of the system, or both. To be accepted, a tracing framework must have the smallest possible impact on the system.
So the unified trace buffer patch applies just about every known trick to limit its runtime cost. The circular buffer is actually a set of per-CPU buffers, each of which allows lockless addition and consumption of events. The event format is highly compact, and every effort is made to avoid copying it, ever. Rather than maintain a separate structure to track the contents of an individual page in the buffer, the patch employs yet another overloaded variant of struct page in the system memory map. (Your editor would not want to be the next luckless developer who has to modify struct page and, in the process, track down and fix all of the tricky not-really-struct-page uses throughout the kernel). And so on.
The patch itself does a fairly good job of describing the trace buffer API; that discussion will not be repeated here. It is worth taking a quick look at the low-level event format, though:
struct ring_buffer_event { u32 type:2, len:3, time_delta:27; u32 array[]; };
This format was driven by the desire to keep the per-event overhead as small as possible, so there is a single 32-bit word of header information. Here, type is the type of the event, len is its length (except when it's not, see below), time_delta is a time offset value, and array contains the actual event data.
There are four types of events; one of them (RINGBUF_TYPE_PADDING) is just a way of filling out empty space at the end of a page. Normal events generated by the tracing system (RINGBUF_TYPE_DATA) have a length given by the len field, which is right-shifted by two bits. So the maximum event length is 28 bytes (32 bytes minus four for the header word), which is not very long. For longer events, len is set to zero and the first word of the array field contains the real length.
The other two event types have to do with time stamps. Over the course of the discussion, it became clear that high-resolution timing information is needed with all events, for two reasons. The recording of events into per-CPU arrays, while essential for performance, does have the effect of separating events which are related in time; the addition of precise timekeeping will allow events to be collated in the proper order. That collation could be handled through some sort of serial counter, but some performance issues can only be understood by looking closely at the precise timing of specific events. So events need to have real time data, at the highest resolution which is practical.
Just how that data will be recorded is still unclear, and may end up being architecture dependent. Some systems may use timestamp counter data directly, while others may be able to provide real times in nanoseconds. Whatever format turns out to be used, there is no doubt that it will require 64 bits of storage. But most of the time data is redundant between any two events, so there is no real desire to add a full 64-bit time stamp to every event in the stream. The compromise which was reached was to store the amount of time which passes between one event and the next in the 27 bits allotted. Should the time delta be too large to fit in that space, the trace buffer code will insert an artificial event (of type RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space.
The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to help keep the buffer timestamps in sync." This little bit of functionality has not yet been implemented, though.
The rate of change of the trace buffer code appears to be slowing somewhat as comments from various directions are addressed; it may be getting close to its final form. Then it will be a matter of implementing the higher-level protocols on top of it. In the mean time, though, the attentive reader may be wondering: what about relayfs? The relay code has been in the kernel for years, and it was intended to solve just this kind of problem.
The most direct (if not most politic) answer to that question was probably posted by Peter Zijlstra:
Deleting relayfs would not be that hard; there are only a couple of users, currently. But relayfs developer Tom Zanussi is not convinced that the problems with relayfs are severe enough to justify tossing it out and starting over. He has posted a series of patches cleaning up the relayfs API and addressing some of its performance problems. At this point, though, it is not clear that anybody is really looking at that work; it has not received much in the way of comments.
One way or the other, the kernel seems set to have a low-level trace buffer
implementation in place soon. That just leaves a few other little problems
to solve, including making dynamic tracing work, instrumenting the kernel
with static trace points, implementing user-space tracing, etc. Working
those issues out is likely to take a while, and it is likely to result in a
few different tracing solutions aimed at different needs. But we'll have
the low-level plumbing, and that's a start.
Index entries for this article | |
---|---|
Kernel | Development tools/Kernel tracing |
Kernel | Tracing |
Posted Oct 2, 2008 1:45 UTC (Thu)
by dw (subscriber, #12017)
[Link] (8 responses)
I'd have thought either of these would be faster:
struct { uint32_t; uint32_t; uint32_t; }
Side note: has there been any discussion of moving the kernel to the use of C99 fixed-width integer types? It's only been almost a decade. Having worked on a C99 project for a while, the in-kernel type names now look pesky. :)
Looking at (unoptimised Apple) gcc 4 output makes me think the struct from the article may not be so optimal.. http://pastebin.com/f33132559
Posted Oct 2, 2008 2:52 UTC (Thu)
by felixfix (subscriber, #242)
[Link]
Posted Oct 2, 2008 3:51 UTC (Thu)
by mbligh (subscriber, #7720)
[Link]
Yes. We want to pack as much data into the buffer as possible
> struct { uint32_t; uint32_t; uint32_t; }
That takes 3 times as much space! You're roughly having the amount of useful
> struct { uint8_t; uint8_t; uint16_t; }
That doesn't leave enough space for the time counter. It's TSC stamping at roughly 2-4GHz, and we need accurate resolution to merge the per-cpu buffers back together. We don't want to log extended timestamp events (whenever your time counter rolls over) too often.
Posted Oct 2, 2008 9:35 UTC (Thu)
by xav (guest, #18536)
[Link] (4 responses)
Posted Oct 2, 2008 17:01 UTC (Thu)
by Oddscurity (guest, #46851)
[Link] (3 responses)
Having dense fields like this also means a lot more of these will fit in cache.
Posted Oct 2, 2008 21:51 UTC (Thu)
by fuhchee (guest, #40059)
[Link] (2 responses)
OTOH, Linus wanted to people to be able to use this tracing stuff with
Posted Oct 3, 2008 23:31 UTC (Fri)
by tbird20d (subscriber, #1901)
[Link] (1 responses)
Posted Oct 3, 2008 23:43 UTC (Fri)
by fuhchee (guest, #40059)
[Link]
If the kernel isn't going to host huge history buffers, then
Posted Oct 2, 2008 22:12 UTC (Thu)
by ndk (subscriber, #43509)
[Link] (3 responses)
should read "left-shifted".
Posted Oct 3, 2008 4:00 UTC (Fri)
by corbet (editor, #1)
[Link] (2 responses)
Posted Oct 3, 2008 18:53 UTC (Fri)
by ndk (subscriber, #43509)
[Link]
Posted Oct 10, 2008 20:46 UTC (Fri)
by sethml (guest, #8471)
[Link]
Something like "the len field contains the length with the two lowest bits lopped off" would
I'm no hardware nut, but..
struct { uint8_t; uint8_t; uint16_t; }
I'm no hardware nut, but..
I'm no hardware nut, but..
> performance.
data we can store in the trace buffer.
I'm no hardware nut, but..
I'm no hardware nut, but..
I'm no hardware nut, but..
> collating data can be done in userspace, and possibly offline after enough
> trace data's been collected.
just a cat/grep, which seems to eschew *requiring* special userspace tools
for unpacking.
I'm no hardware nut, but..
I'm no hardware nut, but..
conversions will in general have to be done essentially online,
probably deferred only a bit.
Low-level tracing plumbing
In fact, the value stored in len has indeed been right-shifted by two bits. It needs to be left-shifted to be interpreted. Apologies if I wrote it in a way that was less than entirely clear, but I believe it was correct.
Right-shifted
Right-shifted
Right-shifted
be clearer.