Low-level tracing plumbing

By Jonathan Corbet
September 30, 2008

Kernel and user-space tracing were heavily discussed at both the kernel summit and the Linux Plumbers Conference. Attendees did not emerge from those discussions with any sort of comprehensive vision of how the tracing problem will be solved; there is not, yet, a consensus on that point. But one clear message did come out: we may end up with several different tracing mechanisms in the kernel, but there is no patience for redundant low-level tracing buffer implementations. All of the potential tracing frameworks are going to have to find a way to live with a single mechanism for collecting trace data and getting it to user space.

This conclusion may look like a way of diverting attention from the intractable problems at the higher levels and, instead, focusing everybody on something so low-level that the real issues disappear. There may be some truth to that. It is also true, though, that there is no call for duplicating the same sort of machinery across several different tracing frameworks; coming up with a common solution to this part of the problem can only lead to a better kernel in the long run. But there is another objective here which is just as important: having all the tracing frameworks using a single buffer allows them to be used together. It is not hard to imagine a future tracing tool integrating information gathered with simultaneous use of ftrace, LTTng, SystemTap, and other tracing tools that have not been written yet. Having all of those tools using the same low-level plumbing should make that integration easier.

With that in mind, Steven Rostedt set out to create a new, unified tracing buffer; as of this writing, that patch was already up to its tenth iteration. A casual perusal of the patch might well leave a reader confused; 2000 lines of relatively complex code to implement what is, in the end, just a circular buffer. This circular buffer is not even suitable for use by tracing frameworks yet; a separate "tracing" layer is to be added for that. The key point here is that, with tracing code, efficiency is crucially important. One of the main use cases for tracing is to debug performance problems in highly stressed production environments. A heavyweight tracing mechanism will create an observer effect which can obscure the situation which called for tracing in the first place, disrupt the production use of the system, or both. To be accepted, a tracing framework must have the smallest possible impact on the system.

So the unified trace buffer patch applies just about every known trick to limit its runtime cost. The circular buffer is actually a set of per-CPU buffers, each of which allows lockless addition and consumption of events. The event format is highly compact, and every effort is made to avoid copying it, ever. Rather than maintain a separate structure to track the contents of an individual page in the buffer, the patch employs yet another overloaded variant of struct page in the system memory map. (Your editor would not want to be the next luckless developer who has to modify struct page and, in the process, track down and fix all of the tricky not-really-struct-page uses throughout the kernel). And so on.

The patch itself does a fairly good job of describing the trace buffer API; that discussion will not be repeated here. It is worth taking a quick look at the low-level event format, though:

    struct ring_buffer_event {
	u32		type:2, len:3, time_delta:27;
	u32		array[];
    };

This format was driven by the desire to keep the per-event overhead as small as possible, so there is a single 32-bit word of header information. Here, type is the type of the event, len is its length (except when it's not, see below), time_delta is a time offset value, and array contains the actual event data.

There are four types of events; one of them (RINGBUF_TYPE_PADDING) is just a way of filling out empty space at the end of a page. Normal events generated by the tracing system (RINGBUF_TYPE_DATA) have a length given by the len field, which is right-shifted by two bits. So the maximum event length is 28 bytes (32 bytes minus four for the header word), which is not very long. For longer events, len is set to zero and the first word of the array field contains the real length.

The other two event types have to do with time stamps. Over the course of the discussion, it became clear that high-resolution timing information is needed with all events, for two reasons. The recording of events into per-CPU arrays, while essential for performance, does have the effect of separating events which are related in time; the addition of precise timekeeping will allow events to be collated in the proper order. That collation could be handled through some sort of serial counter, but some performance issues can only be understood by looking closely at the precise timing of specific events. So events need to have real time data, at the highest resolution which is practical.

Just how that data will be recorded is still unclear, and may end up being architecture dependent. Some systems may use timestamp counter data directly, while others may be able to provide real times in nanoseconds. Whatever format turns out to be used, there is no doubt that it will require 64 bits of storage. But most of the time data is redundant between any two events, so there is no real desire to add a full 64-bit time stamp to every event in the stream. The compromise which was reached was to store the amount of time which passes between one event and the next in the 27 bits allotted. Should the time delta be too large to fit in that space, the trace buffer code will insert an artificial event (of type RINGBUF_TYPE_TIME_EXTENT) to provide the necessary storage space.

The final event type (RINGBUF_TYPE_TIME_STAMP) "will hold data to help keep the buffer timestamps in sync." This little bit of functionality has not yet been implemented, though.

The rate of change of the trace buffer code appears to be slowing somewhat as comments from various directions are addressed; it may be getting close to its final form. Then it will be a matter of implementing the higher-level protocols on top of it. In the mean time, though, the attentive reader may be wondering: what about relayfs? The relay code has been in the kernel for years, and it was intended to solve just this kind of problem.

The most direct (if not most politic) answer to that question was probably posted by Peter Zijlstra:

Dude, relayfs is such a bad performing mess that extending it seems like a bad idea. Better to write something new and delete everything relayfs related.

Deleting relayfs would not be that hard; there are only a couple of users, currently. But relayfs developer Tom Zanussi is not convinced that the problems with relayfs are severe enough to justify tossing it out and starting over. He has posted a series of patches cleaning up the relayfs API and addressing some of its performance problems. At this point, though, it is not clear that anybody is really looking at that work; it has not received much in the way of comments.

One way or the other, the kernel seems set to have a low-level trace buffer implementation in place soon. That just leaves a few other little problems to solve, including making dynamic tracing work, instrumenting the kernel with static trace points, implementing user-space tracing, etc. Working those issues out is likely to take a while, and it is likely to result in a few different tracing solutions aimed at different needs. But we'll have the low-level plumbing, and that's a start.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Tracing

I'm no hardware nut, but..

Posted Oct 2, 2008 1:45 UTC (Thu) by dw (subscriber, #12017) [Link] (8 responses)

It seems to me that the struct is better optimised for size than it is performance. Can gcc really generate speedy code for those kinds of odd shaped bitfields?

I'd have thought either of these would be faster:

struct { uint32_t; uint32_t; uint32_t; }
struct { uint8_t; uint8_t; uint16_t; }

Side note: has there been any discussion of moving the kernel to the use of C99 fixed-width integer types? It's only been almost a decade. Having worked on a C99 project for a while, the in-kernel type names now look pesky. :)

Looking at (unoptimised Apple) gcc 4 output makes me think the struct from the article may not be so optimal.. http://pastebin.com/f33132559

I'm no hardware nut, but..

Posted Oct 2, 2008 2:52 UTC (Thu) by felixfix (subscriber, #242) [Link]

I'd guess it's better to have more events or to use less memory for the same number of events. The few cycles spent twiddling bits are a much better bargain than eating up memory, and will pay for themselves in reduced memory impact anyway. I have written (long long ago) similar traces and my primary goal was ALWAYS as many trace events as possible.

I'm no hardware nut, but..

Posted Oct 2, 2008 3:51 UTC (Thu) by mbligh (subscriber, #7720) [Link]

> It seems to me that the struct is better optimised for size than it is
> performance.

Yes. We want to pack as much data into the buffer as possible

> struct { uint32_t; uint32_t; uint32_t; }

That takes 3 times as much space! You're roughly having the amount of useful
data we can store in the trace buffer.

> struct { uint8_t; uint8_t; uint16_t; }

That doesn't leave enough space for the time counter. It's TSC stamping at roughly 2-4GHz, and we need accurate resolution to merge the per-cpu buffers back together. We don't want to log extended timestamp events (whenever your time counter rolls over) too often.

I'm no hardware nut, but..

Posted Oct 2, 2008 9:35 UTC (Thu) by xav (guest, #18536) [Link] (4 responses)

Nowadays, optimizing for size is what counts the most. The few cycles you spend at packing and unpacking data (or recomputing it in some cases) are well repaid by the huge number of cycles waiting for the memory controller you gained.

I'm no hardware nut, but..

Posted Oct 2, 2008 17:01 UTC (Thu) by Oddscurity (guest, #46851) [Link] (3 responses)

Not only that, but it'll possibly be just the packing. Unpacking and collating data can be done in userspace, and possibly offline after enough trace data's been collected.

Having dense fields like this also means a lot more of these will fit in cache.

I'm no hardware nut, but..

Posted Oct 2, 2008 21:51 UTC (Thu) by fuhchee (guest, #40059) [Link] (2 responses)

> Not only that, but it'll possibly be just the packing. Unpacking and
> collating data can be done in userspace, and possibly offline after enough
> trace data's been collected.

OTOH, Linus wanted to people to be able to use this tracing stuff with
just a cat/grep, which seems to eschew *requiring* special userspace tools
for unpacking.

I'm no hardware nut, but..

Posted Oct 3, 2008 23:31 UTC (Fri) by tbird20d (subscriber, #1901) [Link] (1 responses)

It could still be kernel code doing the unpacking, just not at trace-time. This preserves the ability to use cat & grep, while still omitting the unpacking overhead during trace itself.

I'm no hardware nut, but..

Posted Oct 3, 2008 23:43 UTC (Fri) by fuhchee (guest, #40059) [Link]

> It could still be kernel code doing the unpacking, just not at trace-time.

If the kernel isn't going to host huge history buffers, then
conversions will in general have to be done essentially online,
probably deferred only a bit.

Re: the side note

Posted Oct 5, 2008 20:48 UTC (Sun) by dw (subscriber, #12017) [Link]

Turns out there are good reasons not to go with C99 stdint types in the kernel (namespace issues). Thread starting here.

Low-level tracing plumbing

Posted Oct 2, 2008 22:12 UTC (Thu) by ndk (subscriber, #43509) [Link] (3 responses)

> a length given by the len field, which is right-shifted by two bits

should read "left-shifted".

Right-shifted

Posted Oct 3, 2008 4:00 UTC (Fri) by corbet (editor, #1) [Link] (2 responses)

In fact, the value stored in len has indeed been right-shifted by two bits. It needs to be left-shifted to be interpreted. Apologies if I wrote it in a way that was less than entirely clear, but I believe it was correct.

Right-shifted

Posted Oct 3, 2008 18:53 UTC (Fri) by ndk (subscriber, #43509) [Link]

Yep, I misread what you wrote.

Right-shifted

Posted Oct 10, 2008 20:46 UTC (Fri) by sethml (guest, #8471) [Link]

It depends on what the meaning of "is" is. :)

Something like "the len field contains the length with the two lowest bits lopped off" would
be clearer.