Conditional tracepoints

By Jonathan Corbet
November 30, 2010

Tracepoints are small hooks placed into kernel code; when they are enabled, they can generate event information which can be consumed through the ftrace or perf interfaces. These tracepoints are defined via the decidedly gnarly TRACE_EVENT() macro which Steven Rostedt nicely described in detail for LWN earlier this year. As kernel developers add more tracepoints to the kernel, they are occasionally finding things which can be improved. One of those seems relatively simple: what if a tracepoint should only fire some of the time?

Arjan van de Ven recently posted a patch adding a tracepoint to __mark_inode_dirty(), a function called deep within the virtual filesystem layer to, surprisingly, mark an inode as being dirty. Arjan's purpose is to figure out which processes are causing files to have dirty contents; that will allow tools like PowerTop to tell laptop users which process is causing their disk to spin up. The only problem is that some calls to __mark_inode_dirty() are essentially noise from this point of view; they happen, for example, when an inode is first created or is being freed. Tracing those calls could create a stream of useless events which would have to be filtered out by PowerTop, causing PowerTop itself to require more power. So it is preferable to avoid creating those events in the first place if possible.

For that reason, Arjan made the call to the tracepoint be conditional:

    if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
	trace_writeback_inode_dirty(inode, flags);

This code works in that it causes the tracepoint to be "hit" only when an application has actually done something to dirty an inode.

The VFS developers seem to have no objection to this tracepoint being added; the resulting information can be useful. But they didn't like the conditional nature of it. Part of the problem is that tracepoints are supposed to keep a low profile; developers want to be able to ignore them most of the time. Expanding a tracepoint to two lines and an if statement rather defeats that goal. But tracepoints are also supposed to not affect execution time. They have been carefully coded to impose almost no overhead when they are not enabled (which is most of the time); with techniques like jump label, that overhead can be reduced even further. But that if statement, being outside of the tracepoint altogether, will always be executed regardless of whether the tracepoint is currently enabled or not. Multiply that test-and-jump across millions of calls to __mark_inode_dirty() on each of millions of machines, and the extra CPU cycles start to add up.

So it was asked: could this test be moved into the tracepoint itself? One approach might be to put the test into the TP_fast_assign() portion of the tracepoint, which copies the tracepoint data into the tracing ring buffer. The problem with that idea is that, by that time, the tracepoint has already fired, space has been allocated in the ring buffer, etc. There is currently no mechanism to cancel a tracepoint hit partway through. There has, in the past, been talk of adding some sort of "never mind" operation which could be invoked within TP_fast_assign(), but that idea seems less than entirely elegant.

What might happen, instead, is the creation of a variant of TRACE_EVENT() with a name like TRACE_EVENT_CONDITION(). It would take an extra parameter which would be, of course, another tricky macro. For Arjan's tracepoint, the condition would look something like:

    TP_CONDITION(
	    if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
	    	return 1;
	    else
	    	return 0;
    ),

The tracepoint code would then test the condition before doing any other work associated with the tracepoint - but only if the tracepoint itself has been enabled.

This solution should help to keep the impact of tracepoints to a minimum once again, especially when those tracepoints are not enabled. There is one potential problem in that the condition is now hidden deeply within the definition of the tracepoint; that definition is usually found in a special header file far from the code where the tracepoint is actually inserted. At the tracepoint itself, the condition which might cause it not to fire is not visible in any way. So, if somebody other than the initial developer wants to use the tracepoint, they could misinterpret a lack of output as a sign that the surrounding code is not being executed at all. That little problem could presumably be worked around with clever tracepoint naming, better documentation, or simply expecting users to understand what tracepoints are actually telling them.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Tracing

Conditional tracepoints

Posted Dec 2, 2010 3:58 UTC (Thu) by thedevil (guest, #32913) [Link] (1 responses)

"Tracing those calls could create a stream of useless events which would have to be filtered out by PowerTop, causing PowerTop itself to require more power."

LOL! Thanks for this bit of typical LWN humor. Sometimes I think the Editor has been born on other shores where this gift is more common :)

Conditional tracepoints

Posted Dec 2, 2010 10:16 UTC (Thu) by mjthayer (guest, #39183) [Link]

> "Tracing those calls could create a stream of useless events which would have to be filtered out by PowerTop, causing PowerTop itself to require more power."

It does make me wonder why the filtering logic couldn't just be in PowerTop, as having the tracepoint on is clearly not the normal case. I assume that it would have generated enough extra work and really changed the systems power usage enough to make PowerTop's analysis harder.

Conditional tracepoints

Posted Dec 2, 2010 16:56 UTC (Thu) by intgr (subscriber, #39733) [Link] (2 responses)

> There is one potential problem in that the condition is now hidden deeply
> within the definition of the tracepoint; that definition is usually found
> in a special header file far from the code where the tracepoint is
> actually inserted. At the tracepoint itself, the condition which might
> cause it not to fire is not visible in any way.

Why not use code comments? their purpose after all is to inform other coders of things that might not be obvious at first.

Conditional tracepoints

Posted Dec 3, 2010 10:16 UTC (Fri) by dag- (guest, #30207) [Link] (1 responses)

You removed the next sentence which gives a clue.

> So, if somebody other than the initial developer wants to use the tracepoint, they could misinterpret a lack of output as a sign that the surrounding code is not being executed at all.

The people using the tracepoints are not necessarily kernel developers, and in most cases will not be tracing while at the same time looking at the source-code.

Conditional tracepoints

Posted Dec 8, 2010 11:02 UTC (Wed) by Auders (guest, #53318) [Link]

The condition should be reflected in the format then, so that the tool used to read the tracepoint can know about it.