Conditional tracepoints
Arjan van de Ven recently posted a patch adding a tracepoint to __mark_inode_dirty(), a function called deep within the virtual filesystem layer to, surprisingly, mark an inode as being dirty. Arjan's purpose is to figure out which processes are causing files to have dirty contents; that will allow tools like PowerTop to tell laptop users which process is causing their disk to spin up. The only problem is that some calls to __mark_inode_dirty() are essentially noise from this point of view; they happen, for example, when an inode is first created or is being freed. Tracing those calls could create a stream of useless events which would have to be filtered out by PowerTop, causing PowerTop itself to require more power. So it is preferable to avoid creating those events in the first place if possible.
For that reason, Arjan made the call to the tracepoint be conditional:
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
trace_writeback_inode_dirty(inode, flags);
This code works in that it causes the tracepoint to be "hit" only when an application has actually done something to dirty an inode.
The VFS developers seem to have no objection to this tracepoint being added; the resulting information can be useful. But they didn't like the conditional nature of it. Part of the problem is that tracepoints are supposed to keep a low profile; developers want to be able to ignore them most of the time. Expanding a tracepoint to two lines and an if statement rather defeats that goal. But tracepoints are also supposed to not affect execution time. They have been carefully coded to impose almost no overhead when they are not enabled (which is most of the time); with techniques like jump label, that overhead can be reduced even further. But that if statement, being outside of the tracepoint altogether, will always be executed regardless of whether the tracepoint is currently enabled or not. Multiply that test-and-jump across millions of calls to __mark_inode_dirty() on each of millions of machines, and the extra CPU cycles start to add up.
So it was asked: could this test be moved into the tracepoint itself? One approach might be to put the test into the TP_fast_assign() portion of the tracepoint, which copies the tracepoint data into the tracing ring buffer. The problem with that idea is that, by that time, the tracepoint has already fired, space has been allocated in the ring buffer, etc. There is currently no mechanism to cancel a tracepoint hit partway through. There has, in the past, been talk of adding some sort of "never mind" operation which could be invoked within TP_fast_assign(), but that idea seems less than entirely elegant.
What might happen, instead, is the creation of a variant of TRACE_EVENT() with a name like TRACE_EVENT_CONDITION(). It would take an extra parameter which would be, of course, another tricky macro. For Arjan's tracepoint, the condition would look something like:
TP_CONDITION(
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
return 1;
else
return 0;
),
The tracepoint code would then test the condition before doing any other work associated with the tracepoint - but only if the tracepoint itself has been enabled.
This solution should help to keep the impact of tracepoints to a minimum
once again, especially when those tracepoints are not enabled. There is
one potential problem in that the condition is now hidden deeply within the
definition of the tracepoint; that definition is usually found in a special
header file far from the code where the tracepoint is actually inserted.
At the tracepoint itself, the condition which might cause it not to fire is
not visible in any way. So, if somebody other than the initial developer
wants to use the tracepoint, they could misinterpret a lack of output as a
sign that the surrounding code is not being executed at all. That little
problem could presumably be worked around with clever tracepoint naming,
better documentation, or simply expecting users to understand what
tracepoints are actually telling them.
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Kernel tracing |
| Kernel | Tracing |
Posted Dec 2, 2010 3:58 UTC (Thu)
by thedevil (guest, #32913)
[Link] (1 responses)
LOL! Thanks for this bit of typical LWN humor. Sometimes I think the Editor has been born on other shores where this gift is more common :)
Posted Dec 2, 2010 10:16 UTC (Thu)
by mjthayer (guest, #39183)
[Link]
It does make me wonder why the filtering logic couldn't just be in PowerTop, as having the tracepoint on is clearly not the normal case. I assume that it would have generated enough extra work and really changed the systems power usage enough to make PowerTop's analysis harder.
Posted Dec 2, 2010 16:56 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (2 responses)
Why not use code comments? their purpose after all is to inform other coders of things that might not be obvious at first.
Posted Dec 3, 2010 10:16 UTC (Fri)
by dag- (guest, #30207)
[Link] (1 responses)
> So, if somebody other than the initial developer wants to use the tracepoint, they could misinterpret a lack of output as a sign that the surrounding code is not being executed at all.
The people using the tracepoints are not necessarily kernel developers, and in most cases will not be tracing while at the same time looking at the source-code.
Posted Dec 8, 2010 11:02 UTC (Wed)
by Auders (guest, #53318)
[Link]
Conditional tracepoints
Conditional tracepoints
Conditional tracepoints
> within the definition of the tracepoint; that definition is usually found
> in a special header file far from the code where the tracepoint is
> actually inserted. At the tracepoint itself, the condition which might
> cause it not to fire is not visible in any way.
Conditional tracepoints
Conditional tracepoints
