Brief items
The current development kernel is 2.6.34-rc3,
released on March 30. "
Anyway, from a messy -rc2 we now have a -rc3 that should be in much better
shape. Regressions fixed, and the ShortLog is short enough to be worth
posting to lkml." Testers who are using both SELinux and ext3 will
want to read the announcement; a regression in previous kernels could leave
such systems with corrupted security labels. The short-form changelog is
in the announcement, or see
the
full changelog for all the details.
There have been no stable updates over the last week. There are no
less than four updates in the review process, though:
2.6.27.46 (45 patches),
2.6.31.13 (89 patches),
2.6.32.11 (116 patches), and
2.6.33.2 (156 patches). The release of all
of these updates can be expected on or after April 1.
Comments (none posted)
Ok, so -rc2 was messy, no question about it. I'm too much of a
softie to hold back some peoples work, so my hard-line -rc1 didn't
work out the way I wanted. But _next_ time! For sure this time.
--
Linus Torvalds
What happens if 10000 processes simultaneously write to this
thing? It's root-only so I guess the answer is "root becomes
unemployed".
--
Andrew Morton
Take a look at a default install of Fedora. If you can understand
the security implications of disabling setuid, you get a cookie.
If you can figure out which programs will result in a change of
security label when exec'd, you get another cookie.
--
Andy Lutomirski
Comments (none posted)
By Jonathan Corbet
March 30, 2010
Contemporary Linux systems allow processes to set up their environments in
any of a number of ways. For various reasons, developers sometimes want
even more flexibility; in particular, they would like to take something
away (filesystem access, network access, capabilities) from a running
process, usually in the name of security. The problem is that such changes
can actually make security worse; as has been seen many times, privileged
programs can be made to do strange and unfortunate things when run in
unexpected environments.
As Andy Lutomirski notes, one
response to this problem is to disable setuid semantics as well. But there
are a lot of ways for the execve() system call to change a
process's privileges which do not involve setuid programs; this is
especially true in the presence of security modules. So Andy has proposed
a different idea: opt out of execve() instead. To that end, he
proposes a new prctl() option (PR_RESTRICT_ME) which
could be used to add restrictions to a running process; the first of those
is that the process cannot call execve(). Disabling
execve() would be mandatory before any other restrictions could be
added.
But a process running in a restricted mode might still want to run other
programs; that's how Linux programs often work. To accommodate that need,
Andy has added a new system call, named execve_nosecurity().
This variant of execve() will run the indicated program, but it
will perform absolutely no security transitions first. So no setuid, no
SELinux type changes, etc. The end result is a system call with
functionality similar to simply mapping the program into the caller's
address space and running it directly. With execve_nosecurity(),
it is not possible to increase privileges by running another program, so it
should make the removal of capabilities from running processes safer.
This patch should address a number of the concerns developers have had with
the restricting of privileges. It's hard to tell for sure, though, because
there has been very little in the way of response so far.
Comments (6 posted)
By Jonathan Corbet
March 30, 2010
The removal of the big kernel lock (BKL) has been a kernel development goal
for many years. The BKL creates scalability problems and provides some
truly strange locking semantics that would be nice to eliminate. The
actual work of removing this lock has been a long process, though; it is a
tedious job requiring a fairly deep understanding of the affected code.
Relatively few people are willing to do that work, so the BKL has survived
for far longer than anybody might have liked.
One developer who has put some significant time into BKL removal is Arnd
Bergmann; Arnd has just posted a
patch series which promises to eliminate the BKL altogether - almost.
To that end, a number of significant changes have been made. The block and
tty subsystems both get subsystem-level mutexes to replace their use of the
BKL; that is a relatively tricky job because the locking semantics provided
by a mutex are rather different. An extensive effort has been made to
audit and document ioctl() and llseek() functions which
still require the BKL; no other function called from the
file_operations structure expects the BKL now. Code still
requiring the BKL is now explicitly marked in the kernel configuration
system, making it possible to build BKL-free kernels. The patch set also
includes a significant series from Jan Blunck removing the BKL from much of
the VFS layer.
What's left is a few "mostly obscure device driver modules."
Arnd has used a fairly large value of "mostly obscure," though; the USB
subsystem, for example, still has a BKL dependency. All told, there are 148 modules still using the BKL, most of which
are drivers. That may seem like a lot, but it's a huge step in the right
direction. Many of us may be running BKL-free kernels sooner than we might
have expected.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
March 30, 2010
Interrupt handlers run asynchronously in response to signals from the
hardware. Since they pull the CPU away from whatever it was doing before,
handlers are supposed to be very quick; they should, in most cases, tell
the hardware to be quiet, arrange for any followup work to be done, and get
out of the way. Historically, the situation has not been so simple,
though, leading to a distinction between "fast" and "slow" handlers in the
earliest days of Linux. It now seems, though, that this distinction could
disappear as early as 2.6.35.
The core distinction between fast and slow handlers is this: fast handlers
are run with further interrupts disabled, while slow handlers run with
interrupts enabled. A slow handler, thus, can be interrupted by another
handler, while a fast handler cannot. In an ideal world, slow handlers
would not exist; they would all get their work done quickly and not
monopolize the CPU, so there would be no point in interrupting them. In
the real world, which includes problematic
hardware, slow processors, and developers of varying ability, slow handlers
have been a fact of life. The nature of some hardware (old IDE
controllers, for example) makes it hard to avoid doing a lot of work in the
interrupt handler. Meanwhile, other types of devices must have exceedingly
fast interrupt response to avoid loss of data; a classic example here is a
number of serial ports which are able to buffer exactly one character in
the UART. The slow IDE work could not be allowed to delay serial
processing; thus, the IDE interrupt handler had to be a slow one.
Over time, though, the situation has changed. Hardware has gotten smarter
and better able to handle interrupt response latency. CPUs have gotten
faster, so even a relatively slow handler can get a lot of work done
quickly. The needs of the realtime tree (and other latency-sensitive
workloads) have motivated the reworking of the worst interrupt-time
offenders, and improvements in the kernel's deferred work mechanisms have
made it easier to move work out of handlers. So the need for the
distinction between the two types of interrupt handlers has been fading.
Simultaneously, problems associated with the fast/slow dichotomy have been
growing. There is no way to run handlers for interrupts on shared lines
(found on any system with a PCI bus) with interrupts disabled, because any
other handler for a device on the same line can enable interrupts.
Allowing interrupt handlers to interrupt each other leads to worse cache
behavior and unpredictable completion times. What set off the recent
discussion, though, was
this patch from Andi Kleen which
was aimed at addressing another problem: deeply nested interrupt handlers
can overflow the processor's interrupt stack - a situation from which good
things cannot be expected to ensue.
Andi's solution is to monitor the depth of the interrupt stack within the
core kernel's interrupt-handling code. Should the stack become more than
half full, the core code will no longer enable interrupts before calling
slow handlers. In effect, it treats slow handlers as if they were fast
handlers for the duration of the stack-space squeeze. This patch solved
the problem that was being observed, but it ran into some trouble; in
particular, Thomas Gleixner did not hesitate to make his dislike for the
patch known. Your editor will try to rephrase the argument in slightly
more polite terms; according to Thomas, the patch implemented a solution
which was unreliable at best, was liable to create significant latencies in
the system, and which ignored the real problem.
Said real problem, according to Thomas, is the fact that slow handlers
exist at all. He would like to see a world where all interrupt handlers
are run with interrupts disabled, and where all of those handlers get their
work done quickly. Any extended interrupt processing should be moved to
threaded handlers. In summary:
So what's the point of running a well written (short) interrupt
handler with interrupts enabled ? Nothing at all. It just makes us
deal with crap like stacks overflowing for no good reason.
Linus initially squashed the idea, saying
that a world where we only have fast handlers is not really possible:
So Thomas, you're wrong. We can't fix all irq handlers to be really
quick, and we MUST NOT run them with all other irq's disabled if
they are not - because that obviates the whole point.
It is interesting to note, though, that this position shifted over time.
Linus (and others) expressed a number of concerns about running all
handlers with interrupts disabled:
- The handlers for some devices simply have to do a lot of work, and
that cannot be easily changed. Embedded systems, in particular, can
have fussy hardware and slow processors.
- Some handlers will not work properly if interrupts are not
enabled. In the past, some drivers have done things like waiting for
a certain amount of time to pass (as reflected in changes to the
jiffies variable). This dubious practice fails outright if
interrupts are disabled: the timer interrupt will be blocked, and
jiffies will not advance.
- Some hardware simply has strict latency requirements which cannot wait
for another interrupt handler to finish its job.
Looking at all these worries, one might well wonder if a system which
disabled interrupts for all handlers would function well at all. So it is
interesting to note one thing: any system which has the lockdep
locking checker enabled has been running all handlers that way for some
years now. Many developers and testers run lockdep-enabled kernels, and
they are available for some of the more adventurous distributions (Rawhide,
for example) as well. So we have quite a bit of test coverage for this
mode of operation already.
Another thing that happened over the last few years was the integration of
the dynamic tick code, which disables the clock tick when the system is
idle. Clock ticks are not turned back on for interrupt handlers. So any
handler which expects jiffies to change while it is running will,
sooner or later, go into a rather undignified infinite loop. Users tend to
notice that kind of behavior, so most drivers which behave this way have
long since been fixed.
Finally, the realtime tree developers have spent a great deal of time
tracking down sources of latency; excessive time spent in interrupt
handlers is one of the worst of those. So drivers which control hardware
of interest have generally been fixed. The addition of threaded interrupt
handlers has made it easier to fix drivers; most of the code can simply be
pushed into the threaded handler with no other change at all.
Given all of this, Ingo Molnar felt
confident in saying:
I'm fairly certain, based on having played with those aspects from
many angles that disabling irqs in all drivers should work just
fine today already.
After hearing this from a few core developers, and after doing some
research of his own, Linus eventually stopped
opposing the idea and started talking about how it should be
implemented. Thomas then posted a patch implementing the
change. With this patch, the IRQF_DISABLED flag (used to indicate
a fast handler) becomes a no-op; it is expected to be removed altogether in
2.6.36.
There are still some concerns about the
change, especially with regard to slow hardware on embedded systems. In
some of these cases, the problem can be solved with threaded interrupt
handlers. Some developers worry, though, that threaded handlers impose too
much latency on interrupt response. Improving on that situation is a task
for the future; in the mean time, some interrupt handlers may just have to
enable interrupts internally to get the required behavior. The preferred
function for this purpose is local_irq_enable_in_hardirq(); its
use can already be found in the IDE layer.
Since all of the technical obstacles have seemingly been overcome, chances
are good that this patch will find its way into the kernel in the 2.6.35
merge window.
Comments (4 posted)
March 30, 2010
This article was contributed by Wolfram Sang
Creating patches is usually handwork; fixing one specific issue at a time.
Once in a while though, there is janitorial work to be done or some
infrastructure to change. Then, a larger number of issues have to be taken care
of simultaneously, yet all of them are following the same basic pattern, e.g. a
replacement. Such tasks are often addressed at the source-code level using scripts
in sed, perl, and the like. This article examines the usage of Coccinelle, a
tool targeted at exactly those kinds of repetitive patching jobs.
Because Coccinelle understands C syntax, though, it can handle those jobs
much more easily.
The major drawback of using scripts for code transformation is that they use non-trivial
regular expressions in order to match previously unknown names, parse
structures, and so forth.
To simplify such tasks, "semantic patches"—patches that describe the
kinds of changes to be made, rather than the specific line and
difference that come in normal patches—have been
introduced along with Coccinelle to process them. Coccinelle
translates the source files to an abstract representation, making it easier
to deal with
C expressions, isomorphisms, code paths and so on. For an introduction,
refer to Valerie Aurora's LWN article or the
Coccinelle web site. This article will provide a step-by-step description
how a semantic patch came into existence once a certain problem was identified.
Learning Coccinelle is still a bit challenging as
information is scattered and, like with all languages, just listing the
abilities is not even half of the story. Studying other semantic patches
(in addition to asking on the mailing list) worked best for me, so in return this
article describes the creation of a semantic patch from scratch. I would
like to thank Julia Lawall for her immediate responses to my questions and
bug reports.
The problem
An issue was pointed out while developing an I2C driver for hardware
monitoring: the driver serving an I2C slave device
(called client) uses i2c_set_clientdata() to store a pointer to its private
data structure, usually somewhere in the probe function. In the remove function,
the driver was then supposed to clear the pointer to the data structure before
freeing it, because clients are not really removed but are just unbound from the
driver. To prevent a dangling pointer in the still existing client, a typical
fix looks like:
+ i2c_set_clientdata(client, NULL);
/* clientdata pointed to data before */
kfree(data);
As this dangling pointer looks quite easy to miss, checking all drivers is
a job perfectly suited for Coccinelle. The goal is a patch series fixing
this flaw in I2C drivers all over the kernel tree. While the patch series
Coccinelle successfully created will probably not be merged directly, it helped in finding
a more
generic solution. It was agreed that the i2c-core should
clear the pointer to the private data structure as there is no guarantee for
such pointers after the remove. A follow-up patch series will likely be based on
the semantic patch presented below. In any case, the creation process will be
useful for similar tasks in the future.
The task can be further divided into two sub-problems:
- Find relevant kfree() calls, which have the private data structure as an argument
- Check if clientdata is NULL already
If the latter is not the case, a fix is needed. For the following examples,
Coccinelle 0.2.2 and a 2.6.34-rc1 kernel were used. Older kernels can also
be used
to get the idea, of course.
Find relevant kfree() calls
A typical remove() routine for an I2C driver looks like this
(from drivers/rtc/rtc-pcf8563.c):
static int pcf8563_remove(struct i2c_client *client)
{
struct pcf8563 *pcf8563 = i2c_get_clientdata(client);
if (pcf8563->rtc)
rtc_device_unregister(pcf8563->rtc);
kfree(pcf8563);
return 0;
}
The pointer to the data structure of interest was obtained using
i2c_get_clientdata(). When the structure itself gets freed, then a
check for the call setting clientdata to NULL is needed. So this
combination of i2c_get_clientdata() and kfree() is of
interest, keeping in mind that the name of the pointer and its type can be
anything. As Coccinelle parses the C source on an abstract level, this is
easily possible using a few so-called metavariables in the header of our
matching rule. Those can then carry the actual naming as used in the source
file. Always remember that Coccinelle works on an abstract level. It is quite
easy to forget as most of us are used to standard patches on source-code level.
A first attempt of our semantic patch having one rule may look like this:
@@
// This is the rule header; metavariables must be declared here
type T;
identifier client, data;
@@
// The matching rule itself:
// Catch the clientdata
T data = i2c_get_clientdata(client);
// then anything in between is allowed
...
// prepend the fix if kfree() is found
+ i2c_set_clientdata(client, NULL);
kfree(data);
For the pcf8563 example above, this patch matches. That means, after the first
line of the rule, the metavariable T will carry the type struct pcf8563 *,
data will carry the identifier pcf8563 and client will carry the identifier
client. Later use of these metavariables will, of course, be accordingly
replaced. So, kfree(data) will in fact look for kfree(pcf8563). As this is also
found, the match is complete and the line containing the fix will be added.
But the patch did not find all relevant places. The
probe() function also has a dangling pointer in the error path. It
wasn't matched as it uses i2c_set_clientdata() instead of
i2c_get_clientdata(). So there should be an alternation in the
semantic patch handling both cases. And to make it short, a third variant is
necessary because other drivers use i2c_get_clientdata()
without declaring the type on the same line. It is usually a good idea to do a
little bit of grepping first to get an idea in what ways functions are called.
Here is the patch including all alternations marked by "(", "|", and ")" in the
first column:
@@
type T;
identifier client, data;
@@
// Check if function uses clientdata
(
i2c_set_clientdata(client, data);
|
data = i2c_get_clientdata(client);
|
T data = i2c_get_clientdata(client);
)
// anything in between is allowed
...
+ i2c_set_clientdata(client, NULL);
kfree(data);
Surprisingly, there is still no fixup for the probe() function. Why is
that? The "..." operator in Coccinelle matches if and only if it matches for
all code paths taken. This is to ensure consistency of the modifications. It
usually makes a lot of sense, however, this case is an exception. As it is
written now, the lower block of the patch says "anything in between is allowed,
but then a kfree(data) must follow on all paths". Of course, the
probe() routine does not free the structure if all went well because
the driver is going to use it. So, the above rule will not match on this path
and thus will fail entirely. What is needed here is a "may exist or may not
exist" operator. This is, similar to regular expressions, "?". After changing
the kfree() line to the following
? kfree(data);
the meaning of the lower block changes to "anything in between is allowed and
kfree(data) may occur later". That implies that, if it occurs, the fix
connected to kfree(data) will be applied as well, so finally there is
the second match.
Check if clientdata is freed already
When applying this semantic patch to the whole rtc subdirectory, there
are a number of fixes, but also false positives, i.e. the pointer has correctly
been cleared already by the driver, which is now done twice. To fix this, an
alternation can be used again. Like in many languages, an alternation is
short-cut if one condition is already met. So the replacing part can be done
like this:
(
// If this pattern is found, clientdata is set to NULL before data is freed.
// Do nothing and skip the rest of the alternation
i2c_set_clientdata(client, NULL);
...
kfree(data);
|
// Otherwise apply a fix if kfree() has been found in some code path
// (doesn't need to be in all paths).
+ i2c_set_clientdata(client, NULL);
? kfree(data);
)
If the first block is met, the driver does the right thing. There still is a
match, but no output is produced because no lines are added or removed. If this
is not the case, the fix is applied (if needed). While being here, a few drivers
clear the pointer after they free the structure. The other way around would be
cleaner, so the following snippet is the third alternation:
+ i2c_set_clientdata(client, NULL);
kfree(data);
...
- i2c_set_clientdata(client, NULL);
The final version of the semantic
patch is hopefully less frightening:
@@
type T;
identifier client, data;
@@
// Check if function uses clientdata
(
i2c_set_clientdata(client, data);
|
data = i2c_get_clientdata(client);
|
T data = i2c_get_clientdata(client);
)
// Anything in between is OK
...
(
// If this pattern is found, clientdata is set to NULL before data is freed.
// Do nothing and skip the rest of the alternation
i2c_set_clientdata(client, NULL);
...
kfree(data);
|
// If this pattern is found, clientdata is set to NULL after data is freed.
// Move it to the front and skip the rest of the alternation
+ i2c_set_clientdata(client, NULL);
kfree(data);
...
- i2c_set_clientdata(client, NULL);
|
// Otherwise apply a fix if kfree() has been found in some code path
// (doesn't need to be in all paths).
+ i2c_set_clientdata(client, NULL);
? kfree(data);
)
This matched 96 drivers in 23 directories, changing 213 lines. Note that one
really should review those patches afterward. There might be issues which lead
to further improvement of the semantic patch. Or there are problematic
parts in the source code, but they need to be handled manually. For
example, in this
patch series, there was once a kfree() missing, so a memory leak was
discovered. Also check the Coccinelle output for anomalies. In this case,
there are some exceptions regarding "inconsistent control-flow paths". That
means, the source code was modified in such a way that code paths outside our
match would also be affected. An example is a simple error path in a probe function
(excerpt from drivers/gpio/pcf857x.c):
gpio = kzalloc(sizeof *gpio, GFP_KERNEL);
if (!gpio)
return -ENOMEM;
... /* set 'status' according to initialization */
if (status < 0)
goto fail; /* clientdata not used yet! */
...
i2c_set_clientdata(client, gpio);
...
status = gpiochip_add(&gpio->chip);
if (status < 0)
goto fail; /* clientdata was modified */
...
fail:
dev_dbg(...)
/* 'i2c_set_clientdata(client, NULL)' placed here would be executed for all jumps to 'fail'! */
kfree(gpio);
return status;
As seen, a jump to fail can happen after or before clientdata was set to the
private data structure. The latter case is outside the scope of the above
semantic patch and would still modify its code path. In this example, the
change is harmless as clientdata is still NULL and will be set to NULL again,
but Coccinelle cannot know and outputs a warning. It is possible to enforce
inconsistent changes using the command-line option -allow_inconsistent_paths,
but it is marked as dangerous in the help text for a reason. Either
triple-check the outcome or just handle the exceptions manually.
Conclusion
The article is meant to incrementally describe the creation of a semantic patch
using Coccinelle. While the result is working and the patch series was submitted,
be aware that the semantic patch here is primarily meant for educational
purposes; more
advanced features available in Coccinelle have been left out.
One has to get used to a slightly different way of thinking regarding
patches along with learning
some new syntax when getting started with Coccinelle. The intention of this article
was to demonstrate that it is no major task, though. Once the basic
stuff is familiar, semantic patches are easier to understand than scripts with loads of regular
expressions. Coccinelle has also been around for some time now and produced a number
of useful patch series (available via kernel-janitors), so it is not in
alpha stage anymore.
In the future, being able to read semantic patches will become increasingly
important. Larger tasks, like API changes, might start being
done in an automatic fashion.
Coccinelle is a handy tool, and trying it out is likely to pay off.
Comments (24 posted)
March 31, 2010
This article was contributed by Steven Rostedt
In Part 1, the process of creating a
tracepoint in the core kernel was explained. This article continues from
there with tricks to lower the tracepoint footprint by using the
DECLARE_EVENT_CLASS() macro.
In addition, the macros used to build the TP_STRUCT__entry fields
are described and the TP_printk
helper functions are explained.
Saving space by using DECLARE_EVENT_CLASS()
Every tracepoint that is created with the TRACE_EVENT() macro creates several functions
that allows perf and Ftrace to interact with the tracepoint automatically.
Since these functions have unique prototypes (defined by the
TP_PROTO and TP_ARGS macros in the TRACE_EVENT()
definition),
reference unique structures (defined by the
TP_STRUCT__entry macro), assign
them uniquely to the ring buffer (as defined by TP_fast_assign), and has a unique way
to print out the data (defined in TP_printk), there is very little that the TRACE_EVENT()
macro can do to reuse code. That means that every TRACE_EVENT() defined will increase
the footprint of the kernel, which is enough to make quite a difference with hundreds of TRACE_EVENT()
macros.
text data bss dec hex filename
452114 2788 3520 458422 6feb6 fs/xfs/xfs.o.notrace
996954 38116 4480 1039550 fdcbe fs/xfs/xfs.o.trace
The XFS filesystem declares over a hundred separate trace events. The data section increased
substantially, but that is expected because each event has a corresponding structure
with a set of function pointers attached to it. What was not acceptable,
though, was that enabling the trace events causes the xfs.o text
section to double in size!
That pushed an effort to find a way to condense trace events. The obvious place
to start was to have several events, which record the same structured data, share
their functions. If two events have the same TP_PROTO, TP_ARGS and TP_STRUCT__entry,
there should be a way to have these events share the functions that they use.
This was the motivation for the new macro DECLARE_EVENT_CLASS() (originally
called TRACE_EVENT_TEMPLATE()) and DEFINE_EVENT().
The DECLARE_EVENT_CLASS() macro has the exact same format as TRACE_EVENT():
DECLARE_EVENT_CLASS(sched_wakeup_template,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, prio )
__field( int, success )
__field( int, target_cpu )
),
TP_fast_assign(
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->prio = p->prio;
__entry->success = success;
__entry->target_cpu = task_cpu(p);
),
TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
__entry->comm, __entry->pid, __entry->prio,
__entry->success, __entry->target_cpu)
);
This creates a trace framework that can be used by multiple events.
The DEFINE_EVENT() macro is used to create trace events defined by
DECLARE_EVENT_CLASS():
DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success));
DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success));
The example above creates two trace events sched_wakeup
and sched_wakeup_new. The DEFINE_EVENT() macro requires 4 parameters:
DEFINE_EVENT(class, name, proto, args)
- class - the name of the class created with DECLARE_EVENT_CLASS().
- name - the name of the trace event.
- proto - the prototype that is the same as TP_PROTO in the DECLARE_EVENT_CLASS().
- args - the arguments of the prototype that is the same as TP_ARGS in
DECLARE_EVENT_CLASS().
Unfortunately, due to the limitations of the C preprocessor, the DEFINE_EVENT() macro
needs to repeat the prototype and arguments of the DECLARE_EVENT_CLASS().
Because several of the tracepoints in XFS are very similar, using the DECLARE_EVENT_CLASS()
brought down the text and bss size quite substantially.
text data bss dec hex filename
452114 2788 3520 458422 6feb6 fs/xfs/xfs.o.notrace
996954 38116 4480 1039550 fdcbe fs/xfs/xfs.o.trace
638482 38116 3744 680342 a6196 fs/xfs/xfs.o.class
To keep the footprint of trace events down, try to consolidate events using
the DECLARE_EVENT_CLASS() and DEFINE_EVENT() macros. There is no advantage to using
the TRACE_EVENT() macro over the other two. In fact, the TRACE_EVENT() macro is
now defined as just:
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DECLARE_EVENT_CLASS(name, \
PARAMS(proto), \
PARAMS(args), \
PARAMS(tstruct), \
PARAMS(assign), \
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
Note that the
PARAMS macro allows the arguments to contain commas
and not be mistaken as multiple parameters of
DECLARE_EVENT_CLASS() or
DEFINE_EVENT().
TP_STRUCT__entry macros
The first article mentioned the __field and __array macros used
to create the structure format of the event that is stored in the ring
buffer.
The __field(type, item) declared
a field in the structure called item of type type
(i.e. type item;). The
__array(type, item, len) declared a static array called
item with len elements of type
type (i.e. type item[len];).
Those two are the most common, but there are other macros that allow for more
complex storage into the ring buffer.
__field_ext(type, item, filter_type)
The __field_ext macro is mainly used for helping the event filter. The event filter
(to be discussed in an upcoming article) allows the user to filter events based on the
contents of its fields. The type and item are the same as the fields
used by __field, the filter_type is an enum. Currently only
the following values are used:
- FILTER_OTHER - equivalent to the standard __field() macro.
- FILTER_PTR_STRING - the field points to a string outside the ring buffer.
The FILTER_PTR_STRING and __field_ext are currently only used by the
big kernel lock tracepoints. These fields point to the function and file name
that contain the tracepoint, which triggers when the big kernel lock is taken or released. This extension is not
recommended since it makes the field useless for user-space tools that read the ring
buffer in binary format. The big kernel lock tracepoints are an exception because
they are currently being used to remove the big kernel lock, so hopefully these
tracepoints will be removed from the kernel along with the big kernel lock.
Fields defined by the __field_ext macro are assigned into the ring
buffer in
TP_fast_assign the same way that fields defined by __field are.
__string(item, src)
The __string macro is used to record a variable length string, which
must be null terminated. The first parameter is the name of the field in
TP_STRUCT__entry, the second parameter is the source that will fill the string.
For example, in the irq_handler_entry tracepoint's TP_STRUCT__entry:
__string( name, action->name )
The variable action is declared as one of the tracepoint's parameters.
The __string macro will allocate enough space in the ring buffer and
place the string at the end of the event data. To assign the string in the
TP_fast_assign:
__assign_str(name, action->name);
This will copy the string (action->name) into the reserved space in the ring buffer.
To output the string, in TP_printk:
TP_printk("irq=%d name=%s", __entry->irq, __get_str(name))
The __get_str macro returns a reference to the dynamic string in the __entry
structure.
__dynamic_array(type, item, len)
If more control is needed over a dynamic string or variable length
array that is not a string, __dynamic_array can be
used. The __dynamic_array macro is used to implement the __string
macro. It takes three parameters: the type and item are the
same as for the __field macro, but the third gives how to
determine the length.
For example, the block_rq_with_error tracepoint has the following:
__dynamic_array( char, cmd, blk_cmd_buf_len(rq) )
The call to blk_cmd_buf_len() will determine the length of the
array needed
to save the data.
To assign a dynamic array field in TP_fast_assign, another macro is needed to get a
reference to the array: __get_dynamic_array(item). Note, that
since the block_rq_with_error
tracepoint defines a dynamic array that is a string, it uses the macro
__get_str(item) instead:
blk_dump_cmd(__get_str(cmd), rq);
The blk_dump_cmd() just fills the cmd array with data determined
by the rq variable. The tracepoint can do this because the
__get_str macro is defined as:
#define __get_str(field) (char *)__get_dynamic_array(field)
Either __get_dynamic_array or __get_str can be used in the
TP_printk macro to get a reference to the dynamic array.
TP_printk helper functions
There are four TP_printk helper functions, two of which were already described
in the previous section (__get_str and __get_dynamic_array).
The other two helper functions are more complex and deal with mapping
numbers to names.
__print_flags(flags, delimiter, values)
Being able to see the values of flags in a field as symbolic names instead of numbers
makes reading a trace much easier. Imagine having to manually parse kmalloc()
GFP flags of 0x80d0 instead of GFP_KERNEL|GFP_ZERO.
The first two parameters of the __print_flags are simply the variable that
contains the flags (__entry->gfp_flags) and a string delimiter to use between
flags if more than one is found ("|"). The delimiter may also be NULL or
an empty string (""). The third parameter is an array of structures of the type:
struct trace_print_flags {
unsigned long mask;
const char *name;
};
The module_load tracepoint contains a good example of using __print_flags:
TP_printk("%s %s", __get_str(name), __print_flags(flags, "",
{ (1UL << TAINT_PROPRIETARY_MODULE), "P" },
{ (1UL << TAINT_FORCED_MODULE), "F" },
{ (1UL << TAINT_CRAP), "C" })
Depending on which taint flag is set, the corresponding letter ("P", "F", and/or "C") will
be displayed. If the value
of the flags field is not found within the values parameter, then the value of the flags
parameter is converted to a hex string and that is returned. If no bit is set in the flags
parameter, then __print_flags returns an empty string. Note that
__print_flags internally terminates the values array, so
no explicit termination is required.
Alert readers will have noticed that the previous example of the kmalloc GFP flags used a complex
bit mask. GFP_KERNEL is not a single bit, but is made up of multiple bits. A mask in
values can contain more than one bit. __print_flags will iterate through
values, and will use the first match for any particular set of bits. GFP_KERNEL is made up of
(__GFP_WAIT | __GFP_IO | __GFP_FS). The kmalloc tracepoint passes in the GFP_KERNEL mask before each of the single bit values. This allows __print_flags
to pick the GFP_KERNEL over selecting the individual flags. If one of the three flags
that make up GFP_KERNEL was listed in the values before GFP_KERNEL, then the individual
flags would be in the output instead of printing GFP_KERNEL. Any remaining flag will
also be parsed (as was GFP_ZERO). If bits are still set after all values have been
applied, then those bits will show up as a hex number at the end following the delimiter.
__print_symbolic(val, values)
The
__print_symbolic function is very similar to
__print_flags
except that it only produces output for exact matches. The
values field is still an array of
struct trace_print_flags but the mask must match exactly to
val in order
to have it print
name. If no match is found,
val is
converted to a hex string, which
is returned. No delimiter is needed since only one value is returned by
__print_symbolic. Here's an example of its use by the irq tracepoints:
#define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
#define show_softirq_name(val) \
__print_symbolic(val, \
softirq_name(HI), \
softirq_name(TIMER), \
softirq_name(NET_TX), \
softirq_name(NET_RX), \
softirq_name(BLOCK), \
softirq_name(BLOCK_IOPOLL), \
softirq_name(TASKLET), \
softirq_name(SCHED), \
softirq_name(HRTIMER), \
softirq_name(RCU))
[...]
TP_printk("vec=%d [action=%s]", __entry->vec,
show_softirq_name(__entry->vec))
Notice how a helper macro is used to set up the values. This is recommended
because
macros will be evaluated before they show up in the output format, but functions
will not. User-space tools will still be able to parse this because a macro was used
rather than a function.
A quick demo
To get a better understanding of what is happening with the events, the following contains
some simple usage of event tracing.
The examples assume that the user has changed directories to
tracing in debugfs (usually, but not always, /sys/kernel/debug/tracing). Also notice that the prompt contains '#' which signifies
that these operations require a privileged user:
[tracing] # echo 1 > events/module/module_load/enable
[tracing] # insmod /tmp/taintme.ko
[tracing] # insmod /tmp/gpl-nice.ko
[tracing] # cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
insmod-1812 [003] 469717.724908: module_load: taintme P
insmod-1814 [003] 470058.525771: module_load: gpl_nice
The taintme.ko module is a module I wrote that does nothing, but does not
have a GPL-compliant license. This causes the "P" taint flag to appear. Notice
that no flag appeared for gpl_nice (which, as the name implies,
does have a GPL license). Remember, if no bit is set in the flags
passed to the
__print_flags macro, an empty string is returned.
[tracing] # echo irq_handler_entry softirq_entry > set_event
[tracing] # cat trace | head
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [002] 470574.178475: irq_handler_entry: irq=26 handler=hpet4
<idle>-0 [002] 470574.178485: softirq_entry: softirq=1 action=TIMER
<idle>-0 [002] 470574.178492: softirq_entry: softirq=7 action=SCHED
<idle>-0 [002] 470574.178495: softirq_entry: softirq=9 action=RCU
<idle>-0 [000] 470574.178678: irq_handler_entry: irq=35 handler=eth0
<idle>-0 [000] 470574.178684: softirq_entry: softirq=3 action=NET_RX
Notice that this command used the set_event file to enable tracing.
Using this file or the enable file within the events directory act the
same way. Because the tracepoint names are (at least so far) unique, just
echoing the name into set_event is the equivalent of enabling the
tracepoint using events/irq/irq_handler_entry/enable for example.
For enabling multiple tracepoints at once it is usually more convenient to use the
set_event file, but when activating a singe event, all events in a subsystem, or all events
it is more convenient to use the enable files. More details about using the
event tracer will be explained in an upcoming article.
The IRQ and soft IRQ events shown above illustrate the output of a dynamic string
and use of the __print_symbols helper function. The irq_handle_entry
saves the name of the interrupt device (hpet4 and eth0) using
a dynamic string to display the name in the trace.
The softirq_entry uses the __print_symbols helper function to
convert the number of the soft IRQ vector into a matching name that it represents
(TIMER, SCHED, RCU, and NET_RX).
Coming in Part 3
Part 3 will look at defining tracepoints outside of the include/trace/events directory
(for modules and architecture-specific tracepoints) along with a look at how the
TRACE_EVENT() macro does its magic. It will also include some more examples of how the
tracepoints are used with Ftrace.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
- Mimi Zohar: EVM .
(March 29, 2010)
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>