Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.34-rc3, released on March 30. "Anyway, from a messy -rc2 we now have a -rc3 that should be in much better shape. Regressions fixed, and the ShortLog is short enough to be worth posting to lkml." Testers who are using both SELinux and ext3 will want to read the announcement; a regression in previous kernels could leave such systems with corrupted security labels. The short-form changelog is in the announcement, or see the full changelog for all the details.
There have been no stable updates over the last week. There are no less than four updates in the review process, though: 2.6.27.46 (45 patches), 2.6.31.13 (89 patches), 2.6.32.11 (116 patches), and 2.6.33.2 (156 patches). The release of all of these updates can be expected on or after April 1.
Quotes of the week
Toward a saner execve()
Contemporary Linux systems allow processes to set up their environments in any of a number of ways. For various reasons, developers sometimes want even more flexibility; in particular, they would like to take something away (filesystem access, network access, capabilities) from a running process, usually in the name of security. The problem is that such changes can actually make security worse; as has been seen many times, privileged programs can be made to do strange and unfortunate things when run in unexpected environments.As Andy Lutomirski notes, one response to this problem is to disable setuid semantics as well. But there are a lot of ways for the execve() system call to change a process's privileges which do not involve setuid programs; this is especially true in the presence of security modules. So Andy has proposed a different idea: opt out of execve() instead. To that end, he proposes a new prctl() option (PR_RESTRICT_ME) which could be used to add restrictions to a running process; the first of those is that the process cannot call execve(). Disabling execve() would be mandatory before any other restrictions could be added.
But a process running in a restricted mode might still want to run other programs; that's how Linux programs often work. To accommodate that need, Andy has added a new system call, named execve_nosecurity(). This variant of execve() will run the indicated program, but it will perform absolutely no security transitions first. So no setuid, no SELinux type changes, etc. The end result is a system call with functionality similar to simply mapping the program into the caller's address space and running it directly. With execve_nosecurity(), it is not possible to increase privileges by running another program, so it should make the removal of capabilities from running processes safer.
This patch should address a number of the concerns developers have had with the restricting of privileges. It's hard to tell for sure, though, because there has been very little in the way of response so far.
The BKL end game
The removal of the big kernel lock (BKL) has been a kernel development goal for many years. The BKL creates scalability problems and provides some truly strange locking semantics that would be nice to eliminate. The actual work of removing this lock has been a long process, though; it is a tedious job requiring a fairly deep understanding of the affected code. Relatively few people are willing to do that work, so the BKL has survived for far longer than anybody might have liked.One developer who has put some significant time into BKL removal is Arnd Bergmann; Arnd has just posted a patch series which promises to eliminate the BKL altogether - almost.
To that end, a number of significant changes have been made. The block and tty subsystems both get subsystem-level mutexes to replace their use of the BKL; that is a relatively tricky job because the locking semantics provided by a mutex are rather different. An extensive effort has been made to audit and document ioctl() and llseek() functions which still require the BKL; no other function called from the file_operations structure expects the BKL now. Code still requiring the BKL is now explicitly marked in the kernel configuration system, making it possible to build BKL-free kernels. The patch set also includes a significant series from Jan Blunck removing the BKL from much of the VFS layer.
What's left is a few "mostly obscure device driver modules
".
Arnd has used a fairly large value of "mostly obscure," though; the USB
subsystem, for example, still has a BKL dependency. All told, there are 148 modules still using the BKL, most of which
are drivers. That may seem like a lot, but it's a huge step in the right
direction. Many of us may be running BKL-free kernels sooner than we might
have expected.
Kernel development news
Disabling IRQF_DISABLED
Interrupt handlers run asynchronously in response to signals from the hardware. Since they pull the CPU away from whatever it was doing before, handlers are supposed to be very quick; they should, in most cases, tell the hardware to be quiet, arrange for any followup work to be done, and get out of the way. Historically, the situation has not been so simple, though, leading to a distinction between "fast" and "slow" handlers in the earliest days of Linux. It now seems, though, that this distinction could disappear as early as 2.6.35.The core distinction between fast and slow handlers is this: fast handlers are run with further interrupts disabled, while slow handlers run with interrupts enabled. A slow handler, thus, can be interrupted by another handler, while a fast handler cannot. In an ideal world, slow handlers would not exist; they would all get their work done quickly and not monopolize the CPU, so there would be no point in interrupting them. In the real world, which includes problematic hardware, slow processors, and developers of varying ability, slow handlers have been a fact of life. The nature of some hardware (old IDE controllers, for example) makes it hard to avoid doing a lot of work in the interrupt handler. Meanwhile, other types of devices must have exceedingly fast interrupt response to avoid loss of data; a classic example here is a number of serial ports which are able to buffer exactly one character in the UART. The slow IDE work could not be allowed to delay serial processing; thus, the IDE interrupt handler had to be a slow one.
Over time, though, the situation has changed. Hardware has gotten smarter and better able to handle interrupt response latency. CPUs have gotten faster, so even a relatively slow handler can get a lot of work done quickly. The needs of the realtime tree (and other latency-sensitive workloads) have motivated the reworking of the worst interrupt-time offenders, and improvements in the kernel's deferred work mechanisms have made it easier to move work out of handlers. So the need for the distinction between the two types of interrupt handlers has been fading.
Simultaneously, problems associated with the fast/slow dichotomy have been growing. There is no way to run handlers for interrupts on shared lines (found on any system with a PCI bus) with interrupts disabled, because any other handler for a device on the same line can enable interrupts. Allowing interrupt handlers to interrupt each other leads to worse cache behavior and unpredictable completion times. What set off the recent discussion, though, was this patch from Andi Kleen which was aimed at addressing another problem: deeply nested interrupt handlers can overflow the processor's interrupt stack - a situation from which good things cannot be expected to ensue.
Andi's solution is to monitor the depth of the interrupt stack within the core kernel's interrupt-handling code. Should the stack become more than half full, the core code will no longer enable interrupts before calling slow handlers. In effect, it treats slow handlers as if they were fast handlers for the duration of the stack-space squeeze. This patch solved the problem that was being observed, but it ran into some trouble; in particular, Thomas Gleixner did not hesitate to make his dislike for the patch known. Your editor will try to rephrase the argument in slightly more polite terms; according to Thomas, the patch implemented a solution which was unreliable at best, was liable to create significant latencies in the system, and which ignored the real problem.
Said real problem, according to Thomas, is the fact that slow handlers exist at all. He would like to see a world where all interrupt handlers are run with interrupts disabled, and where all of those handlers get their work done quickly. Any extended interrupt processing should be moved to threaded handlers. In summary:
Linus initially squashed the idea, saying that a world where we only have fast handlers is not really possible:
It is interesting to note, though, that this position shifted over time. Linus (and others) expressed a number of concerns about running all handlers with interrupts disabled:
- The handlers for some devices simply have to do a lot of work, and
that cannot be easily changed. Embedded systems, in particular, can
have fussy hardware and slow processors.
- Some handlers will not work properly if interrupts are not
enabled. In the past, some drivers have done things like waiting for
a certain amount of time to pass (as reflected in changes to the
jiffies variable). This dubious practice fails outright if
interrupts are disabled: the timer interrupt will be blocked, and
jiffies will not advance.
- Some hardware simply has strict latency requirements which cannot wait for another interrupt handler to finish its job.
Looking at all these worries, one might well wonder if a system which disabled interrupts for all handlers would function well at all. So it is interesting to note one thing: any system which has the lockdep locking checker enabled has been running all handlers that way for some years now. Many developers and testers run lockdep-enabled kernels, and they are available for some of the more adventurous distributions (Rawhide, for example) as well. So we have quite a bit of test coverage for this mode of operation already.
Another thing that happened over the last few years was the integration of the dynamic tick code, which disables the clock tick when the system is idle. Clock ticks are not turned back on for interrupt handlers. So any handler which expects jiffies to change while it is running will, sooner or later, go into a rather undignified infinite loop. Users tend to notice that kind of behavior, so most drivers which behave this way have long since been fixed.
Finally, the realtime tree developers have spent a great deal of time tracking down sources of latency; excessive time spent in interrupt handlers is one of the worst of those. So drivers which control hardware of interest have generally been fixed. The addition of threaded interrupt handlers has made it easier to fix drivers; most of the code can simply be pushed into the threaded handler with no other change at all.
Given all of this, Ingo Molnar felt confident in saying:
After hearing this from a few core developers, and after doing some research of his own, Linus eventually stopped opposing the idea and started talking about how it should be implemented. Thomas then posted a patch implementing the change. With this patch, the IRQF_DISABLED flag (used to indicate a fast handler) becomes a no-op; it is expected to be removed altogether in 2.6.36.
There are still some concerns about the change, especially with regard to slow hardware on embedded systems. In some of these cases, the problem can be solved with threaded interrupt handlers. Some developers worry, though, that threaded handlers impose too much latency on interrupt response. Improving on that situation is a task for the future; in the mean time, some interrupt handlers may just have to enable interrupts internally to get the required behavior. The preferred function for this purpose is local_irq_enable_in_hardirq(); its use can already be found in the IDE layer.
Since all of the technical obstacles have seemingly been overcome, chances are good that this patch will find its way into the kernel in the 2.6.35 merge window.
Evolutionary development of a semantic patch using Coccinelle
Creating patches is usually handwork; fixing one specific issue at a time. Once in a while though, there is janitorial work to be done or some infrastructure to change. Then, a larger number of issues have to be taken care of simultaneously, yet all of them are following the same basic pattern, e.g. a replacement. Such tasks are often addressed at the source-code level using scripts in sed, perl, and the like. This article examines the usage of Coccinelle, a tool targeted at exactly those kinds of repetitive patching jobs. Because Coccinelle understands C syntax, though, it can handle those jobs much more easily.
The major drawback of using scripts for code transformation is that they use non-trivial regular expressions in order to match previously unknown names, parse structures, and so forth. To simplify such tasks, "semantic patches"—patches that describe the kinds of changes to be made, rather than the specific line and difference that come in normal patches—have been introduced along with Coccinelle to process them. Coccinelle translates the source files to an abstract representation, making it easier to deal with C expressions, isomorphisms, code paths and so on. For an introduction, refer to Valerie Aurora's LWN article or the Coccinelle web site. This article will provide a step-by-step description how a semantic patch came into existence once a certain problem was identified.
Learning Coccinelle is still a bit challenging as information is scattered and, like with all languages, just listing the abilities is not even half of the story. Studying other semantic patches (in addition to asking on the mailing list) worked best for me, so in return this article describes the creation of a semantic patch from scratch. I would like to thank Julia Lawall for her immediate responses to my questions and bug reports.
The problem
An issue was pointed out while developing an I2C driver for hardware monitoring: the driver serving an I2C slave device (called client) uses i2c_set_clientdata() to store a pointer to its private data structure, usually somewhere in the probe function. In the remove function, the driver was then supposed to clear the pointer to the data structure before freeing it, because clients are not really removed but are just unbound from the driver. To prevent a dangling pointer in the still existing client, a typical fix looks like:
+ i2c_set_clientdata(client, NULL);
/* clientdata pointed to data before */
kfree(data);
As this dangling pointer looks quite easy to miss, checking all drivers is a job perfectly suited for Coccinelle. The goal is a patch series fixing this flaw in I2C drivers all over the kernel tree. While the patch series Coccinelle successfully created will probably not be merged directly, it helped in finding a more generic solution. It was agreed that the i2c-core should clear the pointer to the private data structure as there is no guarantee for such pointers after the remove. A follow-up patch series will likely be based on the semantic patch presented below. In any case, the creation process will be useful for similar tasks in the future.
The task can be further divided into two sub-problems:
- Find relevant kfree() calls, which have the private data structure as an argument
- Check if clientdata is NULL already
If the latter is not the case, a fix is needed. For the following examples, Coccinelle 0.2.2 and a 2.6.34-rc1 kernel were used. Older kernels can also be used to get the idea, of course.
Find relevant kfree() calls
A typical remove() routine for an I2C driver looks like this (from drivers/rtc/rtc-pcf8563.c):
static int pcf8563_remove(struct i2c_client *client)
{
struct pcf8563 *pcf8563 = i2c_get_clientdata(client);
if (pcf8563->rtc)
rtc_device_unregister(pcf8563->rtc);
kfree(pcf8563);
return 0;
}
The pointer to the data structure of interest was obtained using i2c_get_clientdata(). When the structure itself gets freed, then a check for the call setting clientdata to NULL is needed. So this combination of i2c_get_clientdata() and kfree() is of interest, keeping in mind that the name of the pointer and its type can be anything. As Coccinelle parses the C source on an abstract level, this is easily possible using a few so-called metavariables in the header of our matching rule. Those can then carry the actual naming as used in the source file. Always remember that Coccinelle works on an abstract level. It is quite easy to forget as most of us are used to standard patches on source-code level. A first attempt of our semantic patch having one rule may look like this:
@@
// This is the rule header; metavariables must be declared here
type T;
identifier client, data;
@@
// The matching rule itself:
// Catch the clientdata
T data = i2c_get_clientdata(client);
// then anything in between is allowed
...
// prepend the fix if kfree() is found
+ i2c_set_clientdata(client, NULL);
kfree(data);
For the pcf8563 example above, this patch matches. That means, after the first line of the rule, the metavariable T will carry the type struct pcf8563 *, data will carry the identifier pcf8563 and client will carry the identifier client. Later use of these metavariables will, of course, be accordingly replaced. So, kfree(data) will in fact look for kfree(pcf8563). As this is also found, the match is complete and the line containing the fix will be added.
But the patch did not find all relevant places. The probe() function also has a dangling pointer in the error path. It wasn't matched as it uses i2c_set_clientdata() instead of i2c_get_clientdata(). So there should be an alternation in the semantic patch handling both cases. And to make it short, a third variant is necessary because other drivers use i2c_get_clientdata() without declaring the type on the same line. It is usually a good idea to do a little bit of grepping first to get an idea in what ways functions are called. Here is the patch including all alternations marked by "(", "|", and ")" in the first column:
@@
type T;
identifier client, data;
@@
// Check if function uses clientdata
(
i2c_set_clientdata(client, data);
|
data = i2c_get_clientdata(client);
|
T data = i2c_get_clientdata(client);
)
// anything in between is allowed
...
+ i2c_set_clientdata(client, NULL);
kfree(data);
Surprisingly, there is still no fixup for the probe() function. Why is that? The "..." operator in Coccinelle matches if and only if it matches for all code paths taken. This is to ensure consistency of the modifications. It usually makes a lot of sense, however, this case is an exception. As it is written now, the lower block of the patch says "anything in between is allowed, but then a kfree(data) must follow on all paths". Of course, the probe() routine does not free the structure if all went well because the driver is going to use it. So, the above rule will not match on this path and thus will fail entirely. What is needed here is a "may exist or may not exist" operator. This is, similar to regular expressions, "?". After changing the kfree() line to the following
? kfree(data);
the meaning of the lower block changes to "anything in between is allowed and kfree(data) may occur later". That implies that, if it occurs, the fix connected to kfree(data) will be applied as well, so finally there is the second match.
Check if clientdata is freed already
When applying this semantic patch to the whole rtc subdirectory, there are a number of fixes, but also false positives, i.e. the pointer has correctly been cleared already by the driver, which is now done twice. To fix this, an alternation can be used again. Like in many languages, an alternation is short-cut if one condition is already met. So the replacing part can be done like this:
(
// If this pattern is found, clientdata is set to NULL before data is freed.
// Do nothing and skip the rest of the alternation
i2c_set_clientdata(client, NULL);
...
kfree(data);
|
// Otherwise apply a fix if kfree() has been found in some code path
// (doesn't need to be in all paths).
+ i2c_set_clientdata(client, NULL);
? kfree(data);
)
If the first block is met, the driver does the right thing. There still is a match, but no output is produced because no lines are added or removed. If this is not the case, the fix is applied (if needed). While being here, a few drivers clear the pointer after they free the structure. The other way around would be cleaner, so the following snippet is the third alternation:
+ i2c_set_clientdata(client, NULL);
kfree(data);
...
- i2c_set_clientdata(client, NULL);
The final version of the semantic patch is hopefully less frightening:
@@
type T;
identifier client, data;
@@
// Check if function uses clientdata
(
i2c_set_clientdata(client, data);
|
data = i2c_get_clientdata(client);
|
T data = i2c_get_clientdata(client);
)
// Anything in between is OK
...
(
// If this pattern is found, clientdata is set to NULL before data is freed.
// Do nothing and skip the rest of the alternation
i2c_set_clientdata(client, NULL);
...
kfree(data);
|
// If this pattern is found, clientdata is set to NULL after data is freed.
// Move it to the front and skip the rest of the alternation
+ i2c_set_clientdata(client, NULL);
kfree(data);
...
- i2c_set_clientdata(client, NULL);
|
// Otherwise apply a fix if kfree() has been found in some code path
// (doesn't need to be in all paths).
+ i2c_set_clientdata(client, NULL);
? kfree(data);
)
This matched 96 drivers in 23 directories, changing 213 lines. Note that one really should review those patches afterward. There might be issues which lead to further improvement of the semantic patch. Or there are problematic parts in the source code, but they need to be handled manually. For example, in this patch series, there was once a kfree() missing, so a memory leak was discovered. Also check the Coccinelle output for anomalies. In this case, there are some exceptions regarding "inconsistent control-flow paths". That means, the source code was modified in such a way that code paths outside our match would also be affected. An example is a simple error path in a probe function (excerpt from drivers/gpio/pcf857x.c):
gpio = kzalloc(sizeof *gpio, GFP_KERNEL);
if (!gpio)
return -ENOMEM;
... /* set 'status' according to initialization */
if (status < 0)
goto fail; /* clientdata not used yet! */
...
i2c_set_clientdata(client, gpio);
...
status = gpiochip_add(&gpio->chip);
if (status < 0)
goto fail; /* clientdata was modified */
...
fail:
dev_dbg(...)
/* 'i2c_set_clientdata(client, NULL)' placed here would be executed for all jumps to 'fail'! */
kfree(gpio);
return status;
As seen, a jump to fail can happen after or before clientdata was set to the private data structure. The latter case is outside the scope of the above semantic patch and would still modify its code path. In this example, the change is harmless as clientdata is still NULL and will be set to NULL again, but Coccinelle cannot know and outputs a warning. It is possible to enforce inconsistent changes using the command-line option -allow_inconsistent_paths, but it is marked as dangerous in the help text for a reason. Either triple-check the outcome or just handle the exceptions manually.
Conclusion
The article is meant to incrementally describe the creation of a semantic patch using Coccinelle. While the result is working and the patch series was submitted, be aware that the semantic patch here is primarily meant for educational purposes; more advanced features available in Coccinelle have been left out.
One has to get used to a slightly different way of thinking regarding patches along with learning some new syntax when getting started with Coccinelle. The intention of this article was to demonstrate that it is no major task, though. Once the basic stuff is familiar, semantic patches are easier to understand than scripts with loads of regular expressions. Coccinelle has also been around for some time now and produced a number of useful patch series (available via kernel-janitors), so it is not in alpha stage anymore. In the future, being able to read semantic patches will become increasingly important. Larger tasks, like API changes, might start being done in an automatic fashion. Coccinelle is a handy tool, and trying it out is likely to pay off.
Using the TRACE_EVENT() macro (Part 2)
In Part 1, the process of creating a tracepoint in the core kernel was explained. This article continues from there with tricks to lower the tracepoint footprint by using the DECLARE_EVENT_CLASS() macro. In addition, the macros used to build the TP_STRUCT__entry fields are described and the TP_printk helper functions are explained.
Saving space by using DECLARE_EVENT_CLASS()
Every tracepoint that is created with the TRACE_EVENT() macro creates several functions that allows perf and Ftrace to interact with the tracepoint automatically. Since these functions have unique prototypes (defined by the TP_PROTO and TP_ARGS macros in the TRACE_EVENT() definition), reference unique structures (defined by the TP_STRUCT__entry macro), assign them uniquely to the ring buffer (as defined by TP_fast_assign), and has a unique way to print out the data (defined in TP_printk), there is very little that the TRACE_EVENT() macro can do to reuse code. That means that every TRACE_EVENT() defined will increase the footprint of the kernel, which is enough to make quite a difference with hundreds of TRACE_EVENT() macros.
text data bss dec hex filename
452114 2788 3520 458422 6feb6 fs/xfs/xfs.o.notrace
996954 38116 4480 1039550 fdcbe fs/xfs/xfs.o.trace
The XFS filesystem declares over a hundred separate trace events. The data section increased substantially, but that is expected because each event has a corresponding structure with a set of function pointers attached to it. What was not acceptable, though, was that enabling the trace events causes the xfs.o text section to double in size!
That pushed an effort to find a way to condense trace events. The obvious place to start was to have several events, which record the same structured data, share their functions. If two events have the same TP_PROTO, TP_ARGS and TP_STRUCT__entry, there should be a way to have these events share the functions that they use. This was the motivation for the new macro DECLARE_EVENT_CLASS() (originally called TRACE_EVENT_TEMPLATE()) and DEFINE_EVENT().
The DECLARE_EVENT_CLASS() macro has the exact same format as TRACE_EVENT():
DECLARE_EVENT_CLASS(sched_wakeup_template,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, prio )
__field( int, success )
__field( int, target_cpu )
),
TP_fast_assign(
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->prio = p->prio;
__entry->success = success;
__entry->target_cpu = task_cpu(p);
),
TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
__entry->comm, __entry->pid, __entry->prio,
__entry->success, __entry->target_cpu)
);
This creates a trace framework that can be used by multiple events. The DEFINE_EVENT() macro is used to create trace events defined by DECLARE_EVENT_CLASS():
DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success));
DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
TP_PROTO(struct rq *rq, struct task_struct *p, int success),
TP_ARGS(rq, p, success));
The example above creates two trace events sched_wakeup and sched_wakeup_new. The DEFINE_EVENT() macro requires 4 parameters:
DEFINE_EVENT(class, name, proto, args)
- class - the name of the class created with DECLARE_EVENT_CLASS().
- name - the name of the trace event.
- proto - the prototype that is the same as TP_PROTO in the DECLARE_EVENT_CLASS().
- args - the arguments of the prototype that is the same as TP_ARGS in DECLARE_EVENT_CLASS().
Unfortunately, due to the limitations of the C preprocessor, the DEFINE_EVENT() macro needs to repeat the prototype and arguments of the DECLARE_EVENT_CLASS().
Because several of the tracepoints in XFS are very similar, using the DECLARE_EVENT_CLASS() brought down the text and bss size quite substantially.
text data bss dec hex filename
452114 2788 3520 458422 6feb6 fs/xfs/xfs.o.notrace
996954 38116 4480 1039550 fdcbe fs/xfs/xfs.o.trace
638482 38116 3744 680342 a6196 fs/xfs/xfs.o.class
To keep the footprint of trace events down, try to consolidate events using the DECLARE_EVENT_CLASS() and DEFINE_EVENT() macros. There is no advantage to using the TRACE_EVENT() macro over the other two. In fact, the TRACE_EVENT() macro is now defined as just:
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \ DECLARE_EVENT_CLASS(name, \ PARAMS(proto), \ PARAMS(args), \ PARAMS(tstruct), \ PARAMS(assign), \ PARAMS(print)); \ DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));Note that the PARAMS macro allows the arguments to contain commas and not be mistaken as multiple parameters of DECLARE_EVENT_CLASS() or DEFINE_EVENT().
TP_STRUCT__entry macros
The first article mentioned the __field and __array macros used to create the structure format of the event that is stored in the ring buffer. The __field(type, item) declared a field in the structure called item of type type (i.e. type item;). The __array(type, item, len) declared a static array called item with len elements of type type (i.e. type item[len];). Those two are the most common, but there are other macros that allow for more complex storage into the ring buffer.
__field_ext(type, item, filter_type)
The __field_ext macro is mainly used for helping the event filter. The event filter (to be discussed in an upcoming article) allows the user to filter events based on the contents of its fields. The type and item are the same as the fields used by __field, the filter_type is an enum. Currently only the following values are used:
- FILTER_OTHER - equivalent to the standard __field() macro.
- FILTER_PTR_STRING - the field points to a string outside the ring buffer.
The FILTER_PTR_STRING and __field_ext are currently only used by the big kernel lock tracepoints. These fields point to the function and file name that contain the tracepoint, which triggers when the big kernel lock is taken or released. This extension is not recommended since it makes the field useless for user-space tools that read the ring buffer in binary format. The big kernel lock tracepoints are an exception because they are currently being used to remove the big kernel lock, so hopefully these tracepoints will be removed from the kernel along with the big kernel lock.
Fields defined by the __field_ext macro are assigned into the ring buffer in TP_fast_assign the same way that fields defined by __field are.
__string(item, src)
The __string macro is used to record a variable length string, which must be null terminated. The first parameter is the name of the field in TP_STRUCT__entry, the second parameter is the source that will fill the string. For example, in the irq_handler_entry tracepoint's TP_STRUCT__entry:
__string( name, action->name )
The variable action is declared as one of the tracepoint's parameters. The __string macro will allocate enough space in the ring buffer and place the string at the end of the event data. To assign the string in the TP_fast_assign:
__assign_str(name, action->name);
This will copy the string (action->name) into the reserved space in the ring buffer. To output the string, in TP_printk:
TP_printk("irq=%d name=%s", __entry->irq, __get_str(name))
The __get_str macro returns a reference to the dynamic string in the __entry structure.
__dynamic_array(type, item, len)
If more control is needed over a dynamic string or variable length array that is not a string, __dynamic_array can be used. The __dynamic_array macro is used to implement the __string macro. It takes three parameters: the type and item are the same as for the __field macro, but the third gives how to determine the length. For example, the block_rq_with_error tracepoint has the following:
__dynamic_array( char, cmd, blk_cmd_buf_len(rq) )
The call to blk_cmd_buf_len() will determine the length of the array needed to save the data.
To assign a dynamic array field in TP_fast_assign, another macro is needed to get a reference to the array: __get_dynamic_array(item). Note, that since the block_rq_with_error tracepoint defines a dynamic array that is a string, it uses the macro __get_str(item) instead:
blk_dump_cmd(__get_str(cmd), rq);
The blk_dump_cmd() just fills the cmd array with data determined by the rq variable. The tracepoint can do this because the __get_str macro is defined as:
#define __get_str(field) (char *)__get_dynamic_array(field)
Either __get_dynamic_array or __get_str can be used in the TP_printk macro to get a reference to the dynamic array.
TP_printk helper functions
There are four TP_printk helper functions, two of which were already described in the previous section (__get_str and __get_dynamic_array). The other two helper functions are more complex and deal with mapping numbers to names.
__print_flags(flags, delimiter, values)
Being able to see the values of flags in a field as symbolic names instead of numbers makes reading a trace much easier. Imagine having to manually parse kmalloc() GFP flags of 0x80d0 instead of GFP_KERNEL|GFP_ZERO.
The first two parameters of the __print_flags are simply the variable that contains the flags (__entry->gfp_flags) and a string delimiter to use between flags if more than one is found ("|"). The delimiter may also be NULL or an empty string (""). The third parameter is an array of structures of the type:
struct trace_print_flags {
unsigned long mask;
const char *name;
};
The module_load tracepoint contains a good example of using __print_flags:
TP_printk("%s %s", __get_str(name), __print_flags(flags, "",
{ (1UL << TAINT_PROPRIETARY_MODULE), "P" },
{ (1UL << TAINT_FORCED_MODULE), "F" },
{ (1UL << TAINT_CRAP), "C" })
Depending on which taint flag is set, the corresponding letter ("P", "F", and/or "C") will be displayed. If the value of the flags field is not found within the values parameter, then the value of the flags parameter is converted to a hex string and that is returned. If no bit is set in the flags parameter, then __print_flags returns an empty string. Note that __print_flags internally terminates the values array, so no explicit termination is required.
Alert readers will have noticed that the previous example of the kmalloc GFP flags used a complex bit mask. GFP_KERNEL is not a single bit, but is made up of multiple bits. A mask in values can contain more than one bit. __print_flags will iterate through values, and will use the first match for any particular set of bits. GFP_KERNEL is made up of (__GFP_WAIT | __GFP_IO | __GFP_FS). The kmalloc tracepoint passes in the GFP_KERNEL mask before each of the single bit values. This allows __print_flags to pick the GFP_KERNEL over selecting the individual flags. If one of the three flags that make up GFP_KERNEL was listed in the values before GFP_KERNEL, then the individual flags would be in the output instead of printing GFP_KERNEL. Any remaining flag will also be parsed (as was GFP_ZERO). If bits are still set after all values have been applied, then those bits will show up as a hex number at the end following the delimiter.
__print_symbolic(val, values)
The __print_symbolic function is very similar to __print_flags except that it only produces output for exact matches. The values field is still an array of struct trace_print_flags but the mask must match exactly to val in order to have it print name. If no match is found, val is converted to a hex string, which is returned. No delimiter is needed since only one value is returned by __print_symbolic. Here's an example of its use by the irq tracepoints:
#define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
#define show_softirq_name(val) \
__print_symbolic(val, \
softirq_name(HI), \
softirq_name(TIMER), \
softirq_name(NET_TX), \
softirq_name(NET_RX), \
softirq_name(BLOCK), \
softirq_name(BLOCK_IOPOLL), \
softirq_name(TASKLET), \
softirq_name(SCHED), \
softirq_name(HRTIMER), \
softirq_name(RCU))
[...]
TP_printk("vec=%d [action=%s]", __entry->vec,
show_softirq_name(__entry->vec))
Notice how a helper macro is used to set up the values. This is recommended because macros will be evaluated before they show up in the output format, but functions will not. User-space tools will still be able to parse this because a macro was used rather than a function.
A quick demo
To get a better understanding of what is happening with the events, the following contains some simple usage of event tracing. The examples assume that the user has changed directories to tracing in debugfs (usually, but not always, /sys/kernel/debug/tracing). Also notice that the prompt contains '#' which signifies that these operations require a privileged user:
[tracing] # echo 1 > events/module/module_load/enable
[tracing] # insmod /tmp/taintme.ko
[tracing] # insmod /tmp/gpl-nice.ko
[tracing] # cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
insmod-1812 [003] 469717.724908: module_load: taintme P
insmod-1814 [003] 470058.525771: module_load: gpl_nice
The taintme.ko module is a module I wrote that does nothing, but does not have a GPL-compliant license. This causes the "P" taint flag to appear. Notice that no flag appeared for gpl_nice (which, as the name implies, does have a GPL license). Remember, if no bit is set in the flags passed to the __print_flags macro, an empty string is returned.
[tracing] # echo irq_handler_entry softirq_entry > set_event
[tracing] # cat trace | head
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [002] 470574.178475: irq_handler_entry: irq=26 handler=hpet4
<idle>-0 [002] 470574.178485: softirq_entry: softirq=1 action=TIMER
<idle>-0 [002] 470574.178492: softirq_entry: softirq=7 action=SCHED
<idle>-0 [002] 470574.178495: softirq_entry: softirq=9 action=RCU
<idle>-0 [000] 470574.178678: irq_handler_entry: irq=35 handler=eth0
<idle>-0 [000] 470574.178684: softirq_entry: softirq=3 action=NET_RX
Notice that this command used the set_event file to enable tracing. Using this file or the enable file within the events directory act the same way. Because the tracepoint names are (at least so far) unique, just echoing the name into set_event is the equivalent of enabling the tracepoint using events/irq/irq_handler_entry/enable for example. For enabling multiple tracepoints at once it is usually more convenient to use the set_event file, but when activating a singe event, all events in a subsystem, or all events it is more convenient to use the enable files. More details about using the event tracer will be explained in an upcoming article.
The IRQ and soft IRQ events shown above illustrate the output of a dynamic string and use of the __print_symbols helper function. The irq_handle_entry saves the name of the interrupt device (hpet4 and eth0) using a dynamic string to display the name in the trace. The softirq_entry uses the __print_symbols helper function to convert the number of the soft IRQ vector into a matching name that it represents (TIMER, SCHED, RCU, and NET_RX).
Coming in Part 3
Part 3 will look at defining tracepoints outside of the include/trace/events directory (for modules and architecture-specific tracepoints) along with a look at how the TRACE_EVENT() macro does its magic. It will also include some more examples of how the tracepoints are used with Ftrace.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
