Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.34-rc5, released on April 19. "Random fixes all around. The most noticeable (for people who got hit by it) may be the fix for bootup problems that some people had (ACPI dividing by zero: kernel bugzilla 15749), but there's stuff all over. The shortlog gives some idea." Said shortlog is in the announcement, or see the full changelog for all the details. For added fun, this release has a new code name: "Sheep on meth."
There have been no stable updates since April 1.
Quote of the week
Ceph: The Distributed File System Creature from the Object Lagoon (Linux Mag)
Linux Magazine has a detailed look at the Ceph filesystem which was merged for 2.6.34. "However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient."
Fixing the ondemand governor
The "cpufreq" subsystem is charged with adjusting the CPU clock frequency for optimal performance. Definitions of "optimal" can vary, so there's more than one governor - and, thus, more than one policy - available. The "performance" governor prioritized throughput above all else, while the "powersave" tries to keep power consumption to a minimum. The most commonly-used governor, though, is "ondemand," which attempts to perform a balancing act between power usage and throughput.In a simplified form, ondemand works like this: every so often the governor wakes up and looks at how busy the CPU is. If the idle time falls below a threshold, the CPU frequency will be bumped up; if, instead, there is too much idle time, the frequency will be reduced. By default, on a system with high-resolution timers, the minimum idle percentage is 5%; CPU frequency will be reduced if idle time goes above 15%. The minimum percentage can be adjusted in sysfs (under /sys/devices/system/cpu/cpuN/cpufreq/); the maximum is wired at 10% above the minimum.
This governor has been in use for some time, but, as it turns out, it can create performance difficulties in certain situations. Whenever the system workload alternates quickly between CPU-intensive and I/O-intensive phases, things slow down. That's because the governor, on seeing the system go idle, drops the frequency down to the minimum. After the CPU gets busy again, it runs for a while at low speed until the governor figures out that the situation has changed. Then things go idle and the cycle starts over. As it happens, this kind of workload is fairly common; "git grep" and the startup of a large program are a couple of examples.
Arjan van de Ven has come up with a fix for this governor which is quite simple in concept. The accounting of "idle time" is changed so that time spent waiting for disk I/O no longer counts. If a processor is dominated by a program alternating between processing and waiting for disk operations, that processor will appear to be busy all the time. So it will remain at a higher frequency and perform better. That makes the immediate problem go away without, says Arjan, significantly increasing power consumption.
But, Arjan says, "there are many
things wrong with ondemand, and I'm writing a new governor to fix the
more fundamental issues with it
". That code has not yet been
posted, so it's not clear what sort of heuristics it will contain. Stay
tuned; the demand for ondemand may soon be reduced significantly.
DM and MD come a little closer
The management of RAID arrays in the kernel is a complicated task - and one upon which the fate of much data relies. Given that, it would make sense to have a single set of RAID routines which is improved by all. What the Linux kernel has, instead, is three different RAID implementations: in the multiple device (MD) subsystem, in the device mapper (DM) code, and in the Btrfs filesystem. It has often been said that unifying these implementations would be a good thing, but that is not easy and thus far, it has not happened.MD maintainer Neil Brown has now taken a step in this direction with the posting of his dm-raid456 module, a RAID implementation for the device mapper which is built on the MD code. This patch set set has the potential to eliminate a bunch of duplicated code, which can only be a good thing. It also brings some nice features, including RAID6 support, multiple-target support, and more to the device mapper layer.
This is early work which, probably, is not destined for the next merge window. The response from the device mapper side has been reasonably positive, though. So, with luck, we'll someday have both subsystems using the same RAID code.
Kernel development news
ELC: Using LTTng
Several tracing presentations were in evidence at this year's Embedded
Linux Conference, including Mathieu Desnoyers's "hands-on
tutorial
" on the Linux Trace
Toolkit next generation (LTTng). Desnoyers showed how to use LTTng to
solve real-world performance and latency problems, while giving a good
overview of the process of Linux kernel problem solving. The target was
embedded developers, but the presentation was useful for anyone interested
in finding and fixing problems in Linux itself, or applications running
atop it.
Desnoyers has been hacking on the kernel for many years, and was recently awarded his Ph.D. in computer engineering based largely on his work with LTTng. Since then, he has started his own company, EfficiOS to do consulting work on LTTng as well as helping various customers diagnose their performance problems. In addition to LTTng itself, he also developed the Userspace RCU library that allows user-space applications to use the Read-Copy-Update (RCU) data synchronization technique that is used by the kernel.
LTTng consists of three separate components: patches to the Linux kernel to add tracepoints along with the LTTng infrastructure and ring buffer, the LTT control user-space application, and the LTTV GUI interface for viewing trace output. Each is available on the LTTng web site and there are versions for kernels going back to 2.6.12. There is extensive documentation as well.
Lockless trace clock
The lockless trace clock is "one major piece of LTTng that is
architecture dependent
". This clock is a high-precision timestamp
derived from the processor's cycle counter that is placed on each trace
event. The timestamp is coordinated between separate
processors allowing system-wide correlation of events.
On processors with frequency scaling and non-synchronized cycle counters, like
some x86 systems, the trace clock can get confused when the processor
operating frequency changes, so that feature needs to
be disabled before tracing. Desnoyers noted that Nokia had funded work for LTTng to
properly handle frequency scaling and power management
features of the ARM OMAP3 processor, but that it hasn't yet been done for x86.
Tracing strategy
A portion of the talk was about the tradeoffs of various tracing
strategies. Desnoyers described the factors that need to be considered
when deciding on how to trace a problem including what kind of bug it is,
how reproducible it is, how much tracing overhead the system can tolerate,
the availability of the system to reproduce it on,
whether it occurs only on production systems, and so on. Each of these
things "impact the number of tracing iterations available
".
It may be that because it occurs infrequently or is on a third-party
production system, one can only get a single tracing run, "or you may
have the luxury of multiple trace runs
".
Based on those factors, there are different kinds of tracing to choose from in LTTng. At the top level, you can use producer-consumer tracing, where all of the enabled events are recorded to a filesystem, "flight recorder" mode, where the events are stored in fixed-size memory buffers and newer events overwrite the old, or both of these modes can be active at once. There are advantages and disadvantages to each, of course, starting with higher overhead for producer-consumer traces. But the amount of data which can be stored for those types of traces is generally much higher than for flight recorder traces.
Because there is generally an enormous amount of data in a trace, Desnoyers described several techniques to help hone in on the problem area. In LTTng, events are grouped by subsystem into "channels" and each channel can have a different buffer size for flight recorder mode. That allows more backlog for certain kinds of events, while limiting others. In addition, instrumentation (trace events, kernel markers) can be disabled to reduce the amount of trace data that is generated.
Another technique is to use "anchors", specific events in the trace that are the starting point for analysis. The anchor is often used by going backward in the trace from that point. The anchors can either come from instrumentation like the LTTng user-space tracer or kernel trace events, or they can be generated by the analysis itself. The longest timer interrupt jitter (i.e. how far from the nominal time where it should happen) is an example he gave of one kind of analysis-generated anchor.
A related idea is "triggers", which are a kind of instrumentation with a side-effect. By using ltt_trace_stop("name"), a trigger can stop the tracing when a particular condition occurs in the kernel. Using lttctl from user space is another way to stop a trace. Triggers are particularly helpful for flight recorder traces, he said.
Live demo
Desnoyers also did a demonstration of LTTng on two separate kinds of problems both involving the scheduler. One was to try to identify sources of audio latency by running a program with a periodic timer that expired every 10ms. The code was written to put an anchor into the trace data every time the timer missed by more than 5ms. He ran the program, then moved a window around on the screen, which caused delays to the timer of up to 60ms.
Using that data, he brought up the LTTV GUI to look at what was going on. It
was a bit hard to follow exactly what he was doing, but eventually he
narrowed it down to the scheduling of the X.org server. He then
instrumented the kernel scheduler with a kernel marker to get more
information on how it was making its decisions. Kernel markers were
proposed for upstream inclusion a ways back, but were roundly criticized
for cluttering up the kernel with things that looked like a debug
printk(). Markers are good for ad hoc tracepoints, but
"don't try to push it upstream
", he warned.
The other demo was to look at a buffer underrun in ALSA's aplay utility. The same scheduler marker was used to investigate that problem as well.
The status of LTTng
Perhaps the hardest question was saved to the end of the presentation: what is the status of LTTng getting into the mainline? Desnoyers seemed upbeat about the prospects of that happening, partly because there has been so much progress made in the kernel tracing area. Most of the instrumentation side has already gone in, and the tracer itself is ongoing work that he now has time do. He presented a "status of LTTng" presentation at the Linux Foundation Collaboration Summit (LFCS), which was held just after ELC, and there was agreement among some of the tracing developers to work together on getting more of LTTng into the kernel.
Desnoyers does not see a problem with having multiple tracers in the
kernel, so long as they use common infrastructure and target different
audiences. Ftrace is "more oriented toward kernel
developers
", he said, and it has different tracers for specific
purposes. LTTng on the other hand is geared toward users and user-space
programmers who need a look into the kernel to diagnose their problems.
With Ftrace developer Steven Rostedt participating in both the ELC and LFCS
talks—and agreeing with many of Desnoyers's ideas—the prospects
look good for at least parts of LTTng to make their way into the mainline
over the next year.
When writeback goes wrong
Like any other performance-conscious kernel, Linux does not immediately flush data written to files back to the underlying storage. Caching that data in memory can help optimize filesystem layout and seek times; it also eliminates duplicate writes should the same blocks be written multiple times in succession. Sooner or later (preferably sooner), that data must find its way to persistent storage; the process of getting it there is called "writeback." Unfortunately, as some recent discussions demonstrate, all is not well in the Linux writeback code at the moment.There are two distinct ways in which writeback is done in contemporary kernels. A series of kernel threads handles writeback to specific block devices, attempting to keep each device busy as much of the time as possible. But writeback also happens in the form of "direct reclaim," and that, it seems, is where much of the trouble is. Direct reclaim happens when the core memory allocator is short of memory; rather than cause memory allocations to fail, the memory management subsystem will go casting around for pages to free. Once a sufficient amount of memory is freed, the allocator will look again, hoping that nobody else has swiped the pages it worked so hard to free in the meantime.
Dave Chinner recently encountered a problem involving direct reclaim which manifested itself as a kernel stack overflow. Direct reclaim can happen as a result of almost any memory allocation call, meaning that it can be tacked onto the end of a call chain of nearly arbitrary length. So, by the time that direct reclaim is entered, a large amount of kernel stack space may have already been used. Kernel stacks are small - usually no larger than 8KB and often only 4KB - so there is not a lot of space to spare in the best of conditions. Direct reclaim, being invoked from random places in the kernel, cannot count on finding the best of conditions.
The problem is that direct reclaim, itself, can invoke code paths of great complexity. At best, reclaim of dirty pages involves a call into filesystem code, which is complex enough in its own right. But if that filesystem is part of a union mount which sits on top of a RAID device which, in turn, is made up of iSCSI drives distributed over the network, the resulting call chain may be deep indeed. This is not a task that one wants to undertake with stack space already depleted.
Dave ran into stack overflows - with an 8K stack - while working with XFS. The XFS filesystem is not known for its minimalist approach to stack use, but that hardly matters; in the case he describes, over 3K of stack space was already used before XFS got a chance to take its share. This is clearly a situation where things can go easily wrong. Dave's answer was a patch which disables the use of writeback in direct reclaim. Instead, the direct reclaim path must content itself with kicking off the flusher threads and grabbing any clean pages which it may find.
There is another advantage to avoiding writeback in direct reclaim. The per-device flusher threads can accumulate adjacent disk blocks and attempt to write data in a way which minimizes seeks, thus maximizing I/O throughput. Direct reclaim, instead, takes pages from the least-recently-used (LRU) list with an eye toward freeing pages in a specific zone. As a result, pages flushed by direct reclaim tend to be scattered more widely across the storage devices, causing higher seek rates and worse performance. So disabling writeback in direct reclaim looks like a winning strategy.
Except, of course, we're talking about virtual memory management code, and nothing is quite that simple. As Mel Gorman pointed out, no longer waiting for writeback in direct reclaim may well increase the frequency with which direct reclaim fails. That, in turn, can throw the system into the out-of-memory state, which is rarely a fun experience for anybody involved. This is not just a theoretical concern; it has been observed at Google and elsewhere.
[PULL QUOTE: Lumpy reclaim, by its nature, is likely to create seeky I/O patterns, but skipping lumpy reclaim increases the likelihood of higher-order allocation failures. END QUOTE] Direct reclaim is also where lumpy reclaim is done. The lumpy reclaim algorithm attempts to free pages in physically-contiguous (in RAM) chunks, minimizing memory fragmentation and increasing the reliability of larger allocations. There is, unfortunately, a tradeoff to be made here: the nature of virtual memory is such that pages which are physically contiguous in RAM are likely to be widely dispersed on the backing storage device. So lumpy reclaim, by its nature, is likely to create seeky I/O patterns, but skipping lumpy reclaim increases the likelihood of higher-order allocation failures.
So various other solutions have been contemplated. One of those is simply putting the kernel on a new stack-usage diet in the hope of avoiding stack overflows in the future. Dave's stack trace, for example, shows that the select() system call grabs 1600 bytes of stack before actually doing any work. Once again, though, there is a tradeoff here: select() behaves that way in order to reduce allocations (and improve performance) for the common case where the number of file descriptors is relatively small. Constraining its stack use would make an often performance-critical system call slower.
Beyond that, reducing stack usage - while being a worthy activity in its own right - is seen as a temporary fix at best. Stack fixes can make a specific call chain work, but, as long as arbitrarily-complex writeback paths can be invoked with an arbitrary amount of stack space already used, problems will pop up in places. So a more definitive kind of fix is required; stack diets may buy time but will not really solve the problem.
One common suggestion is to move direct reclaim into a separate kernel thread. That would put reclaim (and writeback) onto its own stack where there will be no contention with system calls or other kernel code. The memory allocation paths could poke this thread when its services are needed and, if necessary, block until the reclaim thread has made some pages available. Eventually, the lumpy reclaim code could perhaps be made smarter so that it produces less seeky I/O patterns.
Another possibility is simply to increase the size of the kernel stack. But, given that overflows are being seen with 8K stacks, an expansion to 16K would be required. The increase in memory use would not be welcome, and the increase in larger allocations required to provide those stacks would put more pressure on the lumpy reclaim code. Still, such an expansion may well be in the cards at some point.
According to Andrew Morton, though, the real problem is to be found elsewhere:
In other words, the problem is not how direct reclaim is behaving. It is, instead, the fact that direct reclaim is happening as often as it is in the first place. If there were less need to invoke direct reclaim in the first place, the problems it causes would be less pressing.
So, if Andrew gets his way, the focus of this work will shift to figuring out why the memory management code's behavior changed and fixing it. To that end, Dave has posted a set of tracepoints which should give some visibility into how the writeback code is making its decisions. Those tracepoints have already revealed some bugs, which have been duly fixed. The main issue remains unresolved, though. It has already been named as a discussion topic for the upcoming filesystems, storage, and memory management workshop (happening with LinuxCon in August), but many of the people involved are hoping that this particular issue will be long-solved by then.
Using the TRACE_EVENT() macro (Part 3)
Tracepoints within the kernel facilitate the analysis of how the kernel performs. The flow of critical information can be followed and examined in order to debug a latency problem, or to simply figure out better ways to tune the system. The core kernel tracepoints, like the scheduler and interrupt tracepoints, let the user see when and how events take place inside the kernel. Module developers can also take advantage of tracepoints; if their users or customers have problems, the developer can have them enable the tracepoints and analyze the situation. This article will explain how to add tracepoints in modules that are outside of the core kernel code.
In Part 1, the process of creating a tracepoint in the core kernel was explained. Part 2 explained how to consolidate tracepoints with the use of DECLARE_EVENT_CLASS() and DEFINE_EVENT() and went over the field macros of TP_STRUCT__entry, and the function helpers of TP_printk(). This article takes a look at how to add tracepoints outside of the core kernel, which can be used by modules or architecture specific code. A brief look at some of the magic behind the TRACE_EVENT() macro, and a few more examples to get your feet wet with using tracepoints.
Defining a trace header outside of include/trace/events
For tracepoints in modules or in architecture-specific directories, having trace header files in the global include/trace/events directory may clutter it. The result would be to put files like mips_cpu.h or arm_cpu.h, which are not necessary for the core kernel, into that directory. It would end up something like the old include/asm-*/ setup. Also, if tracepoints went into staging drivers, putting staging header files in the core kernel code base would be a bad design.
Because trace header files are handled very differently than other header files, the best solution is to have the header files placed at the location where they are used. For example, the XFS tracepoints are located in the XFS subdirectory in fs/xfs/xfs_trace.h. But, some of the magic of define_trace.h is that it must be able to include the trace file that included it (the reason for TRACE_HEADER_MULTI_READ). As explained in Part 1, the trace header files start with the cpp conditional:
#if !defined(_TRACE_SCHED_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_SCHED_H
Part 1 explained that one and only one of the C files that include a particular trace header will define CREATE_TRACE_POINTS before including the trace header. That activates the define_trace.h that the trace header file includes. The define_trace.h file will include the header again, but will first define TRACE_HEADER_MULTI_READ. As the cpp condition shows, this define will allow the contents of the trace header to be read again.
For define_trace.h to include the trace header file, it must be able to find it. To do this, some changes need to be made to the Makefile where the trace file is included, and that file will need to tell define_trace.h not to look for it in the default location (include/trace/events).
To tell define_trace.h where to find the trace header, the Makefile must define the path to the location of the trace file. One method is to extend CFLAGS to include the path:
EXTRA_CFLAGS = -I$(src)
But that affects CFLAGS for all files that the Makefile builds. If it is desired to only modify the the CFLAGS for the C file that has the CREATE_TRACE_POINTS defined, then the method used by the net/mac80211/Makefile can be used:
CFLAGS_driver-trace.o = -I$(src)
The driver-trace.c file contains the CREATE_TRACE_POINTS define and the include of driver-trace.h that contains the TRACE_EVENT() macros for the mac80211 tracepoints.
To demonstrate how to add tracepoints to a module, I wrote a simple module, called sillymod, which just creates a thread that wakes up every second and performs a printk and records the number of times that it has done so. I will look at the relevant portions of the files, but the full file contents are also available: module, Makefile, the module with tracepoint, and the trace header file.
The first step is to create the desired tracepoints. The trace header file is created the same way as the core trace headers described in Part 1, with a few more additions. The header must start by defining the system where all tracepoints within the file will belong to:
#undef TRACE_SYSTEM #define TRACE_SYSTEM silly
This module creates a trace system called silly. Then the special cpp condition is included:
#if !defined(_SILLY_TRACE_H) || defined(TRACE_HEADER_MULTI_READ) #define _SILLY_TRACE_H
The linux/tracepoint.h file is included, and finally the TRACE_EVENT() macros; one in this example:
#include <linux/tracepoint.h>
TRACE_EVENT(me_silly,
TP_PROTO(unsigned long time, unsigned long count),
TP_ARGS(time, count),
TP_STRUCT__entry(
__field( unsigned long, time )
__field( unsigned long, count )
),
TP_fast_assign(
__entry->time = jiffies;
__entry->count = count;
),
TP_printk("time=%lu count=%lu", __entry->time, __entry->count)
);
#endif /* _SILLY_TRACE_H */
The above is the same as what was described in Part 1 for core kernel tracepoints. After the #endif things become a bit different. Before including the define_trace.h file the following is added:
/* This part must be outside protection */ #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH . #define TRACE_INCLUDE_FILE silly-trace #include <trace/define_trace.h>
The TRACE_INCLUDE_PATH tells define_trace.h not to look in the default location (include/trace/events) for the trace header, but instead look in the include search path. By default, define_trace.h will include a file defined by TRACE_SYSTEM. The TRACE_INCLUDE_FILE tells define_trace.h that the trace header is called silly-trace.h (The .h is automatically added to the end of TRACE_INCLUDE_PATH).
To add the tracepoint to the module, the module now includes the trace header. Before including the trace header it must also define CREATE_TRACE_POINTS:
#define CREATE_TRACE_POINTS #include "silly-trace.h"
The tracepoint can now be added to the code.
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(HZ);
printk("hello! %lu\n", count);
trace_me_silly(jiffies, count);
count++;
Finally the Makefile must set the CFLAGS to have the includes include the local directory where the silly-trace.h file resides.
CFLAGS_sillymod-event.o = -I$(src)
One might believe the following would also work without modifying the Makefile, if the module resided in the kernel tree:
#define TRACE_INCLUDE_PATH ../../path/to/trace/header
Using a path name in TRACE_INCLUDE_PATH other than '.' runs the risk of containing a macro. For example, if XFS defined TRACE_INCLUDE_PATH as ../../fs/xfs/linux-2.6/xfs_trace.h, it would fail. That is because the Linux build #defines the name linux to nothing, which would make the path be ../../fs/xfs/-2.6/xfs_trace.h.
Now the trace event is available.
[mod] # insmod sillymod-event.ko
[mod] # cd /sys/kernel/debug/tracing
[tracing] # ls events
block ext4 header_event i915 jbd2 module sched skb timer
enable ftrace header_page irq kmem power silly syscalls workqueue
[tracing] # ls events/silly
enable filter me_silly
[tracing] # echo 1 > events/silly/me_silly/enable
[tracing] # cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
silly-thread-5377 [000] 1802.845581: me_silly: time=4304842209 count=10
silly-thread-5377 [000] 1803.845008: me_silly: time=4304843209 count=11
silly-thread-5377 [000] 1804.844451: me_silly: time=4304844209 count=12
silly-thread-5377 [000] 1805.843886: me_silly: time=4304845209 count=13
Once define_trace.h file can safely locate the trace header, the module's tracepoints can be created. To understand why all this manipulation is needed, a look at how define_trace.h is implemented may clarify things a bit.
A look inside the magic of TRACE_EVENT()
For those that dare to jump into the mystical world of the C preprocessor, take a look into include/trace/ftrace.h. But be warned, what you find there may leave you a bit loony, or at least think that the ones that wrote that code were a bit loony (in which case, you may be right). The include/trace/define_trace.h file does some basic set up for the TRACE_EVENT() macro, but for a trace to take advantage of it, it must have a header file in define_trace.h to do the real work (as do Ftrace and perf).
cpp tricks and treats
While I was working on my Masters, a professor showed me a trick with cpp that lets one map strings to enums using the same data:
#define DOGS { C(JACK_RUSSELL), C(BULL_TERRIER), C(ITALIAN_GREYHOUND) }
#undef C
#define C(a) ENUM_##a
enum dog_enums DOGS;
#undef C
#define C(a) #a
char *dog_strings[] = DOGS;
char *dog_to_string(enum dog_enums dog)
{
return dog_strings[dog];
}
The trick is that the macro DOGS contains a sub-macro C() that we can redefine and change the behavior of DOGS. This concept is key to how the TRACE_EVENT() macro works. All the sub-macros within TRACE_EVENT() can be redefined and cause the TRACE_EVENT() to be used to create different code that uses the same information. Part 1 described the requirements needed to create a tracepoint. One set of data (in the TRACE_EVENT() definition) must be able to do several things. Using this cpp trick, it is able to accomplish just that.
The tracepoint code created by Mathieu Desnoyers required a DECLARE_TRACE(name, proto, args) be defined in a header file, and in some C file DEFINE_TRACE(name) was used. TRACE_EVENT() now does both jobs. In include/linux/tracepoint.h:
#define TRACE_EVENT(name, proto, args, struct, assign, print) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
The PARAMS() macro lets proto and args contain commas and not be mistaken as multiple parameters of DECLARE_TRACE(). Since the tracepoint.h must be included in all trace headers, this makes the TRACE_EVENT() macro fulfill the first part of the tracepoint creation. When a C file defines CREATE_TRACE_POINTS before including a trace header, the define_trace.h becomes active and performs:
#undef TRACE_EVENT
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DEFINE_TRACE(name)
That is not enough, however, because the define_trace.h is declared after the TRACE_EVENT() macros are used. For this code to have an impact, the TRACE_EVENT() macro must be included again. The define_trace.h does some nasty C preprocessor obfuscation to be able to include the file that just included it:
#define TRACE_HEADER_MULTI_READ #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
The defining of the TRACE_HEADER_MULTI_READ will let the trace header be read again (and this is why it is needed in the first place). The TRACE_INCLUDE(TRACE_INCLUDE_FILE) is more cpp macro tricks that will include the file that included it. As explained in previous articles, this macro will use either TRACE_SYSTEM.h or TRACE_INCLUDE_FILE.h if that is defined, and will include the file from include/trace/events/ if TRACE_INCLUDE_PATH is not defined. I'll spare the reader the ugliness of the macros to accomplish this. For the more masochistic reader, feel free to look at the include/trace/define_trace.h file directly. When the file is included again, the TRACE_EVENT() macro will be processed again, but with its new meaning.
The above explains how tracepoints are created. It only creates the tracepoint itself, and it does nothing to add it to a tracer's infrastructure. For Ftrace, this is where the ftrace.h file inside the define_trace.h comes into play. (Warning, the ftrace.h file is even more bizarre than define_trace.h). The macros in ftrace.h create the files and directories found in tracing/events. ftrace.h uses the same tricks explained earlier of redefining the macros within the TRACE_EVENT() macro as well as redefining the TRACE_EVENT() macro itself. How ftrace.h works is beyond the scope of this article, but feel free to read it directly, if you don't have any allergies to backslashes.
Playing with trace events
If you change directories to the debugfs filesystem mount point (usually /sys/kernel/debug) and take a look inside tracing/events you will see all of the trace event systems defined in your kernel (i.e. the trace headers that defined TRACE_SYSTEM).
[tracing] # ls events block enable ftrace header_event header_page irq kmem module power sched skb syscalls timer workqueue
As mentioned in Part 2, the enable files are used to enable a tracepoint. The enable file in the events directory can enable or disable all events in the system, the enable file in one of the system's directories can enable or disable all events within the system, and the enable file within the specific event directory can enable or disable that event.
Note, by writing '1' into any of the enable files will enable all events within that directory and below. Writing a '0' will disable all events within that directory and below.
One nice feature about events is that they also show up in the Ftrace tracers. If an event is enabled while a tracer is running, those events will show up in the trace. Enabling events can make the function tracer even more informative:
[tracing] # echo 1 > events/sched/enable
[tracing] # echo function > current_tracer
[tracing] # head -15 trace
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
Xorg-1608 [001] 1695.236400: task_of <-update_curr
Xorg-1608 [001] 1695.236401: sched_stat_runtime: task: Xorg:1608 runtime: 402851 [ns], vruntime: 153144994503 [ns]
Xorg-1608 [001] 1695.236402: account_group_exec_runtime <-update_curr
Xorg-1608 [001] 1695.236402: list_add <-enqueue_entity
Xorg-1608 [001] 1695.236403: place_entity <-enqueue_entity
Xorg-1608 [001] 1695.236403: task_of <-enqueue_entity
Xorg-1608 [001] 1695.236404: sched_stat_sleep: task: gnome-terminal:1864 sleep: 639071 [ns]
Xorg-1608 [001] 1695.236405: __enqueue_entity <-enqueue_entity
Xorg-1608 [001] 1695.236406: hrtick_start_fair <-enqueue_task_fair
Xorg-1608 [001] 1695.236407: sched_wakeup: task gnome-terminal:1864 [120] success=1 [001]
Xorg-1608 [001] 1695.236408: check_preempt_curr <-try_to_wake_up
Combining the events with tricks from the function graph tracer, we can find interrupt latencies, and which interrupts are responsible for long latencies.
[tracing] # echo do_IRQ > set_ftrace_filter
[tracing] # echo 1 > events/irq/irq_handler_entry/enable
[tracing] # echo function_graph > current_tracer
[tracing] # cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) ==========> |
0) | do_IRQ() {
0) | /* irq_handler_entry: irq=30 handler=iwl3945 */
0) ==========> |
0) | do_IRQ() {
0) | /* irq_handler_entry: irq=30 handler=iwl3945 */
0) + 22.965 us | }
0) <========== |
0) ! 148.135 us | }
0) <========== |
0) ==========> |
0) | do_IRQ() {
0) | /* irq_handler_entry: irq=1 handler=i8042 */
0) + 45.347 us | }
0) <========== |
Writing do_IRQ into set_ftrace_filter makes the function tracer only trace the do_IRQ() function. Then the irq_handler_entry tracepoint is activated and the function_graph tracer is selected. Since the function graph tracer shows the time a function executed, we can see the how long the interrupts ran for. But the function graph tracer only shows that the do_IRQ() function ran, but not which interrupt it executed. By enabling the irq_handler_entry event, we now see which interrupt was running. The above shows that my laptop's iwl3945 interrupt that handles the wireless communication caused a 148 microsecond latency!
Conclusion
Tracepoints are a very powerful tool, but to make them useful, they must be flexible and trivial to add. Adding TRACE_EVENT() macros are quite easy and they are popping up all over the kernel. The 2.6.34-rc3 kernel currently has over 300 TRACE_EVENT() macros defined; 341 as of this writing.
The code to implement the trace events uses lots of cpp tricks to accomplish its task. But the complexity of the implementation simplified the usage of tracepoints. The rule of thumb in creating the TRACE_EVENT() macro was: make the use of the TRACE_EVENT() macro as simple as possible, even if that makes the implementation of it extremely complex. Sometimes, the easier something is to use, the more complexity there is to create it. Now, a developer does not need to know how the TRACE_EVENT() macro works, she or he only needs to know that the work has been done for them. Adding TRACE_EVENT()s are easy, and any developer can now take advantage of them.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
