Kernel development

Brief items

Kernel release status

The 2.6.36 kernel is out, released on October 20 with only a small number of changes since the 2.6.36-rc8 prepatch. Headline features in this release include the AppArmor security module, the LIRC infrared controller driver subsystem, and the new Tile architecture. The fanotify notification mechanism for anti-malware applications was disabled at the last minute due to ABI concerns. See the KernelNewbies 2.6.36 page for lots of information.

Linus also noted that a two-week merge window, starting on October 21, would run into the Kernel Summit.

So I'm going to hope that we could perhaps even do the 2.6.37 -rc1 release and close the merge window the Sunday before KS opens. Since 2.6.36 was longer than usual (at least it felt that way), I wouldn't mind having a 2.6.37 that is shorter than usual. But holler if this really screws up any plans. Ten days instead of two weeks? Let's see if it's even reasonably realistic.

(The 2.6.36 development cycle was 80 days, incidentally; almost exactly average for the last few years).

Previously, 2.6.36-rc8 was released on October 15. "I really hate doing this, and I'd much rather just release 2.6.36, but I'm doing one last final -rc instead. There's too much noise here (and pending in emails) for me to be happy, and while I first thought to just delay a day or two, I'm now looking at next week instead, and thus the additional -rc."

Stable updates: there have been no stable updates released in the last week.

Comments (none posted)

Quotes of the week.

I suspect there are a few places in MM/VFS/writeback which could/should be using something like this. Of course, if we do this then your nice little function will end up 250 lines long, utterly incomprehensible and full of subtle bugs. We like things to be that way.

-- Andrew Morton

In contrast, the "v2.6.25-rc1~1089^2~98" expression is actually well-defined. There is no ambiguity there, but it's also obviously not really all that human-readable.

-- Linus Torvalds

Frankly, I'm a lot more concerned about the locking being natural for the data structures we have and easily understood. We *are* fscking close to the complexity cliff, hopefully still on the right side of it.

-- Al Viro

Comments (none posted)

Dueling inode scalability patches

By Jonathan Corbet
October 20, 2010

Nick Piggin's VFS scalability patch set has been under development for well over a year. Some pieces were merged for 2.6.36, but the more complicated parts were deferred because Nick thought they needed more work and more testing. Then things went quiet; Nick changed jobs and went on vacation, so little work was done for some time. Eventually it became clear that Nick was unlikely to get the scalability work into shape for a 2.6.37 merge.

So Dave Chinner decided to jump in and work on these patches, and the code breaking up the inode lock in particular. His first patch set was posted in late September, with a number of revisions happening since. Dave worked on splitting the patch series into smaller, more reviewable chunks. He also took out some of the (to him) scarier changes. Subsequent revisions brought larger changes, to the point that version 5 reads:

None of the patches are unchanged, and several of them are new or completely rewritten, so any previous testing is completely invalidated. I have not tried to optimise locking by using trylock loops - anywhere that requires out-of-order locking drops locks and regains the locks needed for the next operation. This approach simplified the code and lead to several improvements in the patch series (e.g. moving inode->i_lock inside writeback_single_inode(), and the dispose_one_inode factoring) that would have gone unnoticed if I'd gone down the same trylock loop path that Nick used.

According to Dave, this patch set helps with the scalability problems he has been seeing, and other reviewers seem to think that the patch set is starting to look fairly good.

But then Nick returned. While he welcomed the new interest in scalability work, he did not take long to indicate that he was not pleased with the direction in which Dave had taken his patches. He has posted a 35-part patch series which he hopes to merge; the patch posting also details why he doesn't like Dave's alternative approach. The ensuing discussion has been a bit rough in spots, though it has remained mostly focused on the technical issues.

What it has not done, though, is to come up with any sort of conclusion. There are two patch sets out there; both deal with the intersection of the virtual filesystem layer and the memory management code. Much of the contention seems to be over whether "VFS people" or "memory management people" should have the ultimate say in how things are done. Given the difficult nature of both patch sets and the imminent opening of the 2.6.37 merge window, it seems fairly safe to say that neither will be merged unless Linus makes an executive decision. Pushing back this code to 2.6.38 will provide an opportunity for the patches to be discussed at length, and, possibly, for the upcoming Kernel Summit to consider them as well.

Comments (4 posted)

IMA memory hog

By Jonathan Corbet
October 20, 2010

Dave Chinner recently noticed a problem on one of the kernel.org systems: the slab cache was using well over 2GB of memory, mainly on radix tree nodes. Intrigued, he looked further into the problem. It came down to the integrity measurement architecture (IMA) security code, which uses the hardware TPM to help ensure that files on the system have not been tampered with. IMA, it seems, was using a radix tree to store integrity information, indexed by inode address. Radix trees perform poorly with sparse, unclustered keys, so IMA's usage was causing the creation of a separate node for each inode in the system. That added up to a lot of memory.

A number of questions came after this revelation, including:

Why is IMA using such an inappropriate data structure?
Why is it keeping all this information around even though it was disabled on the system in question?
Why was IMA configured into the kernel in the first place?

The answer to the first question seems to be that the IMA developers, as part of the process of getting the code into the mainline, were not allowed to expand the inode structure at all. So they created a separate tree for per-inode information; it just happens that they chose the wrong type of tree and never noticed how poorly it performed.

Question #2 is answered like this: the IMA code needs to keep track of which files are opened for write access at any time. There is no point in measuring the integrity of files (checksumming them, essentially) when they can be changed at any time. Without tracking the state of all files all the time, IMA can never know which files are opened for write access when it first starts up. The only way to be sure, it seems, is to track all files starting at boot time just in case somebody tries to turn IMA on at some point.

As for #3: kernel.org was running a Fedora kernel, and the Fedora folks turned on the feature because it looked like it might be useful to some people. Nobody expected that it would have such an impact on systems where it was not turned on. Some participants in the discussion have given the Fedora kernel maintainers some grief for not having audited the code before enabling it, but auditing everything in the kernel to that level is a bit larger task than Fedora can really be expected to take on.

Eric Paris has started work on slimming IMA down; his patches work by moving the "open for write" counts into the inode structure itself, eliminating the need to allocate the separate IMA structures most of the time. IMA is also shifted over to a red-black tree when it does need to track those structures. This work eliminates almost all of the memory waste, but at the cost of growing the inode structure slightly. That does not sit well with everybody, especially, it seems, those developers who feel that IMA should not exist in the first place. But it's a clear step in the right direction, so one should expect something along these lines to be merged for 2.6.37.

Comments (13 posted)

Kernel development news

Shielding driver authors from locking

By Jonathan Corbet
October 20, 2010

Much of the time, patches can be developed against the mainline kernel and submitted for the next merge window without trouble. At other times, though, the mainline is far removed from the environment a patch will have to fit into at merge time. Your editor, who has been trying the (considerable) patience of the Video4Linux maintainer by trying to get a driver merged for 2.6.37 at the last minute, has encountered this fact of life the hard way: he submitted a driver which did not even come close to compiling inside the 2.6.37 V4L2 tree. Things have changed considerably there. This article will look at one of those changes with an eye toward the kind of design decisions that are being made in that part of the kernel.

The removal of the big kernel lock (BKL) has been documented here over the years. One of the biggest holdouts at this point is the V4L2 subsystem; almost everything that happens in a V4L2 driver is the result of an ioctl() call, and those calls have always been protected by the BKL. Removing BKL protection means auditing the drivers - and there are a lot of them - and, in many cases, providing a replacement locking scheme. It seems that a lot of V4L2 drivers - especially the older ones - do not exhibit the sort of attention to locking that one would expect from code submitted today.

The approach to this problem chosen by the V4L2 developers has proved to be mildly controversial within the group: they have tried to make it possible for driver authors to continue to avoid paying attention to locking. To that end, the video_device structure has gained a new lock field; it is a pointer to a mutex. If that field is non-null, the V4L2 core will acquire the mutex before calling any of the vast number of driver callbacks. So all driver operations are inherently serialized and driver authors need not worry about things. At least, they need not worry in the absence of other types of concurrency - like interrupts.

Hans Verkuil, the developer behind many recent V4L2 improvements, clearly feels that it's better to handle the locking centrally:

If he wants to punish himself and do all the locking manually (and prove that it is correct), then by all means, do so. If you want to use the core locking support and so simplify your driver and allow your brain to concentrate on getting the hardware to work, rather than trying to get the locking right, then that's fine as well. As a code reviewer I'd definitely prefer the latter approach as it makes my life much easier.

On the other side, developers like Laurent Pinchart argue that trying to insulate developers from locking is not the right approach:

Developers must not get told to be stupid and don't care about locks just because other developers got it wrong in the past. If people don't get locking right we need to educate them, not encourage them to understand even less of it.

Central locking at the V4L2 level leads to some interesting problems as well. The V4L2 user-space streaming API offers a pair of ioctl() calls for the management of frame buffers: VIDIOC_DQBUF to obtain a buffer from the driver, and VIDIOC_QBUF to give a buffer back. If there are no buffers available at the time of the call, VIDIOC_DQBUF will normally block until a buffer becomes available. When this call is protected by the BKL, blocking will automatically release the lock and enable other V4L2 operations to continue. That behavior is important: one of those other operations might be a VIDIOC_QBUF call providing the buffer needed to allow the VIDIOC_DQBUF call to proceed; if VIDIOC_DQBUF fails to release the lock, things could deadlock.

Drivers which handle their own locking will naturally release locks before blocking in a situation like this. Either the driver author thinks of it at the outset, or the need is made clear by disgruntled users later on. If the driver author is not even aware that the lock exists, though, it's less likely that the lock will be released at a time like this. That could lead to surprises in drivers which do their own I/O buffer management. If, however, the driver uses videobuf, this problem will be handled transparently with some scary-looking code in videobuf_waiton():

    is_ext_locked = q->ext_lock && mutex_is_locked(q->ext_lock);

    /* Release vdev lock to prevent this wait from blocking outside access to
       the device. */
    if (is_ext_locked)
	mutex_unlock(q->ext_lock);

With enough due care, one assumes that it's possible to be sure that unlocking a mutex acquired elsewhere is a reasonable thing to do here. But one must hope that the driver author - who is not concerned with locking, after all - has left things in a consistent state before calling videobuf_waiton(). Otherwise those disgruntled users will eventually make a return.

Locking complexity in the kernel is growing, and that makes it harder for developers to come up to speed. Complex locking can be an especially difficult challenge for somebody writing this type of driver; these authors tend not to be full-time kernel developers. So the appeal of taking care of locking for them and letting them concentrate on getting their hardware to do reasonable things is clear, especially if it makes the code review process easier as well. Such efforts may ultimately be successful, but there can be no doubt that they will run into disagreement from those who feel that kernel developers should either understand what is going on or switch to Java development. Expect this sort of discussion to pop up in a number of contexts as core developers try to make it easier for others to write correct code.

Comments (2 posted)

A netlink-based user-space crypto API

By Jake Edge
October 20, 2010

User-space access to the kernel cryptography subsystem has reared its head several times of late. We looked at one proposal back in August that had a /dev/crypto interface patterned after similar functionality in OpenBSD. There is another related effort, known as the NCR API, and crypto API maintainer Herbert Xu has recently posted an RFC for yet another. But giving user space the ability to request that the kernel perform its computation-intensive crypto operations is not uncontroversial.

As noted back in August, some kernel hackers are skeptical that there would be any performance gains by moving user-space crypto into the kernel. But there are a number of systems, especially embedded systems, with dedicated cryptographic hardware. Allowing user space to access that hardware will likely result in performance gains, in fact 50-100x performance improvements have been reported.

Another problem with both the /dev/crypto and NCR APIs (collectively known as the cryptodev-linux modules) is the addition of an enormous amount of code to the kernel to support crypto algorithms beyond those that are already available. Those two modules have adapted user-space libraries for crypto and multi-precision integers and included them into the kernel. They are necessary to support some government crypto standards and certifications that require a separation between user space and crypto processing. So, the cryptodev-linux modules are trying to solve two separate (or potentially separate) problems: user-space access to crypto hardware acceleration and security standards compliance.

When Xu first put out an RFC on his idea for the API (without any accompanying code) back in September, Christoph Hellwig had a rather strongly worded reaction:

doing crypto in kernel for userspace consumers [is] simply insane. It's computational intensive code which has no business in kernel space unless absolutely required (e.g. for kernel consumers). In addition to that adding the context switch overhead and address space transitions is god [awful] too.

Xu more or less agrees with Hellwig, but sees his API as a way to provide access to the hardware crypto devices. Because Xu's API is based on netlink sockets (as opposed to ioctl()-based or a brand new API that the cryptodev-linux modules introduce), he is clearly hoping that it will provide a way forward without requiring such large changes to the kernel:

FWIW I don't care about user-space using kernel software crypto at all. It's the security people that do.

The purpose of the user-space API is to export the hardware crypto devices to user-space. This means PCI devices mostly, as things like aesni-intel [Intel AES instructions] can already be used without kernel help.

Now as a side-effect if this means that we can shut the security people up about adding another interface then all the better. But I will certainly not go out of the way to add more crap to the kernel for that purpose.

The netlink-based interface uses a new AF_ALG address family that gets passed to the initial socket() call. There is also a new struct sockaddr_alg that contains information about what type of algorithm (e.g. "hash" or "skcipher") is to be used as well as the specific algorithm name (e.g. "sha1" or "cbc(aes)") that is being requested. That structure is then passed in the bind() call on the socket.

For things like hashing, where there is little or no additional information needed, an accept() is done on the socket, which yields an operation file descriptor. The data to be hashed is written to that descriptor and, when there is no more data to be hashed, the appropriate number of bytes (20 for sha1) are then read from the descriptor.

It is a bit more complicated for ciphers. Before accepting the connection on the socket, a key needs to be established for a symmetric key cipher. That is done with a setsockopt() call using the new SOL_ALG level and ALG_SET_KEY option name and passing the key data and its length. But there are additional parameters that need to be set up for ciphers, and those are done using sendmsg().

A cipher will need to know which direction it is operating in (i.e. encrypting or decrypting) and may need an initialization vector. Those are specified with the ALG_SET_OP and ALG_SET_IV messages. Once the accept() has been done, those messages are sent to the operational descriptor and the cipher is ready for use. Data can be sent as messages or written to the operational descriptor, and the resulting data can then be read from that descriptor.

There is an additional wrinkle for the "authenticated encryption with associated data" (AEAD) block cipher mode, which can include authentication information (i.e. message authentication code or MAC) into the ciphertext stream. Because of that, AEAD requires two data streams, one containing the data itself and another with the associated authentication data (the MAC). This is handled in Xu's API by doing two accept() calls, the first for the operational descriptor, and the second for the associated data. If the cipher is operating in encryption mode, both descriptors will be written to, while the encrypted data is read from the operational descriptor. For decryption, the ciphertext is written to the operational descriptor, while the plaintext and authentication data are read from the two descriptors.

There hasn't been much discussion, yet, of the actual code posting, but Xu's September posting elicited a number of complaints about performance, most from proponents of the cryptodev-linux modules. But it would seem that there is some real resistance to adding completely new APIs (as NCR does) or to adding a complicated ioctl()-based API (as /dev/crypto does). Now there are three competing solutions available, but it isn't at all clear that any interface to the kernel crypto subsystem will be acceptable to the kernel community at large. We will have to wait to see how it all plays out.

Comments (21 posted)

trace-cmd: A front-end for Ftrace

October 20, 2010

This article was contributed by Steven Rostedt

Previous LWN articles have explained the basic way to use Ftrace directly through the debugfs filesystem (part 1 and part 2). While the debugfs interface is rather simple, it can also be awkward to work with. It is especially convenient, though, for embedded platforms where it may be difficult to build and install special user tools on the device. On the desktop, it may be more convenient to have a command-line tool that works with Ftrace instead of echoing various commands into strange files and reading the result from another file. This tool does exist, and it is called trace-cmd.

trace-cmd is a user-space front-end command-line tool for Ftrace. You can download it from the git repository at git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git. Some distributions ship it as a package, and some that currently do not, will soon. There are full man pages included, which are installed with a make install_doc. This article will not go over the information that is already in the man pages, but instead will explain a little about how trace-cmd works and how to use it.

How it works

A simple use case of trace-cmd is to record a trace and then report it.

    # trace-cmd record -e ext4 ls
    [...]
    # trace-cmd report
    version = 6
    CPU 1 is empty
    cpus=2
           trace-cmd-7374  [000]  1062.484227: ext4_request_inode:   \
	   		   	  dev 253:2 dir 40801 mode 33188
           trace-cmd-7374  [000]  1062.484309: ext4_allocate_inode:  \
	   		   	  dev 253:2 ino 10454 dir 40801 mode 33188

The above example enables the ext4 tracepoints for Ftrace, runs the ls command and records the Ftrace data into a file named trace.dat. The report command reads the trace.dat file and outputs the tracing data to standard output. Some metadata is also shown before the trace output is displayed: the version of the file, any empty CPU buffers, and the number of CPUs that were recorded.

By default, the record and report options write and read to the trace.dat file. You can use the -o or -i options to pick a different file to write to or read from respectively, but this article will use the default name when referencing the data file created by trace-cmd.

When recording a trace, trace-cmd will fork off a process for each CPU on the system. Each of these processes will open the file in debugfs that represents the CPU the process is dedicated to record from. The process recording CPU0 will open /sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw, the process recording CPU1 will open a similar file in the cpu1 directory, and so on. The trace_pipe_raw file is a mapping directly to the Ftrace internal buffer for each CPU. Each of these CPU processes will read these files using splice to record into a temporary file during the trace. At the end of the record, the main process will concatenate the temporary files into a single trace.dat file.

There's no need to manually mount the debugfs filesystem before using the tool as trace-cmd will look to see if and where it is mounted. If debugfs is not mounted, it will automatically mount it at /sys/kernel/debug.

Recording a trace

As noted above, trace-cmd forks off a process for each CPU dedicated to record from that CPU but, in order to prevent scheduling interference, the threads are not pinned to a CPU. Pinning the threads to the CPU being traced may result in better cache usage, so a future version of trace-cmd may add an option to do that. The Ftrace ring buffers are allocated one per CPU, and each thread will read from a particular CPU's ring buffer. It is important to mention this because these threads can show up in the trace.

A common request is to have trace-cmd ignore events that are caused by trace-cmd itself. But it is not wise to ignore these events because they show where the tracer may have made an impact on what it is tracing. These events can be filtered out after the trace, but they are good to keep around in the trace.dat file in case some delay was caused by the trace itself, and the events may show that.

As trace-cmd is a front end to Ftrace, the arguments of record reflect some of the features of Ftrace. The -e option enables an event. The argument following the -e can be an event name, event subsystem name, or the special name all. The all name will make trace-cmd enable all events that the system supports. If a subsystem name is specified, then all events under that subsystem will be enabled during the trace. For example, specifying sched will enable all the events within the sched subsystem. To enable a single event, the event name can be used by itself, or the subsystem:event format can be used. If the subsystem name is left off, then all events with the given name will be enabled. Currently this would not be an issue because, as of this writing, all events have unique names. If more than one event or subsystem is to be traced, then multiple -e options may be specified.

Ftrace also has special plugin tracers that do not simply trace specific events. These tracers include the function, function graph, and latency tracers. Through the debugfs tracing directory, these plugins are enabled by echoing the type of tracer into the current_tracer file. With trace-cmd record, they are enabled with the -p option. Using the tracer plugin name as the argument for -p enables that plugin. You can still specify one or more events with a plugin, but you may only specify a single plugin, or no plugin at all.

When the record is finished, trace-cmd examines the kernel buffers and outputs some statistics, which may be a little confusing. Here's an example:

    Kernel buffer statistics:
      Note: "entries" are the entries left in the kernel ring buffer and are not
            recorded in the trace data. They should all be zero.

    CPU: 0
    entries: 0
    overrun: 0
    commit overrun: 0

    CPU: 1
    [...]

As the output explains, the entries field is not the number of entries that were traced, but the number of entries left in the kernel buffer. If entries were dropped because trace-cmd could not read the buffer faster than it was being written to, and the writer overflowed the buffer, then either the overrun or commit overrun values would be something other than zero. The overrun value is the number of entries that were dropped due to the buffer filling up, and the writer deleting the older entries.

The commit overrun is much less likely to occur. Writes to the buffer is a three step process. First the writer reserves space in the ring buffer, then it writes to it, then it commits the change. Writing to the buffer does not disable interrupts. If a write is preempted by an interrupt, and the interrupt performs a large amount of tracing where it fills the buffer up to the point of the space that was reserved for the write it preempted, then it must drop events because it cannot touch the reserved space until it is committed. These dropped events are the commit overrun. This is highly unlikely to happen unless you have a small buffer.

Filtering while recording

As explained in "Secrets of the Ftrace function tracer", Ftrace allows you to filter what functions will be traced by the function tracer. Also, you can graph a single function, or a select set of functions, with the function graph tracer. These features are not lost when using trace-cmd.

    # trace-cmd record -p function -l 'sched_*' -n 'sched_slice'

(Note that the above does not specify a command to execute, so trace-cmd will record until Ctrl^C is hit.)

The -l option is the same as echoing its argument into set_ftrace_filter, and the -n option is the same as echoing its argument into set_ftrace_notrace. You can have more than one -l or -n option on the command line. trace-cmd will simply write all the arguments into the appropriate file. Note, those options are only useful with the function and function_graph plugins. The -g option (not shown) will pass its argument into the set_graph_function file.

Here is a nice trick to see how long interrupts take in the kernel:

    # trace-cmd record -p function_graph -l do_IRQ -e irq_handler_entry sleep 10
    # trace-cmd report
    version = 6
    cpus=2
            Xorg-4262  [001] 212767.154882: funcgraph_entry:                   |  do_IRQ() {
            Xorg-4262  [001] 212767.154887: irq_handler_entry:    irq=21 name=sata_nv
            Xorg-4262  [001] 212767.154952: funcgraph_exit:       + 71.706 us  |  }
            Xorg-4262  [001] 212767.156948: funcgraph_entry:                   |  do_IRQ() {
            Xorg-4262  [001] 212767.156952: irq_handler_entry:    irq=22 name=ehci_hcd:usb1
            Xorg-4262  [001] 212767.156955: irq_handler_entry:    irq=22 name=NVidia CK804
            Xorg-4262  [001] 212767.156985: funcgraph_exit:       + 37.795 us  |  }

The events can also be filtered. To know what fields can be used for filtering a specific event, look in the format file from /sys/kernel/debug/tracing/events/<subsystem>/<event>/format, or run

    # trace-cmd report --events | less

on a trace.dat file that was created by the local system. The --events argument will list the event formats of all events that were available in the system that created the tracing file:

    $ trace-cmd report --events
    [...]
    name: kmalloc_node
    ID: 338
    format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;
        field:int common_lock_depth;    offset:8;       size:4; signed:1;

        field:unsigned long call_site;  offset:16;      size:8; signed:0;
        field:const void * ptr; offset:24;      size:8; signed:0;
        field:size_t bytes_req; offset:32;      size:8; signed:0;
    [...]

Using the kmalloc_node event, we can filter on all requests that were greater than 1000 bytes:

    # trace-cmd record -e kmalloc_node -f 'bytes_req > 1000'

The -f option specifies a filter for the event (specified in a -e option) preceding it.

Reading the trace

As the initial example showed, to read the trace simply run the report command. By default, it will read the trace.dat file, unless the -i option specifies a different file to read, or the input file may simply be specified as the last argument.

    $ trace-cmd report -i mytrace.dat
    $ trace-cmd report mytrace.dat

The above two examples give the same result. The report command is not a privileged operation and only requires read permission on the data file it is reading.

    $ trace-cmd report
    version = 6
    cpus=2
      trace-cmd-8412  [000] 13140.422056: sched_switch:         8412:120:S ==> 0:120: swapper
        <idle>-0     [000] 13140.422068: power_start:          type=1 state=2
        <idle>-0     [000] 13140.422174: irq_handler_entry:    irq=0 handler=timer
        <idle>-0     [000] 13140.422180: irq_handler_exit:     irq=0 return=handled

The output is similar to what you would see in /sys/kernel/debug/tracing/trace.

Having the trace data in a file gives some advantages over reading from a debugfs file. We can now easily filter what events we want to see, or pick a specific CPU to output.

You can do extensive filtering on events and what CPUs you want to focus on:

    $ trace-cmd report --cpu 0 -F 'sched_wakeup: success == 1'
    version = 6
    cpus=2
          ls-8414  [000] 13140.423106: sched_wakeup: 8414:?:? + 8412:120:? trace-cmd Success
          ls-8414  [000] 13140.424179: sched_wakeup: 8414:?:? + 1155:120:? kondemand/0 Success
          ls-8414  [000] 13140.426925: sched_wakeup: 8414:?:? + 704:120:? phy0 Success
          ls-8414  [000] 13140.431172: sched_wakeup: 8414:?:? + 9:120:? events/0 Success
      events/0-9   [000] 13140.431182: sched_wakeup: 9:?:? + 11734:120:? sshd Success
          ls-8414  [000] 13140.434173: sched_wakeup: 8414:?:? + 1155:120:? kondemand/0 Success

The --cpu 0 limits the output to only show the events that occurred on CPU 0. The -F option limits the output further to only show sched_wakeup events that have its success field equal to 1. For more information about the filtering, consult the trace-cmd-report(1) man page.

Tracing over the network

There may be situations where you want to trace an embedded device or some machine with very little disk space. Perhaps another machine has lots of disk space and you want to record the trace to that machine or maybe you are tracing the filesystem itself and minimal interference to that code is needed. This is where tracing over the network comes in handy.

To set up a trace server, simply run something like the following command:

    $ trace-cmd listen -p 12345 -D -d /images/tracing/ -l /images/tracing/logfile

The only required option in the above is the -p option, which tells trace-cmd what port to listen on. The -D puts trace-cmd into daemon mode, while the -d /images/tracing/ tells trace-cmd to output the trace files from the connections it receives into the /images/tracing/ directory. Obviously, any directory you have write permission for can be used here. The -l /images/tracing/logfile tells trace-cmd to not write messages to standard output, but to the /images/tracing/logfile file instead. The listen command is not privileged, and can be run by any user.

On the embedded device (or whatever client is used), instead of specifying a output file in the trace-cmd record the -N option is used followed by host:port syntax.

    # trace-cmd record -N gandalf:12345 -e sched_switch -e sched_wakeup -e irq hackbench 50

Back on the host gandalf a file is made in the /images/tracing/ directory of the format "trace.<client-host>:<client-port>.dat".

    $ ls /images/tracing/
    logfile  trace.frodo:35287.dat

    $ cat /images/tracing/logfile
    [29078]Connected with frodo:35287
    [29078]cpus=2
    [29078]pagesize=4096

    $ trace-cmd report /images/tracing/trace.frodo\:35287.dat
    version = 6
    cpus=2
        <...>-17215 [000] 19858.840695: sched_switch:      17215:120:S ==> 0:120: swapper
       <idle>-0     [000] 19858.840934: irq_handler_entry: irq=30 handler=iwl3945
       <idle>-0     [000] 19858.840959: irq_handler_exit:  irq=30 return=handled
       <idle>-0     [000] 19858.840960: softirq_entry:     softirq=6 action=TASKLET
       <idle>-0     [000] 19858.841005: softirq_exit:      softirq=6 action=TASKLET
      [...]

trace-cmd is versatile enough to handle heterogeneous systems. All the information needed to create and read the trace.dat file is passed from the client to the host. The host could be a 64-bit x86 and the client a 32-bit PowerPC and the above would not change. A big-endian machine can read a little-endian file and vice versa. The compatibility to read various system types is not limited to network tracing. If a trace is performed on a big-endian 32-bit system, the resulting file can still be read from a little-endian 64-bit system.

For the lazy Ftrace user

If using the internal kernel Ftrace buffer is sufficient and there is no need to record the trace, trace-cmd can still be useful. Pretty much all of the record options can be used with the trace-cmd start command. start does not create a trace.dat file, but simply starts Ftrace. Similarly, the stop command is just a convenient way to do:

    $ echo 0 > /sys/kernel/debug/tracing/tracing_on

For example:

    # trace-cmd start -p function_graph -g ip_rcv

    # sleep 10

    # trace-cmd stop

    # cat /sys/kernel/debug/tracing/trace
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     1)               |  ip_rcv() {
     1)               |    T.769() {
     1)               |      nf_hook_slow() {
     1)   0.497 us    |        add_preempt_count();
     1)               |        nf_iterate() {
     1)   0.458 us    |          ip_sabotage_in();

If there is a case where the trace needs to be converted into a trace.dat file, the extract command can be used. After the above trace was done:

    # trace-cmd extract -o kernel-buf.dat

    # trace-cmd report kernel-buf.dat
    version = 6
    cpus=2
       <idle>-0  [001] 214146.661193: funcgraph_entry:          |  ip_rcv() {
       <idle>-0  [001] 214146.661196: funcgraph_entry:          |    T.769() {
       <idle>-0  [001] 214146.661197: funcgraph_entry:          |      nf_hook_slow() {
       <idle>-0  [001] 214146.661197: funcgraph_entry: 0.497 us |        add_preempt_count();
       <idle>-0  [001] 214146.661198: funcgraph_entry:          |        nf_iterate() {
       <idle>-0  [001] 214146.661199: funcgraph_entry: 0.458 us |          ip_sabotage_in();

To disable all tracing, which will ensure that no overhead is left from using the function tracers or events, the reset command can be used. It will disable all of Ftrace and bring the system back to full performance.

    # trace-cmd reset

What's next?

This article explains some of the use cases for trace-cmd. There is still more that it can do but we just did not have the space to fit it all in. This article and the trace-cmd man pages should be enough to get you on your way to using trace-cmd in a productive manner.

So what's next? As this article shows how trace-cmd is a front-end tool for Ftrace, the next article will present kernelshark which is a graphical front-end to trace-cmd. Stay tuned.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.36 ?

Linus Torvalds 2.6.36-rc8 ?

Architecture-specific

David Rientjes x86: allow ZONE_DMA to be configurable ?

achew@nvidia.com rtc: Add real-time clock driver for Tegra. ?

Ohad Ben-Cohen Add OMAP hardware spinlock misc driver ?

Daniel Drake OLPC: Add XO-1 suspend/resume support ?

Alexey Charkov ARM: Add basic architecture support for VIA/WonderMedia 85xx SoC's ?

Core kernel code

John Stultz [RFC] A posix clock/timer interface to the RTC ?

Dave Chinner Inode Lock Scalability V4 ?

npiggin@kernel.dk my inode scaling series for review ?

Mike Galbraith sched: automated per tty task groups ?

Trinabh Gupta [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer ?

Shaohua Li percpu: introduce read mostly percpu API ?

Development tools

Steven Rostedt [GIT PULL] ring-buffer: Various optimizations ?

Device drivers

Peter Hsiang ASoC: Add max98088 CODEC driver ?

Maxim Levitsky Results of my work on memorystick subsystem ?

Alan Cox Add ALS drivers for the apds9802als ?

Alan Cox apds9802als: add runtime PM support ?

pavan_savoy@ti.com Bluetooth: btwilink driver ?

Nicholas A. Bellinger TCM/LIO v4.0.0-rc5 for 2.6.36-rc8 ?

Elly Jones Add Qualcomm Gobi 2000 driver. ?

Alan Cox I2C driver supporting Moorestown and Medfield platform ?

Jonathan Corbet V4L/DVB: Add the via framebuffer camera controller driver ?

Marek Szyprowski Videobuf2 framework ?

Documentation

Michael Kerrisk man-pages-3.29 is released ?

Memory management

Andrea Arcangeli Transparent Hugepage Support #31 ?

Greg Thelen memcg: per cgroup dirty page accounting ?

Peter Zijlstra mm: Preemptibility -v5 ?

Networking

KOVACS Krisztian tproxy: add IPv6 support ?

Security-related

Herbert Xu crypto: af_alg - User-space interface for Crypto API ?

Eric Paris IMA: move read/write counters into struct inode ?

Virtualization and containers

Gleb Natapov KVM: Add host swap event notifications for PV guest ?

xiaohui.xin@intel.com Provide a zero-copy method on KVM virtio-net. ?

Krishna Kumar Implement multiqueue virtio-net ?

Stefano Stabellini xen: initial domain support ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.36-rc8-git3: Reported regressions from 2.6.35 ?

Rafael J. Wysocki 2.6.36-rc8-git3: Reported regressions 2.6.34 -> 2.6.35 ?

Miscellaneous

Akinobu Mita Introduce little endian bitops ?

Page editor: Jonathan Corbet
Next page: Distributions>>