|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc4, released on January 12. There have been 366 non-merge commits to Linus's tree since -rc4, mostly bug fixes. One might expect an -rc5 soon, probably just after publication.

Stable updates: Three stable kernels were released on January 18: 2.6.32.4, 2.6.31.12 , and 2.6.27.44. Each has multiple fixes (2.6.32.4 includes 52 patches), including some security issues. Greg Kroah-Hartman also released a status report on the stable trees, which notes that there will almost certainly be no more 2.6.31 updates, that 2.6.27 is only viable for another 6-8 months, and that he will be maintaining 2.6.32 as a "long-term" stable release.

Comments (none posted)

Quotes of the week

Since 32 bits means that any machine with 1 GB more means HIGHMEM, the number of non-embedded machines that should run 32-bit kernels today is functionally the null set. Unfortunately Linux distros have not properly promoted 64-bit kernels for 32-bit distros; although pure 64 bits is better, it would be a *helluva* lot better if people stuck on 32 bits for compatibility reasons had a saner alternative.
-- H. Peter Anvin

Can people now accept that the reason we have rather more complex models for security policy is because generally speaking these "oh so simple" little magic switches don't actually work or solve any real world problems.
-- Alan Cox

Comments (27 posted)

devtmpfs to lose EXPERIMENTAL tag

By Jake Edge
January 20, 2010
In a fairly short period of time, devtmpfs has gone from a controversial proposal in May to being merged into the mainline for 2.6.32. It was merged as an experimental feature, though, which is something the devtmpfs developers would like to see change. Kay Sievers posted a patch that would remove the experimental designation as well as make it the default: "All major distros enable devtmpfs on recent systems, so remove the EXPERIMENTAL flag, and enable it by default to reflect how it is used today."

Comments on the patch indicate that there is little complaint about removing the experimental designation, but making it the default was not particularly popular. Arjan van de Ven complained that enabling devtmpfs by default violated a kernel convention: "we use 'default y' only for those things that used to be on, and are now turned into a config option". Sievers, at least, had never heard of that convention, but is willing to follow it—if it exists.

Alan Cox pointed out that existing distributions do not use devtmpfs, only those in development, but Sievers sees no harm for older systems:

And it should not harm any old system if it is enabled. If initramfs is used it's completely invisible, if a custom kernel with kernel-mounted rootfs is used, the udev boot script usually over-mounts the devtmpfs at /dev with an empty tmpfs, like it has always done it before.

It is unclear whether this change is meant for 2.6.33 or is just being floated early for the 2.6.34 merge window, but the removal of EXPERIMENTAL seems to have no real opposition. Whether it becomes the default or not looks to be up in the air, but in a fairly short period of time, devtmpfs has cemented its place in the mainline kernel.

Comments (none posted)

Kernel development news

LCA: Why filesystems are hard

By Jonathan Corbet
January 20, 2010
The ext4 filesystem is reaching the culmination of a long development process. It has been marked as stable in the mainline kernel for over a year, distributions are installing it by default, and it may start to see more widespread enterprise-level deployment toward the end of this year. At linux.conf.au 2010, ext4 hacker Ted Ts'o talked about the process of stabilizing ext4 and why filesystems take a long time to become ready for production use.

In general, Ted says, people tend to be overly optimistic about how quickly a filesystem can stabilize. It is not a fast process, for a number of fairly clear reasons. In general, there are some aspects of software which can make it hard to test and debug. These include premature optimization ("the root of all evil"), the presence of large amounts of internal state, and an environment involving a lot of parallelism. Any of these features will make code more difficult to understand and complicate the testing environment.

Filesystems suffer from all of these problems. Users demand that a general-purpose filesystem be heavily optimized for a wide variety of workloads; this optimization work must be done at all levels of the code. The entire job of a filesystem is to store and manage internal state. Among other things, that makes it hard for developers to reproduce problems; specific bugs are quite likely to be associated with the state of a specific filesystem which a user may be unwilling to share even in the absence of the practical difficulties implicit in making hundreds of gigabytes of data available to developers. And parallelism is a core part of the environment for any general-purpose filesystem; there will always be many things going on at once. All of these factors combine to make filesystems difficult to stabilize.

What it comes down to, Ted says, is that filesystems, like fine wines, have to age for a fair period of time before they are ready. But there's an associated problem: the workload-dependent nature of many filesystem problems guarantees that filesystem developers cannot, by themselves, find all of the bugs in their code. There will always be a need for users to test the code and report their experiences. So filesystem developers have a strong incentive to encourage users to use the code, but the more ethical developers (at least) do not want to cause users to lose data. It's a fine line which can be hard to manage.

So what does it take to get a filesystem written and ready for use? As part of the process of seeking funding for Btrfs development, Ted talked to [Ted Ts'o] veterans of a number of filesystem development projects over the years. They all estimated that getting a filesystem to a production-ready state would require something between 75 and 100 person-years of effort - or more. That can be a daunting thing to tell corporate executives when one is trying to get a project funded; for Btrfs, Ted settled for suggesting that every company involved should donate two engineers to the cause. Alas, not all of the companies followed through completely; vague problems associated with an economic crisis got in the way.

An associated example: Sun started working on the ZFS filesystem in 2001. The project was only announced in 2005, with the first shipments happening in 2006. But it is really only in the last year or so that system administrators have gained enough confidence in ZFS to start using it in production environments. Over that period of time, the ZFS team - well over a dozen people at its peak - devoted a lot of time to the development of the filesystem.

So where do things stand with ext4? It is, Ted says, an interesting time. It has been shipping in community distributions for a while, with a number of them now installing it by default. With luck, the long term support and enterprise distributions will start shipping it soon; enterprise-level adoption can be expected to follow a year or so after that.

Over the last year or so, There have been something between 60 and 100 ext4 patches in each mainline kernel release. Just under half of those are bug fixes; many of the rest are cleanup patches. There's also a small amount of new feature and performance enhancement work still. Ted noted that the number of bug fixes has not been going down in recent releases. That, he says, is to be expected; the user community for ext4 is growing rapidly, and more users will find (and report) more bugs.

A certain number of those bugs are denial of service problems; many of those are system crashes in response to a corrupted on-disk filesystem image. A larger share of the problems are race conditions and, especially deadlocks. There are a few problems associated with synchronization; one does not normally notice these at all unless the system crashes at the wrong time. And there are a few memory leaks, usually associated with poorly-tested error-handling paths.

The areas where the bulk of these bugs can be found is illuminating. There have been problems in the interaction between the block allocator and the online resize functionality - it turns out that people do not resize filesystems often, so this code is not always all that heavily tested. Other bugs have come up in the interaction between block pre-allocation and out-of-space handling. Online defragmentation has had a number of problems, including one nasty security bug; it turned out that nobody had really been testing that code. The FIEMAP ioctl() command, really only used by one utility, had some problems. There were issues associated with disk quotas; this feature, too, is not often used, especially by filesystem developers. And there have been problems with the no-journal mode contributed by Google; the filesystem had a number of "there is always a journal" assumptions inherited from ext3, but, again, few people have tested this feature.

The common theme here should be clear: a lot of the bugs turning up in this stage of the game are associated with little-used features which have not received as much testing as the core filesystem functions. The good news is that, as a result, most of the bugs have not actually affected that many users.

There was one problem in particular which took six months to find; about once a month, it would corrupt a filesystem belonging to a dedicated and long-suffering tester. It turned out that there was a race condition which could corrupt the disk if two processes were writing the same file at the same time. Samba, as it happens, does exactly that, whereas the applications run by most filesystem developers do not. The moral of the story: just because the filesystem developer has not seen problems does not mean that the code is truly safe.

Another bug would only strike if the system crashed at just the wrong time; it had been there for a long time before anybody noticed it. How long? The bug was present in the ext3 filesystem as well, but nobody ever reported it.

There have also been a number of performance problems which have been found and fixed. Perhaps the most significant one had to do with performance in the writeback path. According to Ted, the core writeback code in the kernel is fairly badly broken at the moment, with the result that it will not tell the filesystem to write back more than 1024 blocks at a time. That is far too small for large, fast devices. So ext4 contains a hack whereby it will write back much more data than the VFS layer has requested; it is justified, he says, because all of the other filesystems do it too. In general, nobody wants to touch the writeback code, partly because they fear breaking all of the workarounds which have found their way into filesystem-specific code over the years.

Ted concluded by noting that, in one way, filesystems are easy: the Linux kernel contains a great deal of generic support code which does much of the work. But the truth of the matter is that they are hard. There are lots of workloads to support, the performance demands are strong, and there tend to be lots of processes running in parallel. The creation of a new filesystem is done as a labor of love; it's generally hard to justify from a business perspective. This reality is reflected in the fact that almost nobody is investing in filesystem work currently, with the one high-profile exception being Sun and its ZFS work. But, Ted noted, that work has cost them a lot, and it's not clear that they have gotten a return which justifies that investment. Hopefully the considerable amount of work which has gone into Linux filesystem development will have a more obvious payback.

Comments (24 posted)

Uprobes: not quite there yet

By Jonathan Corbet
January 20, 2010
Tracing support in Linux has made a great deal of progress over the course of the last year or so. One important feature still lacks support in the mainline kernel, though: seamless, combined tracing of user-space execution along with the kernel. The subsystem which is meant to support this feature - utrace - has run into a number of roadblocks on its way into the mainline. Now a higher-level feature, uprobes, has been proposed as a solution for dynamic probing of user-space programs. All told, the combination shows a lot of progress toward inclusion, but the resulting discussion suggests that there are still problems to be overcome before this code will be merged.

This version of uprobes is actually two independent modules which address the problem at different levels. The lower-level piece is called "UBP," for user-space break points; its job is to handle the actual placement of probes into user-space processes. The developers reasoned that there might be additional users of user-space probes in the future, so the facilities for the placement and removal of those probes were carved out separately.

On top of UBP is the actual uprobes code, which handles higher-level details. Uprobes arbitrates between multiple users of breakpoints, even if two users want to place a probe at the same location. It uses utrace to ensure that processes are not running in an area which is about to have a probe inserted, and deals with the case of multiple processes running the same code where some are being traced and others are not. The uprobe code is also in charge of actually calling the probe function when a probe is hit and recovering properly if that function behaves poorly.

This separation is the first point of contention; Peter Zijlstra (who has been the main reviewer of this code so far) sees uprobes as an unnecessary glue layer which could be eliminated. Peter would rather see any needed features pushed down into UBP, after which the higher-level code could be dropped. The uprobes developers disagree, though, saying that the functions implemented at that level are necessary and cannot really be eliminated. This part of the discussion kind of died out, but it doesn't look like the developers are inclined to make major changes here.

The next problem is with the implementation of the probes themselves. When a probe is placed in a user-space program, the instruction at the probed location is overwritten by a breakpoint. When the breakpoint is hit, the probe handler function is invoked; once it returns, the replaced instruction must be executed somehow. A simple implementation would put that instruction back into its original location, single-step through it, then restore the breakpoint once again. That approach fails, though, if there is a second process (or thread) running the probed code. If that second process executes through the probed area while the probe has been removed, the associated event will be lost.

So the uprobes developers took a separate approach, called "single-step out of line" or "execute out of line" (XOL). A separate region of memory is set up for the purpose of holding instructions which have been displaced by probe breakpoints. When one of those instructions is to be executed, it is run (again, in single-step mode) out of this separate area; after that, control returns after the probe location. This solution allows a probe to work with multiple processes at the same time.

The problem is this: the memory containing the XOL instructions must be in the probed process's address space. So the XOL code adds a virtual memory area (VMA) to the process, reserving a range of address space for this purpose. This works, but it strikes some observers as inelegant at best, and potentially disruptive at worst. Currently, the layout of a process's address space is almost entirely under the control of the process itself. The injection of a special kernel VMA can perturb the process's control of its address space, causing other VMAs to move or conflicting with an attempt by the process to place a VMA at a specific location. Debuggers are often known to distort application behavior (leading to "heisenbugs" which disappear when somebody attempts to observe them directly), but tracing, which is meant to work on production systems, should really minimize such distortions. Peter also dislikes the precedent of kernel code messing with a process's address space. Finally, on 32-bit systems, losing even a small amount of address space to a kernel function is likely to be unwelcome in a number of situations.

Solving this problem is not necessarily easy. Peter seems to favor emulating the displaced instruction, but that would require the implementation of a full instruction emulator in the kernel. That code would be large, architecture-specific, and error prone. There was some discussion of trying to run the instruction in kernel space, but doing that securely appears to be a challenging task. After an extended discussion, the prevailing opinion seemed to be something like that expressed by Pekka Enberg:

I guess we're looking at few megabytes of the address space for normal scenarios which doesn't seem too excessive... I don't like the idea but if the performance benefits are real (are they?), maybe it's a worthwhile trade-off.

In the end, perhaps the kernel developers will hold their noses and merge this approach, but chances are they'll need to talk about it for a while yet first.

The uprobes code comes with an ftrace plugin which provides an interface to user space for the placement and management of probes. The problem here is that the kernel developers have, for all practical purposes, decided that there will be no more ftrace plugins added to the kernel. New features are supposed to go through the perf events subsystem instead, which is seen as having a better-designed interface. So the current ftrace plugin will almost certainly have to be redone for perf events before this code can go in.

The ftrace plugin also associates user-space probes with specific process of interest. Peter argues that it makes more sense to hook probes onto executable files, then make the process association by way of the VMA structure when the file is mapped. Existing features in the kernel, perhaps supplemented with a simple hook or two, would make it easy for uprobes to find processes running code from a file and to deal with process comings and goings while the probes are in place. The uprobes developers have not said as much, as of this writing, but it seems likely that the API could be reworked in those terms.

Then, there is the nagging issue of the utrace layer, which has not yet found its way into the mainline. It has recently been added to linux-next, but there is some discomfort with that and it's not clear if it will remain there or not.

All of this may seem like a lot of obstacles to the merging of this code, but it also represents a step forward. The road into the mainline has been long for utrace; a final detour or two seems about par for the course. The existence of uprobes as an in-kernel user of utrace might help its cause, once uprobes itself passes muster. Assuming consensus on these issues can be reached, it should be possible to make a last round of changes and be quite close to getting the code merged - though it might be difficult to get this done for the 2.6.34 merge window. But, if things go well, we should have user-space probing not too much later than that.

Comments (9 posted)

Secrets of the Ftrace function tracer

January 20, 2010

This article was contributed by Steven Rostedt

Probably the most powerful tracer derived from Ftrace is the function tracer. It has the ability to trace practically every function in the kernel. It can be run not just for debugging or analyzing, but also to learn and observe the flow of the Linux kernel.

Two previous articles, Debugging the Linux Kernel Using Ftrace parts I and II, explain some of the basic features of Ftrace and the function tracer; this article is written with the assumption that the reader has already read them. As with the previous articles, the examples in this article expect that the user has already changed to the debug file system tracing directory. The kernel configuration options that are need to be enabled to follow the examples in this article are:

  • CONFIG_FUNCTION_TRACER
  • CONFIG_DYNAMIC_FTRACE
  • CONFIG_FUNCTION_GRAPH_TRACER

Note, the CONFIG_HAVE_FUNCTION_TRACER, CONFIG_HAVE_DYNAMIC_FTRACE, and CONFIG_HAVE_FUNCTION_GRAPH_TRACER options are enabled when the architecture supports the corresponding feature. Do not confuse them with the listed options. The features are only enabled when the listed configuration options are enabled and not when only the _HAVE_ options are.

As shown in the previous articles, here is a quick example of how to enable the function tracer.

   [tracing]# echo function > current_tracer
   [tracing]# cat trace
          <idle>-0     [000] 1726568.996435: hrtimer_get_next_event <-get_next_timer_interrupt
          <idle>-0     [000] 1726568.996436: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [000] 1726568.996436: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [000] 1726568.996437: rcu_needs_cpu <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 1726568.996438: enter_idle <-cpu_idle
          ...

The above shows you the process name (<idle>), PID (0) the CPU that the trace executed on ([000]), a time-stamp in seconds with the decimal places down to microseconds (1726568.996435) the function being traced (hrtimer_get_next_event) and the parent that called that function (get_next_timer_interrupt).

Function filtering

Running the function tracer can be overwhelming. The amount of data may be vast, and very hard to get a hold of by the human brain. Ftrace provides a way to limit what functions you see. Two files exist that let you limit what functions are traced:

   set_ftrace_filter
   set_ftrace_notrace

These filtering features depend on the CONFIG_DYNAMIC_FTRACE option. As explained in the previous articles, when this configuration is enabled all of the mcount caller locations are stored and at boot time are converted into NOPs. These locations are saved and used to enable tracing when the function tracer is activated. But this also has a nice side effect: not all functions must be enabled. The above files will determine which functions gets enabled and which do not.

When any function is listed in the set_ftrace_filter, only those functions will be traced. This will help the performance of the system when the trace is active. Tracing every function incurs a large overhead, but when using the set_ftrace_filter, only those functions listed in that file will have the NOPs changed to call the tracer. Depending on which functions are being traced, just having a couple of hundred functions enabled is hardly noticeable.

The set_ftrace_notrace file is the opposite of set_ftrace_filter. Instead of limiting the trace to a set of functions, functions listed in set_ftrace_notrace will not be traced. Some functions show up quite often and not only does tracing these functions slow down the system, they can fill up the trace buffer and make it harder to analyze the functions you care about. Functions such as rcu_read_lock() and spin_lock() fall into this category.

The process to add functions to these files typically uses bash redirection. Using the symbol '>' will remove all existing functions in the file and add what is being echoed into the file. Appending to the file using '>>' will keep the existing functions and add new ones.

   [tracing]# echo sys_read > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_read
   [tracing]# echo sys_write >> set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_write
   sys_read
   [tracing]# echo sys_open > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_open

To remove all functions just echo a blank line into the filter file.

   [tracing]# echo sys_read sys_open sys_write > set_ftrace_notrace 
   [tracing]# cat set_ftrace_notrace
   sys_open
   sys_write
   sys_read
   [tracing]# echo > set_ftrace_notrace
   [tracing]# cat set_ftrace_notrace
   [tracing]#

The functions listed in these files can also be set on the kernel command line. The options ftrace_notrace and ftrace_filter will preset these files by listing a comma delimited set of functions.

   ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock
   ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault

Functions added by the kernel command line set what will be in the corresponding filter files. These options only pre-load the files, functions can still be removed or added using the bash redirection as explained above.

The functions listed in set_ftrace_notrace take precedence. That is, if a function is listed in both set_ftrace_notrace and set_ftrace_filter, that function will not be traced.

Wildcard filters

A list of functions that can be added to the filter files is shown in the available_filter_functions file. This list of functions was derived from the list of stored mcount callers previously mentioned.

   [tracing]# cat available_filter_functions | head -8
   _stext
   do_one_initcall
   run_init_process
   init_post
   name_to_dev_t
   create_dev
   T.627
   set_personality_64bit

You can grep this file and redirect the result into one of the filter files:

   [tracing]# grep sched available_filter_functions > set_ftrace_filter
   [tracing]# cat set_ftrace_filter | head -8
   save_stack_address_nosched
   mce_schedule_work
   smp_reschedule_interrupt
   native_smp_send_reschedule
   sys32_sched_rr_get_interval
   sched_avg_update
   proc_sched_set_task
   sys_sched_get_priority_max

Unfortunately, adding lots of functions to the filtering files is slow and you will notice that the above grep took several seconds to execute. This is because each function name written into the filter file will be processed individually. The above grep produces over 300 function names. Each of those 300 names will be compared (using strcmp()) against every function name in the kernel, which is quite a lot.

   [tracing]# wc -l  available_filter_functions 
   24331 available_filter_functions

So the grep above caused set_ftrace_filter to generate over 300 * 24331 (7,299,300) comparisons!

Fortunately, these files also take wildcards; the following glob expressions are valid:

  • value* - Select all functions that begin with value.
  • *value* - Select all functions that contain the text value.
  • *value - Select all functions that end with value.

The kernel contains a rather simple parser, and will not process value*value in the expected way. It will ignore the second value and select all functions that start with value regardless of what it ends with. Wildcards passed to the filter files are processed directly for each available function, which is much faster than passing in individual functions in a list.

Because the star (*) is also used by bash, it is best to wrap the input with quotes:

   [tracing]# echo set* > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   [tracing]# echo 'set*' > set_ftrace_filter
   [tracing]# cat set_ftrace_filter | head -5
   set_personality_64bit
   set_intr_gate_ist
   set_intr_gate
   set_intr_gate
   set_tsc_mode

The filters can also select only those functions that belong to a specific module by using the 'mod' command in the input to the filter file:


   [tracing]# echo ':mod:tg3' > set_ftrace_filter
   [tracing]# cat set_ftrace_filter |head -8
   tg3_write32
   tg3_read32
   tg3_write_flush_reg32
   tw32_mailbox_flush
   tg3_write32_tx_mbox
   tg3_read32_mbox_5906
   tg3_write32_mbox_5906
   tg3_disable_ints

This is very useful if you are debugging a single module, and only want to see the functions that belong to that module in the trace.

In the earlier articles, enabling and disabling recording to the ring buffer was done using the tracing_on file and the tracing_on() and tracing_off() kernel functions. But if you do not want to recompile the kernel, and you want to stop the tracing at a particular function, set_ftrace_filter has a method to do so.

The format of the command to have the function trace enable or disable the ring buffer is as follows:

   function:command[:count]

This will execute the command at the start of the function. The command is either traceon or traceoff, and an optional count can be added to have the command only execute a given number of times. If the count is left off (including the leading colon) then the command will be executed every time the function is called.

A while back, I was debugging a change to the kernel I made that was causing a segmentation fault to some programs. I was having a hard time catching the trace, because by the time I was able to stop the trace after seeing the segmentation fault, the data had already been overwritten. But the backtrace on the console showed that the function __bad_area_nosemaphore was being called. I was then able to stop the tracer with the following command:


  [tracing]# echo '__bad_area_nosemaphore:traceoff' > set_ftrace_filter
  [tracing]# cat set_ftrace_filter
  #### all functions enabled ####
  __bad_area_nosemaphore:traceoff:unlimited
  [tracing]# echo function > current_tracer

Notice that functions with commands do not affect the general filters. Even though a command has been added to __bad_area_nosemaphore, the filter still allowed all functions to be traced. Commands and filter functions are separate and do not affect each other. With the above command attached to the function __bad_area_nosemaphore, the next time the segmentation fault occurred, the trace stopped and contained the data I needed to debug the situation.

Removing functions from the filters

As stated earlier, echoing in nothing with '>' will clear the filter file. But what if you only want to remove a few functions from the filter?

   [tracing]# cat set_ftrace_filter > /tmp/filter
   [tracing]# grep -v lock /tmp/filter > set_ftrace_filter

The above works, but as mentioned, it may take a while to complete if there were several functions already in set_ftrace_filter. The following does the same thing but is much faster:

   [tracing]# echo '!*lock*' >> set_ftrace_filter

The '!' symbol will remove functions listed in the filter file. As shown above, the '!' works with wildcards, but could also be used with a single function. Since '!' has special meaning in bash it must be wrapped with single quotes or bash will try to execute what follows it. Also note the '>>' is used. If you make the mistake of using '>' you will end up with no functions in the filter file.

Because the commands and filters do not interfere with each other, clearing the set_ftrace_filter will not clear the commands. The commands must be cleared with the '!' symbol.

   [tracing]# echo 'sched*' > set_ftrace_filter
   [tracing]# echo 'schedule:traceoff' >> set_ftrace_filter
   [tracing]# cat trace | tail -5
   schedule_console_callback
   schedule_bh
   schedule_iso_resource
   schedule_reallocations
   schedule:traceoff:unlimited
   [tracing]# echo > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   schedule:traceoff:unlimited
   [tracing]# echo '!schedule:traceoff' >> set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   [tracing]#

This may seem awkward, but having the '>' and '>>' only affect the functions to be traced and not the function commands, actually simplifies the control between filtering functions and adding and removing commands.

Tracing a specific process

Perhaps you only need to trace a specific process, or set of processes. The file set_ftrace_pid lets you specify specific processes that you want to trace. To just trace the current thread you can do the following:

   [tracing]# echo $$ > set_ftrace_pid

The above will set the function tracer to only trace the bash shell that executed the echo command. If you want to trace a specific process, you can create a shell script wrapper program.

   [tracing]# cat ~/bin/ftrace-me
   #!/bin/sh
   DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
   echo $$ > $DEBUGFS/tracing/set_ftrace_pid
   echo function > $DEBUGFS/tracing/current_tracer
   exec $*
   [tracing]# ~/bin/ftrace-me ls -ltr

Note, you must clear the set_ftrace_pid file if you want to go back to generic function tracing after performing the above.

   [tracing]# echo -1 > set_ftrace_pid

What calls a specific function?

Sometimes it is useful to know what is calling a particular function. The immediate predecessor is helpful, but an entire backtrace is even better. The function tracer contains an option that will create a backtrace in the ring buffer for every function that is called by the tracer. Since creating a backtrace for every function has a large overhead, which could live lock the system, care must be taken when using this feature. Imagine the timer interrupt on a slower system where it is run at 1000 HZ. It is quite possible that having every function that the timer interrupt calls produce a backtrace could take 1 millisecond to complete. By the time the timer interrupt returns, a new one will be triggered before any other work can be done, which leads to a live lock.

To use the function tracer backtrace feature, it is imperative that the functions being called are limited by the function filters. The option to enable the function backtracing is unique to the function tracer and activating it can only be done when the function tracer is enabled. This means you must first enable the function tracer before you have access to the option:

   [tracing]# echo kfree > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   kfree
   [tracing]# echo function > current_tracer
   [tracing]# echo 1 > options/func_stack_trace
   [tracing]# cat trace | tail -8
    => sys32_execve
    => ia32_ptregs_common
                cat-6829  [000] 1867248.965100: kfree <-free_bprm
                cat-6829  [000] 1867248.965100: <stack trace>

    => free_bprm
    => compat_do_execve
    => sys32_execve
    => ia32_ptregs_common
   [tracing]# echo 0 > options/func_stack_trace
   [tracing]# echo > set_ftrace_filter

Notice that I was careful to cat the set_ftrace_filter before enabling the func_stack_trace option to ensure that the filter was enabled. At the end, I disabled the options/func_stack_trace before disabling the filter. Also note that the option is non-volatile, that is, even if you enable another tracer plugin in current_tracer, the option will still be enabled if you re-enable the function tracer.

The function_graph tracer

The function tracer is very powerful, but it may be difficult to understand the linear format that it produces. Frederic Weisbecker has extended the function tracer into the function_graph tracer. The function_graph tracer piggy-backs off of most of the code created by the function tracer, but adds its own hook in the mcount call. Because it still uses the mcount calling methods most of the function filtering explained above also applies to the function_graph tracer, with the exception of the traceon/traceoff commands and set_ftrace_pid (although the latter may change in the future).

The function_graph tracer was also explained in the previous articles, but the set_graph_function file was not described. The func_stack_trace used in the previous section can see what might call a function, but set_graph_function can be used to see what a function calls:

   [tracing]# echo kfree > set_graph_function
   [tracing]# echo function_graph > current_tracer
   [tracing]# cat trace
   # tracer: function_graph
   #
   # CPU  DURATION                  FUNCTION CALLS
   # |     |   |                     |   |   |   |
    0)               |  kfree() {
    0)               |    virt_to_cache() {
    0)               |      virt_to_head_page() {
    0)   0.955 us    |        __phys_addr();
    0)   2.643 us    |      }
    0)   4.299 us    |    }
    0)   0.855 us    |    __cache_free();
    0)   ==========> |
    0)               |    smp_apic_timer_interrupt() {
    0)               |      apic_write() {
    0)   0.849 us    |        native_apic_mem_write();
    0)   2.853 us    |      }
   [tracing]# echo > set_graph_function

This displays the call graph performed only by kfree(). The "==========>" shows that an interrupt happened during the call. The trace records all functions within the kfree() block, even those functions called by an interrupt that triggered while in the scope of kfree().

The function_graph tracer shows the time a function took in the duration field. In the previous articles, it was mentioned that only the leaf functions, the ones that do not call other functions, have an accurate duration, since the duration of parent functions also includes the overhead of the function_graph tracer calling the child functions. By using the set_ftrace_filter file, you can force any function into becoming a leaf function in the function_graph tracer, and this will allow you to see an accurate duration of that function.

   [tracing]# echo smp_apic_timer_interrupt > set_ftrace_filter
   [tracing]# echo function_graph > current_tracer
   [tracing]# cat trace | head
   # tracer: function_graph
   #
   # CPU  DURATION                  FUNCTION CALLS
   # |     |   |                     |   |   |   |
    1)   ==========> |
    1) + 16.433 us   |  smp_apic_timer_interrupt();
    1)   ==========> |
    1) + 25.897 us   |  smp_apic_timer_interrupt();
    1)   ==========> |
    1) + 24.764 us   |  smp_apic_timer_interrupt();

The above shows that the timer interrupt takes between 16 and 26 microseconds to complete.

Function profiling

oprofile and perf are very powerful profiling tools that take periodic samples of the system and can show where most of the time is spent. With the function profiler, it is possible to take a good look at the actual function execution and not just samples. If CONFIG_FUNCTION_GRAPH_TRACER is configured in the kernel, the function profiler will use the function graph infrastructure to record how long a function has been executing. If just CONFIG_FUNCTION_TRACER is configured, the function profiler will just count the functions being called.


   [tracing]# echo nop > current_tracer
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function 0 |head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     schedule                             22943    1994458706 us     86931.03 us 
     poll_schedule_timeout                 8683    1429165515 us     164593.5 us 
     schedule_hrtimeout_range              8638    1429155793 us     165449.8 us 
     sys_poll                             12366    875206110 us     70775.19 us 
     do_sys_poll                          12367    875136511 us     70763.84 us 
     compat_sys_select                     3395    527531945 us     155384.9 us 
     compat_core_sys_select                3395    527503300 us     155376.5 us 
     do_select                             3395    527477553 us     155368.9 us 

The above also includes the times that a function has been preempted or schedule() was called and the task was swapped out. This may seem useless, but it does give an idea of what functions get preempted often. Ftrace also includes options that allow you to have the function graph tracer ignore the time the task was scheduled out.

   [tracing]# echo 0 > options/sleep-time
   [tracing]# echo 0 > function_profile_enabled
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function0  | head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     default_idle                          2493    6763414 us     2712.962 us 
     native_safe_halt                      2492    6760641 us     2712.938 us 
     sys_poll                              4723    714243.6 us     151.226 us  
     do_sys_poll                           4723    692887.4 us     146.704 us  
     sys_read                              9211    460896.3 us     50.037 us   
     vfs_read                              9243    434521.2 us     47.010 us   
     smp_apic_timer_interrupt              3940    275747.4 us     69.986 us   
     sock_poll                            80613    268743.2 us     3.333 us    

Note that the sleep-time option contains a "-" and is not sleep_time.

Disabling the function profiler and then re-enabling it causes the numbers to reset. The list is sorted by the Avg times, but using scripts you can easily sort by any of the numbers. The trace_stat/function0 only represents CPU 0. There exists a trace_stat/function# for each CPU on the system. All functions that are traced and have been hit are in this file.

   [tracing]# cat trace_stat/function0  | wc -l
   2978

Functions that were not hit are not listed. The above shows that 2978 functions were hit since I started the profiling.

Another option that affects profiling is graph-time (again with a "-"). By default it is enabled. When enabled, the times for a function include the times of all the functions that were called within the function. As you can see from the output in the above examples, several system calls are listed with the highest average. When disabled, the times only include the execution of the function itself, and do not contain the times of functions that are called from the function:

   [tracing]# echo 0 > options/graph-time
   [tracing]# echo 0 > function_profile_enabled
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function0  | head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     mwait_idle                           10132    246835458 us     24361.96 us 
     tg_shares_up                        154467    389883.5 us     2.524 us    
     _raw_spin_lock_irqsave              343012    263504.3 us     0.768 us    
     _raw_spin_unlock_irqrestore         351269    175205.6 us     0.498 us    
     walk_tg_tree                         14087    126078.4 us     8.949 us    
     __set_se_shares                     274937    88436.65 us     0.321 us    
     _raw_spin_lock                      100715    82692.61 us     0.821 us    
     kstat_irqs_cpu                      257500    80124.96 us     0.311 us    

Note that both sleep-time and graph-time also affect the duration times displayed by the function_graph tracer.

Conclusion

The function tracer is very powerful with lots of different options. It is already available in mainline Linux, and hopefully will be enabled by default in most distributions. It allows you to see into the depths of the kernel and with its arsenal of features, gives you a good idea of why things are happening the way they do. Start using the function tracer to open up the black box that we call the kernel. Have fun!

Comments (6 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Theodore Ts'o dioread_nolock patch ?

Memory management

Networking

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds