Brief items
The current development kernel is 2.6.33-rc4, released on January 12. There
have been 366 non-merge commits to Linus's tree since -rc4, mostly bug
fixes. One might expect an -rc5 soon, probably just after publication.
Stable updates: Three stable kernels were released on January 18: 2.6.32.4, 2.6.31.12 , and 2.6.27.44. Each has multiple fixes (2.6.32.4
includes 52 patches),
including some security issues. Greg Kroah-Hartman also released a status report on the stable
trees, which notes that there will almost certainly be no more 2.6.31
updates, that 2.6.27 is only viable for another 6-8 months, and that he
will be maintaining 2.6.32 as a "long-term" stable release.
Comments (none posted)
Since 32 bits means that any machine with 1 GB more means HIGHMEM,
the number of non-embedded machines that should run 32-bit kernels
today is functionally the null set. Unfortunately Linux distros
have not properly promoted 64-bit kernels for 32-bit distros;
although pure 64 bits is better, it would be a *helluva* lot better
if people stuck on 32 bits for compatibility reasons had a saner
alternative.
--
H. Peter Anvin
Can people now accept that the reason we have rather more complex
models for security policy is because generally speaking these "oh
so simple" little magic switches don't actually work or solve any
real world problems.
--
Alan Cox
Comments (27 posted)
By Jake Edge
January 20, 2010
In a fairly short period of time, devtmpfs has gone from a
controversial proposal in May to
being merged into the mainline for 2.6.32. It was merged as an
experimental feature, though, which is something the devtmpfs developers
would like to see change. Kay Sievers posted a
patch that would remove the
experimental designation as well as make it the default: "
All major distros enable devtmpfs on recent systems, so remove
the EXPERIMENTAL flag, and enable it by default to reflect how it
is used today."
Comments on the patch indicate that there is little complaint about
removing the experimental designation, but making it the default was not
particularly popular. Arjan van de Ven complained that enabling devtmpfs by default
violated a kernel convention: "we use 'default y' only for those things
that used to be on, and are now turned into a config option."
Sievers, at least, had never heard of that convention, but is willing to follow it—if it exists.
Alan Cox pointed out that existing
distributions do not use devtmpfs, only those in development, but Sievers
sees no harm for older systems:
And it should not harm any old system if it is enabled. If initramfs
is used it's completely invisible, if a custom kernel with
kernel-mounted rootfs is used, the udev boot script usually
over-mounts the devtmpfs at /dev with an empty tmpfs, like it has
always done it before.
It is unclear whether this change is meant for 2.6.33 or is just being
floated early for the 2.6.34 merge window, but the removal of EXPERIMENTAL
seems to have no real opposition. Whether it becomes the default or not
looks to be up in the air, but in a fairly short period of time, devtmpfs
has cemented its place in the mainline kernel.
Comments (none posted)
Kernel development news
By Jonathan Corbet
January 20, 2010
The ext4 filesystem is reaching the culmination of a long development
process. It has been marked as stable in the mainline kernel for over a
year, distributions are installing it by default, and it may start to see
more widespread enterprise-level deployment toward the end of this year.
At linux.conf.au 2010, ext4 hacker Ted Ts'o talked about the process of
stabilizing ext4 and why filesystems take a long time to become ready for
production use.
In general, Ted says, people tend to be overly optimistic about how quickly
a filesystem can stabilize. It is not a fast process, for a number of
fairly clear reasons. In general, there are some aspects of software which
can make it hard to test and debug. These include premature optimization
("the root of all evil"), the presence of large amounts of internal state,
and an environment involving a lot of parallelism. Any of these features
will make code more difficult to understand and complicate the testing
environment.
Filesystems suffer from all of these problems. Users demand that a
general-purpose filesystem be heavily optimized for a wide variety of
workloads; this optimization work must be done at all levels of the code.
The entire job of a filesystem is to store and manage internal state.
Among other things, that makes it hard for developers to reproduce
problems; specific bugs are quite likely to be associated with the state of
a specific filesystem which a user may be unwilling to share even in the
absence of the practical difficulties implicit in making hundreds of
gigabytes of data available to developers. And parallelism is a core part of the
environment for any general-purpose filesystem; there will always be many
things going on at once. All of these factors combine to make filesystems
difficult to stabilize.
What it comes down to, Ted says, is that filesystems, like fine wines, have
to age for a fair period of time before they are ready. But there's an
associated problem: the workload-dependent nature of many filesystem
problems guarantees that filesystem developers cannot, by themselves, find
all of the bugs in their code. There will always be a need for users to
test the code and report their experiences. So filesystem developers have
a strong incentive to encourage users to use the code, but the more ethical
developers (at least) do not want to cause users to lose data. It's a fine
line which can be hard to manage.
So what does it take to get a filesystem written and ready for use? As
part of the process of seeking funding for Btrfs development, Ted talked to
veterans of a number of filesystem development projects over the years.
They all estimated that getting a filesystem to a production-ready state
would require something between 75 and 100 person-years of effort - or
more. That can be a daunting thing to tell corporate executives when one
is trying to get a project funded; for Btrfs, Ted settled for suggesting
that every company involved should donate two engineers to the cause.
Alas, not all of the companies followed through completely; vague problems
associated with an economic crisis got in the way.
An associated example: Sun started working on the ZFS filesystem in 2001.
The project was only announced in 2005, with the first shipments happening
in 2006. But it is really only in the last year or so that system
administrators have gained enough confidence in ZFS to start using it in
production environments. Over that period of time, the ZFS team - well
over a dozen people at its peak - devoted a lot of time to the development
of the filesystem.
So where do things stand with ext4? It is, Ted says, an interesting time.
It has been shipping in community distributions for a while, with a
number of them now installing it by default. With luck, the long term
support and enterprise distributions will start shipping it soon;
enterprise-level adoption can be expected to follow a year or so after
that.
Over the last year or so, There have been something between 60 and 100 ext4
patches in each mainline kernel release. Just under half of those are bug
fixes; many of the rest are cleanup patches. There's also a small amount
of new feature and performance enhancement work still. Ted noted that the
number of bug fixes has not been going down in recent releases. That, he
says, is to be expected; the user community for ext4 is growing rapidly,
and more users will find (and report) more bugs.
A certain number of those bugs are denial of service problems; many of
those are system crashes in response to a corrupted on-disk filesystem
image. A larger share of the problems are race conditions and, especially
deadlocks. There are a few problems associated with synchronization; one
does not normally notice these at all unless the system crashes at the
wrong time. And there are a few memory leaks, usually associated with
poorly-tested error-handling paths.
The areas where the bulk of these bugs can be found is illuminating. There
have been problems in the interaction between the block allocator and the
online resize functionality - it turns out that people do not resize
filesystems often, so this code is not always all that heavily tested.
Other bugs have come up in the interaction between block pre-allocation and
out-of-space handling. Online defragmentation has had a number of
problems, including one nasty security bug; it turned out that nobody had
really been testing that code. The FIEMAP ioctl()
command, really only used by one utility, had some problems. There were
issues associated with disk quotas; this feature, too, is not often used,
especially by filesystem developers. And there have been problems with the
no-journal mode contributed by Google; the filesystem had a number of
"there is always a journal" assumptions inherited from ext3, but, again,
few people have tested this feature.
The common theme here should be clear: a lot of the bugs turning up in this
stage of the game are associated with little-used features which have not
received as much testing as the core filesystem functions. The good news
is that, as a result, most of the bugs have not actually affected that many
users.
There was one problem in particular which took six months to find;
about once a month, it would corrupt a filesystem belonging to a dedicated
and long-suffering tester. It turned out that there was a race condition
which could corrupt the disk if two processes were writing the same file at
the same time. Samba, as it happens, does exactly that, whereas the
applications run by most filesystem developers do not. The moral of the
story: just because the filesystem developer has not seen problems does
not mean that the code is truly safe.
Another bug would only strike if the system crashed at just the wrong time;
it had been there for a long time before anybody noticed it. How long?
The bug was present in the ext3 filesystem as well, but nobody ever
reported it.
There have also been a number of performance problems which have been found
and fixed. Perhaps the most significant one had to do with performance in
the writeback path. According to Ted, the core writeback code in the
kernel is fairly badly broken at the moment, with the result that it will
not tell the filesystem to write back more than 1024 blocks at a time.
That is far too small for large, fast devices. So ext4 contains a hack
whereby it will write back much more data than the VFS layer has requested; it
is justified, he says, because all of the other filesystems do it too. In
general, nobody wants to touch the writeback code, partly because they fear
breaking all of the workarounds which have found their way into
filesystem-specific code over the years.
Ted concluded by noting that, in one way, filesystems are easy: the Linux
kernel contains a great deal of generic support code which does much of the
work. But the truth of the matter is that they are hard. There are lots
of workloads to support, the performance demands are strong, and there tend
to be lots of processes running in parallel.
The creation of a new filesystem is done as a labor of love; it's generally
hard to justify from a business perspective. This reality is reflected in
the fact that almost nobody is investing in filesystem work currently, with
the one high-profile exception being Sun and its ZFS work. But, Ted noted,
that work has cost them a lot, and it's not clear that they have gotten a
return which justifies that investment. Hopefully the considerable amount
of work which has gone into Linux filesystem development will have a more
obvious payback.
Comments (24 posted)
By Jonathan Corbet
January 20, 2010
Tracing support in Linux has made a great deal of progress over the course
of the last year or so. One important feature still lacks support in the
mainline kernel, though: seamless, combined tracing of user-space execution
along with the kernel. The subsystem which is meant to support this
feature - utrace - has
run into a
number of roadblocks on its way into the mainline. Now a higher-level
feature,
uprobes, has been
proposed as a solution for dynamic probing of user-space programs. All
told, the combination shows a lot of progress toward inclusion, but the
resulting
discussion suggests that there are still problems to be overcome before
this code will be merged.
This version of uprobes is actually two independent modules which address
the problem at different levels. The lower-level piece is called "UBP,"
for user-space break points; its job is to handle the actual placement of
probes into user-space processes. The developers reasoned that there might
be additional users of user-space probes in the future, so the facilities
for the placement and removal of those probes were carved out separately.
On top of UBP is the actual uprobes code, which handles higher-level
details. Uprobes arbitrates between multiple users of breakpoints, even if
two users want to place a probe at the same location. It uses utrace to
ensure that
processes are not running in an area which is about to have a probe
inserted, and deals with the case of multiple processes running the same
code where some are being traced and others are not. The uprobe code is
also in charge of actually calling the probe function when a probe is hit
and recovering properly if that function behaves poorly.
This separation is the first point of contention; Peter Zijlstra (who has
been the main reviewer of this code so far) sees uprobes as an unnecessary glue
layer which could be eliminated. Peter would rather see any needed
features pushed down into UBP, after which the higher-level code could
be dropped. The uprobes developers disagree, though, saying that the
functions implemented at that level are necessary and cannot really be
eliminated. This part of the discussion kind of died out, but it doesn't
look like the developers are inclined to make major changes here.
The next problem is with the implementation of the probes themselves. When
a probe is placed in a user-space program, the instruction at the probed
location is overwritten by a breakpoint. When the breakpoint is hit, the
probe handler function is invoked; once it returns, the replaced
instruction must be executed somehow. A simple implementation would put
that instruction back into its original location, single-step through it,
then restore the breakpoint once again. That approach fails, though, if
there is a second process (or thread) running the probed code. If that
second process executes through the probed area while the probe has been
removed, the associated event will be lost.
So the uprobes developers took a separate approach, called "single-step out
of line" or "execute out of line" (XOL). A separate region of memory is
set up for the purpose of holding instructions which have been displaced by
probe breakpoints. When one of those instructions is to be executed, it is
run (again, in single-step mode) out of this separate area; after that,
control returns after the probe location. This solution allows a probe to
work with multiple processes at the same time.
The problem is this: the memory containing the XOL instructions must be in
the probed process's address space. So the XOL code adds a virtual memory
area (VMA) to the process, reserving a range of address space for this purpose.
This works, but it strikes some observers as inelegant at best, and
potentially disruptive at worst. Currently, the layout of a process's
address space is almost entirely under the control of the process itself.
The injection of a special kernel VMA can perturb the process's control of
its address space, causing other VMAs to move or conflicting with an
attempt by the process to place a VMA at a specific location. Debuggers
are often known to distort application behavior (leading to "heisenbugs"
which disappear when somebody attempts to observe them directly), but
tracing, which is meant to work on production systems, should really
minimize such distortions. Peter also dislikes the precedent of kernel
code messing with a process's address space. Finally, on 32-bit systems,
losing even a small amount of address space to a kernel function is likely
to be unwelcome in a number of situations.
Solving this problem is not necessarily easy. Peter seems to favor
emulating the displaced instruction, but that would require
the implementation of a full instruction emulator in the kernel. That code
would be large, architecture-specific, and error prone. There was some
discussion of trying to run the instruction in kernel space, but doing that
securely appears to be a challenging task. After an extended discussion,
the prevailing opinion seemed to be something like that expressed by Pekka Enberg:
I guess we're looking at few megabytes of the address space for
normal scenarios which doesn't seem too excessive... I don't like
the idea but if the performance benefits are real (are they?),
maybe it's a worthwhile trade-off.
In the end, perhaps the kernel developers will hold their noses and merge
this approach, but chances are they'll need to talk about it for a while
yet first.
The uprobes code comes with an ftrace plugin which provides an interface to
user space for the placement and management of probes. The problem here is
that the kernel developers have, for all practical purposes, decided that
there will be no more ftrace plugins added to the kernel. New features are
supposed to go through the perf events subsystem instead, which is seen as
having a better-designed interface. So the current ftrace plugin will
almost certainly have to be redone for perf events before this code can go
in.
The ftrace plugin also associates user-space probes with specific process
of interest. Peter argues that it makes more sense to hook probes onto
executable files, then make the process association by way of the VMA
structure when the file is mapped. Existing features in the kernel,
perhaps supplemented with a
simple hook or two, would make it easy for uprobes to find processes
running code from a file and to deal with process comings and goings while
the probes are in place. The uprobes developers have not said as much, as
of this writing, but it seems likely that the API could be reworked in
those terms.
Then, there is the nagging issue of the utrace layer, which has not yet
found its way into the mainline. It has recently been added to
linux-next, but there is some discomfort
with that and it's not clear if it will remain there or not.
All of this may seem like a lot of obstacles to the merging of this code, but
it also represents a step forward. The road into the mainline has been
long for utrace; a final detour or two seems about par for the course. The
existence of uprobes as an in-kernel user of utrace might help its cause,
once uprobes itself passes muster.
Assuming consensus on these issues can be reached, it should be possible to
make a last round of changes and be quite close to getting the code merged
- though it might be difficult to get this done for the 2.6.34 merge
window. But, if things go well, we should have user-space probing not too
much later than that.
Comments (9 posted)
January 20, 2010
This article was contributed by Steven Rostedt
Probably the most powerful tracer derived from Ftrace is the
function tracer. It has the ability to trace practically every function in the
kernel. It can be run not just for debugging or analyzing, but also
to learn and observe the flow of the Linux kernel.
Two previous articles, Debugging the Linux Kernel Using Ftrace
parts I and
II, explain some of the basic features
of Ftrace and the function tracer; this article is written with the
assumption
that the reader has already read them. As with the previous articles, the
examples in this article expect that
the user has already changed to the debug file system tracing directory.
The kernel configuration options that are need to be enabled to follow
the examples in this article are:
- CONFIG_FUNCTION_TRACER
- CONFIG_DYNAMIC_FTRACE
- CONFIG_FUNCTION_GRAPH_TRACER
Note, the CONFIG_HAVE_FUNCTION_TRACER, CONFIG_HAVE_DYNAMIC_FTRACE,
and CONFIG_HAVE_FUNCTION_GRAPH_TRACER options are enabled when the architecture
supports the corresponding feature. Do not confuse them with the listed options.
The features are only enabled when the listed configuration options are enabled
and not when only the _HAVE_ options are.
As shown in the previous articles, here is a quick example of how to
enable the function tracer.
[tracing]# echo function > current_tracer
[tracing]# cat trace
<idle>-0 [000] 1726568.996435: hrtimer_get_next_event <-get_next_timer_interrupt
<idle>-0 [000] 1726568.996436: _spin_lock_irqsave <-hrtimer_get_next_event
<idle>-0 [000] 1726568.996436: _spin_unlock_irqrestore <-hrtimer_get_next_event
<idle>-0 [000] 1726568.996437: rcu_needs_cpu <-tick_nohz_stop_sched_tick
<idle>-0 [000] 1726568.996438: enter_idle <-cpu_idle
...
The above shows you the process name (<idle>), PID (0) the CPU
that the trace executed on ([000]), a time-stamp in seconds with the
decimal places down to microseconds (1726568.996435) the function being
traced (hrtimer_get_next_event) and the parent that called that
function (get_next_timer_interrupt).
Function filtering
Running the function tracer can be overwhelming. The amount of data
may be vast, and very hard to get a hold of by the human brain.
Ftrace provides a way to limit what functions you see. Two files exist
that let you limit what functions are traced:
set_ftrace_filter
set_ftrace_notrace
These filtering features depend on the CONFIG_DYNAMIC_FTRACE
option.
As explained in the previous articles, when this configuration is enabled
all of the mcount caller locations are stored and at boot time are
converted into
NOPs. These locations are saved and used to enable tracing when the function tracer
is activated.
But this also has a nice side effect: not all functions must be enabled.
The above files will determine which functions gets enabled and which do not.
When any function is listed in the set_ftrace_filter, only those functions will
be traced. This will help the performance of the system when the trace is
active.
Tracing every function incurs a large overhead, but when using the set_ftrace_filter,
only those functions listed in that file will have the NOPs changed to call the tracer.
Depending on which functions are being traced, just having a couple of hundred functions
enabled is hardly noticeable.
The set_ftrace_notrace file is the opposite of set_ftrace_filter.
Instead of limiting the
trace to a set of functions, functions listed in set_ftrace_notrace will not be traced.
Some functions show up quite often and not only does tracing these functions slow down
the system, they can fill up the trace buffer and make it harder to analyze the
functions you care about. Functions such as rcu_read_lock() and spin_lock()
fall into this category.
The process to add functions to these files typically uses bash redirection.
Using the symbol '>' will remove all existing functions in the file and add what
is being echoed into the file. Appending to the file using '>>' will keep the existing
functions and add new ones.
[tracing]# echo sys_read > set_ftrace_filter
[tracing]# cat set_ftrace_filter
sys_read
[tracing]# echo sys_write >> set_ftrace_filter
[tracing]# cat set_ftrace_filter
sys_write
sys_read
[tracing]# echo sys_open > set_ftrace_filter
[tracing]# cat set_ftrace_filter
sys_open
To remove all functions just echo a blank line into the filter file.
[tracing]# echo sys_read sys_open sys_write > set_ftrace_notrace
[tracing]# cat set_ftrace_notrace
sys_open
sys_write
sys_read
[tracing]# echo > set_ftrace_notrace
[tracing]# cat set_ftrace_notrace
[tracing]#
The functions listed in these files can also be set on the kernel command line.
The options ftrace_notrace and ftrace_filter will preset
these files by listing a comma delimited set of functions.
ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock
ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault
Functions added by the kernel command line set what will be in the
corresponding filter files. These options only pre-load the
files, functions can still be removed or added using the bash redirection
as explained above.
The functions listed in set_ftrace_notrace take precedence. That is, if a function
is listed in both set_ftrace_notrace and set_ftrace_filter, that function will not
be traced.
Wildcard filters
A list of functions that can be added to the filter files is
shown in the available_filter_functions file. This list of functions was
derived from the list of stored mcount callers previously mentioned.
[tracing]# cat available_filter_functions | head -8
_stext
do_one_initcall
run_init_process
init_post
name_to_dev_t
create_dev
T.627
set_personality_64bit
You can grep this file and redirect the result into one of the filter files:
[tracing]# grep sched available_filter_functions > set_ftrace_filter
[tracing]# cat set_ftrace_filter | head -8
save_stack_address_nosched
mce_schedule_work
smp_reschedule_interrupt
native_smp_send_reschedule
sys32_sched_rr_get_interval
sched_avg_update
proc_sched_set_task
sys_sched_get_priority_max
Unfortunately, adding lots of functions to the filtering files is slow and you will notice that the
above grep took several seconds to execute. This is because each function
name written into the filter file will be processed individually. The above grep
produces over 300 function names. Each of those 300 names will be compared
(using strcmp()) against
every function name in the kernel, which is quite a lot.
[tracing]# wc -l available_filter_functions
24331 available_filter_functions
So the grep above caused set_ftrace_filter to generate over 300
* 24331
(7,299,300) comparisons!
Fortunately, these files also take wildcards; the following glob expressions
are valid:
- value* - Select all functions that begin with value.
- *value* - Select all functions that contain the text value.
- *value - Select all functions that end with value.
The kernel contains a rather simple parser, and will not process
value*value in the expected way. It will ignore the second
value and select all functions that start with value regardless
of what it ends with.
Wildcards passed to the filter files are processed directly for each available
function, which is much faster than passing in individual functions in a
list.
Because the star (*) is also used by bash, it is best to wrap the input with quotes:
[tracing]# echo set* > set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
[tracing]# echo 'set*' > set_ftrace_filter
[tracing]# cat set_ftrace_filter | head -5
set_personality_64bit
set_intr_gate_ist
set_intr_gate
set_intr_gate
set_tsc_mode
The filters can also select only those functions that belong to a specific
module by using the 'mod' command in the input to the filter file:
[tracing]# echo ':mod:tg3' > set_ftrace_filter
[tracing]# cat set_ftrace_filter |head -8
tg3_write32
tg3_read32
tg3_write_flush_reg32
tw32_mailbox_flush
tg3_write32_tx_mbox
tg3_read32_mbox_5906
tg3_write32_mbox_5906
tg3_disable_ints
This is very useful if you are debugging a single module, and only
want to see the functions that belong to that module in the trace.
In the earlier articles, enabling and disabling recording to the ring
buffer was done using the
tracing_on file and the
tracing_on() and
tracing_off() kernel functions. But if you do not want to
recompile the kernel, and you want to stop the tracing at a particular function,
set_ftrace_filter has a method to do so.
The format of the command to have the function trace enable or disable
the ring buffer is
as follows:
function:command[:count]
This will execute the command at the start of the
function. The command is either
traceon or traceoff, and an optional count can
be added to have the command only execute a given number of times. If the
count is left off (including the leading colon) then the command
will be executed every time the function is called.
A while back, I was debugging a change to the kernel I made that was
causing a segmentation fault to some programs. I was having a hard time
catching the trace, because by the time I was able to stop the trace after seeing
the segmentation fault, the data had already been overwritten. But the backtrace
on the console showed that the function __bad_area_nosemaphore was
being called. I was then able to stop the tracer with the following command:
[tracing]# echo '__bad_area_nosemaphore:traceoff' > set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
__bad_area_nosemaphore:traceoff:unlimited
[tracing]# echo function > current_tracer
Notice that functions with commands do not affect the general filters.
Even though a command has been added to __bad_area_nosemaphore, the
filter
still allowed all functions to be traced. Commands and filter functions are
separate
and do not affect each other. With the above command attached to the function
__bad_area_nosemaphore, the next time the segmentation fault occurred,
the trace stopped and contained the data I needed to debug the situation.
Removing functions from the filters
As stated earlier, echoing in nothing with '>' will clear the filter file.
But what if you only want to remove a few functions from the filter?
[tracing]# cat set_ftrace_filter > /tmp/filter
[tracing]# grep -v lock /tmp/filter > set_ftrace_filter
The above works, but as mentioned, it may take a while to complete if there were
several functions already in set_ftrace_filter. The following does the
same thing but is much faster:
[tracing]# echo '!*lock*' >> set_ftrace_filter
The '!' symbol will remove functions listed in the filter file. As shown
above, the '!' works with wildcards, but could also be used with a single
function. Since '!' has special meaning in bash it must be wrapped with
single quotes or bash will try to execute what follows it. Also note the
'>>' is used. If you make the mistake of using '>' you will end up with no
functions in the filter file.
Because the commands and filters do not interfere with each other, clearing
the set_ftrace_filter will not clear the commands. The commands
must be cleared with the '!' symbol.
[tracing]# echo 'sched*' > set_ftrace_filter
[tracing]# echo 'schedule:traceoff' >> set_ftrace_filter
[tracing]# cat trace | tail -5
schedule_console_callback
schedule_bh
schedule_iso_resource
schedule_reallocations
schedule:traceoff:unlimited
[tracing]# echo > set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
schedule:traceoff:unlimited
[tracing]# echo '!schedule:traceoff' >> set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
[tracing]#
This may seem awkward, but having the '>' and '>>' only affect
the functions to be traced and not the function commands, actually simplifies
the control between filtering functions and adding and removing commands.
Tracing a specific process
Perhaps you only need to trace a specific process, or set of processes.
The file set_ftrace_pid lets you specify specific processes that you want
to trace. To just trace the current thread you can do the following:
[tracing]# echo $$ > set_ftrace_pid
The above will set the function tracer to only trace the bash shell that executed
the echo command. If you want to trace a specific process, you can create
a shell script wrapper program.
[tracing]# cat ~/bin/ftrace-me
#!/bin/sh
DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer
exec $*
[tracing]# ~/bin/ftrace-me ls -ltr
Note, you must clear the set_ftrace_pid file if you want to go back
to generic function tracing after performing the above.
[tracing]# echo -1 > set_ftrace_pid
What calls a specific function?
Sometimes it is useful to know what is calling a particular function.
The immediate predecessor is helpful, but an entire backtrace is even better.
The function tracer contains an option that will create a backtrace in
the ring buffer for every function that is called by the tracer.
Since creating a backtrace for every function has a large overhead,
which could live lock the system, care must be taken when using this feature.
Imagine the timer interrupt on a slower system where it is run at
1000 HZ. It is quite possible that having every function that the timer interrupt
calls produce a backtrace could take 1 millisecond to complete. By the time
the timer interrupt returns, a new one will be triggered before any other work
can be done, which leads to a live lock.
To use the function tracer backtrace feature, it is imperative that the functions
being called are limited by the function filters. The option to enable the function
backtracing is unique to the function tracer and activating it can only be
done when the function tracer is enabled. This means you must first enable
the function tracer before you have access to the option:
[tracing]# echo kfree > set_ftrace_filter
[tracing]# cat set_ftrace_filter
kfree
[tracing]# echo function > current_tracer
[tracing]# echo 1 > options/func_stack_trace
[tracing]# cat trace | tail -8
=> sys32_execve
=> ia32_ptregs_common
cat-6829 [000] 1867248.965100: kfree <-free_bprm
cat-6829 [000] 1867248.965100: <stack trace>
=> free_bprm
=> compat_do_execve
=> sys32_execve
=> ia32_ptregs_common
[tracing]# echo 0 > options/func_stack_trace
[tracing]# echo > set_ftrace_filter
Notice that I was careful to cat the set_ftrace_filter before enabling
the func_stack_trace option to ensure that the filter was enabled.
At the end, I disabled the options/func_stack_trace before disabling the filter.
Also note that the option is non-volatile, that is, even if you enable another
tracer plugin in current_tracer, the option will still be enabled if you re-enable
the function tracer.
The function_graph tracer
The function tracer is very powerful, but it may be difficult to understand the
linear format that it produces. Frederic Weisbecker has extended
the function tracer into the function_graph tracer. The function_graph tracer piggy-backs off of most of the code created by the function tracer, but adds its own hook
in the
mcount call. Because it still uses the
mcount calling methods
most of the function filtering explained above also applies to the function_graph
tracer, with the exception of the
traceon/
traceoff commands
and
set_ftrace_pid (although the latter may change in the future).
The function_graph tracer was also explained in the previous articles,
but the set_graph_function file was not described.
The func_stack_trace used in the previous section can see what might call a function,
but set_graph_function can be used to see what a function calls:
[tracing]# echo kfree > set_graph_function
[tracing]# echo function_graph > current_tracer
[tracing]# cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) | kfree() {
0) | virt_to_cache() {
0) | virt_to_head_page() {
0) 0.955 us | __phys_addr();
0) 2.643 us | }
0) 4.299 us | }
0) 0.855 us | __cache_free();
0) ==========> |
0) | smp_apic_timer_interrupt() {
0) | apic_write() {
0) 0.849 us | native_apic_mem_write();
0) 2.853 us | }
[tracing]# echo > set_graph_function
This displays the call graph performed only by kfree().
The "==========>" shows that an interrupt happened during the
call. The trace records all functions within the kfree()
block, even those functions called by an interrupt that triggered while
in the scope of kfree().
The function_graph tracer shows the time a function took in the duration
field. In the previous articles, it was mentioned that only the leaf
functions, the ones that do not call other functions, have an accurate
duration, since the duration of parent functions also includes the
overhead of the function_graph tracer calling the child functions. By
using the set_ftrace_filter file, you can force any function
into becoming a leaf function in the function_graph tracer, and this
will allow you to see an accurate duration of that function.
[tracing]# echo smp_apic_timer_interrupt > set_ftrace_filter
[tracing]# echo function_graph > current_tracer
[tracing]# cat trace | head
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) ==========> |
1) + 16.433 us | smp_apic_timer_interrupt();
1) ==========> |
1) + 25.897 us | smp_apic_timer_interrupt();
1) ==========> |
1) + 24.764 us | smp_apic_timer_interrupt();
The above shows that the timer interrupt takes between 16 and 26
microseconds to complete.
Function profiling
oprofile and perf are very powerful profiling
tools that take periodic samples of the system and can show where
most of the time is spent. With the function profiler, it is possible to
take a good look at the actual function execution and not just samples.
If CONFIG_FUNCTION_GRAPH_TRACER is configured in the kernel, the function
profiler will use the function graph infrastructure to record how long a function
has been executing. If just CONFIG_FUNCTION_TRACER is configured,
the function profiler will just count the functions being called.
[tracing]# echo nop > current_tracer
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function 0 |head
Function Hit Time Avg
-------- --- ---- ---
schedule 22943 1994458706 us 86931.03 us
poll_schedule_timeout 8683 1429165515 us 164593.5 us
schedule_hrtimeout_range 8638 1429155793 us 165449.8 us
sys_poll 12366 875206110 us 70775.19 us
do_sys_poll 12367 875136511 us 70763.84 us
compat_sys_select 3395 527531945 us 155384.9 us
compat_core_sys_select 3395 527503300 us 155376.5 us
do_select 3395 527477553 us 155368.9 us
The above also includes the times that a function has been preempted or
schedule() was called and the task was swapped out.
This may seem useless, but it does give an idea of what functions get
preempted often. Ftrace also includes options that allow
you to have the function graph tracer ignore the time the task was
scheduled out.
[tracing]# echo 0 > options/sleep-time
[tracing]# echo 0 > function_profile_enabled
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function0 | head
Function Hit Time Avg
-------- --- ---- ---
default_idle 2493 6763414 us 2712.962 us
native_safe_halt 2492 6760641 us 2712.938 us
sys_poll 4723 714243.6 us 151.226 us
do_sys_poll 4723 692887.4 us 146.704 us
sys_read 9211 460896.3 us 50.037 us
vfs_read 9243 434521.2 us 47.010 us
smp_apic_timer_interrupt 3940 275747.4 us 69.986 us
sock_poll 80613 268743.2 us 3.333 us
Note that the sleep-time option contains a "-" and is not
sleep_time.
Disabling the function profiler and then re-enabling it causes the
numbers to reset. The list is sorted by the Avg times, but
using scripts you can easily sort by any of the numbers.
The trace_stat/function0 only represents CPU 0. There exists a
trace_stat/function# for each CPU on the system.
All functions that are traced and have been hit are in this file.
[tracing]# cat trace_stat/function0 | wc -l
2978
Functions that were not hit are not listed. The above shows that 2978 functions
were hit since I started the profiling.
Another option that affects profiling is graph-time (again with a "-").
By default it is enabled. When enabled, the times for a function include the
times of all the functions that were called within the function. As you can
see from the output in the above examples, several system calls are listed with
the highest average. When disabled, the times only include the execution of the
function itself, and do not contain the times of functions that are called
from the function:
[tracing]# echo 0 > options/graph-time
[tracing]# echo 0 > function_profile_enabled
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function0 | head
Function Hit Time Avg
-------- --- ---- ---
mwait_idle 10132 246835458 us 24361.96 us
tg_shares_up 154467 389883.5 us 2.524 us
_raw_spin_lock_irqsave 343012 263504.3 us 0.768 us
_raw_spin_unlock_irqrestore 351269 175205.6 us 0.498 us
walk_tg_tree 14087 126078.4 us 8.949 us
__set_se_shares 274937 88436.65 us 0.321 us
_raw_spin_lock 100715 82692.61 us 0.821 us
kstat_irqs_cpu 257500 80124.96 us 0.311 us
Note that both sleep-time and graph-time also affect the
duration times displayed by the function_graph tracer.
Conclusion
The function tracer is very powerful with lots of different options.
It is already available in mainline Linux, and hopefully will be enabled by
default in most distributions. It allows you to see into the depths of the kernel
and with its arsenal of features, gives you a good idea of why things are
happening the way they do. Start using the function tracer to open up the
black box that we call the kernel. Have fun!
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>