Brief items
The 2.6.36 kernel is out,
released on October 20 with
only a small number of changes since the 2.6.36-rc8 prepatch. Headline
features in this release include the AppArmor
security module, the LIRC infrared controller driver subsystem, and the new
Tile architecture. The fanotify notification mechanism for anti-malware
applications was disabled at the last minute due to
ABI concerns. See
the KernelNewbies 2.6.36
page for lots of information.
Linus also noted that a two-week merge window, starting on October 21, would run into
the Kernel Summit.
So I'm going to hope that we could perhaps even do the 2.6.37 -rc1
release and close the merge window the Sunday before KS opens. Since
2.6.36 was longer than usual (at least it felt that way), I wouldn't
mind having a 2.6.37 that is shorter than usual.
But holler if this really screws up any plans. Ten days instead of two
weeks? Let's see if it's even reasonably realistic.
(The 2.6.36
development cycle was 80 days, incidentally; almost exactly average for the
last few years).
Previously, 2.6.36-rc8 was released on October 15. "I
really hate doing this, and I'd much rather just release 2.6.36, but I'm
doing one last final -rc instead. There's too much noise here (and pending
in emails) for me to be happy, and while I first thought to just delay a
day or two, I'm now looking at next week instead, and thus the additional
-rc."
Stable updates: there have been no stable updates released in the
last week.
Comments (none posted)
I suspect there are a few places in MM/VFS/writeback which
could/should be using something like this. Of course, if we do
this then your nice little function will end up 250 lines long,
utterly incomprehensible and full of subtle bugs. We like things
to be that way.
--
Andrew Morton
In contrast, the "v2.6.25-rc1~1089^2~98" expression is actually
well-defined. There is no ambiguity there, but it's also obviously
not really all that human-readable.
--
Linus Torvalds
Frankly, I'm a lot more concerned about the locking being natural
for the data structures we have and easily understood. We *are*
fscking close to the complexity cliff, hopefully still on the right
side of it.
--
Al Viro
Comments (none posted)
By Jonathan Corbet
October 20, 2010
Nick Piggin's
VFS scalability
patch set has been under development for well over a year. Some pieces
were merged for 2.6.36, but the more complicated parts were deferred
because Nick thought they needed more work and more testing. Then things
went quiet; Nick changed jobs and went on vacation, so little work was done
for some time. Eventually it became clear that Nick was unlikely to get
the scalability work into shape for a 2.6.37 merge.
So Dave Chinner decided to jump in and work on these patches, and the code
breaking up the inode lock in particular. His first patch set was posted in
late September, with a number of revisions happening since. Dave worked on
splitting the patch series into smaller, more reviewable chunks. He also
took out some of the (to him) scarier changes. Subsequent revisions
brought larger changes, to the point that version 5 reads:
None of the patches are unchanged, and several of them are new or
completely rewritten, so any previous testing is completely
invalidated. I have not tried to optimise locking by using trylock
loops - anywhere that requires out-of-order locking drops locks and
regains the locks needed for the next operation. This approach
simplified the code and lead to several improvements in the patch
series (e.g. moving inode->i_lock inside writeback_single_inode(),
and the dispose_one_inode factoring) that would have gone unnoticed
if I'd gone down the same trylock loop path that Nick used.
According to Dave, this patch set helps with the scalability problems he
has been seeing, and other reviewers seem to think that the patch set is
starting to look fairly good.
But then Nick returned. While he welcomed the new interest in scalability
work, he did not take long to indicate that he was not pleased with the
direction in which Dave had taken his patches. He has posted a 35-part patch series which he
hopes to merge; the patch posting also details why he doesn't like Dave's
alternative approach. The ensuing discussion has been a bit rough in
spots, though it has remained mostly focused on the technical issues.
What it has not done, though, is to come up with any sort of conclusion.
There are two patch sets out there; both deal with the intersection of
the virtual filesystem layer and the memory management code. Much of the
contention seems to be over whether "VFS people" or "memory management
people" should have the ultimate say in how things are done. Given the
difficult nature of both patch sets and the imminent opening of the 2.6.37
merge window, it seems fairly safe to say that neither will be merged
unless Linus makes an executive decision. Pushing back this code to 2.6.38
will provide an opportunity for the patches to be discussed at length, and,
possibly, for the upcoming Kernel Summit to consider them as well.
Comments (4 posted)
By Jonathan Corbet
October 20, 2010
Dave Chinner recently
noticed a problem on
one of the kernel.org systems: the slab cache was using well over 2GB of
memory, mainly on radix tree nodes. Intrigued, he looked further into the
problem. It came down to the integrity measurement architecture (IMA)
security code, which uses the hardware TPM to help ensure that files on the
system have not been tampered with. IMA, it seems, was using a
radix tree to store integrity
information, indexed by inode address. Radix trees perform poorly with
sparse, unclustered keys, so IMA's usage was causing the creation of a
separate node for each inode in the system. That added up to a lot of
memory.
A number of questions came after this revelation, including:
- Why is IMA using such an inappropriate data structure?
- Why is it keeping all this information around even though it was
disabled on the system in question?
- Why was IMA configured into the kernel in the first place?
The answer to the first question seems to be that the IMA developers, as
part of the process of getting the code into the mainline, were not allowed
to expand the inode structure at all. So they created a separate tree for
per-inode information; it just happens that they chose the wrong type of
tree and never noticed how poorly it performed.
Question #2 is answered like this: the IMA code needs to keep track of
which files are opened for write access at any time. There is no point in
measuring the integrity of files (checksumming them, essentially) when they
can be changed at any time. Without tracking the state of all files all
the time, IMA can never know which files are opened for write access when
it first starts up. The only way to be sure, it seems, is to track all
files starting at boot time just in case somebody tries to turn IMA on at some
point.
As for #3: kernel.org was running a Fedora kernel, and the Fedora folks
turned on the feature because it looked like it might be useful to some
people. Nobody expected that it would have such an impact on systems where
it was not turned on. Some participants in the discussion have given the
Fedora kernel maintainers some grief for not having audited the code before
enabling it, but auditing everything in the kernel to that level is a bit
larger task than Fedora can really be expected to take on.
Eric Paris has started work on slimming IMA down; his patches work by moving the "open for write"
counts into the inode structure itself, eliminating the need to allocate
the separate IMA structures most of the time. IMA is also shifted over to
a red-black tree when it does need to track those structures. This work
eliminates almost all of the memory waste, but at the cost of growing the
inode structure slightly. That does not sit well with everybody,
especially, it seems, those developers who feel that IMA should not exist
in the first place. But it's a clear step in the right direction, so one
should expect something along these lines to be merged for 2.6.37.
Comments (13 posted)
Kernel development news
By Jonathan Corbet
October 20, 2010
Much of the time, patches can be developed against the mainline kernel and
submitted for the next merge window without trouble. At other times,
though, the mainline is far removed from the environment a patch will have
to fit into at merge time. Your editor, who has been trying the
(considerable) patience of the Video4Linux maintainer by trying to get a
driver merged for 2.6.37 at the last minute, has encountered this fact of
life the hard way: he submitted a driver which did not even come close to
compiling inside the 2.6.37 V4L2 tree. Things have changed considerably
there. This article will look at one of those changes with an eye
toward the kind of design decisions that are being made in that part of the
kernel.
The removal of the big kernel lock (BKL) has been documented here over the
years. One of the biggest holdouts at this point is the V4L2 subsystem;
almost everything that happens in a V4L2 driver is the result of an
ioctl() call, and those calls have always been protected by the
BKL. Removing BKL protection means auditing the drivers - and there are a
lot of them - and, in many cases, providing a replacement locking scheme.
It seems that a lot of V4L2 drivers - especially the older ones - do not
exhibit the sort of attention to locking that one would expect from code
submitted today.
The approach to this problem chosen by the V4L2 developers has proved to be
mildly controversial within the group: they have tried to make it possible
for driver authors to continue to avoid paying attention to locking. To
that end, the video_device structure has gained a new
lock field; it is a pointer to a mutex. If that field is
non-null, the V4L2 core will acquire the mutex before calling any of the
vast number of driver callbacks. So all driver operations are inherently
serialized and driver authors need not worry about things. At least, they
need not worry in the absence of other types of concurrency - like
interrupts.
Hans Verkuil, the developer behind many recent V4L2 improvements, clearly
feels that it's better to handle the
locking centrally:
If he wants to punish himself and do all the locking manually (and
prove that it is correct), then by all means, do so. If you want to
use the core locking support and so simplify your driver and allow
your brain to concentrate on getting the hardware to work, rather
than trying to get the locking right, then that's fine as well. As
a code reviewer I'd definitely prefer the latter approach as it
makes my life much easier.
On the other side, developers like Laurent Pinchart argue that trying to insulate developers from
locking is not the right approach:
Developers must not get told to be stupid and don't care about
locks just because other developers got it wrong in the past. If
people don't get locking right we need to educate them, not
encourage them to understand even less of it.
Central locking at the V4L2 level leads to some interesting problems as
well. The V4L2 user-space streaming API offers a pair of
ioctl() calls for the management of frame buffers:
VIDIOC_DQBUF to obtain a buffer from the driver, and
VIDIOC_QBUF to give a buffer back. If there are no buffers
available at the time of the call, VIDIOC_DQBUF will normally
block until a buffer becomes available. When this call is protected by the
BKL, blocking will automatically release the lock and enable other V4L2
operations to continue. That behavior is important: one of those other
operations might be a VIDIOC_QBUF call providing the buffer needed
to allow the VIDIOC_DQBUF call to proceed; if
VIDIOC_DQBUF fails to release the lock, things
could deadlock.
Drivers which handle their own locking will naturally release locks
before blocking in a situation like this. Either the driver author thinks
of it at the outset, or the need is made clear by disgruntled users later
on. If the driver author is not even aware that the lock exists, though,
it's less likely that the lock will be released at a time like this. That
could lead to surprises in drivers which do their own I/O buffer
management. If, however,
the driver uses videobuf,
this problem will be handled transparently
with some scary-looking code in
videobuf_waiton():
is_ext_locked = q->ext_lock && mutex_is_locked(q->ext_lock);
/* Release vdev lock to prevent this wait from blocking outside access to
the device. */
if (is_ext_locked)
mutex_unlock(q->ext_lock);
With enough due care, one assumes that it's possible to be sure that
unlocking a mutex acquired elsewhere is a reasonable thing to do here. But
one must hope that the driver author - who is not concerned with locking,
after all - has left things in a consistent state before calling
videobuf_waiton(). Otherwise those disgruntled users will
eventually make a return.
Locking complexity in the kernel is growing, and that makes it harder for
developers to come up to speed. Complex locking can be an especially
difficult challenge for somebody writing this type of driver; these authors
tend not to be full-time kernel developers. So the appeal of taking care
of locking for them and letting them concentrate on getting their hardware
to do reasonable things is clear, especially if it makes the code review
process easier as well. Such efforts may ultimately be
successful, but there can be no doubt that they will run into disagreement
from those who feel that kernel developers should either understand what is
going on or switch to Java development. Expect this sort of discussion to
pop up in a number of contexts as core developers try to make it easier for
others to write correct code.
Comments (2 posted)
By Jake Edge
October 20, 2010
User-space access to the kernel cryptography subsystem has reared its head
several times of late. We looked
at one proposal back in August that had a /dev/crypto
interface patterned after
similar
functionality in OpenBSD. There is another related effort, known as the NCR API, and
crypto API maintainer Herbert Xu has recently posted an RFC
for yet another. But giving user space the ability to request that the kernel
perform its computation-intensive crypto operations is not uncontroversial.
As noted back in August, some kernel hackers are skeptical that there would
be any performance gains by moving user-space crypto into the kernel. But
there are a number of systems, especially embedded systems, with dedicated
cryptographic hardware. Allowing user space to access that hardware will
likely result in performance gains, in fact 50-100x performance improvements
have been reported.
Another problem with both the /dev/crypto and NCR APIs
(collectively known as the cryptodev-linux modules) is the
addition of an enormous amount of code to the kernel to support
crypto algorithms beyond those that are already available. Those two
modules have adapted user-space
libraries for crypto and multi-precision integers and included them into
the kernel. They are necessary to support
some government crypto standards and certifications that require a
separation between user
space and crypto processing. So, the cryptodev-linux modules are trying to
solve two separate (or potentially separate) problems: user-space access to
crypto hardware acceleration and security standards compliance.
When Xu first put out an RFC on his idea
for the API (without any accompanying code) back in September, Christoph
Hellwig had a rather strongly worded
reaction:
doing crypto in kernel for userspace consumers [is] simply insane.
It's computational intensive code which has no business in kernel space
unless absolutely required (e.g. for kernel consumers). In addition
to that adding the context switch overhead and address space transitions
is god [awful] too.
Xu more or less agrees with Hellwig, but sees his API as a way to provide
access to the hardware crypto devices. Because Xu's API is based on
netlink sockets (as opposed to ioctl()-based or a brand new API that the
cryptodev-linux modules introduce), he is clearly hoping that it will provide a
way forward without requiring such large changes to the kernel:
FWIW I don't care about user-space using kernel software crypto at
all. It's the security people that do.
The purpose of the user-space API is to export the hardware crypto
devices to user-space. This means PCI devices mostly, as things
like aesni-intel [Intel AES instructions] can already be used without
kernel help.
Now as a side-effect if this means that we can shut the security
people up about adding another interface then all the better. But
I will certainly not go out of the way to add more crap to the
kernel for that purpose.
The netlink-based interface uses a new AF_ALG address family that gets passed to the initial
socket() call. There is also a new struct sockaddr_alg
that contains
information about what type of algorithm (e.g. "hash" or "skcipher") is to
be used as well as the specific algorithm name (e.g. "sha1" or "cbc(aes)")
that is being requested. That structure is then passed in the
bind() call on
the socket.
For things like hashing, where there is little or no additional information
needed, an accept() is done on the socket, which yields an
operation file descriptor. The data to be hashed is written to that descriptor
and, when there is no more data to be hashed, the appropriate number of
bytes (20 for sha1) are then read from the descriptor.
It is a bit more
complicated for ciphers.
Before accepting the connection on the socket, a key needs to be
established for a symmetric key cipher. That is done with a
setsockopt() call using the new SOL_ALG level and
ALG_SET_KEY option name and passing the key data and its
length. But there are additional parameters that need to be set up for
ciphers, and those are done using sendmsg().
A cipher will need to know which direction it is operating in
(i.e. encrypting or decrypting) and may need an initialization vector.
Those are
specified with the ALG_SET_OP and ALG_SET_IV messages.
Once the accept() has been done, those messages are sent to the
operational descriptor and the cipher is ready for use. Data can be sent as
messages or written to the operational descriptor, and the resulting data can then
be read from that descriptor.
There is an additional wrinkle for the "authenticated
encryption with associated data" (AEAD) block cipher mode, which can
include authentication information (i.e. message authentication code or
MAC) into the ciphertext stream. Because of that, AEAD requires two data
streams, one containing the data itself and another with the associated
authentication data (the MAC). This is handled in Xu's API by
doing two accept() calls, the first for the operational descriptor, and
the second for the associated data. If the cipher is operating in
encryption mode, both descriptors will be written to, while the encrypted data is
read from the operational descriptor. For decryption, the ciphertext is written to
the operational descriptor, while the plaintext and authentication data are read
from the two descriptors.
There hasn't been much discussion, yet, of the actual code posting, but
Xu's September posting elicited a number of complaints about performance,
most from proponents of the cryptodev-linux modules. But it would seem
that there is some real resistance to adding completely new APIs (as NCR does)
or to adding a complicated ioctl()-based API (as /dev/crypto
does). Now there are three competing solutions available, but it isn't at
all clear that any interface to the kernel crypto subsystem will be
acceptable to the kernel community at large. We will have to wait to see
how it all plays out.
Comments (21 posted)
October 20, 2010
This article was contributed by Steven Rostedt
Previous LWN articles have
explained the basic way to use Ftrace directly through the debugfs filesystem (part 1 and part 2). While the debugfs interface is rather simple, it can also be awkward to
work with. It is especially convenient, though, for embedded platforms where
it may be difficult to build and install special user tools on the device.
On the desktop, it may be more convenient to have a command-line tool
that works with Ftrace instead of echoing various commands into
strange files and reading the result from another file. This tool does
exist, and it is called trace-cmd.
trace-cmd is a user-space front-end command-line tool for
Ftrace. You can download it from the git repository at
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git. Some
distributions ship it as a package, and some that currently do not, will
soon. There are full man pages included, which are installed with a
make install_doc. This article will not go over the information
that is already in the man pages, but instead will explain a little about
how trace-cmd works and how to use it.
How it works
A simple use case of trace-cmd is to record a trace and then
report it.
# trace-cmd record -e ext4 ls
[...]
# trace-cmd report
version = 6
CPU 1 is empty
cpus=2
trace-cmd-7374 [000] 1062.484227: ext4_request_inode: \
dev 253:2 dir 40801 mode 33188
trace-cmd-7374 [000] 1062.484309: ext4_allocate_inode: \
dev 253:2 ino 10454 dir 40801 mode 33188
The above example enables the ext4 tracepoints for Ftrace, runs the
ls command and records the Ftrace data into a file named
trace.dat. The report command reads the
trace.dat file and outputs the tracing data to standard
output. Some metadata is also shown before the trace output is displayed: the version of the
file, any empty CPU buffers, and the number of CPUs that were recorded.
By default, the record and report options write and
read to the trace.dat file. You can use the -o or
-i options to pick a different file to write to or read from
respectively, but this article will use the default name when
referencing the data file created by trace-cmd.
When recording a trace, trace-cmd will fork off a process for
each CPU on the system. Each of these processes will open the file
in debugfs that represents the CPU the process is
dedicated to record from. The process recording CPU0 will open
/sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw, the process
recording CPU1 will open a similar file in the cpu1 directory, and
so on. The trace_pipe_raw file is a mapping directly to the
Ftrace internal buffer for each CPU. Each of these CPU processes will read
these files using splice to record into a temporary file during the
trace. At the end of the record, the main process will concatenate the
temporary files into a single trace.dat file.
There's no need to manually mount the debugfs filesystem before using
the tool as trace-cmd will look to see if and where it is mounted. If
debugfs is not mounted, it will automatically mount it at
/sys/kernel/debug.
Recording a trace
As noted above, trace-cmd
forks off a process for each CPU dedicated to record from that CPU but, in
order to prevent scheduling interference, the
threads are not pinned to a CPU. Pinning the threads to the CPU being
traced may result in better cache usage, so a future version of
trace-cmd may add an option to do that.
The Ftrace ring buffers are allocated one
per CPU, and each thread will read from a particular CPU's ring buffer. It is important
to mention this because these threads can show up in the trace.
A common
request is to have trace-cmd ignore events that are caused by
trace-cmd itself. But it is not wise to ignore these events because
they show where the tracer may have made an impact on what it is
tracing. These events can be filtered out after the trace, but they are
good to keep around in the trace.dat file in case some delay
was caused by the trace itself, and the events may show that.
As trace-cmd is a front end to Ftrace, the arguments of
record reflect some of the features of Ftrace. The -e
option enables an event. The argument following the -e can be an
event name, event subsystem name, or the special name all. The
all name will make trace-cmd enable all events that the
system supports. If a subsystem name is specified, then all events under that
subsystem will be enabled during the trace. For example, specifying
sched will enable all the events within the sched
subsystem. To enable a single event, the event name can be used by itself, or the
subsystem:event format can be used. If the subsystem name is left off, then all events with
the given name will be enabled. Currently this would not be an issue
because, as of this writing, all events have unique names. If more than
one event or subsystem is to be traced, then multiple -e options may
be specified.
Ftrace also has special plugin tracers that do not simply trace
specific events. These tracers include the function, function graph, and
latency tracers. Through the debugfs tracing directory, these
plugins are
enabled by echoing the type of tracer into the current_tracer
file. With trace-cmd record, they are enabled with the
-p option. Using the tracer plugin name as the argument for
-p enables that plugin. You can still specify one or more events
with a plugin, but you may only specify a single plugin, or no plugin at
all.
When the record is finished, trace-cmd examines the kernel buffers
and outputs some statistics, which may be a little confusing. Here's an
example:
Kernel buffer statistics:
Note: "entries" are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.
CPU: 0
entries: 0
overrun: 0
commit overrun: 0
CPU: 1
[...]
As the output explains, the entries field is not the number of
entries that were traced, but the number of entries left in the kernel
buffer. If entries were dropped because trace-cmd could not read
the buffer faster than it was being written to, and the writer overflowed
the buffer, then either the overrun or commit overrun
values would be something other than zero. The overrun value is
the number of entries that were dropped due to the buffer filling up, and
the writer deleting the older entries.
The commit overrun is much
less likely to occur. Writes to the buffer is a three step process. First
the writer reserves space in the ring buffer, then it writes to it, then it
commits the change. Writing to the buffer does not disable interrupts. If a
write is preempted by an interrupt, and the interrupt performs a large
amount of tracing where it fills the buffer up to the point of the space
that was reserved for
the write it preempted, then it must drop events because it cannot touch
the reserved space until it is committed. These dropped events
are the commit overrun. This is highly unlikely to happen unless
you have a small buffer.
Filtering while recording
As explained in "Secrets of the Ftrace function
tracer", Ftrace allows you to filter what functions will be traced by
the function tracer. Also, you can graph a single function, or a select set of
functions, with the function graph tracer. These features are not lost when
using
trace-cmd.
# trace-cmd record -p function -l 'sched_*' -n 'sched_slice'
(Note that the above does not specify a command to execute, so
trace-cmd will record until Ctrl^C is hit.)
The -l option is the same as echoing its argument into
set_ftrace_filter, and the -n option is the same as echoing
its argument into set_ftrace_notrace. You can have more than
one -l or -n option on the command
line. trace-cmd will simply write all the arguments into the
appropriate file. Note, those options are only useful with the
function and function_graph plugins. The -g
option (not shown) will pass its argument into the
set_graph_function file.
Here is a nice trick to see how long interrupts take in the kernel:
# trace-cmd record -p function_graph -l do_IRQ -e irq_handler_entry sleep 10
# trace-cmd report
version = 6
cpus=2
Xorg-4262 [001] 212767.154882: funcgraph_entry: | do_IRQ() {
Xorg-4262 [001] 212767.154887: irq_handler_entry: irq=21 name=sata_nv
Xorg-4262 [001] 212767.154952: funcgraph_exit: + 71.706 us | }
Xorg-4262 [001] 212767.156948: funcgraph_entry: | do_IRQ() {
Xorg-4262 [001] 212767.156952: irq_handler_entry: irq=22 name=ehci_hcd:usb1
Xorg-4262 [001] 212767.156955: irq_handler_entry: irq=22 name=NVidia CK804
Xorg-4262 [001] 212767.156985: funcgraph_exit: + 37.795 us | }
The events can also be filtered. To know what fields can be used for
filtering a specific event, look in the format file from
/sys/kernel/debug/tracing/events/<subsystem>/<event>/format,
or run
# trace-cmd report --events | less
on a
trace.dat
file that was created by the local system. The
--events argument
will list
the event formats of all events that were available in the system that
created the tracing file:
$ trace-cmd report --events
[...]
name: kmalloc_node
ID: 338
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;
field:unsigned long call_site; offset:16; size:8; signed:0;
field:const void * ptr; offset:24; size:8; signed:0;
field:size_t bytes_req; offset:32; size:8; signed:0;
[...]
Using the kmalloc_node event, we can filter on all requests
that were greater than 1000 bytes:
# trace-cmd record -e kmalloc_node -f 'bytes_req > 1000'
The -f option specifies a filter for the event (specified in a
-e option) preceding it.
Reading the trace
As the initial example showed, to read the trace simply run the
report command. By default, it will read the trace.dat
file, unless the -i option specifies a different file to read, or
the input file may simply be specified as the last argument.
$ trace-cmd report -i mytrace.dat
$ trace-cmd report mytrace.dat
The above two examples give the same result. The report command is not
a privileged operation and only requires read permission on the data file
it is reading.
$ trace-cmd report
version = 6
cpus=2
trace-cmd-8412 [000] 13140.422056: sched_switch: 8412:120:S ==> 0:120: swapper
<idle>-0 [000] 13140.422068: power_start: type=1 state=2
<idle>-0 [000] 13140.422174: irq_handler_entry: irq=0 handler=timer
<idle>-0 [000] 13140.422180: irq_handler_exit: irq=0 return=handled
The output is similar to what you would see in
/sys/kernel/debug/tracing/trace.
Having the trace data in a file gives some advantages over reading from
a
debugfs file. We can now easily filter what events we want to see, or pick
a specific CPU to output.
You can do extensive filtering on
events and what CPUs you want to focus on:
$ trace-cmd report --cpu 0 -F 'sched_wakeup: success == 1'
version = 6
cpus=2
ls-8414 [000] 13140.423106: sched_wakeup: 8414:?:? + 8412:120:? trace-cmd Success
ls-8414 [000] 13140.424179: sched_wakeup: 8414:?:? + 1155:120:? kondemand/0 Success
ls-8414 [000] 13140.426925: sched_wakeup: 8414:?:? + 704:120:? phy0 Success
ls-8414 [000] 13140.431172: sched_wakeup: 8414:?:? + 9:120:? events/0 Success
events/0-9 [000] 13140.431182: sched_wakeup: 9:?:? + 11734:120:? sshd Success
ls-8414 [000] 13140.434173: sched_wakeup: 8414:?:? + 1155:120:? kondemand/0 Success
The --cpu 0 limits the output to only show the events that
occurred on CPU 0. The -F option limits
the output further to only show sched_wakeup events that have its
success field equal to 1. For more information about the filtering,
consult the trace-cmd-report(1) man page.
Tracing over the network
There may be situations where you want to trace an embedded device or some
machine with very little disk space. Perhaps another machine has lots of
disk space and you want to record the trace to that machine or maybe you
are tracing the filesystem itself and minimal interference to that
code is needed. This is where tracing over the network comes in handy.
To set up a trace server, simply run something like the following command:
$ trace-cmd listen -p 12345 -D -d /images/tracing/ -l /images/tracing/logfile
The only required option in the above is the -p option, which
tells trace-cmd what port to listen on. The -D puts
trace-cmd into daemon mode, while the -d /images/tracing/ tells
trace-cmd to output the trace files from the connections it
receives into the /images/tracing/ directory. Obviously, any
directory you have write permission for can be used here. The -l /images/tracing/logfile tells trace-cmd to not write messages
to standard output, but to the /images/tracing/logfile file
instead. The listen command is not
privileged, and can be run by any user.
On the embedded device (or whatever client is used), instead of
specifying a output file in the trace-cmd record the -N
option is used followed by host:port syntax.
# trace-cmd record -N gandalf:12345 -e sched_switch -e sched_wakeup -e irq hackbench 50
Back on the host gandalf a file is made in the
/images/tracing/ directory of the format
"trace.<client-host>:<client-port>.dat".
$ ls /images/tracing/
logfile trace.frodo:35287.dat
$ cat /images/tracing/logfile
[29078]Connected with frodo:35287
[29078]cpus=2
[29078]pagesize=4096
$ trace-cmd report /images/tracing/trace.frodo\:35287.dat
version = 6
cpus=2
<...>-17215 [000] 19858.840695: sched_switch: 17215:120:S ==> 0:120: swapper
<idle>-0 [000] 19858.840934: irq_handler_entry: irq=30 handler=iwl3945
<idle>-0 [000] 19858.840959: irq_handler_exit: irq=30 return=handled
<idle>-0 [000] 19858.840960: softirq_entry: softirq=6 action=TASKLET
<idle>-0 [000] 19858.841005: softirq_exit: softirq=6 action=TASKLET
[...]
trace-cmd is versatile enough to handle heterogeneous systems.
All the information needed to create and read the trace.dat file
is passed from the client to the host. The host could be a 64-bit x86 and
the client a 32-bit PowerPC and the above would not change. A big-endian
machine can read a little-endian file and vice versa. The compatibility to
read various system types is not limited to
network tracing. If a trace is performed on a big-endian 32-bit system, the
resulting file can still be read from a little-endian 64-bit system.
For the lazy Ftrace user
If using the internal kernel
Ftrace buffer is sufficient and there is no need to record the trace,
trace-cmd can still be useful. Pretty
much all of the record options can be used with the trace-cmd start
command. start does not create a trace.dat
file, but simply starts Ftrace.
Similarly, the stop command is just a convenient way to do:
$ echo 0 > /sys/kernel/debug/tracing/tracing_on
For example:
# trace-cmd start -p function_graph -g ip_rcv
# sleep 10
# trace-cmd stop
# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) | ip_rcv() {
1) | T.769() {
1) | nf_hook_slow() {
1) 0.497 us | add_preempt_count();
1) | nf_iterate() {
1) 0.458 us | ip_sabotage_in();
If there is a case where the trace needs to be converted
into a trace.dat file, the extract command can be
used. After
the above trace was done:
# trace-cmd extract -o kernel-buf.dat
# trace-cmd report kernel-buf.dat
version = 6
cpus=2
<idle>-0 [001] 214146.661193: funcgraph_entry: | ip_rcv() {
<idle>-0 [001] 214146.661196: funcgraph_entry: | T.769() {
<idle>-0 [001] 214146.661197: funcgraph_entry: | nf_hook_slow() {
<idle>-0 [001] 214146.661197: funcgraph_entry: 0.497 us | add_preempt_count();
<idle>-0 [001] 214146.661198: funcgraph_entry: | nf_iterate() {
<idle>-0 [001] 214146.661199: funcgraph_entry: 0.458 us | ip_sabotage_in();
To disable all tracing, which will ensure that no overhead is left from using
the function tracers or events, the reset command can be used. It
will disable all of Ftrace and bring the system back to full performance.
# trace-cmd reset
What's next?
This article explains some of the use cases for
trace-cmd. There is still more that it can do but we just did not
have the space to fit it all in. This article and the
trace-cmd man pages should be enough to get you on your way to
using trace-cmd in a productive manner.
So what's next? As this article shows how trace-cmd is a
front-end tool for Ftrace, the next article will present kernelshark
which is a graphical front-end to trace-cmd. Stay tuned.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>