Brief items
The current development kernel is 2.6.33-rc1,
released by Linus on
December 17. The 2.6.33 merge window is now closed; significant
patches merged since
last week's
summary include a direct rendering driver for the VMware virtual GPU,
along with drivers for Maxim 8660/8661 voltage regulators, Marvell 88PM8607
PMIC devices, STMicroeletronics LIS3LV02Dx accelerometers, LaCie Network
Space v2 NAS boards, DesignWare SPI controller core devices, Samsung
S3C64XX series type SPI controllers, MDIO buses on Octeon SOCs, Octeon
Management port Ethernet controllers, Cisco PowerTV platforms, and HP Smart
Array SCSI controllers.
There have been a small number of patches merged since the 2.6.33-rc1
release; along with fixes, these include the kfifo API rework patch set.
Stable updates: The 2.6.27.42, 2.6.31.9, and 2.6.32.2 stable kernel updates were released
on December 18. Each contains another long list of important fixes
all across the kernel tree.
Comments (none posted)
For an added bonus make sure the rant includes something highly
quotable and funny as a put down, then you'll make lwn.net as well
--
Alan Cox
We need a sane scripting language available to the kernel build,
and given all the problems we have had with different versions or
even just sometimes different builds of sh, awk, and even bc --
plus the fact that those utilities just don't necessarily do what
we want makes it very frustrating. Personally I think a dependency
on Perl is better than the mess we're in; I understand other people
disagree. What is definitely not acceptable, however, is the
status quo. The situation is, quite frankly, ridiculous enough
that perhaps the right thing to do is to write a small scripting
engine and bundle it with the kernel.
--
H. Peter Anvin
The two-week merge window is _not_ supposed to be "one day merge
window after thirteen days of silence". In fact, I think that next
time around I'll make the merge window be 11-12 days instead, and
people who try to game the system and do a last-minute pull request
will get a surprise, and get unceremoniously bumped to 2.6.35
instead.
--
Linus Torvalds
Comments (11 posted)
By Jonathan Corbet
December 22, 2009
One of a kernel developer's best friends is the
printk() function,
which works much like
printf() in user-space programs. There
are some differences, though, including the existence of various levels of
logging. The convention used is a little funny, with the logging level
being a short string prepended to the format string. So a warning might be
printed this way:
printk(KERN_WARNING "Core meltdown imminent\n");
This form is not universally loved, though; some call it verbose, making it
hard to make lines fit within 80 columns, and the severity string is easy
to forget. As an alternative, the 2.6.28-rc5 kernel saw the addition of a
set of pr_*() macros, written by Martin Schwidefsky, which are
designed to make life a little easier. So, for example, the above warning
could be rewritten as:
pr_warning("Software patents detected\n");
These macros sat in relative obscurity for a few development cycles until
Joe Perches decided to switch a number of printk() statements in
the core kernel. That led to an outburst
from Peter Zijlstra and the eventual reverting of the change. Peter says:
I might be strange, but if I want to print something in C I write
print[fk]() and be done with it, there's no reason what so ever to
introduce fancy wankery for this. We try to stick to ANSI-C as
much as possible, we've got kalloc,kfree,strcmp,strnlen and all the
other 'regular' C bits, deviating from that serves no purpose but
seed confusion.
Chances are, there won't be any more such conversions near that part of the
kernel. But the pr_*() macros won't be going away either. Their
real purpose, perhaps, was best expressed
by Arjan van de Ven: "pr_ is really just for 'I am a driver and want
a single line message out in a standardized format'. Nothing wrong with
that."
Comments (4 posted)
By Jonathan Corbet
December 23, 2009
Tejun Heo's
concurrency-managed
workqueues patch was discussed here in October. This work has been
progressing, with some of the associated clean-up patches being merged for
2.6.33; the main part of the work would appear to be on a path for merging
in 2.6.34. Or maybe not: some developers are starting to express some
qualms.
The loudest complaints come from Peter Zijlstra, who would rather see
effort go into converting workqueue users to using threaded interrupt
handlers instead. To developers like Peter, the new workqueues look like a
bunch of new complexity which could create new problems (management of
CPU-intensive workqueue tasks, for example) while failing to address other
issues, including the locking problems which can plague workqueue users
now.
Tejun has responded with a description of
some of the problems being solved by the redone workqueues, concluding:
Shifting complexity out of peripheral code to better crafted and
managed core code is the right thing to do and it will shift a lot
of complexity out of peripheral codes.
That may actually be where some of the trouble lies: the patch set, in its
current state, does not really demonstrate this shift in complexity. So
Ingo Molnar has requested some example
conversions that show the advantages of concurrency-managed workqueues:
For this particular patchset it should be possible to identify
existing patterns of code in the existing code base of 6+ millions
lines of Linux driver code that would make the advantages of this
+2000 lines of core kernel code plain obvious. There were multiple
claims of problems with the current abstractions - so there sure
must be a way to show off the new code in a few places.
Tejun has indicated that he will work to provide this demonstration.
Should the next version of the patch set prove convincing on this front,
the new workqueues might still be on-track for 2.6.34.
Comments (none posted)
Kernel development news
By Jonathan Corbet
December 22, 2009
The 2.6.33 merge window has run its course, and a great deal of code has
been merged into the mainline. The merge window always seems like a bit of
a game of musical chairs, though: when the music stops, at least one
project tends to be left conspicuously standing. This time around, two
projects were left without a chair in the mainline despite having sent in
pull requests: the
Ceph
distributed filesystem and the
AlacrityVM hypervisor code.
Often, originators of ignored pull requests are left in silence to wonder why those
requests were not acted upon. This time around, Linus explained the missing pulls: there didn't seem
to be enough interest in those features. As he put it:
The best thing to do is to try to have users that are vocal about
the feature, and talk about how great it is. Some advocates for it,
in other words. Just a few other people saying "hey, I use this,
it's great", is actually a big deal to me. For alacrityvm and
cephfs, I didn't have that, or they just weren't loud enough for me
to hear.
Sun CEO Scott McNealy once remarked that free software is like a free
puppy. There is some truth to that remark in general, and with respect to
code pulled into the kernel in particular. The code itself comes for free,
with a nice, GPL-compatible license attached to it. But the kernel
maintainers know that this new code is likely to make a few messes around
the house and chew up their favorite pair of slippers before it is properly
trained. It also must be fed and taken for an occasional visit to the
veterinarian for years into the future. So it is important to be sure
that, at a minimum, this is a puppy that users actually want. That is why
Linus is asking for users to express their support for proposed new
features.
Getting that support can be a bit of a catch-22 situation, though. It
takes a dedicated user indeed to grab an in-progress patch and build it
into their own kernel; most users will not do that. Life can be easier if
distributions package proposed code, giving users a chance to test it out
without having to build and install a new kernel, but distributors can get
into trouble for doing that. The
recent fuss over Nouveau was a clear example of unhappiness about
shipping out-of-tree kernel code. Similarly, a few years ago, SUSE shipped
AppArmor without merging it
first, drawing this complaint
from Andrew Morton:
Sigh. Please don't put us in this position again. Get stuff
upstream before shipping it to customers, OK? It ain't rocket
science.
But getting the customers to request the software - within hearing of Linus
Torvalds - before it has been either merged or shipped to them can indeed
seem like rocket science at times.
There has been at least one public request for the merging of Ceph in a
future development cycle. The bar may be even higher for AlacrityVM,
though. There does not appear to be crowd of users asking for a new set of
virtualized device drivers which are meant to be used with an out-of-tree
virtualization mechanism. Beyond that, past discussions about this code
have been long and heated, with some significant disagreements between
AlacrityVM developer Gregory Haskins and (in particular) the KVM
developers.
This history led Ingo Molnar to post a
reminder of flame wars past and a request that Gregory try harder to
work with the KVM development community. Needless to say, this posting has
started another extensive discussion, with Gregory stating that he has tried hard indeed to work
with the other developers, and that, in any case, the current AlacrityVM
posting, which consists mostly of drivers, is not relevant to KVM. From
there, the discussion moved into whether this work is really necessary, the
best approaches to improving I/O performance in virtualized guests, and so
on.
It's not clear that there is an obvious solution to this particular
disagreement other than having serious users try out the various solutions
and report on what works best. That will be hard to do with an out-of-tree
virtualization solution, but the existence of this kind of controversy will
only make getting the code into the mainline harder. Linus was quite clear on that:
So when I see another virtualization interface, I want the
virtualization people to just argue it out amongst
themselves. Thanks to the virtue of me personally not caring one
whit about virtualization, I can stand back and just watch the
fireworks.
Which is not to say that I enjoy it (I like the occasional
flame-fest, but in order to like them I need to _care_ enough to
get fired up about them!). So I just don't want the in-fighting to
take place in my tree, so I'd rather see the fighting die out
_before_ I actually pull.
This code was
developed by SUSE, which, presumably, wishes to provide AlacrityVM to its
customers. This may be one of those situations where the distributor has
no choice but to ship the code ahead of mainline integration, just to get
the user feedback that shows it's worthwhile. That course has risks: the
code may never be merged, or it may suffer incompatible changes on its way
into the mainline later on. But the alternative may be to see this code
languish on the sideline indefinitely.
Comments (3 posted)
December 21, 2009
This article was contributed by Matthew Garrett
The Advanced Configuration and Power Interface specification (ACPI)
was introduced to replace the myriad of differing protocols for
providing configuration data to the operating system. It additionally
provided a new power control specification to replace APM, moving
policy decisions out of the hard-coded BIOS and into the operating
system. Since then it's had a number of extensions implementing all
kinds of functionality, variously specified and unspecified. Hardware
vendors have seized upon this to implement their own custom "value
add" interfaces, taking advantage of the existing specified
functionality while adding their own non-standard extensions.
In this article we'll be looking at writing a driver to manage a
persistent irritation on Toshiba hardware: the case of the missing
Bluetooth. Some Toshibas will boot without Bluetooth, or will manage to
lose it somewhere between being suspended and resumed. Sometimes it'll
be there until the owner hits the rfkill switch, at which point it's
gone no matter how plaintively the user flicks the switch back and
forth. In short, the Bluetooth interface is fickle, flaky and not to be
relied upon.
We're lucky, though. Toshiba implemented their Bluetooth control in
the form of an ACPI device. At this point some of you may feel that
this is some unusual meaning of the word "lucky", but it's really not
as bad as it could be.
First, we'll need one of the ACPI tables. ACPI
tables are sections of information provided to the operating system by
the BIOS; they contain either blocks of configuration information or,
alternatively, executable code in a compiled bytecode called AML. The
table that we want is the "Discrete System Descriptor Table", or
DSDT. This provides a set of configuration information and control
methods for the system hardware. On Linux, it can be found in
/sys/firmware/acpi/tables/DSDT. We need to decompile it from the AML
bytecode to ASL (ACPI Source Language), which can be done with iasl -
the Intel ACPI compiler. This will typically be available as a package
in distributions but can also be downloaded as source from
acpica.org . The -d
option to iasl decompiles an
executable table to something resembling the original source. For
reference there's an example of a decompiled
DSDT here,
and it contains the devices and methods discussed in the rest of this
article.
Looking at the decompiled DSDT, the first thing we realise is that
there's a huge pile of junk and extraneous configuration in here, so
let's try to find something useful. First of all, let's look for
interesting devices. ACPI device names are limited to four characters,
which is generally not helpful in finding something interesting from
scratch. Thankfully there's also the _HID string, which provides a tag
for identifying the type of device. These strings use the same
namespace as old ISA PNP devices, so some of them may be familiar to
those of you who spent too long cursing at IRQ and IO settings in the
bad old days. So ignore anything with a _HID tag that starts with
PNP -
it's some piece of standardized system hardware that's unlikely to be
doing anything interesting.
On this Toshiba, that leaves us with 5 devices - NSC1100, SMCF030,
TOS6205, ACPI0003 and TOS6208. According to the ACPI specification,
ACPI0003 is an AC adapter. So ignore that. Google says that NSC1100 is
a TPM device. SMCF030 is an infrared port. So that leaves TOS6205 and
TOS6208, which look something like this:
Device (BT)
{
Name (_HID, "TOS6205")
...
}
Device (VALZ)
{
Name (_HID, "TOS6208")
...
}
VALZ turns out to be the generic event and control interface for all
kinds of other bits of laptop functionality. There's already a driver
for this in the kernel (toshiba_acpi.c), so let's ignore that. The one
called BT certainly sounds like a better bet, so TOS6205 it is.
At this point we can write a skeleton driver that does nothing other
than bind to this ACPI device. It's only a few lines of code to do that,
and it's consistent between all ACPI drivers. All we need to do is
register an ACPI driver structure with add and remove functions. These
will be called whenever the kernel finds an ACPI device with the TOS6205
ID, and we can do further setup there.
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/types.h>
#include <acpi/acpi_bus.h>
#include <acpi/acpi_drivers.h>
static int toshiba_bt_rfkill_add(struct acpi_device *device);
static int toshiba_bt_rfkill_remove(struct acpi_device *device, int type);
static const struct acpi_device_id bt_device_ids[] = {
{ "TOS6205", 0},
{ "", 0},
};
MODULE_DEVICE_TABLE(acpi, bt_device_ids);
static struct acpi_driver toshiba_bt_driver = {
.name = "Toshiba BT",
.class = "Toshiba",
.ids = bt_device_ids,
.ops = {
.add = toshiba_bt_rfkill_add,
.remove = toshiba_bt_rfkill_remove,
},
.owner = THIS_MODULE,
};
static int toshiba_bt_rfkill_add(struct acpi_device *device)
{
return 0;
}
static int __init toshiba_bt_rfkill_init(void)
{
int result = 0;
result = acpi_bus_register_driver(&toshiba_bt_driver);
if (result < 0) {
ACPI_DEBUG_PRINT((ACPI_DB_ERROR,
"Error registering driver\n"));
return -ENODEV;
}
return 0;
}
static int toshiba_bt_rfkill_remove(struct acpi_device *device, int type)
{
return 0;
}
static void __exit toshiba_bt_rfkill_exit(void)
{
acpi_bus_unregister_driver(&toshiba_bt_driver);
}
module_init(toshiba_bt_rfkill_init);
module_exit(toshiba_bt_rfkill_exit);
Now what? Let's look at the device's methods. These are functions
associated with the device, and will typically be declared in the same
scope block. For the Toshiba Bluetooth device, we see something like this:
Device (BT)
{
Name (_HID "TOS6205")
Method (_STA, 0, Notserialized)
{
...
}
Method (AUSB, 0, Notserialized)
{
...
}
Method (DUSB, 0, Notserialized)
{
...
}
Method (BTPO, 0, Notserialized)
{
...
}
Method (BTPF, 0, Notserialized)
{
...
}
Method (BTST, 0, Notserialized)
{
...
}
}
The device has an _STA method. Methods with a leading _
are supposed to
be reserved for the ACPI specification. _STA is defined as returning
the device status, and, in this case, will tell us whether the TOS6205
device is functional or not. If there's no Bluetooth on the machine,
it'll return zero. The ACPI core handles this for us, so we can ignore
it.
The other methods are BTST, AUSB, DUSB,
BTPO and BTPF. Working out
what these do can be awkward. The DSDT I'm looking at (and which is
linked above) has methods that read and write from a set of objects
which reflect the hardware state in some way. DUSB writes a zero to an
object called BTDT, while AUSB writes one to it.
BTPF writes a zero to an
object called BTPW, while BTPO writes one to
it. BTST returns a byte
with the value of BTSW in the low bit, BTDT in bit six,
and BTPW in bit
seven. Looking at the DSDT, we can see that BTDT, BTPW
and BTSW are
defined in a block that looks like this:
OperationRegion (ERAM, EmbeddedControl, 0x00, 0xFF)
Field (ERAM, ByteAcc, Lock, Preserve) {
BTDT, 1
BTPW, 1
...
BTSW, 1
...
}
An OperationRegion is ACPI speak for an addressable system resource -
it may refer to an area of memory, system IO space, a PCI
configuration block or (in this case) the registers of the system
embedded controller. Objects can be declared within that block in
order to let the ACPI code read and write to them. In this case, BTDT,
BTPW and BTSW all refer to sections of the embedded controller
register space. The number after the comma is the number of bits the
object takes up, so we now know that BTDT is the first bit of the
first byte-sized register of the embedded controller, BTPW the second
bit and BTSW the fifth. Writing to these values will probably cause
the embedded controller (a microprocessor running its own firmware) to
perform some act in return - alternatively, an external event (such as
flicking a switch) may generate an event picked up by the embedded
controller and causing it to update a register's contents.
(Side note: generally speaking, accessing the embedded controller
registers directly from the operating system is an error. Vendors may
alter the embedded controller firmware and the layout of the bits
between different models or even different BIOS versions. When this
happens, ACPI methods will be updated to match - so if there's an ACPI
method to call to interface with the embedded controller, use that
rather than trying to drive it directly. This kind of thing can also
act in our favour - ACPI interfaces tend to be retained over a range
of models, even if the underlying hardware has changed
dramatically. One machine may make an opaque system management call in
response to an ACPI method, while another with the same method may
return a register directly. By comparing the tables from different
machines with the same interface, it's often possible to learn more
about what these methods are actually meant to do).
We can make pretty good guesses about what's going on here. BTST
returns a set of register values, so is probably the BlueTooth device
STatus. BTSW presumably contains the state of the kill switch on the
laptop. BTPW sounds like it's the power state - so if
BTPF and BTPO
change that, they're presumably BlueTooth Power ofF and BlueTooth
Power On. AUSB and DUSB sound like Attach USB and Detach USB.
So at this point, we have enough knowledge of the interface to
implement a Bluetooth enabler driver. On module load we should check
the switch status. If it's on, we should enable the Bluetooth.
static int toshiba_bluetooth_enable(acpi_handle handle)
{
acpi_status result;
acpi_integer status;
result = acpi_evaluate_integer(handle, "BTST", NULL, &status);
if (ACPI_FAILURE(result))
return -EINVAL;
/* Check the current status */
if (!(status & (1 << 0))) /* The switch is off */
return -EBUSY;
if (!(status & (1 << 6))) /* The device is detached - attach it */
result = acpi_evaluate_object(handle, "AUSB", NULL, NULL);
if (ACPI_FAILURE(result)) {
printk(KERN_ERR "Failed to reattach Toshiba Bluetooth device\n");
return -ENODEV;
}
if (!(status & (1 << 7))) /* The device isn't powered up */
result = acpi_evaluate_object(handle, "BTPO", NULL, NULL);
if (ACPI_FAILURE(result)) {
printk(KERN_ERR "Failed to power on Toshiba Bluetooth device\n");
return -ENODEV;
}
return 0;
}
static int toshiba_bt_rfkill_add(struct acpi_device *device)
{
return toshiba_bluetooth_enable(device->handle);
}
acpi_evaluate_object() asks the kernel's ACPI interpreter to find a
method underneath a specific area of namespace (in this case, the
namespace of the device that we've bound to - the first argument
provides that), with the third and fourth arguments providing the
arguments to be passed to the function and the results returned from
the function respectively. acpi_evaluate_integer() is similar, but
assumes that the method is going to return an integer.
(Random stylistic sidepoint: When looking at individual bits inside a
returned integer, I find it neater to refer to the 0th bit as
"1 << 0" rather than 1. There's no especially good
reason for this).
Ok. What about when the user flicks the switch back? ACPI can catch
various hardware events and then signal them to the operating
system. This is done via the "Notify" keyword. Grepping for "Notify"
in the DSDT reveals a pile of cases we don't care about, but also a
couple of lines like:
Notify (BT, 0x90)
The first argument to notify is the device to be notified. The kernel
will check whether a driver is bound to this device, and if so will
pass the notification on to the driver. 0x90 is simply the event type
- numbers below 0x80 are device-independent and defined by the ACPI
specification, whereas those above are device-specific and only
defined by the ACPI specification if the device itself is defined by
the ACPI specification. Toshiba has presumably defined what 0x90
means, but given that it's the only event sent by the device we'll
just assume it means "Reattach the Bluetooth device". This machine
doesn't seem to send an event when Bluetooth device is turned off, but
that's less of a concern because the USB device vanishes anyway -
there's nothing we need to do in response. Anyway, let's hook up a
notification function.
.ops = {
.add = toshiba_bt_rfkill_add,
.remove = toshiba_bt_rfkill_remove,
.notify = toshiba_bt_rfkill_notify,
},
static void toshiba_bt_rfkill_notify(struct acpi_device *device, u32 event)
{
toshiba_bluetooth_enable(device->handle);
}
We ignore the event type in this case because some machines seem to
send different numbers here, and we don't really get useful
information from it. Other devices may use different event values to
indicate the type of event that's been received, and in those cases
you'd want to check it appropriately.
At this point we now have a driver that enables Bluetooth at boot (if
it's enabled) and responds to the switch being flicked by enabling
Bluetooth again. The only remaining case is the one where the user
turned off Bluetooth while the system was running, suspended and then
flicked the switch back to enable. Let's check again on resume.
.ops = {
.add = toshiba_bt_rfkill_add,
.remove = toshiba_bt_rfkill_remove,
.notify = toshiba_bt_rfkill_notify,
.resume = toshiba_bt_rfkill_resume,
},
static int toshiba_bt_rfkill_resume(struct acpi_device *device)
{
toshiba_bluetooth_enable(device->handle);
return 0;
}
And that's it - a fully functional ACPI driver. The TOS6205 device is
one of the simplest ACPI devices I've found, but the principle is the
same for any other. In more complex cases you'll want to expose some
kind of userspace interface to perform method calls on the
device. These should use standard kernel interfaces wherever
possible. Backlight control should be carried out via the backlight
class, device hotkeys should be sent via an input device and more
complex radio control should use the rfkill layer. More
device-specific functionality may require you to add sysfs attributes
directly, which is somewhat outside the scope of this article.
It would be nice to hope for these ACPI interfaces to become
standardized over time, but unfortunately there seems to be little
willingness on the part of the companies involved to do so. Some
vendors are even moving away from using ACPI directly and are instead
using WMI interfaces - a mechanism intended for exposing system
management information to Windows applications, but easily subverted
into a general purpose control system with even less transparency than
pure ACPI. But that's a subject for a different article, and there are
still many pieces of hardware with pure ACPI interfaces and no
drivers written as yet.
Comments (18 posted)
December 22, 2009
This article was contributed by Steven Rostedt
The Ftrace tracing utility has many different features that will assist
in tracking down Linux kernel problems. The previous
article discussed setting up Ftrace, using the function and function graph
tracers, using trace_printk(), and a simple way to stop the recording
of a trace from user space. This installment will touch on how user space
can interact with Ftrace, faster ways of stopping the trace, debugging a
crash, and finding what kernel functions are the biggest stack hogs.
Trace Markers
Seeing what happens inside the kernel gives the user a better
understanding of how their system works. But sometimes there needs to be
coordination between what is happening in user space and what is happening
inside the kernel. The timestamps that are shown in the traces are all
relative to what is happening within the trace, but they do not correspond
well with wall time.
To help synchronize between the actions in user space and kernel space,
the trace_marker file was created. It provides a way to write into the
Ftrace ring buffer from user space. This marker will then appear in the trace
to give a location in the trace of where a specific event occurred.
[tracing]# echo hello world > trace_marker
[tracing]# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-3718 [001] 5546.183420: 0: hello world
The <...> indicates that the name of the task that
wrote the marker was not recorded. Future releases may fix this.
Starting, Stopping and Recording in a Program
The tracing_on and trace_marker
files work very well to trace the activities of an application if the
source of the application is available. If there is a problem within the
application and you need to find out what is happening inside the kernel at
a particular location of the application, these two files come in
handy.
At the start of the application, you can open
these files to have the file descriptors ready:
int trace_fd = -1;
int marker_fd = -1;
int main(int argc, char *argv)
{
char *debugfs;
char path[256];
[...]
debugfs = find_debugfs();
if (debugfs) {
strcpy(path, debugfs);
strcat(path,"/tracing/tracing_on");
trace_fd = open(path, O_WRONLY);
if (trace_fd >= 0)
write(trace_fd, "1", 1);
strcpy(path, debugfs);
strcat(path,"/tracing/trace_marker");
marker_fd = open(path, O_WRONLY);
Then, at some critical location in the code, markers can be placed
to show where the application currently is:
if (marker_fd >= 0)
write(marker_fd, "In critical area\n", 17);
if (critical_function() < 0) {
/* we failed! */
if (trace_fd >= 0)
write(trace_fd, "0", 1);
}
In looking at the example, first you see a function
called "find_debugfs()". The proper location to mount the debug file system
is at /sys/kernel/debug but a robust tool should not depend on the
debug file system being mounted there. An example of
find_debugfs() is located here.
The file descriptors are initialized to -1 to allow this code to work
both with and without a tracing enabled kernel.
When the problem is detected, writing the ASCII character "0"
into the trace_fd file descriptor stops tracing. As discussed
in part 1, this only disables the recording into the Ftrace ring buffer,
but the tracers are still incurring overhead.
When using the initialization code above, tracing will be enabled
at the beginning of the application because
the tracer runs in overwrite mode. That is, when the trace buffer
fills up, it will remove the old data and replace it with the new.
Since only the most recent trace information is relevant when the problem
occurs
there is no need to stop and start the tracing during the normal
running of the application. The tracer only needs to be disabled when
the problem is detected so the trace will have the history of what led
up to the error. If interval tracing is needed within the application, it can
write an ASCII "1" into the trace_fd to enable the tracing.
Here is an example of a simple program called simple_trace.c
that uses
the initialization process described above:
req.tv_sec = 0;
req.tv_nsec = 1000;
write(marker_fd, "before nano\n", 12);
nanosleep(&req, NULL);
write(marker_fd, "after nano\n", 11);
write(trace_fd, "0", 1);
(No error checking was added due to this being a simple program for
example purposes only.)
Here is the process to trace this simple program:
[tracing]# echo 0 > tracing_on
[tracing]# echo function_graph > current_tracer
[tracing]# ~/simple_trace
[tracing]# cat trace
The first line disables tracing because the program will enable it at
start up. Next the function graph tracer is selected. The program is
executed, which results in the following trace. Note that the output can
be a little verbose so much of it has been cut and replaced with
[...]:
[...]
0) | __kmalloc() {
0) 0.528 us | get_slab();
0) 2.271 us | }
0) | /* before nano */
0) | kfree() {
0) 0.475 us | __phys_addr();
0) 2.062 us | }
0) 0.608 us | inotify_inode_queue_event();
0) 0.485 us | __fsnotify_parent();
[...]
1) 0.523 us | _spin_unlock();
0) 0.495 us | current_kernel_time();
1) | it_real_fn() {
0) 1.602 us | }
1) 0.728 us | __rcu_read_lock();
0) | sys_nanosleep() {
0) | hrtimer_nanosleep() {
0) 0.526 us | hrtimer_init();
1) 0.418 us | __rcu_read_lock();
0) | do_nanosleep() {
1) 1.114 us | _spin_lock_irqsave();
[...]
0) | __kmalloc() {
1) 2.760 us | }
0) 0.556 us | get_slab();
1) | mwait_idle() {
0) 1.851 us | }
0) | /* after nano */
0) | kfree() {
0) 0.486 us | __phys_addr();
Notice that the writes to trace_marker show up as comments in
the function graph tracer.
The first column here represents the CPU. When we have the CPU traces
interleaved like this, it may become hard to read the trace. The tool
grep can easily filter this, or the per_cpu trace files
may be used. The per_cpu trace files are located in the debugfs
tracing directory under per_cpu.
[tracing]# ls per_cpu
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
There exists a trace file in each one of these CPU directories that
only show the trace for that CPU.
To get a nice view of the function graph tracer without the interference of
other CPUs just look at per_cpu/cpu0/trace.
[tracing]# cat per_cpu/cpu0/trace
0) | __kmalloc() {
0) 0.528 us | get_slab();
0) 2.271 us | }
0) | /* before nano */
0) | kfree() {
0) 0.475 us | __phys_addr();
0) 2.062 us | }
0) 0.608 us | inotify_inode_queue_event();
0) 0.485 us | __fsnotify_parent();
0) 0.488 us | inotify_dentry_parent_queue_event();
0) 1.106 us | fsnotify();
[...]
0) 0.721 us | _spin_unlock_irqrestore();
0) 3.380 us | }
0) | audit_syscall_entry() {
0) 0.495 us | current_kernel_time();
0) 1.602 us | }
0) | sys_nanosleep() {
0) | hrtimer_nanosleep() {
0) 0.526 us | hrtimer_init();
0) | do_nanosleep() {
0) | hrtimer_start_range_ns() {
0) | __hrtimer_start_range_ns() {
0) | lock_hrtimer_base() {
0) 0.866 us | _spin_lock_irqsave();
[...]
0) | __kmalloc() {
0) | get_slab() {
0) 1.851 us | }
0) | /* after nano */
0) | kfree() {
0) 0.486 us | __phys_addr();
Disabling the Tracer Within the Kernel
During the development of a kernel driver there may exist strange
errors that occur during testing. Perhaps the driver gets stuck in a sleep
state and never wakes up. Trying to disable the tracer from user space
when a kernel event occurs is difficult and usually results in a buffer
overflow and loss of the relevant information before the user can stop
the trace.
There are two functions that work well inside the kernel:
tracing_on() and tracing_off(). These two act just like
echoing "1" or "0" respectively into the tracing_on file. If there is
some condition that can be checked for inside the kernel, then the tracer
may be stopped by adding something like the following:
if (test_for_error())
tracing_off();
Next, add several trace_printk()s (see part 1), recompile, and
boot the kernel. You can then enable the function or function graph tracer
and just
wait for the error condition to happen. Examining the tracing_on
file will let you know when the error condition occurred. It will switch
from "1" to "0" when the kernel calls tracing_off().
After examining the trace, or saving it off in another file with:
cat trace > ~/trace.sav
you can continue the trace to examine another
hit. To do so, just echo "1" into
tracing_on, and the trace will
continue. This is also useful if the condition that triggers the
tracing_off() call can be triggered legitimately. If the condition was
triggered by normal operation, just restart the trace by echoing a "1" back
into
tracing_on and hopefully the next time the condition is hit
will be because of the abnormality.
ftrace_dump_on_oops
There are times that the kernel will crash and examining the memory and
state of the crash is more of a CSI science than a program debugging
science. Using kdump/kexec with the crash
utility is a valuable way to examine the state of the system at the point
of the crash, but it does not let you see what has happened prior to the
event that caused the crash.
Having Ftrace configured and enabling ftrace_dump_on_oops in
the kernel boot parameters, or by echoing a "1" into
/proc/sys/kernel/ftrace_dump_on_oops, will enable Ftrace to dump
to the console the entire trace buffer in ASCII format on oops or panic.
Having the console output to a serial log makes debugging crashes much
easier. You can now trace back the events that led up to the crash.
Dumping to the console may take a long time since the default Ftrace
ring buffer is over a megabyte per CPU. To shrink the size of the ring
buffer, write the number of kilobytes you want the ring buffer to be to
buffer_size_kb. Note that the value is per CPU, not the total
size of the ring buffer.
[tracing]# echo 50 > buffer_size_kb
The above will shrink the Ftrace ring buffer down to 50 kilobytes per
CPU.
You can also trigger a dump of the Ftrace buffer to the console with
sysrq-z.
To choose a particular location for the kernel dump, the kernel may call
ftrace_dump() directly. Note, this may permanently disable Ftrace
and a reboot may be necessary to enable it again. This is because
ftrace_dump() reads the buffer. The buffer is made to be written
to in all contexts (interrupt, NMI, scheduling) but the reading of the
buffer requires locking. To be able to perform ftrace_dump() the
locking is disabled and the buffer may end up being corrupted after the
output.
/*
* The following code will lock up the box, so we dump out the
* trace before we hit that location.
*/
ftrace_dump();
/* code that locks up */
Stack Tracing
The final topic to discuss is the ability to examine the size of the
kernel stack and how much stack space each function is using. Enabling the
stack tracer (CONFIG_STACK_TRACER) will show where the biggest use
of the stack takes place.
The stack tracer is built from the function tracer infrastructure. It
does not use the Ftrace ring buffer, but it does use the function tracer to
hook into every function call. Because it uses the function tracer
infrastructure, it does not add overhead when not enabled. To enable the
stack tracer, echo 1 into
/proc/sys/kernel/stack_tracer_enabled. To see the max stack size
during boot up, add "stacktrace" to the kernel boot parameters.
The stack tracer checks the size of the stack at every function call. If it
is greater than the last recorded maximum, it records the stack trace and
updates the maximum with the new size. To see the current maximum, look at the
stack_max_size file.
[tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled
[tracing]# cat stack_max_size
2928
[tracing]# cat stack_trace
Depth Size Location (34 entries)
----- ---- --------
0) 2952 16 mempool_alloc_slab+0x15/0x17
1) 2936 144 mempool_alloc+0x52/0x104
2) 2792 16 scsi_sg_alloc+0x4a/0x4c [scsi_mod]
3) 2776 112 __sg_alloc_table+0x62/0x103
[...]
13) 2072 48 __elv_add_request+0x98/0x9f
14) 2024 112 __make_request+0x43e/0x4bb
15) 1912 224 generic_make_request+0x424/0x471
16) 1688 80 submit_bio+0x108/0x115
17) 1608 48 submit_bh+0xfc/0x11e
18) 1560 112 __block_write_full_page+0x1ee/0x2e8
19) 1448 80 block_write_full_page_endio+0xff/0x10e
20) 1368 16 block_write_full_page+0x15/0x17
21) 1352 16 blkdev_writepage+0x18/0x1a
22) 1336 32 __writepage+0x1a/0x40
23) 1304 304 write_cache_pages+0x241/0x3c1
24) 1000 16 generic_writepages+0x27/0x29
[...]
30) 424 64 bdi_writeback_task+0x3f/0xb0
31) 360 48 bdi_start_fn+0x76/0xd7
32) 312 128 kthread+0x7f/0x87
33) 184 184 child_rip+0xa/0x20
Not only does this give you the size of the maximum stack found, it also
shows the breakdown of the stack sizes used by each function. Notice that
write_cache_pages had the biggest stack with 304 bytes being used,
followed by generic_make_request with 224 bytes of stack.
To reset the maximum, echo "0" into the stack_max_size
file.
[tracing]# echo 0 > stack_max_size
Keeping this running for a while will show where the
kernel is using a bit too much stack. But remember that the stack tracer
only has no overhead when it is not enabled. When it is running you may notice a
bit of a performance degradation.
Note that the stack tracer will not trace the max stack size when the
kernel is using a separate stack. Because interrupts have their own stack,
it will not trace the stack usage there. The reason is that currently
there is no easy way to quickly see what the top of the stack is when the
stack is something other than the current task's stack. When using split
stacks, a process stack may be two pages but the interrupt stack may only
be one. This may be fixed in the future, but keep this in mind when using
the stack tracer.
Conclusion
Ftrace is a very powerful tool and easy to configure. No extra tools are
necessary. Everything that was shown it this tutorial can be used on
embedded devices that only have Busybox installed. Taking advantage of the
Ftrace infrastructure should cut the time needed to debug that hard-to-find
race condition. I seldom use printk() any more because using the
function and function graph tracers along with trace_printk() and
tracing_off() have become my main tools for debugging the Linux
kernel.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Page editor: Jonathan Corbet
Next page: Distributions>>