Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc1, released by Linus on December 17. The 2.6.33 merge window is now closed; significant patches merged since last week's summary include a direct rendering driver for the VMware virtual GPU, along with drivers for Maxim 8660/8661 voltage regulators, Marvell 88PM8607 PMIC devices, STMicroeletronics LIS3LV02Dx accelerometers, LaCie Network Space v2 NAS boards, DesignWare SPI controller core devices, Samsung S3C64XX series type SPI controllers, MDIO buses on Octeon SOCs, Octeon Management port Ethernet controllers, Cisco PowerTV platforms, and HP Smart Array SCSI controllers.

There have been a small number of patches merged since the 2.6.33-rc1 release; along with fixes, these include the kfifo API rework patch set.

Stable updates: The 2.6.27.42, 2.6.31.9, and 2.6.32.2 stable kernel updates were released on December 18. Each contains another long list of important fixes all across the kernel tree.

Comments (none posted)

Quotes of the week

For an added bonus make sure the rant includes something highly quotable and funny as a put down, then you'll make lwn.net as well

-- Alan Cox

We need a sane scripting language available to the kernel build, and given all the problems we have had with different versions or even just sometimes different builds of sh, awk, and even bc -- plus the fact that those utilities just don't necessarily do what we want makes it very frustrating. Personally I think a dependency on Perl is better than the mess we're in; I understand other people disagree. What is definitely not acceptable, however, is the status quo. The situation is, quite frankly, ridiculous enough that perhaps the right thing to do is to write a small scripting engine and bundle it with the kernel.

-- H. Peter Anvin

The two-week merge window is _not_ supposed to be "one day merge window after thirteen days of silence". In fact, I think that next time around I'll make the merge window be 11-12 days instead, and people who try to game the system and do a last-minute pull request will get a surprise, and get unceremoniously bumped to 2.6.35 instead.

-- Linus Torvalds

Comments (11 posted)

pr_unwanted()?

By Jonathan Corbet
December 22, 2009

One of a kernel developer's best friends is the printk() function, which works much like printf() in user-space programs. There are some differences, though, including the existence of various levels of logging. The convention used is a little funny, with the logging level being a short string prepended to the format string. So a warning might be printed this way:

    printk(KERN_WARNING "Core meltdown imminent\n");

This form is not universally loved, though; some call it verbose, making it hard to make lines fit within 80 columns, and the severity string is easy to forget. As an alternative, the 2.6.28-rc5 kernel saw the addition of a set of pr_*() macros, written by Martin Schwidefsky, which are designed to make life a little easier. So, for example, the above warning could be rewritten as:

    pr_warning("Software patents detected\n");

These macros sat in relative obscurity for a few development cycles until Joe Perches decided to switch a number of printk() statements in the core kernel. That led to an outburst from Peter Zijlstra and the eventual reverting of the change. Peter says:

I might be strange, but if I want to print something in C I write print[fk]() and be done with it, there's no reason what so ever to introduce fancy wankery for this. We try to stick to ANSI-C as much as possible, we've got kalloc,kfree,strcmp,strnlen and all the other 'regular' C bits, deviating from that serves no purpose but seed confusion.

Chances are, there won't be any more such conversions near that part of the kernel. But the pr_*() macros won't be going away either. Their real purpose, perhaps, was best expressed by Arjan van de Ven: "pr_ is really just for 'I am a driver and want a single line message out in a standardized format'. Nothing wrong with that."

Comments (4 posted)

Turbulence for concurrency-managed workqueues

By Jonathan Corbet
December 23, 2009

Tejun Heo's concurrency-managed workqueues patch was discussed here in October. This work has been progressing, with some of the associated clean-up patches being merged for 2.6.33; the main part of the work would appear to be on a path for merging in 2.6.34. Or maybe not: some developers are starting to express some qualms.

The loudest complaints come from Peter Zijlstra, who would rather see effort go into converting workqueue users to using threaded interrupt handlers instead. To developers like Peter, the new workqueues look like a bunch of new complexity which could create new problems (management of CPU-intensive workqueue tasks, for example) while failing to address other issues, including the locking problems which can plague workqueue users now.

Tejun has responded with a description of some of the problems being solved by the redone workqueues, concluding:

Shifting complexity out of peripheral code to better crafted and managed core code is the right thing to do and it will shift a lot of complexity out of peripheral codes.

That may actually be where some of the trouble lies: the patch set, in its current state, does not really demonstrate this shift in complexity. So Ingo Molnar has requested some example conversions that show the advantages of concurrency-managed workqueues:

For this particular patchset it should be possible to identify existing patterns of code in the existing code base of 6+ millions lines of Linux driver code that would make the advantages of this +2000 lines of core kernel code plain obvious. There were multiple claims of problems with the current abstractions - so there sure must be a way to show off the new code in a few places.

Tejun has indicated that he will work to provide this demonstration. Should the next version of the patch set prove convincing on this front, the new workqueues might still be on-track for 2.6.34.

Comments (none posted)

Kernel development news

Two that didn't make it

By Jonathan Corbet
December 22, 2009

The 2.6.33 merge window has run its course, and a great deal of code has been merged into the mainline. The merge window always seems like a bit of a game of musical chairs, though: when the music stops, at least one project tends to be left conspicuously standing. This time around, two projects were left without a chair in the mainline despite having sent in pull requests: the Ceph distributed filesystem and the AlacrityVM hypervisor code.

Often, originators of ignored pull requests are left in silence to wonder why those requests were not acted upon. This time around, Linus explained the missing pulls: there didn't seem to be enough interest in those features. As he put it:

The best thing to do is to try to have users that are vocal about the feature, and talk about how great it is. Some advocates for it, in other words. Just a few other people saying "hey, I use this, it's great", is actually a big deal to me. For alacrityvm and cephfs, I didn't have that, or they just weren't loud enough for me to hear.

Sun CEO Scott McNealy once remarked that free software is like a free puppy. There is some truth to that remark in general, and with respect to code pulled into the kernel in particular. The code itself comes for free, with a nice, GPL-compatible license attached to it. But the kernel maintainers know that this new code is likely to make a few messes around the house and chew up their favorite pair of slippers before it is properly trained. It also must be fed and taken for an occasional visit to the veterinarian for years into the future. So it is important to be sure that, at a minimum, this is a puppy that users actually want. That is why Linus is asking for users to express their support for proposed new features.

Getting that support can be a bit of a catch-22 situation, though. It takes a dedicated user indeed to grab an in-progress patch and build it into their own kernel; most users will not do that. Life can be easier if distributions package proposed code, giving users a chance to test it out without having to build and install a new kernel, but distributors can get into trouble for doing that. The recent fuss over Nouveau was a clear example of unhappiness about shipping out-of-tree kernel code. Similarly, a few years ago, SUSE shipped AppArmor without merging it first, drawing this complaint from Andrew Morton:

Sigh. Please don't put us in this position again. Get stuff upstream before shipping it to customers, OK? It ain't rocket science.

But getting the customers to request the software - within hearing of Linus Torvalds - before it has been either merged or shipped to them can indeed seem like rocket science at times.

There has been at least one public request for the merging of Ceph in a future development cycle. The bar may be even higher for AlacrityVM, though. There does not appear to be crowd of users asking for a new set of virtualized device drivers which are meant to be used with an out-of-tree virtualization mechanism. Beyond that, past discussions about this code have been long and heated, with some significant disagreements between AlacrityVM developer Gregory Haskins and (in particular) the KVM developers.

This history led Ingo Molnar to post a reminder of flame wars past and a request that Gregory try harder to work with the KVM development community. Needless to say, this posting has started another extensive discussion, with Gregory stating that he has tried hard indeed to work with the other developers, and that, in any case, the current AlacrityVM posting, which consists mostly of drivers, is not relevant to KVM. From there, the discussion moved into whether this work is really necessary, the best approaches to improving I/O performance in virtualized guests, and so on.

It's not clear that there is an obvious solution to this particular disagreement other than having serious users try out the various solutions and report on what works best. That will be hard to do with an out-of-tree virtualization solution, but the existence of this kind of controversy will only make getting the code into the mainline harder. Linus was quite clear on that:

So when I see another virtualization interface, I want the virtualization people to just argue it out amongst themselves. Thanks to the virtue of me personally not caring one whit about virtualization, I can stand back and just watch the fireworks.

Which is not to say that I enjoy it (I like the occasional flame-fest, but in order to like them I need to _care_ enough to get fired up about them!). So I just don't want the in-fighting to take place in my tree, so I'd rather see the fighting die out _before_ I actually pull.

This code was developed by SUSE, which, presumably, wishes to provide AlacrityVM to its customers. This may be one of those situations where the distributor has no choice but to ship the code ahead of mainline integration, just to get the user feedback that shows it's worthwhile. That course has risks: the code may never be merged, or it may suffer incompatible changes on its way into the mainline later on. But the alternative may be to see this code languish on the sideline indefinitely.

Comments (3 posted)

Writing an ACPI driver - an introduction

December 21, 2009

This article was contributed by Matthew Garrett

The Advanced Configuration and Power Interface specification (ACPI) was introduced to replace the myriad of differing protocols for providing configuration data to the operating system. It additionally provided a new power control specification to replace APM, moving policy decisions out of the hard-coded BIOS and into the operating system. Since then it's had a number of extensions implementing all kinds of functionality, variously specified and unspecified. Hardware vendors have seized upon this to implement their own custom "value add" interfaces, taking advantage of the existing specified functionality while adding their own non-standard extensions.

In this article we'll be looking at writing a driver to manage a persistent irritation on Toshiba hardware: the case of the missing Bluetooth. Some Toshibas will boot without Bluetooth, or will manage to lose it somewhere between being suspended and resumed. Sometimes it'll be there until the owner hits the rfkill switch, at which point it's gone no matter how plaintively the user flicks the switch back and forth. In short, the Bluetooth interface is fickle, flaky and not to be relied upon. We're lucky, though. Toshiba implemented their Bluetooth control in the form of an ACPI device. At this point some of you may feel that this is some unusual meaning of the word "lucky", but it's really not as bad as it could be.

First, we'll need one of the ACPI tables. ACPI tables are sections of information provided to the operating system by the BIOS; they contain either blocks of configuration information or, alternatively, executable code in a compiled bytecode called AML. The table that we want is the "Discrete System Descriptor Table", or DSDT. This provides a set of configuration information and control methods for the system hardware. On Linux, it can be found in /sys/firmware/acpi/tables/DSDT. We need to decompile it from the AML bytecode to ASL (ACPI Source Language), which can be done with iasl - the Intel ACPI compiler. This will typically be available as a package in distributions but can also be downloaded as source from acpica.org . The -d option to iasl decompiles an executable table to something resembling the original source. For reference there's an example of a decompiled DSDT here, and it contains the devices and methods discussed in the rest of this article.

Looking at the decompiled DSDT, the first thing we realise is that there's a huge pile of junk and extraneous configuration in here, so let's try to find something useful. First of all, let's look for interesting devices. ACPI device names are limited to four characters, which is generally not helpful in finding something interesting from scratch. Thankfully there's also the _HID string, which provides a tag for identifying the type of device. These strings use the same namespace as old ISA PNP devices, so some of them may be familiar to those of you who spent too long cursing at IRQ and IO settings in the bad old days. So ignore anything with a _HID tag that starts with PNP - it's some piece of standardized system hardware that's unlikely to be doing anything interesting.

On this Toshiba, that leaves us with 5 devices - NSC1100, SMCF030, TOS6205, ACPI0003 and TOS6208. According to the ACPI specification, ACPI0003 is an AC adapter. So ignore that. Google says that NSC1100 is a TPM device. SMCF030 is an infrared port. So that leaves TOS6205 and TOS6208, which look something like this:

    Device (BT)
    {
	Name (_HID, "TOS6205")
	...
    }

    Device (VALZ)
    {
	Name (_HID, "TOS6208")
	...
    }

VALZ turns out to be the generic event and control interface for all kinds of other bits of laptop functionality. There's already a driver for this in the kernel (toshiba_acpi.c), so let's ignore that. The one called BT certainly sounds like a better bet, so TOS6205 it is.

At this point we can write a skeleton driver that does nothing other than bind to this ACPI device. It's only a few lines of code to do that, and it's consistent between all ACPI drivers. All we need to do is register an ACPI driver structure with add and remove functions. These will be called whenever the kernel finds an ACPI device with the TOS6205 ID, and we can do further setup there.

    #include <linux/kernel.h>
    #include <linux/module.h>
    #include <linux/init.h>
    #include <linux/types.h>
    #include <acpi/acpi_bus.h>
    #include <acpi/acpi_drivers.h>

    static int toshiba_bt_rfkill_add(struct acpi_device *device);
    static int toshiba_bt_rfkill_remove(struct acpi_device *device, int type);

    static const struct acpi_device_id bt_device_ids[] = {
        { "TOS6205", 0},
        { "", 0},
    };
    MODULE_DEVICE_TABLE(acpi, bt_device_ids);

    static struct acpi_driver toshiba_bt_driver = {
        .name =         "Toshiba BT",
        .class =        "Toshiba",
        .ids =          bt_device_ids,
        .ops =          {
                                .add =          toshiba_bt_rfkill_add,
                                .remove =       toshiba_bt_rfkill_remove,
                        },
	.owner =	THIS_MODULE,
    };

    static int toshiba_bt_rfkill_add(struct acpi_device *device)
    {
	return 0;
    }

    static int __init toshiba_bt_rfkill_init(void)
    {
        int result = 0;

        result = acpi_bus_register_driver(&toshiba_bt_driver);
        if (result < 0) {
                ACPI_DEBUG_PRINT((ACPI_DB_ERROR,
                                  "Error registering driver\n"));
                return -ENODEV;
        }

        return 0;
    }

    static int toshiba_bt_rfkill_remove(struct acpi_device *device, int type)
    {
	return 0;
    }

    static void __exit toshiba_bt_rfkill_exit(void)
    {
        acpi_bus_unregister_driver(&toshiba_bt_driver);
    }

    module_init(toshiba_bt_rfkill_init);
    module_exit(toshiba_bt_rfkill_exit);

Now what? Let's look at the device's methods. These are functions associated with the device, and will typically be declared in the same scope block. For the Toshiba Bluetooth device, we see something like this:

    Device (BT)
    {
	Name (_HID "TOS6205")
	Method (_STA, 0, Notserialized)
	{
		...
	}
	Method (AUSB, 0, Notserialized)
	{
		...
	}
	Method (DUSB, 0, Notserialized)
	{
		...
	}
	Method (BTPO, 0, Notserialized)
	{
		...
	}
	Method (BTPF, 0, Notserialized)
	{
		...
	}
	Method (BTST, 0, Notserialized)
	{
		...
	}
    }

The device has an _STA method. Methods with a leading _ are supposed to be reserved for the ACPI specification. _STA is defined as returning the device status, and, in this case, will tell us whether the TOS6205 device is functional or not. If there's no Bluetooth on the machine, it'll return zero. The ACPI core handles this for us, so we can ignore it.

The other methods are BTST, AUSB, DUSB, BTPO and BTPF. Working out what these do can be awkward. The DSDT I'm looking at (and which is linked above) has methods that read and write from a set of objects which reflect the hardware state in some way. DUSB writes a zero to an object called BTDT, while AUSB writes one to it. BTPF writes a zero to an object called BTPW, while BTPO writes one to it. BTST returns a byte with the value of BTSW in the low bit, BTDT in bit six, and BTPW in bit seven. Looking at the DSDT, we can see that BTDT, BTPW and BTSW are defined in a block that looks like this:

    OperationRegion (ERAM, EmbeddedControl, 0x00, 0xFF)
    Field (ERAM, ByteAcc, Lock, Preserve) {
        BTDT, 1
	BTPW, 1
	...
	BTSW, 1
	...
    }

An OperationRegion is ACPI speak for an addressable system resource - it may refer to an area of memory, system IO space, a PCI configuration block or (in this case) the registers of the system embedded controller. Objects can be declared within that block in order to let the ACPI code read and write to them. In this case, BTDT, BTPW and BTSW all refer to sections of the embedded controller register space. The number after the comma is the number of bits the object takes up, so we now know that BTDT is the first bit of the first byte-sized register of the embedded controller, BTPW the second bit and BTSW the fifth. Writing to these values will probably cause the embedded controller (a microprocessor running its own firmware) to perform some act in return - alternatively, an external event (such as flicking a switch) may generate an event picked up by the embedded controller and causing it to update a register's contents.

(Side note: generally speaking, accessing the embedded controller registers directly from the operating system is an error. Vendors may alter the embedded controller firmware and the layout of the bits between different models or even different BIOS versions. When this happens, ACPI methods will be updated to match - so if there's an ACPI method to call to interface with the embedded controller, use that rather than trying to drive it directly. This kind of thing can also act in our favour - ACPI interfaces tend to be retained over a range of models, even if the underlying hardware has changed dramatically. One machine may make an opaque system management call in response to an ACPI method, while another with the same method may return a register directly. By comparing the tables from different machines with the same interface, it's often possible to learn more about what these methods are actually meant to do).

We can make pretty good guesses about what's going on here. BTST returns a set of register values, so is probably the BlueTooth device STatus. BTSW presumably contains the state of the kill switch on the laptop. BTPW sounds like it's the power state - so if BTPF and BTPO change that, they're presumably BlueTooth Power ofF and BlueTooth Power On. AUSB and DUSB sound like Attach USB and Detach USB.

So at this point, we have enough knowledge of the interface to implement a Bluetooth enabler driver. On module load we should check the switch status. If it's on, we should enable the Bluetooth.

    static int toshiba_bluetooth_enable(acpi_handle handle)
    {
	acpi_status result;
	acpi_integer status;

	result = acpi_evaluate_integer(handle, "BTST", NULL, &status);

	if (ACPI_FAILURE(result))
	    return -EINVAL;

	/* Check the current status */
	if (!(status & (1 << 0))) /* The switch is off */
	    return -EBUSY;

	if (!(status & (1 << 6))) /* The device is detached - attach it */
	    result = acpi_evaluate_object(handle, "AUSB", NULL, NULL);

	if (ACPI_FAILURE(result)) {
	    printk(KERN_ERR "Failed to reattach Toshiba Bluetooth device\n");
	    return -ENODEV;
	}

	if (!(status & (1 << 7))) /* The device isn't powered up */
	    result = acpi_evaluate_object(handle, "BTPO", NULL, NULL);

	if (ACPI_FAILURE(result)) {
	    printk(KERN_ERR "Failed to power on Toshiba Bluetooth device\n");
	    return -ENODEV;
	}

	return 0;
    }

    static int toshiba_bt_rfkill_add(struct acpi_device *device)
    {
        return toshiba_bluetooth_enable(device->handle);
    }

acpi_evaluate_object() asks the kernel's ACPI interpreter to find a method underneath a specific area of namespace (in this case, the namespace of the device that we've bound to - the first argument provides that), with the third and fourth arguments providing the arguments to be passed to the function and the results returned from the function respectively. acpi_evaluate_integer() is similar, but assumes that the method is going to return an integer.

(Random stylistic sidepoint: When looking at individual bits inside a returned integer, I find it neater to refer to the 0th bit as "1 << 0" rather than 1. There's no especially good reason for this).

Ok. What about when the user flicks the switch back? ACPI can catch various hardware events and then signal them to the operating system. This is done via the "Notify" keyword. Grepping for "Notify" in the DSDT reveals a pile of cases we don't care about, but also a couple of lines like:

    Notify (BT, 0x90)

The first argument to notify is the device to be notified. The kernel will check whether a driver is bound to this device, and if so will pass the notification on to the driver. 0x90 is simply the event type - numbers below 0x80 are device-independent and defined by the ACPI specification, whereas those above are device-specific and only defined by the ACPI specification if the device itself is defined by the ACPI specification. Toshiba has presumably defined what 0x90 means, but given that it's the only event sent by the device we'll just assume it means "Reattach the Bluetooth device". This machine doesn't seem to send an event when Bluetooth device is turned off, but that's less of a concern because the USB device vanishes anyway - there's nothing we need to do in response. Anyway, let's hook up a notification function.

    .ops = {
		.add =          toshiba_bt_rfkill_add,
		.remove =       toshiba_bt_rfkill_remove,
		.notify =       toshiba_bt_rfkill_notify,
            },

    static void toshiba_bt_rfkill_notify(struct acpi_device *device, u32 event)
    {
	toshiba_bluetooth_enable(device->handle);
    }

We ignore the event type in this case because some machines seem to send different numbers here, and we don't really get useful information from it. Other devices may use different event values to indicate the type of event that's been received, and in those cases you'd want to check it appropriately.

At this point we now have a driver that enables Bluetooth at boot (if it's enabled) and responds to the switch being flicked by enabling Bluetooth again. The only remaining case is the one where the user turned off Bluetooth while the system was running, suspended and then flicked the switch back to enable. Let's check again on resume.

    .ops = {
		.add =          toshiba_bt_rfkill_add,
		.remove =       toshiba_bt_rfkill_remove,
		.notify =       toshiba_bt_rfkill_notify,
		.resume =       toshiba_bt_rfkill_resume,
            },

    static int toshiba_bt_rfkill_resume(struct acpi_device *device)
    {
	toshiba_bluetooth_enable(device->handle);
	return 0;
    }

And that's it - a fully functional ACPI driver. The TOS6205 device is one of the simplest ACPI devices I've found, but the principle is the same for any other. In more complex cases you'll want to expose some kind of userspace interface to perform method calls on the device. These should use standard kernel interfaces wherever possible. Backlight control should be carried out via the backlight class, device hotkeys should be sent via an input device and more complex radio control should use the rfkill layer. More device-specific functionality may require you to add sysfs attributes directly, which is somewhat outside the scope of this article.

It would be nice to hope for these ACPI interfaces to become standardized over time, but unfortunately there seems to be little willingness on the part of the companies involved to do so. Some vendors are even moving away from using ACPI directly and are instead using WMI interfaces - a mechanism intended for exposing system management information to Windows applications, but easily subverted into a general purpose control system with even less transparency than pure ACPI. But that's a subject for a different article, and there are still many pieces of hardware with pure ACPI interfaces and no drivers written as yet.

Comments (18 posted)

Debugging the kernel using Ftrace - part 2

December 22, 2009

This article was contributed by Steven Rostedt

The Ftrace tracing utility has many different features that will assist in tracking down Linux kernel problems. The previous article discussed setting up Ftrace, using the function and function graph tracers, using trace_printk(), and a simple way to stop the recording of a trace from user space. This installment will touch on how user space can interact with Ftrace, faster ways of stopping the trace, debugging a crash, and finding what kernel functions are the biggest stack hogs.

Trace Markers

Seeing what happens inside the kernel gives the user a better understanding of how their system works. But sometimes there needs to be coordination between what is happening in user space and what is happening inside the kernel. The timestamps that are shown in the traces are all relative to what is happening within the trace, but they do not correspond well with wall time.

To help synchronize between the actions in user space and kernel space, the trace_marker file was created. It provides a way to write into the Ftrace ring buffer from user space. This marker will then appear in the trace to give a location in the trace of where a specific event occurred.

    [tracing]# echo hello world > trace_marker
    [tracing]# cat trace
    # tracer: nop
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
               <...>-3718  [001]  5546.183420: 0: hello world

The <...> indicates that the name of the task that wrote the marker was not recorded. Future releases may fix this.

Starting, Stopping and Recording in a Program

The tracing_on and trace_marker files work very well to trace the activities of an application if the source of the application is available. If there is a problem within the application and you need to find out what is happening inside the kernel at a particular location of the application, these two files come in handy.

At the start of the application, you can open these files to have the file descriptors ready:

    int trace_fd = -1;
    int marker_fd = -1;

    int main(int argc, char **argv)
    {
	    char *debugfs;
	    char path[256];
	    [...]

	    debugfs = find_debugfs();
	    if (debugfs) {
		    strcpy(path, debugfs);  /* BEWARE buffer overflow */
		    strcat(path,"/tracing/tracing_on");
		    trace_fd = open(path, O_WRONLY);
		    if (trace_fd >= 0)
			    write(trace_fd, "1", 1);

		    strcpy(path, debugfs);
		    strcat(path,"/tracing/trace_marker");
		    marker_fd = open(path, O_WRONLY);

Then, at some critical location in the code, markers can be placed to show where the application currently is:

    if (marker_fd >= 0)
	    write(marker_fd, "In critical area\n", 17);

    if (critical_function() < 0) {
	    /* we failed! */
	    if (trace_fd >= 0)
		    write(trace_fd, "0", 1);
    }

In looking at the example, first you see a function called "find_debugfs()". The proper location to mount the debug file system is at /sys/kernel/debug but a robust tool should not depend on the debug file system being mounted there. An example of find_debugfs() is located here. The file descriptors are initialized to -1 to allow this code to work both with and without a tracing enabled kernel.

When the problem is detected, writing the ASCII character "0" into the trace_fd file descriptor stops tracing. As discussed in part 1, this only disables the recording into the Ftrace ring buffer, but the tracers are still incurring overhead.

When using the initialization code above, tracing will be enabled at the beginning of the application because the tracer runs in overwrite mode. That is, when the trace buffer fills up, it will remove the old data and replace it with the new. Since only the most recent trace information is relevant when the problem occurs there is no need to stop and start the tracing during the normal running of the application. The tracer only needs to be disabled when the problem is detected so the trace will have the history of what led up to the error. If interval tracing is needed within the application, it can write an ASCII "1" into the trace_fd to enable the tracing.

Here is an example of a simple program called simple_trace.c that uses the initialization process described above:

    req.tv_sec = 0;
    req.tv_nsec = 1000;
    write(marker_fd, "before nano\n", 12);
    nanosleep(&req, NULL);
    write(marker_fd, "after nano\n", 11);
    write(trace_fd, "0", 1);

(No error checking was added due to this being a simple program for example purposes only.)

Here is the process to trace this simple program:

    [tracing]# echo 0 > tracing_on
    [tracing]# echo function_graph > current_tracer
    [tracing]# ~/simple_trace
    [tracing]# cat trace

The first line disables tracing because the program will enable it at start up. Next the function graph tracer is selected. The program is executed, which results in the following trace. Note that the output can be a little verbose so much of it has been cut and replaced with [...]:

    [...]
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
    [...]
     1)   0.523 us    |          _spin_unlock();
     0)   0.495 us    |    current_kernel_time();
     1)               |          it_real_fn() {
     0)   1.602 us    |  }
     1)   0.728 us    |            __rcu_read_lock();
     0)               |  sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     1)   0.418 us    |            __rcu_read_lock();
     0)               |      do_nanosleep() {
     1)   1.114 us    |            _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     1)   2.760 us    |  }
     0)   0.556 us    |        get_slab();
     1)               |  mwait_idle() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

Notice that the writes to trace_marker show up as comments in the function graph tracer.

The first column here represents the CPU. When we have the CPU traces interleaved like this, it may become hard to read the trace. The tool grep can easily filter this, or the per_cpu trace files may be used. The per_cpu trace files are located in the debugfs tracing directory under per_cpu.

    [tracing]# ls per_cpu
    cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7

There exists a trace file in each one of these CPU directories that only show the trace for that CPU.

To get a nice view of the function graph tracer without the interference of other CPUs just look at per_cpu/cpu0/trace.

    [tracing]# cat per_cpu/cpu0/trace
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
     0)   0.488 us    |      inotify_dentry_parent_queue_event();
     0)   1.106 us    |      fsnotify();
    [...]
     0)   0.721 us    |    _spin_unlock_irqrestore();
     0)   3.380 us    |  }
     0)               |  audit_syscall_entry() {
     0)   0.495 us    |    current_kernel_time();
     0)   1.602 us    |  }
     0)               |  sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     0)               |      do_nanosleep() {
     0)               |        hrtimer_start_range_ns() {
     0)               |          __hrtimer_start_range_ns() {
     0)               |            lock_hrtimer_base() {
     0)   0.866 us    |              _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     0)               |        get_slab() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

Disabling the Tracer Within the Kernel

During the development of a kernel driver there may exist strange errors that occur during testing. Perhaps the driver gets stuck in a sleep state and never wakes up. Trying to disable the tracer from user space when a kernel event occurs is difficult and usually results in a buffer overflow and loss of the relevant information before the user can stop the trace.

There are two functions that work well inside the kernel: tracing_on() and tracing_off(). These two act just like echoing "1" or "0" respectively into the tracing_on file. If there is some condition that can be checked for inside the kernel, then the tracer may be stopped by adding something like the following:

    if (test_for_error())
	    tracing_off();

Next, add several trace_printk()s (see part 1), recompile, and boot the kernel. You can then enable the function or function graph tracer and just wait for the error condition to happen. Examining the tracing_on file will let you know when the error condition occurred. It will switch from "1" to "0" when the kernel calls tracing_off().

After examining the trace, or saving it off in another file with:

cat trace > ~/trace.sav

you can continue the trace to examine another hit. To do so, just echo "1" into tracing_on, and the trace will continue. This is also useful if the condition that triggers the tracing_off() call can be triggered legitimately. If the condition was triggered by normal operation, just restart the trace by echoing a "1" back into tracing_on and hopefully the next time the condition is hit will be because of the abnormality.

ftrace_dump_on_oops

There are times that the kernel will crash and examining the memory and state of the crash is more of a CSI science than a program debugging science. Using kdump/kexec with the crash utility is a valuable way to examine the state of the system at the point of the crash, but it does not let you see what has happened prior to the event that caused the crash.

Having Ftrace configured and enabling ftrace_dump_on_oops in the kernel boot parameters, or by echoing a "1" into /proc/sys/kernel/ftrace_dump_on_oops, will enable Ftrace to dump to the console the entire trace buffer in ASCII format on oops or panic. Having the console output to a serial log makes debugging crashes much easier. You can now trace back the events that led up to the crash.

Dumping to the console may take a long time since the default Ftrace ring buffer is over a megabyte per CPU. To shrink the size of the ring buffer, write the number of kilobytes you want the ring buffer to be to buffer_size_kb. Note that the value is per CPU, not the total size of the ring buffer.

    [tracing]# echo 50 > buffer_size_kb

The above will shrink the Ftrace ring buffer down to 50 kilobytes per CPU.

You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.

To choose a particular location for the kernel dump, the kernel may call ftrace_dump() directly. Note, this may permanently disable Ftrace and a reboot may be necessary to enable it again. This is because ftrace_dump() reads the buffer. The buffer is made to be written to in all contexts (interrupt, NMI, scheduling) but the reading of the buffer requires locking. To be able to perform ftrace_dump() the locking is disabled and the buffer may end up being corrupted after the output.

    /*
     * The following code will lock up the box, so we dump out the
     * trace before we hit that location.
     */
    ftrace_dump();

    /* code that locks up */

Stack Tracing

The final topic to discuss is the ability to examine the size of the kernel stack and how much stack space each function is using. Enabling the stack tracer (CONFIG_STACK_TRACER) will show where the biggest use of the stack takes place.

The stack tracer is built from the function tracer infrastructure. It does not use the Ftrace ring buffer, but it does use the function tracer to hook into every function call. Because it uses the function tracer infrastructure, it does not add overhead when not enabled. To enable the stack tracer, echo 1 into /proc/sys/kernel/stack_tracer_enabled. To see the max stack size during boot up, add "stacktrace" to the kernel boot parameters.

The stack tracer checks the size of the stack at every function call. If it is greater than the last recorded maximum, it records the stack trace and updates the maximum with the new size. To see the current maximum, look at the stack_max_size file.

    [tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled
    [tracing]# cat stack_max_size
    2928
    [tracing]# cat stack_trace
            Depth    Size   Location    (34 entries)
            -----    ----   --------
      0)     2952      16   mempool_alloc_slab+0x15/0x17
      1)     2936     144   mempool_alloc+0x52/0x104
      2)     2792      16   scsi_sg_alloc+0x4a/0x4c [scsi_mod]
      3)     2776     112   __sg_alloc_table+0x62/0x103
    [...]
     13)     2072      48   __elv_add_request+0x98/0x9f
     14)     2024     112   __make_request+0x43e/0x4bb
     15)     1912     224   generic_make_request+0x424/0x471
     16)     1688      80   submit_bio+0x108/0x115
     17)     1608      48   submit_bh+0xfc/0x11e
     18)     1560     112   __block_write_full_page+0x1ee/0x2e8
     19)     1448      80   block_write_full_page_endio+0xff/0x10e
     20)     1368      16   block_write_full_page+0x15/0x17
     21)     1352      16   blkdev_writepage+0x18/0x1a
     22)     1336      32   __writepage+0x1a/0x40
     23)     1304     304   write_cache_pages+0x241/0x3c1
     24)     1000      16   generic_writepages+0x27/0x29
    [...]
     30)      424      64   bdi_writeback_task+0x3f/0xb0
     31)      360      48   bdi_start_fn+0x76/0xd7
     32)      312     128   kthread+0x7f/0x87
     33)      184     184   child_rip+0xa/0x20

Not only does this give you the size of the maximum stack found, it also shows the breakdown of the stack sizes used by each function. Notice that write_cache_pages had the biggest stack with 304 bytes being used, followed by generic_make_request with 224 bytes of stack.

To reset the maximum, echo "0" into the stack_max_size file.

    [tracing]# echo 0 > stack_max_size

Keeping this running for a while will show where the kernel is using a bit too much stack. But remember that the stack tracer only has no overhead when it is not enabled. When it is running you may notice a bit of a performance degradation.

Note that the stack tracer will not trace the max stack size when the kernel is using a separate stack. Because interrupts have their own stack, it will not trace the stack usage there. The reason is that currently there is no easy way to quickly see what the top of the stack is when the stack is something other than the current task's stack. When using split stacks, a process stack may be two pages but the interrupt stack may only be one. This may be fixed in the future, but keep this in mind when using the stack tracer.

Conclusion

Ftrace is a very powerful tool and easy to configure. No extra tools are necessary. Everything that was shown it this tutorial can be used on embedded devices that only have Busybox installed. Taking advantage of the Ftrace infrastructure should cut the time needed to debug that hard-to-find race condition. I seldom use printk() any more because using the function and function graph tracers along with trace_printk() and tracing_off() have become my main tools for debugging the Linux kernel.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.33-rc1 ?

Greg KH Linux 2.6.32.2 ?

Greg KH Linux 2.6.31.9 ?

Greg KH Linux 2.6.27.42 ?

Architecture-specific

Jean Pihet Perf Event support for ARMv7 (was: Re: [PATCH 5/5] arm/perfevents: implement perf event support for ARMv6) ?

Jeremy Kerr Device-tree support for ARM ?

john stultz [RFC][PATCH 0/14] Convert remaining arches to read/update_persistent_clock ?

Core kernel code

Tejun Heo Concurrency-managed workqueue ?

Con Kolivas BFS v0.312 configurable CPU scheduler for 2.6.32 ?

Christoph Lameter Per cpu atomics in core allocators, cleanup and more this_cpu_ops ?

Stefani Seibold new kqueue API v.08 ?

Stefani Seibold new kfifo API v.08 (fixed and mostly checkpatch.pl clean) ?

Andi Kleen [0/11] SYSCTL: Use RCU to avoid races with string sysctls ?

Ben Blum cgroups: support for module-loadable subsystems ?

Development tools

Darren Hart tracecmd: Add initial python tracecmd API Dec 17

Masami Hiramatsu kprobes: Kprobes jump optimization support ?

Jiri Olsa dynamic debug - adding ring buffer storage support ?

Xiao Guangrong [PATCH v2 0/5] perf_event: introduce 'perf timer' to analyze timer's behavior ?

Device drivers

Christopher Heiny [PATCH 0/1] input/touchscreen: Synaptics Touchscreen Driver ?

Alessandro Rubini i2c port expanders support for nomadik nhk8815 ?

Sakari Ailus V4L2 file handles and event interface ?

Pawel Osciak Mem-to-mem device framework ?

Pawel Osciak [ARM] Add Samsung S3C/S5P image rotator driver ?

Greg KH Staging tree status for the .33 kernel merge ?

Filesystems and block I/O

Steve Rago improve the performance of large sequential write NFS workloads ?

Nick Piggin fs: truncate introduce new sequence ?

Memory management

Dan Magenheimer Tmem [PATCH 0/5] (Take 3): Transcendent memory ?

KAMEZAWA Hiroyuki [RFC 0/4] speculative page fault (Was Re: [mm][RFC][PATCH 0/11] mm accessor updates. ?

Andrea Arcangeli Transparent Hugepage support #2 ?

Page editor: Jonathan Corbet
Next page: Distributions>>