Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.31-rc4, released on July 22. "Ok, that was a fun week. We had a binutils bug, a ccache bug, and a compiler bug. And that was just the bugs that were outside the kernel, but resulted in a broken build." Beyond that, it's mostly just a big pile of fixes, many of which are for newly-discovered NULL pointer problems; see the long-format changelog for full details.

The current stable 2.6 kernel is 2.6.30.3, released (along with 2.6.27.28) on July 24. This is a single-fix update to work around a compiler problem which affected 2.6.30.2 and 2.6.27.27.

The 2.6.30.4 and 2.6.27.29 updates are currently in the review process. These kernels (each containing a long list of assorted fixes) will likely be released sometime on July 30.

For 2.4 users: the 2.4.37.4 update was released on July 26. Among other things, it contains a personality-related security fix; 2.4 maintainer Willy Tarreau would appreciate more eyes on this code to help come up with a proper fix.

Comments (none posted)

Kernel development news

Quotes of the week

Geeze you guys send a lot of stuff. Stop writing new code and go fix some bugs!

-- Andrew Morton

I feel bad for just sending this email instead of proper bug reports and patches, but the truth is that I'm cycling through Africa on a bicycle. I sleep in a tent. It took me days to scrape together enough electricity and internet to send this one email...

-- Dan Carpenter with another lame excuse

But we don't do language-lawyering based on standards that inevitably never really delve into all the nitty-gritty details. We are simply better than that. Leave the language-lawyering to the people who can't do things well, and then whine about their crap being "technically correct".

-- Linus Torvalds

Comments (1 posted)

In Brief

By Jonathan Corbet
July 29, 2009

FAT timestamps. The FAT filesystem has a number of deficiencies. The fact that it cannot record time stamps for the root directory of a filesystem is probably not at the top of most peoples' lists, but Jorg Schummer has put together a patch to provide those time stamps anyway. The patch is a hack which stores the time stamp information in the FAT volume label, essentially hiding it from any system which doesn't know to look for it. This is not a new scheme; Mac OS X does the same thing. There does not seem to be a great clamor for this feature, but it is optional, the implementation is straightforward, and it's off by default. So there is little reason to leave it out either.

Remapping ext2/3 UIDs. Another failing of FAT is its inability to associate user or group ownership information with files. One would not normally want to port this "feature" to more complete filesystems, but Ludwig Nussel has noted a problem: a user moving an ext3 filesystem from one system to another will have problems accessing the files if said user's accounts have different user IDs on the two boxes. The solution is to add a uid= mount option to ext2 and ext3; the filesystem will then map between the given user ID (on the running system) and zero (on the filesystem).

There doesn't seem to be a great clamor for this feature either; the use of ext3 on filesystems moved between machines is probably relatively rare. Still, Andreas Dilger indicated that the feature might have its uses, but that some changes would be welcome. The ability to create root-owned setuid files needed to go away, and it would be nice to have a more general "remap UID1 to UID2" capability instead of just mapping to and from the root UID. Andreas also requested an ext4 version of the patch.

Fanotify. Eric Paris has posted a description of the new fanotify API for comments, noting that real patches will follow soon. That API has changed considerably since it was covered here at the beginning of July; the strange use of getsockopt() to get notifications is no more. Instead, a relatively normal socket is created, with read() being used to read notification events. There were a number of comments and suggestions, but the consensus seems to be that things are headed in the right direction.

ABUSE. We have FUSE, which allows the implementation of filesystems in user space, and CUSE, which does the same for char devices. So why not do the same thing for block devices? With Zachary Amsden's ABUSE patch, that now becomes possible. Zachary says: "This device is not about performance, is it about extending the boundaries of the kernel to the almost improbable." The code commentary notes that the feature can be "incredibly useful," but it's not clear what use case is being targeted at the moment.

ABUSE is highly unlikely to be merged, for the simple reason that much of what it does is already doable with the network block device (NBD) driver. Zachary plans to move to NBD for whatever purpose he has in mind. That purpose, apparently, makes it necessary to have access to partitions, which is why FUSE cannot be used.

The partitions topic led to a small side discussion, where Alan Cox suggested that partition support should be removed from the kernel altogether. Instead, the device mapper should be used to implement partitions. There are a lot of advantages - mostly administrative flexibility - which come from the use of the device mapper, but there are users, Linus included, who are not interested in requiring its use. So the kernel's partition code will not be going anywhere anytime soon.

A new book on the way. Man pages maintainer Michael Kerrisk, while writing about a recent release, noted that he is well along in the writing of a new book which extensively documents the Linux kernel's user-space API. It will not be light reading; it looks to end up at about 1500 pages. For the curious, Michael has posted a general description of the book and the table of contents. Publication is expected sometime in the first half of 2010.

Comments (10 posted)

Dynamic probes with ftrace

By Jonathan Corbet
July 28, 2009

The ftrace tracing infrastructure has only been in the mainline since 2.6.27 - less than one year. During that time, ftrace has seen a great deal of development and has acquired several new capabilities. It now provides many of the features that come with more heavyweight tools like SystemTap, along with some which are unique to ftrace. But there are still capabilities found in "real" tracing utilities which are not present in ftrace. One of the more significant limitations is the lack of dynamic tracing; ftrace can easily trace function calls or use static tracepoints placed in the kernel source, but it cannot add its own tracepoints on the fly. That could change, though, should Masami Hiramatsu's kprobe-based event tracer patch make it into the mainline.

The kprobes mechanism has been a part of the kernel for a long time; LWN ran an overview of it back in 2005. Kprobes are, of course, dynamic tracepoints; by use of on-the-fly code patching, the kernel can hook into its own code at any point. Tools like SystemTap use kprobes to implement their dynamic tracing features. With SystemTap, though, these probes are inserted by way of a special kernel module generated on the fly - a bit of a tricky interface. Masami's patch aims to turn the insertion of dynamic probes into something close to a command-line operation.

The patch creates a new debugfs file /sys/kernel/debug/tracing/kprobe_events. A new probe is inserted by appending a line to that file; that line has a somewhat complex format:

    p[:EVENT] SYMBOL[+offset|-offset]|MEMADDR [FETCHARGS]
    r[:EVENT] SYMBOL[+0] [FETCHARGS]

The first variant will set a probe with the optional name EVENT (if the name isn't supplied, the code makes one up). The probe will be placed at the location of the given SYMBOL, adjusted by the optional offset; an absolute address (MEMADDR) can also be used to locate the probe. The FETCHARGS portion of the line describes the data to be fetched and emitted when the tracepoint is hit; the syntax allows the specification of various types of data, including register contents, stack offsets, absolute addresses, kernel symbols, function arguments, and more. What the code does not currently allow is much in the way of sophisticated formatting of this data; it comes out in straight hexadecimal format.

The second line, above, inserts a "retprobe" instead. Retprobes are fired when the given function (as specified by SYMBOL) returns to its caller; they can capture the function's return value and the address it is returning to.

The patch posting contains an example of a couple of probes placed in do_sys_open(); the commands to do so are:

    echo p:myprobe do_sys_open a0 a1 a2 a3 > /sys/kernel/debug/tracing/kprobe_events
    echo r:myretprobe do_sys_open rv ra >> /sys/kernel/debug/tracing/kprobe_events

Two probes are placed here. One called myprobe will fire on entry to do_sys_open() and output the values of the four arguments passed to that function. The other, myretprobe, triggers when do_sys_open() returns, fetching the return value and return address in the process.

The output from these tracepoints can be seen by reading /sys/kernel/debug/tracing/trace:

#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-1447  [001] 1038282.286885: do_sys_open+0x0/0xd6: 0xffffff9c 0x40413c 0x8000 0x1b6
           <...>-1447  [001] 1038282.286915: sys_open+0x1b/0x1d <- do_sys_open: 0x3 0xffffffff81367a3a

Here we see a call to do_sys_open() with its four parameters: the directory file descriptor (0xffffff9c), file name pointer (0x40413c), flags (0x8000), and mode (0x1b6). For the curious, the strange file descriptor value is the magic value AT_FDCWD, meaning that the file lookup should begin in the current working directory. There is also a return line (as indicated by the "<-" arrow) showing that the call returned to sys_open(), having opened file descriptor 3.

The patch also provides mechanisms for turning individual probes on and off, filtering probe output, and maintaining profiles of probe hits.

Tracing of function entry and exit as shown above is a useful feature, but the existing ftrace function tracer can do that already. The obvious value in this new patch is the ability to place tracepoints at locations other than function entry and exit points. But that leads to an interesting question: how does the user manage to get tracepoints set in the right locations? Guessing at offsets from function symbols seems like a recipe for trouble, especially given that the placement of a tracepoint in the middle of an instruction is unlikely to lead to pleasant results.

Addressing that last concern is, as it turns out, the job of the bulk of the code in Masami's patch. Placing probes is relatively easy - the code to do that is already in the kernel. But making sure that the probe is in the right place requires the addition of an x86 instruction decoding module. When a probe is requested within a function, the instruction decoder goes to work; it starts at the beginning of the function and decodes instructions until it reaches the probe point. If the probe is located at an instruction boundary, all is well; otherwise the placement of the probe is disallowed.

Actually generating the right offsets for dynamic probes is likely to be a job for user-space software which can parse debugging information and map line numbers onto offsets. A tool like a debugger or SystemTap, for example. It is, in fact, conceivable that tools like SystemTap could move over to this mechanism once it's merged; that would allow SystemTap to share more of the low-level ftrace plumbing and get it closer to working with unpatched mainline kernels.

That's getting a little ahead of the game, though; first the kprobe-based event tracing code needs to be merged. There does not appear to be any real opposition to that merger - but this code has been around for a while and is currently on its 13th revision. The value of getting real dynamic probing support into the kernel seems reasonably evident, though; expect this feature to get in at some point.

Comments (2 posted)

Finding buffer overflows with Parfait

By Jake Edge
July 29, 2009

Recently, Roel Kluin has been proposing patches to fix a number of buffer overflows in the kernel, some of which he credited to "Parfait". It turns out that Parfait is a static source code checking tool that comes out of Sun Labs in Australia. The project reported 54 buffer overflows to the linux-security mailing list in early July, and Kluin has been going through them to get them fixed.

It is best to treat buffer overflows as potential security vulnerabilities, even though they may be hard—or impossible—to exploit. Various types of these bugs have been thought to be unexploitable along the way, but then were found to be exploitable, so caution is clearly indicated. The full list was sent to the private kernel security alias, and then passed along to Kluin by Andrew Morton. Kluin has then been posting patches to linux-kernel, as well as the netdev mailing list, to fix them. A number of the fixes have already been picked up by subsystem maintainers, and some have made their way into the mainline.

The tool itself is relatively new, first demonstrated as an alpha last October, and takes a multi-layered approach using an "ensemble" of static analysis techniques. Thus the name. One of the goals, from the outset, was to produce something that could analyze a huge codebase—the OpenSolaris or Linux kernel for example—in a matter of minutes rather than the days or weeks that other tools require.

As part of a paper [PDF] presented at the Kernel Conference Australia in mid-July, the Parfait developers reported checking 5.7 million lines of code in the 2.6.29 kernel for buffer overflows in 13 minutes. The times for OpenSolaris and OpenBSD were similar when scaled for the number of lines of code checked.

Unsurprisingly, for all three kernels, the majority of buffer overflows were found in the driver code. For 2.6.29, Parfait found 12 buffer overflows in the Linux core, and 85 in the drivers (which makes up 71% of the codebase). Some of those were false positives, but the paper does not make it clear just how many. Given that 54 were reported to linux-security, though, it would seem that something approaching half were false positives.

Kluin provided an example of the Parfait output:

    Bug type: Buffer overflow
    File: /usr/src/linux-2.6.29/security/smack/smackfs.c
    Line: 777
    Function: smk_write_netlbladdr
    Code snippet:

    0772:   if (count < SMK_NETLBLADDRMIN || count > SMK_NETLBLADDRMAX)
    0773:           return -EINVAL;
    0774:   if (copy_from_user(data, buf, count) != 0)
    0775:           return -EFAULT;
    0776:
    0777:   data[count] = '\0';
    0778:
    0779:   rc = sscanf(data, "%hhd.%hhd.%hhd.%hhd/%d %s",
    0780:           &host[0], &host[1], &host[2], &host[3], &m, smack);
    0781:   if (rc != 6) {
    0782:           rc = sscanf(data, "%hhd.%hhd.%hhd.%hhd %s",

    Parfait report:
    Error: Buffer overflow at
    /usr/src/linux-2.6.29/security/smack/smackfs.c:777 in function
    'smk_write_netlbladdr' [Symbolic analysis]
	  In array dereference of data[count] with index 'count'
	  Array size is 42 bytes, count >= 9 and count <= 42

    Comments:
    Off-by-one when adding the trailing null on line 777 - data is
    declared with size
    SMK_NETLBLADDRMAX, and count is allowed to equal SMK_NETLBLADDRMAX

Which shows a buffer overflow that he had already fixed in the kernel prior to the Parfait report. The paper also describes a GUI tool that collects up the code and declarations that make Parfait believe there is a problem, which can help developers determine whether there truly is a problem or not.

Currently, Parfait is not available to those outside of Sun, but a binary release is planned. According to lead developer Cristina Cifuentes, it should be available on the web site within the next month or two: "I estimate it will be end of August (to be optimistic) before the binary release is out, a more pessimistic estimate is end of September." That release will be available for "use on a non-commercial basis", she said. Sun is considering an open source release, but no decision on that has yet been made.

In an interview on the Sun Labs web site, Cifuentes gives a broader view of where Parfait is headed—more than just looking for buffer overflows:

At the moment the types of bugs we're finding include other memory-pointer related bugs. Things like null pointer dereference, double free, use after free, memory leaks, format string type mismatches — they can all be found with similar types of analysis. Those are some that we're putting our emphasis on now.

In many ways, Parfait is similar to the Coverity analysis tool that has been used on the kernel as well as other free software. In both cases, at least for now, the analysis can only be run by the company who owns the tool, or those who have licensed it in the case of Coverity. A free software analysis tool that did these kinds of checks—and didn't depend on the goodwill of various companies—would be a real boon. With luck, perhaps Parfait will some day fill that role.

These source analysis tools clearly find real bugs, though there is some evidence that the bug reports resulting from the scans are not being used to their fullest. The Coverity scanner found the tun.c NULL pointer dereference problem long before it was fixed in the kernel, but the report either went unnoticed or was (incorrectly as it turns out) not seen to be a serious problem. More source code analysis—at least any that isn't plagued by too many false positives—can only be a good thing, but the problems found need to be addressed or the value of the effort drops dramatically. It would be awfully nice to have free versions of these kinds of tools as well.

Comments (6 posted)

A tempest in a tty pot

By Jonathan Corbet
July 29, 2009

There are dark areas of the kernel where only the bravest hackers dare to tread. Places where the code is twisted, the requirements are complex, and everything depends on ancient code which has seen little change over the years because even the most qualified developers fear the consequences. Arguably, no part of the kernel is darker and scarier than the serial terminal (TTY) code. Recently, this code was getting a much-needed update, but it now appears that a disconnect within the community has brought that work to a halt and thrown TTY back into the "unmaintained" column - at a time when that code has known regressions in the 2.6.31-rc kernel.

At a first glance, the TTY layer wouldn't seem like it should be all that challenging. It is, after all, just a simple char device which is charged with transferring byte-oriented data streams between two well-defined points. But the problem is harder than it looks. Much of the TTY code has roots in ancient hardware implementing the RS-232 standard - one of the loosest, most variable standards out there. TTY drivers also have to monitor the data stream and extract information from it; this duty can include ^S/^Q flow control, parity checking, and detection of control characters. Control characters may turn into out-of-band information which must be communicated to user space; ^D may become an end-of-file when the application reads to the appropriate point in the data stream, while other characters map onto signals. So the TTY code has to deal with complex signal delivery as well - never a path to a simple code base. Echoing of data - possibly transforming it in the process - must be handled. With the addition of pseudo terminals (PTYs), the TTY code has also become a sort of interprocess communication mechanism, with all of the weird TTY semantics preserved. The TTY code also needs to support networking protocols like PPP without creating performance bottlenecks.

All told, it's a complicated problem. It is also a problem which seems to interest relatively few developers. The top of drivers/char/tty_io.c still reads "Copyright (C) 1991, 1992, Linus Torvalds." Much of the code is still dependent on the big kernel lock. There are deadlocks and race conditions to be found. Almost nobody wants to touch it, but it still mostly works.

Alan, you are a true wizard :-) The tty layer is one of the very few pieces of kernel code that scares the hell out of me :-)
-- Ingo Molnar, July, 2007

In recent times, though, an energetic TTY maintainer has stepped forward: Alan Cox. One could almost hear the sighs of relief across the net when this happened; if anybody could clean out that particular set of Augean Stables, it would certainly be Alan, who has the combination of technical skill and attention to detail needed to avoid breaking things. Over the last year, it has been clear that fixing the TTY code has stressed even Alan's skills; the work has been slow and apparently laborious. But it has also been successful at getting the TTY code into better shape while preserving it as a functioning subsystem.

At least, that was the case until 2.6.31, where the combination of significant changes and some last-minute tweaks led to regressions. Users started to report that the kdesu application stopped working. The emacs compile mode started losing output. And so on. It turns out that there were a few separate bugs, not all of which were in the tty layer:

The problem with kdesu appears to be a KDE bug; the application would read too much data, then wonder why the next read didn't have what it wanted. This code worked with the older TTY code, but broke with 2.6.31. There is probably no way to fix it which doesn't saddle the kernel with maintaining weird legacy bug-compatibility code - something the TTY layer does not need more of.
The emacs problem is different. In this case, the compile process would finish its work (writing its final output to the PTY) and exit. Emacs would try to read that final output, but would get a failed read resulting from the SIGCHLD signal sent by the exiting compile process. That failure was unexpected and caused emacs to drop the data. In essence, emacs expected that, by the time the compile process had completed its close() of the PTY file descriptor, the data written to that descriptor had been pushed through to the other end and would be available for reading. The 2.6.31 changes broke that assumption.

The second problem results from the complex nature of TTY data processing. It's not just a serial stream of data; instead, there is the line discipline processing in the middle. In 2.6.31, data written to a PTY will have been queued up for line discipline attention by the time a close() is allowed to complete, but there's no assurance that the line discipline code will have actually run and passed the data through to the other end. So the SIGCHLD signal can pass the data and arrive first.

Alan thinks this behavior is reasonable; it complies with the applicable standards and can be implemented in a relatively straightforward way. Making a close() on a PTY block until the other end has received the data might make emacs work better, but it also risks deadlock if both sides write data and close their file descriptors at the same time. Even so, Alan posted a "elegant in all the wrong ways" patch which fixed the problem, but also made it clear that he thought emacs was buggy and that the real fix belonged there.

Linus merged a version of this patch, but he was not happy about it. He believes that emacs is correct in its assumptions, and would like to see a better fix which makes the ordering of events clear and deterministic. He made his frustration clear:

Why? Why blame emacs? Why call user land buggy, when the bug was introduced by you, and was in the kernel? Why are you fighting it? Why did it take so long to admit that all the regressions were kernel problems? Why were you trying to dismiss the regression report as a user-land bug from the very beginning?

At that point, it was Alan's turn to express frustration; he did not hold back:

I've been working on fixing it. I have spent a huge amount of time working on the tty stuff trying to gradually get it sane without breaking anything and fixing security holes along the way as they came up. I spent the past two evenings working on the tty regressions.

However I've had enough. If you think that problem is easy to fix you fix it. Have fun.

The message included a patch removing Alan as the maintainer of the TTY layer.

And that is where things stand, as of this writing. The TTY code is unmaintained again, a promising rework has halted partway through, and the person most qualified to fix the problems has thrown up his hands and left the building (though it should be noted that he is participating in the conversation on how the next maintainer, whoever that might be, can fix things). Kernel development will go on, but development in this area will go rather more slowly; the TTY layer has claimed another victim.

Comments (147 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.31-rc4 ?

Greg KH Linux 2.6.30.3 ?

Greg KH Linux 2.6.27.28 ?

Willy Tarreau Linux 2.4.37.4 ?

Core kernel code

Dave Hansen flexible array implementation v4 ?

Paul E. McKenney RCU cleanups and simplified preemptable RCU ?

Jon Hunter Dynamic Tick: Enabling longer sleep times on 32-bit ?

Pallipadi, Venkatesh ondemand: Make kondemand workqueue run with dynamic sched priority ?

Martin Schwidefsky clocksource / timekeeping rework V2 ?

Development tools

Zhaolei Add walltime support for ring-buffer ?

Catalin Marinas Kmemleak patches for 2.6.32 ?

Masami Hiramatsu [PATCH -tip -v13 00/11] tracing: kprobe-based event tracer and x86 instruction decoder ?

Prerna Saxena Hardware Breakpoint support for systemtap translator ?

Darrick J. Wong ACPI 4.0 power meter ?

Xiao Guangrong ftrace: add tracepoint for timer ?

Mel Gorman Add some trace events for the page allocator ?

Robert Richter oprofile: Performance counter multiplexing ?

Device drivers

Mark Allyn Revsion 2 of the security processor kernel driver; ?

Mark Allyn Restricted access regions register driver revision 2 ?

Zhang Rui introduce device async actions mechanism ?

Wan ZongShun Add watchdog driver for w90p910 ?

Mark Brown WM831x drivers ?

Mark Brown hwmon: WM831x PMIC hardware monitoring driver ?

Sascha Hauer Add Support for Freescale FlexCAN CAN controller ?

Rafael J. Wysocki [PATCH update] PM: Introduce core framework for run-time PM of I/O devices (rev. 11) ?

Jayamohan Kallickal RFC: be iscsi driver ?

Documentation

Michael Kerrisk man-pages-3.22 is released ?

Filesystems and block I/O

Ludwig Nussel implement uid mount option for ext2 and ext3 ?

Eric Paris fanotify - overall design before I start sending patches ?

Vivek Goyal IO scheduler based IO controller V7 ?

Ryo Tsuruta blkio-cgroup-v10: Introduction ?

Zachary Amsden Allow userspace block device implementation ?

Jorg Schummer fat: Save FAT root directory timestamps to volume label ?

Janitorial

Thomas Gleixner Cleanup init_MUTEX[_LOCKED] / DECLARE_MUTEX ?

Memory management

Roland Dreier ummunotify: Userspace support for MMU notifications ?

Networking

Hannes Eder IPVS full NAT support + netfilter 'ipvs' match support ?

Virtualization and containers

Joerg Roedel KVM: support for 1gb pages ?

Oren Laadan c/r: checkpoint and restore FIFOs ?

Paul Menage CGroup: Support for named and empty hierarchies ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.31-rc4: Reported regressions from 2.6.30 ?

Rafael J. Wysocki 2.6.31-rc4: Reported regressions 2.6.29 -> 2.6.30 ?

Miscellaneous

Daniel Lezcano lxc: linux container tools 0.6.3 release ?

Tetsuo Handa TOMOYO's new userland tools package ?

Page editor: Jonathan Corbet
Next page: Distributions>>