User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.31 kernel is out, released by Linus on September 9. A few of the major features in 2.6.31 include performance counter support, the "fsnotify" notification infrastructure, kernel mode setting for ATI Radeon chipsets, the kmemleak tool, char drivers in user space support, USB 3 support, and much more. As always, see the KernelNewbies 2.6.31 page for a much more exhaustive list.

The last prepatch, 2.6.31-rc9, was released on September 5.

The current stable kernel is, released (along with on September 8. Both contain a long list of fixes, many of which are in the KVM subsystem.

Comments (6 posted)

Kernel development news

Quotes of the week

After reading more and more about BFS, I've realized that it's the Fight Club of schedulers. You do not talk about BFS on linux-kernel. BFS does not benchmark, it does not keep score, it has no leaderboard. BFS only exists in the time between when Flash Player starts and when Flash Player crashes.
-- Wesley Felter

My life's project is to hunt down the guy who invented mail client wordwrapping, set him on fire then dance on his ashes.
-- Andrew Morton (Thanks to Nikanth K)

Linux is a 18+ years old kernel, there's not that many easy projects left in it anymore :-/ Core kernel features that look basic and which are not in Linux yet often turn out to be not that simple.
-- Ingo Molnar

Checkpoint/restart has traditionally been interesting in the mainframe and supercomputer space. These environments have very different security profiles from a user desktop. No one at the [.......] National Supercomputer Centre cares if you can save your rogue game as soon as you pick up the Amulet of Yendor and restart it if you get killed on the way up. These environments are concerned with leaking data between the groups that have funded the facility, which is why they are very often customers of advanced access control technologies. I don't know that I see a really good security story for [checkpoint/restart] in the desktop space, and as Russell points out, there are plenty of opportunities to exploit the feature.
-- Casey Schaufler

Comments (1 posted)

In brief

By Jonathan Corbet
September 9, 2009
reflink() for 2.6.32. Joel Becker's announcement of his 2.6.32 ocfs2 merge plans included a mention that the reflink() system call would be merged alongside the ocfs2 changes. A call to reflink() creates a lightweight copy, wherein both files share the same blocks in a copy-on-write mode. The final reflink() API looks like this:

    int reflink(const char *oldpath, const char *newpath, int preserve);
    int reflinkat(int olddirfd, const char *oldpath,
                  int newdirfd, const char *newpath,
      		  int preserve,  int flags);

A call to reflink() causes newpath to look like a copy of oldpath. If preserve is REFLINK_ATTR_PRESERVE, then the entire security state of oldpath will be replicated for the new file; this is a privileged operation. Otherwise (if preserve is REFLINK_ATTR_NONE), newpath will get a new security state as if it were an entirely new file. The reflinkat() form adds the ability to supply the starting directories for relative paths and flags like the other *at() system calls. For more information, see the documentation file at the top of the reflink() patch.

Joel's patch adds reflink() support for the ocfs2 filesystem; it's not clear whether other filesystems will get reflink() support in 2.6.32 or not.

A stable debugfs?. Recurring linux-kernel arguments tend to focus on vitally important issues - like where debugfs should be mounted. The official word is that it belongs on /sys/kernel/debug, but there have been ongoing problems with rogue developers mounting it on unofficial places like /debug instead. Greg Kroah-Hartman defends /sys/kernel/debug by noting that debugfs is for kernel developers only; there's no reason for users to be interested in it.

Except, of course, that there is. The increasing utility of the ftrace framework is making it more interesting beyond kernel development circles. That led Steven Rostedt to make a suggestion:

I think that the tracing system has matured beyond a "debug" level and is being enabled on production systems. Both fedora and debian are now shipping kernels with it enabled. Perhaps we should create another pseudo fs that can be like debugfs but for stable ABIs. A new interface could start out in debugfs, but when it has reached a stable interface, then it could be moved to another location to signal this.

Steven would like a new virtual filesystem for stable kernel ABIs which is easier to work with than sysfs and which can be mounted in a more typing-friendly location. Responses to the suggestion have been scarce so far; somebody will probably need to post a patch to get a real discussion going.

data=guarded. Chris Mason has posted a new version of the ext3 data=guarded mode patch. The guarded mode works to ensure that data blocks arrive on disk before any metadata changes which reference those blocks. The goal is to provide the performance benefits of the data=writeback mode while avoiding the potential information disclosure (after a crash) problems with that mode. Chris had mentioned in the past that he would like to merge this code for 2.6.32; the latest posting, though, suggests that some work still needs to be done, so it might not be ready in time.

Comments (1 posted)

Some notes from the BFS discussion

By Jonathan Corbet
September 9, 2009
As was recently reported here, Con Kolivas recently resurfaced with a new CPU scheduler called "BFS". This scheduler, he said, addresses the problems which ail the mainline CFS scheduler; the biggest of these, it seems, is the prioritization of "scalability" over use on normal desktop systems. BFS was meant to put the focus back on user-level systems and, perhaps, make the case for supporting multiple schedulers in the kernel.

benchmark results] Since then, CFS creator Ingo Molnar has responded with a series of benchmark results comparing the two schedulers. Tests included kernel build times, pipe performance, messaging performance, and an online transaction processing test; graphs were posted showing how each scheduler performed on each test. Ingo's conclusion: "Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box." In fact, the opposite was true: BFS generally performed worse than the mainline scheduler.

Con's answer was best described as "dismissive":

/me sees Ingo run off to find the right combination of hardware and benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]

As far as your editor can tell, Con's objections to the results mirror those heard elsewhere: Ingo chose an atypical machine for his tests, and those tests, in any case, do not really measure the performance of a scheduler in a desktop situation. The more cynical observers seem to believe that Ingo is more interested in defending the current scheduler than improving the desktop experience for "normal" users.

The machine chosen was certainly at the high end of the "desktop" scale:

So the testbox i picked fits into the upper portion of what i consider a sane range of systems to tune for - and should still fit into BFS's design bracket as well according to your description: it's a dual quad core system with hyperthreading. It has twice as many cores as the quad you tested on but it's not excessive and certainly does not have 4096 CPUs.

A number of people thought that this box is not a typical desktop Linux system. That may indeed be true - today. But, as Ingo (among others) has pointed out, it's important to be a little ahead of the curve when designing kernel subsystems:

But when it comes to scheduler design and merge decisions that will trickle down and affect users 1-2 years down the line (once it gets upstream, once distros use the new kernels, once users install the new distros, etc.), i have to "look ahead" quite a bit (1-2 years) in terms of the hardware spectrum.

Btw., that's why the Linux scheduler performs so well on quad core systems today - the groundwork for that was laid two years ago when scheduler developers were testing on a quads. If we discovered fundamental problems on quads _today_ it would be way too late to help Linux users.

Partly in response to the criticisms, though, Ingo reran his tests on a single quad-core system, the same type of system as Con's box. The end results were just about the same.

The hardware used is irrelevant, though, if the benchmarks are not testing performance characteristics that desktop users care about. The concern here is latency: how long it takes before a runnable process can get its work done. If latencies are too high, audio or video streams will skip, the pointer will lag the mouse, scrolling will be jerky, and Maelstrom players will lose their ships. A number of Ingo's original tests were latency-related, and he added a couple more in the second round. So it looks like the benchmarks at least tried to measure the relevant quantity.

Benchmark results are not the same as a better desktop experience, though, and a number of users are reporting a "smoother" desktop when running with BFS. On the other hand, making significant scheduler changes in response to reports of subjective "feel" is a sure recipe for trouble: if one cannot measure improvement, one not only risks failing to fix any problems, one is also at significant risk of introducing performance regressions for other users. There has to be some sort of relatively objective way to judge scheduler improvements.

The way preferred by the current scheduler maintainers is to identify causes of latencies and fix them. The kernel's infrastructure for the identification of latency problems has improved considerably over the last year or two. One useful tool is latencytop, which collects data on what is delaying applications and presents the results to the user. The ftrace tracing framework is also able to create data on the delay between when a process is awakened and when it actually gets into the CPU; see this post from Frederic Weisbecker for an overview of how these measurements can be taken.

If there are real latency problems remaining in the Linux scheduler - and there are enough "BFS is better" reports to suggest that there are - then using the available tools to describe them seems like the right direction to take. Once the problem is better understood, it will be possible to consider possible remedies. It may well be that the mainline scheduler can be adjusted to make those problems go away. Or, possibly, a more radical sort of approach is necessary. But, without some understanding of the problem - and associated ability to measure it - attempted fixes seem a bit like a risky shot in the dark.

Ingo welcomed Con back to the development community and invited him to help improve the Linux scheduler. This seems unlikely to happen, though. Con's way of working has never meshed well with the kernel development community, and he is showing little sign of wanting to change that situation. That is unfortunate; he is a talented developer who could do a lot to improve Linux for an important user community. The adoption of the current CFS scheduler is a direct result of his earlier work, even if he did not write the code which was actually merged. In general, though, improving Linux requires working with the Linux development community; in the absence of a desire to do that effectively, there will be severe limits on what a developer will be able to accomplish.

(See also: Frans Pop's benchmark tests, which show decidedly mixed results.)

Comments (25 posted)

News from the staging tree

By Jake Edge
September 9, 2009

The staging tree has made a lot of progress since it appeared in June 2008. To start with, the tree itself quickly moved into the mainline in October 2008; it also has accumulated more than 40 drivers of various sorts. Staging is an outgrowth of the Linux Driver Project that is meant to collect drivers, and other "standalone" code such as filesystems, that are not yet ready for the mainline. But, it was never meant to be a "dumping ground for dead code", as staging maintainer Greg Kroah-Hartman put it in a recent status update. Code that is not being improved, so that it can move into the mainline, will be removed from the tree.

Some of the code that is, at least currently, slated for removal includes some fairly high-profile drivers, including one from Microsoft that was released with great fanfare in July. After a massive cleanup that resulted in more than 200 patches to get the code "into a semi-sane kernel coding style", Kroah-Hartman said that it may have to be removed in six months or so:

Unfortunately the Microsoft developers seem to have disappeared, and no one is answering my emails. If they do not show back up to claim this driver soon, it will be removed in the 2.6.33 release. So sad...

Microsoft is certainly not alone in Kroah-Hartman's report—which details the status of the tree for the upcoming 2.6.32 merge window—as several other large companies' drivers are in roughly the same boat. Drivers for Android hardware (staging/android), Intel's Management Engine Interface (MEI) hardware (staging/heci), among others were called out in the report. Both are slated for removal, android for 2.6.32, and heci in 2.6.33 (presumably). The latter provides an excellent example of how not to do Linux driver development:

A wonderful example of a company throwing code over the wall, watching it get rejected, and then running away as fast as possible, all the while yelling over their shoulder, "it's required on all new systems, you will love it!" We don't, it sucks, either fix it up, or I am removing it.

Kroah-Hartman's lengthy report covers more than just drivers that may be removed; it also looks at those that have made progress, including some that should be moving to the mainline, as well as new drivers that are being added to staging. But the list of drivers that aren't being actively worked on is roughly as long as the other two lists combined, which is clearly suboptimal.

Presumably to see if folks read all the way through, Kroah-Hartman sprinkles a few laughs in an otherwise dry summary. For the me4000 and meilhaus drivers, he notes that there is no reason to continue those drivers "except to watch the RT guys squirm as they try to figure out the byzantine locking and build logic here (which certainly does count for something, cheap entertainment is always good.)"

He also notes several drivers that are in the inactive category, but are quite close to being merge-worthy. He suggests that developers looking for a way to contribute consider drivers such as asus_oled (Asus OLED display), frontier (Frontier digital audio workstation controller), line6 (PODxt Pro audio effects modeler), mimio (Mimio Xi interactive whiteboard), and panel (parallel port LCD/keypad). Each of those should be relatively easy to get into shape for inclusion in the mainline.

There are a fair number of new drivers being added for 2.6.32, including the Microsoft Hyper-V drivers (staging/hv) mentioned earlier, as well as VME bus drivers (staging/vme), the industrial I/O subsystem (staging/iio), and several wireless drivers (VIA vt6655 and vt6656, Realtek rtl8192e, and Ralink 3090). Also, "another COW driver" is being added: the Cowloop copy-on-write pseudo block driver (staging/cowloop).

Two of Evgeniy Polyakov's projects—mistakenly listed in the "new driver" section though they were added in 2.6.30—were also mentioned. The distributed storage (DST) network block device (staging/dst), which Kroah-Hartman notes may be "dead" is a candidate for removal, while the distributed filesystem POHMELFS (staging/pohmelfs) is mostly being worked on out-of-tree. Polyakov agrees that DST is not needed in the mainline, but is wondering about moving POHMELFS out of staging and into fs/. Since there are extensive changes on the way for POHMELFS, it is unlikely to move out of staging for another few kernel releases at least.

There was also praise for the work on various drivers which have been actively worked on over the last few months. Bartlomiej Zolnierkiewicz was singled out for his work on rt* and rtl* wireless drivers (which put him atop the list of most active 2.6.31 developers), along with Alan Cox for work on the et131x driver for the Agere gigabit Ethernet adapter. Johannes Berg noted that much of Zolnierkiewicz's work on the rt* drivers "will have been in vain" because of the progress being made by the rt2x00 project. But that doesn't faze Zolnierkiewicz:

The end goal of this work has always been having native rt2x00 support for all those chipsets (as have been explained multiple times). If this means that one day we will delete all Ralink drivers in staging in favor of proper wireless drivers -- fine with me.

In the meantime (before clean and proper support becomes useful) Linux users are provided with the possibility to use their hardware before it becomes obsolete.

At least one developer stepped up to work on one of the inactive drivers (asus_oled) in the thread. In addition, Willy Tarreau mentioned that he had heard from another who was working on panel, telling Kroah-Hartman: "This proves that the principle of the staging tree seems to work".

Overall, the staging tree seems to be doing exactly what Kroah-Hartman and others envisioned. Adding staging into the mainline, which raised the profile and availability of those drivers, has led to a fair amount of cleanup work, some of which has resulted in the drivers themselves moving out of staging and into the mainline. Some drivers seem to be falling by the wayside, but one would guess that Kroah-Hartman would welcome them back into the tree should anyone show up to work on them. In the meantime, the code certainly hasn't suffered from whatever fixes various kernel hackers found time to do. Those changes will be waiting for anyone who wants to pick that code back up, even if it is no longer part of staging.

Comments (11 posted)

POSIX v. reality: A position on O_PONIES

September 9, 2009

This article was contributed by Valerie Aurora (formerly Henson)

Sure, programmers (especially operating systems programmers) love their specifications. Clean, well-defined interfaces are a key element of scalable software development. But what is it about file systems, POSIX, and when file data is guaranteed to hit permanent storage that brings out the POSIX fundamentalist in all of us? The recent fsync()/rename()/O_PONIES controversy was the most heated in recent memory but not out of character for fsync()-related discussions. In this article, we'll explore the relationship between file systems developers, the POSIX file I/O standard, and people who just want to store their data.

In the beginning, there was creat()

Like many practical interfaces (including HTML and TCP/IP), the POSIX file system interface was implemented first and specified second. UNIX was written beginning in 1969; the first release of the POSIX specification for the UNIX file I/O interface (IEEE Standard 1003.1) was released in 1988. Before UNIX, application access to non-volatile storage (e.g., a spinning drum) was a decidedly application- and hardware-specific affair. Record-based file I/O was a common paradigm, growing naturally out of punch cards, and each kind of file was treated differently. The new interface was designed by a few guys (Ken Thompson, Dennis Ritchie, et alia) screwing around with their new machine, writing an operating system that would make it easier to, well, write more operating systems.

As we know now, the new I/O interface was a hit. It turned out to be a portable, versatile, simple paradigm that made modular software development much easier. It was by no means perfect, of course: a number of warts revealed themselves over time, not all of which were removed before the interface was codified into the POSIX specification. One example is directory hard links, which permit the creation of a directory cycle - a directory that is a descendant of itself - and its subsequent detachment from the file system hierarchy, resulting in allocated but inaccessible directories and files. Recording the time of the last access time - atime - turns every read into a tiny write. And don't forget the apocryphal quote from Ken Thompson when asked if he'd do anything differently if he were designing UNIX today: "If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'". (That's the creat() system call to create a new file.) But overall, the UNIX file system interface is a huge success.

POSIX file I/O today: Ponies and fsync()

Over time, various more-or-less portable additions have accreted around the standard set of POSIX file I/O interfaces; they have been occasionally standardized and added to the canon - revelations from latter-day prophets. Some examples off the top of my head include pread()/pwrite(), direct I/O, file preallocation, extended attributes, access control lists (ACLs) of every stripe and color, and a vast array of mount-time options. While these additions are often debated and implemented in incompatible forms, in most cases no one is trying to oppose them purely on the basis of not being present in a standard written in 1988. Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.

Why, then, does the topic of when file system data is guaranteed to be "on disk" suddenly turn file systems developers into pedantic POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes down to this: Waiting for data to actually hit disk before returning from a system call is a losing game for file system performance. As the most extreme example, the original synchronous version of the UNIX file system frequently used only 3-5% of the disk throughput. Nearly every file system performance improvement since then has been primarily the result of saving up writes so that we can allocate and write them out as a group. As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it.

As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it. Fortunately for the file systems developers, the POSIX specification is so very minimal that it doesn't even mention the topic of file system behavior after a system crash. After all, the original FFS-style file systems (e.g., ext2) can theoretically lose your entire file system after a crash, and are still POSIX-compliant. Ironically, as file systems developers, we spend 90% of our brain power coming up with ways to quickly recover file system consistency after system crash! No wonder file systems users are irked when we define file system metadata as important enough to keep consistent, but not file data - we take care of our own so well. File systems developers have magnanimously conceded, though, that on return from fsync(), and only from fsync(), and only on a file system with the right mount options, the changes to that file will be available if the system crashes after that point.

At the same time, fsync() is often more expensive than it absolutely needs to be. The easiest way to implement fsync() is to force out every outstanding write to the file system, regardless of whether it is a journaling file system, a COW file system, or a file system with no crash recovery mechanism whatsoever. This is because it is very difficult to map backward from a given file to the dirty file system blocks needing to be written to disk in order to create a consistent file system containing those changes. For example, the block containing the bitmap for newly allocated file data blocks may also have been changed by a later allocation for a different file, which then requires that we also write out the indirect blocks pointing to the data for that second file, which changes another bitmap block... When you solve the problem of tracing specific dependencies of any particular write, you end up with the complexity of soft updates. No surprise then, that most file systems take the brute force approach, with the result that fsync() commonly takes time proportional to all outstanding writes to the file system.

So, now we have the following situation: fsync() is required to guarantee that file data is on stable storage, but it may perform arbitrarily poorly, depending on what other activity is going on in the file system. Given this situation, application developers came to rely on what is, on the face of it, a completely reasonable assumption: rename() of one file over another will either result in the contents of the old file, or the contents of the new file as of the time of the rename(). This is a subtle and interesting optimization: rather than asking the file system to synchronously write the data, it is instead a request to order the writes to the file system. Ordering writes is far easier for the file system to do efficiently than synchronous writes.

However, the ordering effect of rename() turns out to be a file system specific implementation side effect. It only works when changes to the file data in the file system are ordered with respect to changes in the file system metadata. In ext3/4, this is only true when the file system is mounted with the data=ordered mount option - a name which hopefully makes more sense now! Up until recently, data=ordered was the default journal mode for ext3, which, in turn, was the default file system for Linux; as a result, ext3 data=ordered was all that many Linux application developers had any experience with. During the Great File System Upheaval of 2.6.30, the default journal mode for ext3 changed to data=writeback, which means that file data will get written to disk when the file system feels like it, very likely after the file's metadata specifying where its contents are located has been written to disk. This not only breaks the rename() ordering assumption, but also means that the newly renamed file may contain arbitrary garbage - or a copy of /etc/shadow, making this a security hole as well as a data corruption problem.

Which brings us to the present day fsync/rename/O_PONIES controversy, in which many file systems developers argue that applications should explicitly call fsync() before renaming a file if they want the file's data to be on disk before the rename takes effect - a position which seems bizarre and random until you understand the individual decisions, each perfectly reasonable, that piled up to create the current situation. Personally, as a file systems developer, I think it is counterproductive to replace a performance-friendly implicit ordering request in the form of a rename() with an impossible to optimize fsync(). It may not be POSIX, but the programmer's intent is clear - no one ever, ever wrote "creat(); write(); close(); rename();" and hoped they would get an empty file if the system crashed during the next 5 minutes. That's what truncate() is for. A generalized "O_PONIES do-what-I-want" flag is indeed not possible, but in this case, it is to the file systems developers' benefit to extend the semantics of rename() to imply ordering so that we reduce the number of fsync() calls we have to cope with. (And, I have to note, I did have a real, live pony when I was a kid, so I tend to be on the side of giving programmers ponies when they ask for them.)

My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.

Comments (119 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds