Brief itemsreleased by Linus on September 9. A few of the major features in 2.6.31 include performance counter support, the "fsnotify" notification infrastructure, kernel mode setting for ATI Radeon chipsets, the kmemleak tool, char drivers in user space support, USB 3 support, and much more. As always, see the KernelNewbies 2.6.31 page for a much more exhaustive list.
The last prepatch, 2.6.31-rc9, was released on September 5.
Kernel development news
int reflink(const char *oldpath, const char *newpath, int preserve); int reflinkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int preserve, int flags);
A call to reflink() causes newpath to look like a copy of oldpath. If preserve is REFLINK_ATTR_PRESERVE, then the entire security state of oldpath will be replicated for the new file; this is a privileged operation. Otherwise (if preserve is REFLINK_ATTR_NONE), newpath will get a new security state as if it were an entirely new file. The reflinkat() form adds the ability to supply the starting directories for relative paths and flags like the other *at() system calls. For more information, see the documentation file at the top of the reflink() patch.
Joel's patch adds reflink() support for the ocfs2 filesystem; it's not clear whether other filesystems will get reflink() support in 2.6.32 or not.
A stable debugfs?. Recurring linux-kernel arguments tend to focus on vitally important issues - like where debugfs should be mounted. The official word is that it belongs on /sys/kernel/debug, but there have been ongoing problems with rogue developers mounting it on unofficial places like /debug instead. Greg Kroah-Hartman defends /sys/kernel/debug by noting that debugfs is for kernel developers only; there's no reason for users to be interested in it.
Except, of course, that there is. The increasing utility of the ftrace framework is making it more interesting beyond kernel development circles. That led Steven Rostedt to make a suggestion:
Steven would like a new virtual filesystem for stable kernel ABIs which is easier to work with than sysfs and which can be mounted in a more typing-friendly location. Responses to the suggestion have been scarce so far; somebody will probably need to post a patch to get a real discussion going.
data=guarded. Chris Mason has posted a new version of the ext3 data=guarded mode patch. The guarded mode works to ensure that data blocks arrive on disk before any metadata changes which reference those blocks. The goal is to provide the performance benefits of the data=writeback mode while avoiding the potential information disclosure (after a crash) problems with that mode. Chris had mentioned in the past that he would like to merge this code for 2.6.32; the latest posting, though, suggests that some work still needs to be done, so it might not be ready in time.reported here, Con Kolivas recently resurfaced with a new CPU scheduler called "BFS". This scheduler, he said, addresses the problems which ail the mainline CFS scheduler; the biggest of these, it seems, is the prioritization of "scalability" over use on normal desktop systems. BFS was meant to put the focus back on user-level systems and, perhaps, make the case for supporting multiple schedulers in the kernel.
Since then, CFS creator Ingo Molnar has responded with a series of benchmark results comparing the two schedulers. Tests included kernel build times, pipe performance, messaging performance, and an online transaction processing test; graphs were posted showing how each scheduler performed on each test. Ingo's conclusion: "Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box." In fact, the opposite was true: BFS generally performed worse than the mainline scheduler.
Con's answer was best described as "dismissive":
[snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]
As far as your editor can tell, Con's objections to the results mirror those heard elsewhere: Ingo chose an atypical machine for his tests, and those tests, in any case, do not really measure the performance of a scheduler in a desktop situation. The more cynical observers seem to believe that Ingo is more interested in defending the current scheduler than improving the desktop experience for "normal" users.
The machine chosen was certainly at the high end of the "desktop" scale:
A number of people thought that this box is not a typical desktop Linux system. That may indeed be true - today. But, as Ingo (among others) has pointed out, it's important to be a little ahead of the curve when designing kernel subsystems:
Btw., that's why the Linux scheduler performs so well on quad core systems today - the groundwork for that was laid two years ago when scheduler developers were testing on a quads. If we discovered fundamental problems on quads _today_ it would be way too late to help Linux users.
Partly in response to the criticisms, though, Ingo reran his tests on a single quad-core system, the same type of system as Con's box. The end results were just about the same.
The hardware used is irrelevant, though, if the benchmarks are not testing performance characteristics that desktop users care about. The concern here is latency: how long it takes before a runnable process can get its work done. If latencies are too high, audio or video streams will skip, the pointer will lag the mouse, scrolling will be jerky, and Maelstrom players will lose their ships. A number of Ingo's original tests were latency-related, and he added a couple more in the second round. So it looks like the benchmarks at least tried to measure the relevant quantity.
Benchmark results are not the same as a better desktop experience, though, and a number of users are reporting a "smoother" desktop when running with BFS. On the other hand, making significant scheduler changes in response to reports of subjective "feel" is a sure recipe for trouble: if one cannot measure improvement, one not only risks failing to fix any problems, one is also at significant risk of introducing performance regressions for other users. There has to be some sort of relatively objective way to judge scheduler improvements.
The way preferred by the current scheduler maintainers is to identify causes of latencies and fix them. The kernel's infrastructure for the identification of latency problems has improved considerably over the last year or two. One useful tool is latencytop, which collects data on what is delaying applications and presents the results to the user. The ftrace tracing framework is also able to create data on the delay between when a process is awakened and when it actually gets into the CPU; see this post from Frederic Weisbecker for an overview of how these measurements can be taken.
If there are real latency problems remaining in the Linux scheduler - and there are enough "BFS is better" reports to suggest that there are - then using the available tools to describe them seems like the right direction to take. Once the problem is better understood, it will be possible to consider possible remedies. It may well be that the mainline scheduler can be adjusted to make those problems go away. Or, possibly, a more radical sort of approach is necessary. But, without some understanding of the problem - and associated ability to measure it - attempted fixes seem a bit like a risky shot in the dark.
Ingo welcomed Con back to the development community and invited him to help improve the Linux scheduler. This seems unlikely to happen, though. Con's way of working has never meshed well with the kernel development community, and he is showing little sign of wanting to change that situation. That is unfortunate; he is a talented developer who could do a lot to improve Linux for an important user community. The adoption of the current CFS scheduler is a direct result of his earlier work, even if he did not write the code which was actually merged. In general, though, improving Linux requires working with the Linux development community; in the absence of a desire to do that effectively, there will be severe limits on what a developer will be able to accomplish.
(See also: Frans Pop's benchmark tests, which show decidedly mixed results.)
The staging tree has made a lot of progress since it appeared in June 2008. To start with, the tree itself quickly moved into the mainline in October 2008; it also has accumulated more than 40 drivers of various sorts. Staging is an outgrowth of the Linux Driver Project that is meant to collect drivers, and other "standalone" code such as filesystems, that are not yet ready for the mainline. But, it was never meant to be a "dumping ground for dead code", as staging maintainer Greg Kroah-Hartman put it in a recent status update. Code that is not being improved, so that it can move into the mainline, will be removed from the tree.
Some of the code that is, at least currently, slated for removal includes some fairly high-profile drivers, including one from Microsoft that was released with great fanfare in July. After a massive cleanup that resulted in more than 200 patches to get the code "into a semi-sane kernel coding style", Kroah-Hartman said that it may have to be removed in six months or so:
Microsoft is certainly not alone in Kroah-Hartman's report—which details the status of the tree for the upcoming 2.6.32 merge window—as several other large companies' drivers are in roughly the same boat. Drivers for Android hardware (staging/android), Intel's Management Engine Interface (MEI) hardware (staging/heci), among others were called out in the report. Both are slated for removal, android for 2.6.32, and heci in 2.6.33 (presumably). The latter provides an excellent example of how not to do Linux driver development:
Kroah-Hartman's lengthy report covers more than just drivers that may be removed; it also looks at those that have made progress, including some that should be moving to the mainline, as well as new drivers that are being added to staging. But the list of drivers that aren't being actively worked on is roughly as long as the other two lists combined, which is clearly suboptimal.
Presumably to see if folks read all the way through, Kroah-Hartman sprinkles a few laughs in an otherwise dry summary. For the me4000 and meilhaus drivers, he notes that there is no reason to continue those drivers "except to watch the RT guys squirm as they try to figure out the byzantine locking and build logic here (which certainly does count for something, cheap entertainment is always good.)"
He also notes several drivers that are in the inactive category, but are quite close to being merge-worthy. He suggests that developers looking for a way to contribute consider drivers such as asus_oled (Asus OLED display), frontier (Frontier digital audio workstation controller), line6 (PODxt Pro audio effects modeler), mimio (Mimio Xi interactive whiteboard), and panel (parallel port LCD/keypad). Each of those should be relatively easy to get into shape for inclusion in the mainline.
There are a fair number of new drivers being added for 2.6.32, including the Microsoft Hyper-V drivers (staging/hv) mentioned earlier, as well as VME bus drivers (staging/vme), the industrial I/O subsystem (staging/iio), and several wireless drivers (VIA vt6655 and vt6656, Realtek rtl8192e, and Ralink 3090). Also, "another COW driver" is being added: the Cowloop copy-on-write pseudo block driver (staging/cowloop).
Two of Evgeniy Polyakov's projects—mistakenly listed in the "new driver" section though they were added in 2.6.30—were also mentioned. The distributed storage (DST) network block device (staging/dst), which Kroah-Hartman notes may be "dead" is a candidate for removal, while the distributed filesystem POHMELFS (staging/pohmelfs) is mostly being worked on out-of-tree. Polyakov agrees that DST is not needed in the mainline, but is wondering about moving POHMELFS out of staging and into fs/. Since there are extensive changes on the way for POHMELFS, it is unlikely to move out of staging for another few kernel releases at least.
There was also praise for the work on various drivers which have been actively worked on over the last few months. Bartlomiej Zolnierkiewicz was singled out for his work on rt* and rtl* wireless drivers (which put him atop the list of most active 2.6.31 developers), along with Alan Cox for work on the et131x driver for the Agere gigabit Ethernet adapter. Johannes Berg noted that much of Zolnierkiewicz's work on the rt* drivers "will have been in vain" because of the progress being made by the rt2x00 project. But that doesn't faze Zolnierkiewicz:
In the meantime (before clean and proper support becomes useful) Linux users are provided with the possibility to use their hardware before it becomes obsolete.
At least one developer stepped up to work on one of the inactive drivers (asus_oled) in the thread. In addition, Willy Tarreau mentioned that he had heard from another who was working on panel, telling Kroah-Hartman: "This proves that the principle of the staging tree seems to work".
Overall, the staging tree seems to be doing exactly what Kroah-Hartman and others envisioned. Adding staging into the mainline, which raised the profile and availability of those drivers, has led to a fair amount of cleanup work, some of which has resulted in the drivers themselves moving out of staging and into the mainline. Some drivers seem to be falling by the wayside, but one would guess that Kroah-Hartman would welcome them back into the tree should anyone show up to work on them. In the meantime, the code certainly hasn't suffered from whatever fixes various kernel hackers found time to do. Those changes will be waiting for anyone who wants to pick that code back up, even if it is no longer part of staging.
O_PONIEScontroversy was the most heated in recent memory but not out of character for
fsync()-related discussions. In this article, we'll explore the relationship between file systems developers, the POSIX file I/O standard, and people who just want to store their data.
As we know now, the new I/O interface was a hit. It turned out to be a
portable, versatile, simple paradigm that made modular software
development much easier. It was by no means perfect, of course: a
number of warts revealed themselves over time, not all of which were
removed before the interface was codified into the POSIX
specification. One example is directory hard links, which permit the
creation of a directory cycle - a directory that is a descendant of
itself - and its subsequent detachment from the file system hierarchy,
resulting in allocated but inaccessible directories and files.
Recording the time of the last access time - atime - turns every read
into a tiny write. And don't forget the apocryphal quote from Ken
Thompson when asked if he'd do anything differently if he were
designing UNIX today: "If I had to do it over again? Hmm... I guess
I'd spell 'creat' with an 'e'". (That's the
system call to create a new file.) But overall, the UNIX file system
interface is a huge success.
Why, then, does the topic of when file system data is guaranteed to be
"on disk" suddenly turn file systems developers into pedantic
POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes
down to this: Waiting for data to actually hit disk before returning
from a system call is a losing game for file system performance. As
the most extreme example, the original synchronous version of the UNIX
file system frequently used only 3-5% of the disk throughput. Nearly
every file system performance improvement since then has been
primarily the result of saving up writes so that we can allocate and
write them out as a group. As file systems developers, we are going
to look for every loophole in
fsync() and squirm our way
As file systems developers, we are going
to look for every loophole in
fsync() and squirm our way
Fortunately for the file systems developers, the POSIX specification
is so very minimal that it doesn't even mention the topic of file
system behavior after a system crash. After all, the original
FFS-style file systems (e.g., ext2) can theoretically lose your entire
file system after a crash, and are still POSIX-compliant. Ironically,
as file systems developers, we spend 90% of our brain power coming up
with ways to quickly recover file system consistency after system
crash! No wonder file systems users are irked when we define file
system metadata as important enough to keep consistent, but not file
data - we take care of our own so well. File systems developers have
magnanimously conceded, though, that on return
fsync(), and only from
only on a file system with the right mount options, the changes to
that file will be available if the system crashes after that point.
At the same time,
fsync() is often more expensive than it
absolutely needs to be. The easiest way to
fsync() is to force out every outstanding write
to the file system, regardless of whether it is a journaling file
system, a COW file system, or a file system with no crash recovery
mechanism whatsoever. This is because it is very difficult to map
backward from a given file to the dirty file system blocks needing to
be written to disk in order to create a consistent file system
containing those changes. For example, the block containing the
bitmap for newly allocated file data blocks may also have been changed
by a later allocation for a different file, which then requires that
we also write out the indirect blocks pointing to the data for that
second file, which changes another bitmap block... When you solve the
problem of tracing specific dependencies of any particular write, you
end up with the complexity
of soft updates. No
surprise then, that most file systems take the brute force approach,
with the result that
fsync() commonly takes time
proportional to all outstanding writes to the file system.
So, now we have the following situation:
required to guarantee that file data is on stable storage, but it may
perform arbitrarily poorly, depending on what other activity is going
on in the file system. Given this situation, application developers
came to rely on what is, on the face of it, a completely reasonable
rename() of one file over another will either
result in the contents of the old file, or the contents of the new
file as of the time of the
rename(). This is a subtle
and interesting optimization: rather than asking the file system to
synchronously write the data, it is instead a request to order the
writes to the file system. Ordering writes is far easier for the file
system to do efficiently than synchronous writes.
However, the ordering effect of
rename() turns out to be
a file system specific implementation side effect. It only works when
changes to the file data in the file system are ordered with respect
to changes in the file system metadata. In ext3/4, this is only true
when the file system is mounted with the
mount option - a name which hopefully makes more sense now! Up until
data=ordered was the default journal mode for
ext3, which, in turn, was the default file system for Linux; as a result,
ext3 data=ordered was all that
many Linux application developers had any experience with. During the
Great File System Upheaval of 2.6.30, the default journal mode for
ext3 changed to
data=writeback, which means that file
data will get written to disk when the file system feels like it, very
likely after the file's metadata specifying where its contents are
located has been written to disk. This not only breaks
rename() ordering assumption, but also means that the
newly renamed file may contain arbitrary garbage - or a copy
/etc/shadow, making this a security hole as well as a
data corruption problem.
Which brings us to the present
controversy, in which many file systems developers argue that
applications should explicitly call
renaming a file if they want the file's data to be on disk before the
rename takes effect - a position which seems bizarre and random until
you understand the individual decisions, each perfectly reasonable,
that piled up to create the current situation. Personally, as a file
systems developer, I think it is counterproductive to replace a
performance-friendly implicit ordering request in the form of
rename() with an impossible to
fsync(). It may not be POSIX, but the
programmer's intent is clear - no one ever, ever wrote
creat(); write(); close(); rename();" and hoped they
would get an empty file if the system crashed during the next 5
minutes. That's what
truncate() is for. A generalized
O_PONIES do-what-I-want" flag is indeed not possible,
but in this case, it is to the file systems developers' benefit to
extend the semantics of
rename() to imply ordering so
that we reduce the number of
fsync() calls we have to cope
with. (And, I have to note, I did have a real, live pony when I was a
kid, so I tend to be on the side of giving programmers ponies when
they ask for them.)
My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds