Brief items
The 2.6.31 kernel is out,
released by Linus on
September 9. A
few of the major features in 2.6.31 include
performance counter support, the
"fsnotify" notification infrastructure, kernel mode setting for ATI Radeon
chipsets, the
kmemleak tool,
char drivers in user space
support, USB 3 support, and much more. As always, see
the KernelNewbies 2.6.31
page for a much more exhaustive list.
The last prepatch, 2.6.31-rc9, was released on September 5.
The current stable kernel is 2.6.30.6, released (along with 2.6.27.32 2.6.27.33) on September 8.
Both contain a long list of fixes, many of which are in the KVM subsystem.
Comments (6 posted)
Kernel development news
After reading more and more about BFS, I've realized that it's the
Fight Club of schedulers. You do not talk about BFS on
linux-kernel. BFS does not benchmark, it does not keep score, it
has no leaderboard. BFS only exists in the time between when Flash
Player starts and when Flash Player crashes.
--
Wesley Felter
My life's project is to hunt down the guy who invented mail client
wordwrapping, set him on fire then dance on his ashes.
--
Andrew Morton (Thanks to Nikanth K)
Linux is a 18+ years old kernel, there's not that many easy
projects left in it anymore :-/ Core kernel features that look
basic and which are not in Linux yet often turn out to be not that
simple.
--
Ingo Molnar
Checkpoint/restart has traditionally been interesting in the
mainframe and supercomputer space. These environments have very
different security profiles from a user desktop. No one at the
[.......] National Supercomputer Centre cares if you can save your
rogue game as soon as you pick up the Amulet of Yendor and restart
it if you get killed on the way up. These environments are
concerned with leaking data between the groups that have funded the
facility, which is why they are very often customers of advanced
access control technologies. I don't know that I see a really good
security story for [checkpoint/restart] in the desktop space, and as Russell points
out, there are plenty of opportunities to exploit the feature.
--
Casey Schaufler
Comments (1 posted)
By Jonathan Corbet
September 9, 2009
reflink() for 2.6.32. Joel Becker's
announcement of his 2.6.32 ocfs2 merge plans
included a mention that the
reflink() system call
would be merged alongside the ocfs2 changes. A call to
reflink()
creates a lightweight copy,
wherein both files share the same blocks in a copy-on-write mode. The
final
reflink() API looks like this:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
A call to reflink() causes newpath to look like a copy of
oldpath. If preserve is REFLINK_ATTR_PRESERVE,
then the entire security state of oldpath will be replicated for
the new file; this is a privileged operation. Otherwise (if
preserve is REFLINK_ATTR_NONE), newpath will get
a new security state as if it were an entirely new file. The
reflinkat() form adds the ability to supply the starting
directories for relative paths and flags like the other *at()
system calls. For more information, see the documentation file at the top
of the
reflink() patch.
Joel's patch adds reflink() support for the ocfs2 filesystem; it's
not clear whether other filesystems will get reflink() support in
2.6.32 or not.
A stable debugfs?. Recurring linux-kernel arguments tend to focus
on vitally important issues - like where debugfs should be mounted. The
official word is that it belongs on /sys/kernel/debug, but there
have been ongoing problems with rogue developers mounting it on unofficial places
like /debug instead. Greg Kroah-Hartman defends /sys/kernel/debug by noting
that debugfs is for kernel developers only; there's no reason for users to
be interested in it.
Except, of course, that there is. The increasing utility of the ftrace
framework is making it more interesting beyond kernel development circles.
That led Steven Rostedt to make a
suggestion:
I think that the tracing system has matured beyond a "debug" level
and is being enabled on production systems. Both fedora and debian
are now shipping kernels with it enabled. Perhaps we should create
another pseudo fs that can be like debugfs but for stable ABIs. A
new interface could start out in debugfs, but when it has reached
a stable interface, then it could be moved to another location to
signal this.
Steven would like a new virtual filesystem for stable kernel ABIs
which is easier to work with than sysfs and which can be mounted in a more
typing-friendly location. Responses to the suggestion have been scarce so
far; somebody will probably need to post a patch to get a real discussion
going.
data=guarded. Chris Mason has posted a new version of the ext3
data=guarded mode patch. The guarded mode works to ensure that data
blocks arrive on disk before any metadata changes which reference those
blocks. The goal is to provide the performance benefits of the
data=writeback mode while avoiding the potential information disclosure
(after a crash) problems with that mode. Chris had mentioned in the past
that he would like to merge this code for 2.6.32; the latest posting,
though, suggests that some work still needs to be done, so it might not be
ready in time.
Comments (1 posted)
By Jonathan Corbet
September 9, 2009
As was recently
reported
here, Con Kolivas recently resurfaced with
a new CPU
scheduler called "BFS". This scheduler, he said, addresses the
problems which ail the mainline CFS scheduler; the biggest of these, it
seems, is the prioritization of "scalability" over use on normal desktop
systems. BFS was meant to put the focus back on user-level systems and,
perhaps, make the case for supporting multiple schedulers in the kernel.
Since then, CFS creator Ingo Molnar has responded with a series of
benchmark results comparing the two schedulers. Tests included kernel
build times, pipe performance, messaging performance, and an online
transaction processing test; graphs were posted showing how each scheduler
performed on each test. Ingo's conclusion: "Alas, as it can be seen
in the graphs, i can not see any BFS performance improvements, on this
box." In fact, the opposite was true: BFS generally performed
worse than the mainline scheduler.
Con's answer was best described as
"dismissive":
/me sees Ingo run off to find the right combination of hardware and
benchmark to prove his point.
[snip lots of bullshit meaningless benchmarks showing how great cfs
is and/or how bad bfs is, along with telling people they should use
these artificial benchmarks to determine how good it is,
demonstrating yet again why benchmarks fail the desktop]
As far as your editor can tell, Con's objections to the results mirror
those heard elsewhere: Ingo chose an atypical machine for his tests, and
those tests, in any case, do not really measure the performance of a
scheduler in a desktop situation. The more cynical observers seem to
believe that Ingo is more interested in defending the current scheduler
than improving the desktop experience for "normal" users.
The machine chosen was certainly at the high end of the "desktop" scale:
So the testbox i picked fits into the upper portion of what i
consider a sane range of systems to tune for - and should still fit
into BFS's design bracket as well according to your description:
it's a dual quad core system with hyperthreading. It has twice as
many cores as the quad you tested on but it's not excessive and
certainly does not have 4096 CPUs.
A number of people thought that this box is not a typical desktop Linux
system. That may indeed be true - today. But, as Ingo (among others) has
pointed out, it's important to be a little
ahead of the curve when designing kernel subsystems:
But when it comes to scheduler design and merge decisions that will
trickle down and affect users 1-2 years down the line (once it gets
upstream, once distros use the new kernels, once users install the
new distros, etc.), i have to "look ahead" quite a bit (1-2 years)
in terms of the hardware spectrum.
Btw., that's why the Linux scheduler performs so well on quad core
systems today - the groundwork for that was laid two years ago when
scheduler developers were testing on a quads. If we discovered
fundamental problems on quads _today_ it would be way too late to
help Linux users.
Partly in response to the criticisms, though, Ingo reran his tests on a single quad-core system,
the same type of system as Con's box. The end results were just about the
same.
The hardware used is irrelevant, though, if the benchmarks are not testing
performance characteristics that desktop users care about. The concern
here is latency: how long it takes before a runnable process can get its
work done. If latencies are too high, audio or video streams will skip,
the pointer will lag the mouse, scrolling will be jerky, and Maelstrom
players will lose their ships. A number of Ingo's original tests were
latency-related, and he added a couple more in the second round. So it
looks like the benchmarks at least tried to measure the relevant quantity.
Benchmark results are not the same as a better desktop experience, though,
and a number of users are reporting a "smoother" desktop when running with
BFS. On the other hand, making significant scheduler changes in response
to reports of subjective "feel" is a sure recipe for trouble: if one cannot
measure improvement, one not only risks failing to fix any problems, one is
also at significant risk of introducing performance regressions for other
users. There has to be some sort of relatively objective way to judge
scheduler improvements.
The way preferred by the current scheduler maintainers is to identify
causes of latencies and fix them. The kernel's infrastructure for the
identification of latency problems has improved considerably over the last
year or two. One useful tool is latencytop, which collects data on
what is delaying applications and presents the results to the user. The
ftrace tracing framework is also able to create data on the delay between
when a process is awakened and when it actually gets into the CPU; see this post from Frederic Weisbecker for an
overview of how these measurements can be taken.
If there are real latency problems remaining in the Linux scheduler - and
there are enough "BFS is better" reports to suggest that there are - then
using the available tools to describe them seems like the right direction
to take. Once the problem is better understood, it will be possible to
consider possible remedies. It may well be that the mainline scheduler can
be adjusted to make those problems go away. Or, possibly, a more radical
sort of approach is necessary. But, without some understanding of the
problem - and associated ability to measure it - attempted fixes seem a bit
like a risky shot in the dark.
Ingo welcomed Con back to the development community and invited him to help
improve the Linux scheduler. This seems unlikely to happen, though. Con's
way of working has never meshed well with the kernel development community,
and he is showing little sign of wanting to change that situation. That is
unfortunate; he is a talented developer who could do a lot to improve Linux
for an important user community. The adoption of the current CFS scheduler
is a direct result of his earlier work, even if he did not write the code
which was actually merged. In general, though, improving Linux requires
working with the Linux development community; in the absence of a desire to
do that effectively, there will be severe limits on what a developer will
be able to accomplish.
(See also: Frans Pop's benchmark tests,
which show decidedly mixed results.)
Comments (25 posted)
By Jake Edge
September 9, 2009
The staging tree has made a lot of progress since it appeared in June 2008. To start with, the
tree itself quickly moved into the mainline
in October 2008; it also has accumulated more than 40 drivers of various
sorts. Staging is an outgrowth of the Linux Driver Project that is
meant to collect drivers, and other "standalone" code such as filesystems,
that are not yet ready for the mainline. But, it was never meant to be a
"dumping ground for dead
code", as staging maintainer Greg Kroah-Hartman put it in a recent status update. Code that
is not being improved, so that it can move into the mainline, will be
removed from the tree.
Some of the code that is, at least currently, slated for removal includes
some fairly high-profile drivers, including one from Microsoft that was
released with great fanfare
in July. After a massive cleanup that resulted in more than 200 patches to
get the code "into a semi-sane kernel coding style",
Kroah-Hartman said that it may have to be removed in six months or so:
Unfortunately the Microsoft developers
seem to have disappeared, and no one is answering my emails.
If they do not show back up to claim this driver soon, it will
be removed in the 2.6.33 release. So sad...
Microsoft is certainly not alone in Kroah-Hartman's report—which
details the status of the tree for the upcoming 2.6.32 merge
window—as several other large companies' drivers are in roughly the
same boat. Drivers for Android hardware (staging/android),
Intel's Management Engine Interface (MEI) hardware (staging/heci),
among others were called out in the report. Both are slated
for removal, android for 2.6.32, and heci in 2.6.33
(presumably). The latter provides an excellent example of how not to
do Linux driver development:
A wonderful example of a company throwing code over the
wall, watching it get rejected, and then running away as fast
as possible, all the while yelling over their shoulder, "it's
required on all new systems, you will love it!" We don't, it
sucks, either fix it up, or I am removing it.
Kroah-Hartman's lengthy report covers more than just drivers that may be
removed; it also looks at those that have made progress, including some
that should be moving to the mainline, as well as new drivers that are
being added to staging. But the list of drivers that aren't being actively
worked on is roughly as long as the other two lists combined, which is
clearly suboptimal.
Presumably to see if folks read all the way through,
Kroah-Hartman sprinkles a few laughs in an otherwise dry summary. For the
me4000 and meilhaus drivers, he notes that there is no
reason to continue those drivers "except to watch the RT guys squirm
as they try to figure out the byzantine locking and build logic here (which
certainly does count for something, cheap entertainment is
always good.)"
He also notes several drivers that are in the inactive category, but are
quite close to being merge-worthy. He suggests that developers looking
for a way to contribute consider drivers such as asus_oled (Asus
OLED display),
frontier (Frontier digital audio workstation controller),
line6 (PODxt Pro audio effects modeler), mimio (Mimio Xi
interactive whiteboard), and panel (parallel port LCD/keypad).
Each of those should be relatively easy to get into shape for inclusion in
the mainline.
There are a fair number of new drivers being added for 2.6.32,
including the Microsoft Hyper-V drivers (staging/hv) mentioned
earlier, as well as VME bus drivers (staging/vme), the industrial
I/O subsystem (staging/iio), and several wireless drivers (VIA
vt6655 and vt6656, Realtek rtl8192e, and Ralink 3090). Also,
"another COW driver" is being added: the Cowloop copy-on-write
pseudo block driver
(staging/cowloop).
Two of
Evgeniy Polyakov's projects—mistakenly listed in the "new driver"
section though they were added in 2.6.30—were also mentioned.
The distributed storage (DST)
network block device (staging/dst), which Kroah-Hartman notes may
be "dead" is a candidate for removal, while the distributed
filesystem POHMELFS (staging/pohmelfs) is mostly being
worked on out-of-tree. Polyakov agrees that DST is not needed in the
mainline, but is wondering about moving POHMELFS out of staging and
into fs/. Since there are extensive changes on the way for
POHMELFS,
it is unlikely to move out of staging for another few kernel releases at
least.
There was also praise for the work on various drivers which have been
actively worked on over the last few months. Bartlomiej Zolnierkiewicz
was singled out for his work on rt* and rtl* wireless
drivers (which put him atop the list of most active 2.6.31
developers), along with Alan Cox for work on the et131x driver
for the
Agere gigabit Ethernet adapter. Johannes Berg noted that much of Zolnierkiewicz's work on
the rt* drivers "will have been in vain" because of
the progress being made by the rt2x00 project. But that doesn't faze Zolnierkiewicz:
The end goal of this work has always been having native rt2x00 support
for all those chipsets (as have been explained multiple times). If this
means that one day we will delete all Ralink drivers in staging in favor
of proper wireless drivers -- fine with me.
In the meantime (before clean and proper support becomes useful) Linux
users are provided with the possibility to use their hardware before it
becomes obsolete.
At least one developer stepped up to work on one of the inactive drivers (asus_oled) in
the thread. In addition, Willy Tarreau mentioned that he had heard from another who
was working on panel, telling Kroah-Hartman: "This
proves that the principle of the staging tree seems to work".
Overall, the staging tree seems to be doing exactly what Kroah-Hartman and
others envisioned. Adding staging into the mainline, which raised the
profile and availability of those drivers, has led to a fair amount of
cleanup work, some of which has resulted in the drivers themselves moving
out of staging and into the mainline. Some drivers seem to be falling by
the wayside, but one would guess that Kroah-Hartman would welcome them back
into the tree should anyone show up to work on them. In the meantime, the
code certainly hasn't suffered from whatever fixes various kernel
hackers found time to do. Those changes will be waiting for anyone who
wants to pick that code back up, even if it is no longer part of staging.
Comments (11 posted)
September 9, 2009
This article was contributed by Valerie Aurora (formerly Henson)
Sure, programmers (especially operating systems programmers) love
their specifications. Clean, well-defined interfaces are a key
element of scalable software development. But what is it about file
systems, POSIX, and when file data is guaranteed to hit permanent
storage that brings out the POSIX fundamentalist in all of us? The
recent
fsync()/rename()/O_PONIES
controversy was the most heated in recent memory but not out of
character for
fsync()-related discussions. In this
article, we'll explore the relationship between file systems
developers, the POSIX file I/O standard, and people who just want to
store their data.
In the beginning, there was creat()
Like many practical interfaces (including HTML and TCP/IP), the POSIX file system
interface was implemented first and specified second. UNIX was
written beginning in 1969; the first release of the POSIX
specification for the UNIX file I/O interface (IEEE Standard 1003.1)
was released in 1988. Before UNIX, application access to non-volatile
storage (e.g., a spinning drum) was a decidedly application- and
hardware-specific affair. Record-based file I/O was a common paradigm,
growing naturally out of punch cards, and each kind of file was treated
differently. The new interface was designed by a few guys (Ken
Thompson, Dennis Ritchie, et alia) screwing around with their new
machine, writing an operating system that would make it easier
to, well, write more operating systems.
As we know now, the new I/O interface was a hit. It turned out to be a
portable, versatile, simple paradigm that made modular software
development much easier. It was by no means perfect, of course: a
number of warts revealed themselves over time, not all of which were
removed before the interface was codified into the POSIX
specification. One example is directory hard links, which permit the
creation of a directory cycle - a directory that is a descendant of
itself - and its subsequent detachment from the file system hierarchy,
resulting in allocated but inaccessible directories and files.
Recording the time of the last access time - atime - turns every read
into a tiny write. And don't forget the apocryphal quote from Ken
Thompson when asked if he'd do anything differently if he were
designing UNIX today: "If I had to do it over again? Hmm... I guess
I'd spell 'creat' with an 'e'". (That's the creat()
system call to create a new file.) But overall, the UNIX file system
interface is a huge success.
POSIX file I/O today: Ponies and fsync()
Over time, various more-or-less portable additions have accreted
around the standard set of POSIX file I/O interfaces; they have been
occasionally standardized and added to the canon - revelations from
latter-day prophets. Some examples off the top of my head include
pread()/pwrite(), direct I/O, file preallocation, extended attributes,
access control lists (ACLs) of every stripe and color, and a vast
array of mount-time options. While these additions are often debated
and implemented in incompatible forms, in most cases no one is trying
to oppose them purely on the basis of not being present in a standard
written in 1988. Similarly, there is relatively little debate about
refusing to conform to some of the more brain-dead POSIX details, such
as the aforementioned directory hard link feature.
Why, then, does the topic of when file system data is guaranteed to be
"on disk" suddenly turn file systems developers into pedantic
POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes
down to this: Waiting for data to actually hit disk before returning
from a system call is a losing game for file system performance. As
the most extreme example, the original synchronous version of the UNIX
file system frequently used only 3-5% of the disk throughput. Nearly
every file system performance improvement since then has been
primarily the result of saving up writes so that we can allocate and
write them out as a group. As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
[PULL QUOTE:
As file systems developers, we are going
to look for every loophole in fsync() and squirm our way
through it.
END QUOTE]
Fortunately for the file systems developers, the POSIX specification
is so very minimal that it doesn't even mention the topic of file
system behavior after a system crash. After all, the original
FFS-style file systems (e.g., ext2) can theoretically lose your entire
file system after a crash, and are still POSIX-compliant. Ironically,
as file systems developers, we spend 90% of our brain power coming up
with ways to quickly recover file system consistency after system
crash! No wonder file systems users are irked when we define file
system metadata as important enough to keep consistent, but not file
data - we take care of our own so well. File systems developers have
magnanimously conceded, though, that on return
from fsync(), and only from fsync(), and
only on a file system with the right mount options, the changes to
that file will be available if the system crashes after that point.
At the same time, fsync() is often more expensive than it
absolutely needs to be. The easiest way to
implement fsync() is to force out every outstanding write
to the file system, regardless of whether it is a journaling file
system, a COW file system, or a file system with no crash recovery
mechanism whatsoever. This is because it is very difficult to map
backward from a given file to the dirty file system blocks needing to
be written to disk in order to create a consistent file system
containing those changes. For example, the block containing the
bitmap for newly allocated file data blocks may also have been changed
by a later allocation for a different file, which then requires that
we also write out the indirect blocks pointing to the data for that
second file, which changes another bitmap block... When you solve the
problem of tracing specific dependencies of any particular write, you
end up with the complexity
of soft updates. No
surprise then, that most file systems take the brute force approach,
with the result that fsync() commonly takes time
proportional to all outstanding writes to the file system.
So, now we have the following situation: fsync() is
required to guarantee that file data is on stable storage, but it may
perform arbitrarily poorly, depending on what other activity is going
on in the file system. Given this situation, application developers
came to rely on what is, on the face of it, a completely reasonable
assumption: rename() of one file over another will either
result in the contents of the old file, or the contents of the new
file as of the time of the rename(). This is a subtle
and interesting optimization: rather than asking the file system to
synchronously write the data, it is instead a request to order the
writes to the file system. Ordering writes is far easier for the file
system to do efficiently than synchronous writes.
However, the ordering effect of rename() turns out to be
a file system specific implementation side effect. It only works when
changes to the file data in the file system are ordered with respect
to changes in the file system metadata. In ext3/4, this is only true
when the file system is mounted with the data=ordered
mount option - a name which hopefully makes more sense now! Up until
recently, data=ordered was the default journal mode for
ext3, which, in turn, was the default file system for Linux; as a result,
ext3 data=ordered was all that
many Linux application developers had any experience with. During the
Great File System Upheaval of 2.6.30, the default journal mode for
ext3 changed to data=writeback, which means that file
data will get written to disk when the file system feels like it, very
likely after the file's metadata specifying where its contents are
located has been written to disk. This not only breaks
the rename() ordering assumption, but also means that the
newly renamed file may contain arbitrary garbage - or a copy
of /etc/shadow, making this a security hole as well as a
data corruption problem.
Which brings us to the present
day fsync/rename/O_PONIES
controversy, in which many file systems developers argue that
applications should explicitly call fsync() before
renaming a file if they want the file's data to be on disk before the
rename takes effect - a position which seems bizarre and random until
you understand the individual decisions, each perfectly reasonable,
that piled up to create the current situation. Personally, as a file
systems developer, I think it is counterproductive to replace a
performance-friendly implicit ordering request in the form of
a rename() with an impossible to
optimize fsync(). It may not be POSIX, but the
programmer's intent is clear - no one ever, ever wrote
"creat(); write(); close(); rename();" and hoped they
would get an empty file if the system crashed during the next 5
minutes. That's what truncate() is for. A generalized
"O_PONIES do-what-I-want" flag is indeed not possible,
but in this case, it is to the file systems developers' benefit to
extend the semantics of rename() to imply ordering so
that we reduce the number of fsync() calls we have to cope
with. (And, I have to note, I did have a real, live pony when I was a
kid, so I tend to be on the side of giving programmers ponies when
they ask for them.)
My opinion is that POSIX and most other useful standards are helpful
clarifications of existing practice, but are not sufficient when we
encounter surprising new circumstances. We criticize applications
developers for using folk-programming practices ("It seems to work!")
and coming to rely on file system-specific side effects, but the bare
POSIX specification is clearly insufficient to define useful system
behavior. In cases where programmer intent is unambiguous, we should
do the right thing, and put the new behavior on the list for the next
standards session.
Comments (119 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>