The current 2.6 development kernel is 2.6.30-rc4
by Linus (who has
reverted to the old "just after LWN goes out" schedule) on April 29.
Changes this time around include Tux's return as the kernel mascot and a
whole bunch of fixes. Plus the code name for this release has been changed
to "Vindictive Armadillo." Full details can be found in the
Patches continue to flow into the mainline repository; they are almost all
fixes, including one from LWN editor Jake Edge addressing some of the address space randomization
problems covered on last week's Security Page.
No stable 2.6.29 updates have been made in the last week. We did
see the release of the 18.104.22.168 and 22.214.171.124 updates on May 2.
They contain fixes all over the tree (58 and 88 patches respectively);
several have CVE numbers associated with them, so users are encouraged to
upgrade. Also: "NOTE, this is the LAST update of the 2.6.28 kernel
series, so all users are very strongly encouraged to upgrade to the 2.6.29
series at this point in time!" 2.6.27 will continue to be maintained
by the stable folks for quite some time to come.
Comments (4 posted)
Kernel development news
We were able to shave 400 milliseconds off the shutdown time by
slightly trimming the WAV file shutdown music.
-- John Curran
; boot time is no longer the battleground.
Tools Are Not Deities To Be Appeased. Subject saying in effect
"$TOOL is upset!!!" is bloody useless.
-- Al Viro
From the few reports I have heard that the actual bug is not in the
linux kernel code but rather it sounds like a denial of service attack
against the implementation of http://uscode.house.gov/. With the
attackers being able to inject a few bogus values, and cause lots of
Now in the linux kernel we work around lots of bugs from lots of
different sources, and this may be a place to work around someone
else's bug. This does not appear to be a context where anyone is
concerned about a 0 day exploit, so we don't need to rush. Further
the functionality has been the same in the same in all places for a
long time, and all of the pieces are at least in theory open to public
review. So this should be a reasonable context for a public discussion.
The only reason I can see for not ultimately talking about things publicly
is if this is one company making shady deals with another company in which
case I do not see why the maintenance burden for those decision should
fall on the linux community as a whole.
-- Eric Biederman
Comments (6 posted)
Jon Masters is experimenting with the idea of creating a short podcast with
a summary of discussions on the linux-kernel mailing list. The initial
is just under four minutes long; it includes brief summaries of
discussions about DRBD
GFP_PANIC, file descriptor abuse, and more. "I am hoping this of use to some people who can't read LKML every day.
Yesterday took 15-20 minutes to put together, and that's doable on a
regular basis, subject to it being of use to anyone. I figured I'm
reading LKML whether I do I summary recording or not. If it takes off,
then I'll try forming a small team to share the effort out.
Full Story (comments: 19)
The drive for faster boot times has led to a number of changes in the
kernel. Some, like the parallelization of USB
initialization we looked at last week, have caused disruptions for some
users. But others, like the recently proposed devtmpfs, have a different set of challenges.
While it may provide a good solution to reducing boot times,
devtmpfs faces some
fairly stiff resistance, at least partially because it reminds some folks
of a feature previously excised from the kernel, namely devfs.
The basic idea is to create a tmpfs early in the kernel
initialization before the driver core has initialized. Then, as each
device registers with the driver core, its major and minor numbers and
device name can be used to create an entry in that filesystem. Eventually,
the root filesystem will be mounted and the populated tmpfs can be
mounted at /dev.
This has a number of benefits, all of which derive from the fact that no
user-space support is required to have a working /dev directory.
With the current udev-based approach, there is a need for a
reasonably functional user-space environment for udev to operate
in. For simplified booting scenarios—like rescue tools or using the
init=/bin/sh kernel boot parameter—a functional
/dev directory is needed, in particular because of
dynamic device numbers. It would also be useful for embedded devices that
do not need or want a full-featured user space.
Andrew Morton's immediate reaction was amusement: "Lol, devfs." Greg
Kroah-Hartman, who authored the patch along with Kay Sievers and Jan
Blunck, admitted that it was a kind of
devfs: "Well, devfs 'done right' with hopefully none of the
vfs problems the
last devfs had. :)" But Morton is somewhat concerned that "devfs2", as he calls
it, is just going over old ground:
I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based)
would have fixed up these things. But it was never quite completed and
came when minds were already made up.
I don't understand why we need devfs2, really. What problems are
people having with [the] existing design?
Though the other advantages are important, Kroah-Hartman replied with the crux of the argument for
Boot speed, boot speed, boot speed.
Oh, and reduction in complexity in init scripts, and saving embedded
systems a lot of effort to implement a dynamic /dev properly (have you
_seen_ what Android does to keep from having to ship udev? It's
But Alan Cox is not so sure. His argument
is that moving this
functionality (back) into
the kernel, just papers over a user-space problem, while increasing kernel,
thus not pageable, memory usage. Others think that the kernel should just
buffer uevents—the messages generated by the kernel to send to udev
on device state
changes—until udevd is started. But, that doesn't solve the
synchronization problem: user space must still wait for a populated
A problem with the current scheme is that it
essentially does the device enumeration twice—once in the kernel as
devices are registered and once in user space by udevd, when it gets
started. The device information that was gathered by the kernel is lost. When
udevd initializes, it walks the /sys directory to find
devices, then creates device nodes for them. That can take 1-2 seconds on
a complex system—on the order of twice the kernel boot time—but
worse still, no other user-space processes can start until this "coldplug"
pass has completed. Using devtmpfs, there will be a working
/dev that other user-space code can use, so that the udev
coldplug pass can be done in parallel.
Several alternate methods of solving the problem were proposed in the
thread, but, by and large, Sievers was able to show why they didn't
the problem. In some cases, the behavior of devfs is being
incorrectly attributed to devtmpfs, but the two are quite different.
The new scheme would create root-owned device nodes, with fixed 0600
permissions, for each device. It would avoid much of complexity of
devfs. As Sievers puts
We are not implementing anything crazy here like devfs did, including
the later versions - there is no modprobe behind your back, no lookup
hooks, no stupid new naming scheme, no new filesystem type to
Christoph Hellwig objected to the proposal
as well. Part of his complaint is how quickly devtmpfs was added
to the linux-next tree, but he also sees it as adding devfs back
into the kernel:
It basically does re-introduce devfs under a different name, and from
looking at the implementation it might not be quite as bad a Gooch's
original, but it's certainly worse than Adam Richters rewrite the we
never ended up merging.
Now we might want to revisit the decision to leave all the device name
handling to a userspace daemon, because it [proved] to be quite fragile
under certain circumstances, and you apparently see performance issues.
Sievers outlines the differences between
devtmpfs and Adam Richter's proposal
from 2003. It mostly boils down to complexity; devtmpfs is a much
simpler scheme, which really adds very little to the kernel. The
implementation is around 300 lines of code, in comparison to roughly 3600
for devfs and 600 for an early version of Richter's mini-devfs.
Anticipating the next complaint, Sievers also points out that the device
naming policy is already in the kernel, but that udev can override
the kernel-supplied values if need be. From his perspective this has
already occurred, making that an invalid argument against devtmpfs:
The kernel carries the policy today for 98% of the devices,
if you change any driver given name, it will no longer show up in /dev
with the current name. That's the reality since years, and will not be
different anytime soon, there is no real naming policy besides the
current kernel supplied names.
It is clear that the devtmpfs developers have put a fair amount of
thought into just what was needed, and how it could work with existing
code—both inside and outside the kernel. It is also clear that there
is some resistance to returning to anything even remotely reminiscent of
devfs. Because devtmpfs is really quite different, and
has a nice effect on boot speed, one would think that it is likely to find
its way into the mainline sooner or later. If no further objections are
raised, and the
linux-next trials go well, 2.6.31 may very well be the release that sees
the inclusion of
Comments (33 posted)
When Microsoft filed its lawsuit against TomTom, it named two patents which
cover the VFAT filesystem. That, naturally, led to a renewed push to
either (1) get those patents invalidated, or (2) move away from
VFAT altogether. But some participants have advocated a third approach:
find a way to work around the patents which retains most of the VFAT
filesystem functionality while, with luck, avoiding any potential infringement of the
claims of the patent. But, as a recently-posted patch
and the ensuing discussion
show, workarounds are not a straightforward solution even after the lawyers
have been satisfied.
The patch (written by Andrew Tridgell, but posted by Dave Kleikamp), comes
with this changelog:
Add CONFIG_VFAT_NO_CREATE_WITH_LONGNAMES option
When this option is enabled the VFAT filesystem will refuse to
create new files with long names. Accessing existing files with
long names will continue to work.
Note that the changelog gives no clue as to why one might want this
particular configuration option. What it probably comes down to is this: all of the
claims in the VFAT patent refer to the creation of long file names.
Reading filesystems with such names is not addressed by the patent. So the
apparent thinking is that, even if the named patents really read on the Linux
VFAT implementation, they will not read on a version which cannot create
files with long names.
It looks like a reasonable hack. Interoperability with all existing VFAT
filesystems is retained, as long as one does not need to create files with
long names on the Linux side. But systems which run kernels with this
option enabled have a much lower probability of being found to infringe on
the VFAT patents. It could, maybe, be an optimal solution.
That said, the patch has been poorly received in the kernel development
community. One of the reasons for this chilly reception, certainly, is
general hostility to the software patent system and an associated lack of
willingness to capitulate to it. Add in a generous helping of contempt for
the VFAT patents - and their owner - in particular, and it is not
surprising that some developers would rather not entertain "solutions" to
The bigger issue, though, is that the patch does not describe the real
problem that it is trying to solve. There has been a lot of fairly
weaselly discussion from IBM developers on the lists, but none of them are
willing to just come out and say what is going on. The closest, perhaps,
is this message from Tridge:
However, if you are willing to concede that there are good
non-technical arguments for wanting to "get the VFAT out" then
choosing the best way to achieve that is most definitely a
technical decision, and that is what we can discuss here.
Unfortunately I am unable to discuss any of the non-technical
reasons for why "get the VFAT out" might be a good idea in the
first place. That is damn frustrating, but it is just how things
All of this talk creates a certain feeling of patches being sent out to the
list from some smoke-filled room deep within IBM headquarters. But, more
importantly, the lack of information makes it impossible for the
development community to determine whether the patch works. To make that
decision, developers need to know what problem is being solved, and how the
proposed solution makes the problem go away. But they don't have that
information; instead, they simply have a patch which makes it possible to
remove some functionality from the kernel.
The subtext of the conversation is that some lawyers at IBM have,
presumably, determined that a potential problem exists. That problem could
be as simple as "this feature may attract infringement suits,"
independently of whether the patents are valid or whether Linux
infringes on them. For any number of Linux users, the simple fact
that the probability of being sued might go up is enough to inspire a
search for alternatives. Also, presumably, these same lawyers have
concluded that this particular workaround can resolve these worries. So
now they believe it should be a part of the Linux kernel.
But if the lawyers have really come to these conclusions, they are not
saying so in any public forum. So the kernel developers are left wondering
what is really going on. Are there really lawyers involved, or is this
patch the work of a couple of programmers who have tried to create a
solution (to a problem perceived by them) on their own? Why can't a
company like TomTom just patch out the long-name functionality on their own
if they are truly worried about it? Might the inclusion of this patch open
the kernel up to other potential legal difficulties that we don't know
Tridge's suggestion is that a prominent
kernel developer needs to have a conversation with a lawyer before making
the decision on this patch. That approach might lead to a correct outcome,
but it will still leave most of the community in the dark and unhappy about
It would appear that a better way is required. Currently, it is difficult
for developers to determine whether a patent really applies to an algorithm
in the kernel or not. If they conclude that there is a patent problem,
these same developers are poorly placed to figure out what a minimal
workaround might be. We need some help in this area. This particular
problem is likely to come up again in other contexts; if we can put some
sort of process in place for addressing legal issues, life will be easier
in the future.
IBM is said to have extensive documentation on the process of working
around patents; for some strange reason, this information has never been
released to the public. Unfortunately, determinations by lawyers are also
unlikely to be released to the public, for any number of reasons. But
developers need all of this information to respond properly to legal
may be no alternative to some sort of process where a limited group of
developers is given access to information under non-disclosure agreements.
Such processes are distasteful, but they also are fairly common; many
device drivers are created under non-disclosure agreements.
The Linux Foundation currently has an NDA program intended to connect
developers with hardware documentation. Perhaps a similar program (under
the auspices of the Linux Foundation, or of another group like the Software
Freedom Law Center or the Open Invention Network) could be created for
access to legal information. As it is, we have a situation where some
developers are talking to their employers' lawyers and nobody else has any
real idea of what is going on. That will lead to slow, loud, and
contentious attempts to solve legal problems. Given that we're almost
certain to have more of these problems in the future, we might want to put
some thought into finding a better way.
Comments (50 posted)
One of the discussions your editor missed at the recent Linux Storage and
Filesystem workshop covered the proposed reflink()
Fortunately, the filesystem developers have now filled in the relevant
information with a detailed email exchange, complete with patches. We now
have a proposed system call
which has created
more open questions than answers. The creation of a new core system call
requires a lot of thought, so a close look at these questions would seem to
be called for.
The proposed system calls are pretty simple:
int reflink(const char *oldname, const char *newname);
int reflinkat(int old_dir_fd, const char *oldname,
int new_dir_fd, const char *newname, int flags);
These system calls function much like link() and linkat()
but with an important exception: rather than create a new link pointing to
an existing inode, they create a new inode which happens to share the same
disk blocks as the existing file. So, at the conclusion of a
reflink() call, newname looks very much like a copy of
oldname, but the actual data blocks have not been duplicated. The
files are copy-on-write, though, meaning that a write to either file will cause
some or all of the blocks to be duplicated. A change to one of the files
will thus not be visible in the other file. In a sense, a reflink()
call behaves like a low-cost file copy operation, though how copy-like it will be
remains to be seen.
The first question to arise was: does the kernel really need to provide
both the reflink() and reflinkat() system calls? Most of
the other *at() calls are paired with the non-at versions because
the latter came first. Since Unix-like systems have had link()
for a long time, it cannot be removed without breaking applications. So
linkat() had to go in as a separate call. But
reflink() is new, so it can just as easily be implemented in the C
library as a wrapper around reflinkat(). That is how things
will probably be done in the end.
The deeper discussion, though, reveals that there are two fundamentally
different views of how this system call should work. Joel Becker, who
posted the reflink() patch, sees it as a new variant of the
link() system call. Others, though, would like it to behave more
like a file copy operation. If you see reflink() as being a type
of link(), then certain implications emerge:
- The reflink-as-link view requires that the new file have (to the
greatest extent possible) the same metadata as the old one; in
particular, it must have (at the end of the reflink() system
call) the same owner, just like hard links do.
- Creating low-level snapshots of filesystems (or portions thereof) is
straightforward and easy. Reflinked files look just like the
originals; in particular, they have (mostly) the same metadata and can
share the same security context.
- Disk quotas are a problem. Should a reflinked file count against the
owner's disk quota, even though little or no extra storage is actually
used? If so, the system must take extra steps to keep users from
creating a reflink to a file they do not own; otherwise one user could
exhaust another user's quota. If, instead, storage is charged against
the quota of the user who created the reflink, complicated structures
will be needed to track usage associated with files owned by others.
- What happens if the new file's metadata - permissions or owner - are
changed? In some scenarios, depending on the underlying filesystem
implementation, it seems that a metadata change could
require a copy-on-write of the whole file. That would turn a command
like chmod into a rather heavy-weight operation.
On the other hand, if a reflink is like making a copy, the situation
- Security becomes a rather more complicated affair. Making a hard link
doesn't require messing with SELinux security contexts, but a
reflink-as-copy would require that. Permission checks (again,
including security module checks) would have to become more
elaborate; it would have to be clear that the user making the reflink
had read access to the file.
- The quota problem goes away. If a reflink is essentially a copy, then
the resulting link should be owned by the user who creates it, rather
than the owner of the original file. The only course which makes
sense is to charge both users for the full size of the file. There
are no concerns about one user exhausting another's disk quota, and
there are no real difficulties with keeping disk usage information
- Metadata changes are handled naturally, since the files are completely
separate from each other.
- Reflinks are no longer true snapshots; they will not work to represent
the state of the filesystem at a given time. For a user whose real
interest is low-level snapshotting, reflink-as-copy will not work.
On the other hand, reflink-as-copy could be used in a lot of other
interesting situations; the cp command could create reflinks by
default when the destination file is on the same filesystem. That would
turn "cp -r" into a fast and efficient operation. They could
also be used to share files between virtualized guests.
What it comes down to is that there are real uses for both the
reflink-as-link and reflink-as-copy modes of operation. So the right
solution may well be to implement both modes. The flags parameter
to reflinkat() can be used to distinguish between the two.
Implementing both behaviors will complicate the implementation somewhat,
and it muddies up what is otherwise a conceptually clean system call. But
that's what happens, sometimes, when designs encounter the real world.
Comments (86 posted)
Patches and updates
Core kernel code
- Nigel Cunningham: TuxOnIce .
(May 6, 2009)
Filesystems and block I/O
Virtualization and containers
- Gregory Haskins: irqfd .
(May 5, 2009)
Page editor: Jonathan Corbet
Next page: Distributions>>