The current 2.6 prepatch is 2.6.18-rc3
on July 29. The
patch rate is beginning to slow as this kernel stabilizes, so this prepatch
adds a number of fixes but not much else. The long-format
has the details.
Well over 100 fixes have been merged into the mainline repository since
-rc3 was released.
The current -mm tree is 2.6.18-rc2-mm1. Recent changes
to -mm include a big x86-64 update, an NFS update, and lots of fixes.
Comments (4 posted)
Kernel development news
I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data
structures more important. Bad programmers worry about the
code. Good programmers worry about data structures and their
-- Linus Torvalds
Comments (12 posted)
Marcelo Tosatti has announced
availability of the third 2.4.33 release candidate, containing a very small
number of patches. He has also announced that the 2.4 maintainership is
passing on to Willy Tarreau, who has been running the 2.4 "hotfix" patch
series for some time. Many thanks are due to Marcelo, who has maintained
the 2.4 kernel since 2.4.16.
Comments (2 posted)
Burning data to a CD or DVD is a complicated task, involving the use of a
wide range of SCSI commands. So, any application which burns discs must
have the ability to send special SCSI operations to the drive. Just before
the 2.6.8 kernel came out, however, the kernel developers decided that
applications should not be able to send just any
SCSI command. Some
of those commands could lead the drive to rewrite its firmware, catch fire,
or replace music tracks with recordings of Richard Stallman singing. In an
attempt to keep such undesirable things from happening, Linus added a late patch
unprivileged users from using
any SCSI commands which do not appear in an
It is almost certainly true that no user ever destroyed a CD drive with a
2.6.8 system. In fact, very few of them even wrote discs; the filtering at
that stage was so severe that unprivileged users could not do anything
useful at all. Subsequent updates made things better, however, and by
about 2.6.10 burning worked again for most users.
Not for all users, however. As Dave Jones recently noted on the linux-scsi list, the command
filtering still trips up some Plextor drives. The cdrecord utility tries
to send vendor-specific commands to those drives, but the kernel
filters them out. Everything then comes to a halt, and the user must retry
the operation as root to get the job done. Dave asked: might it be a good
idea to add a per-vendor exceptions capability to the filtering code?
The response which came back from a couple of block subsystem developers
was that the command filtering should simply be taken out altogether.
Evidently this topic had been discussed at the recent storage summit, and
the participants had agreed that this feature should be removed. James
Bottomley put it this way:
If we're going to allow users access to burn CDs, it's impossible
to police them with certainty as this case indicates. If we allow
vendor specific commands down, there are bound to be some that
format the drive or destroy the firmware...
So I think ripping the table out and acknowledging we have no
security is better than giving the illusion of having it.
There are a number of complaints about the filtering code. It is a way of
encoding policy in the kernel, which is generally frowned upon - even
though the policy, in this case, is really an attempt to enforce a
difference between access to a disc within a drive and access to the drive
itself. The command list will never be entirely correct; it seems that
some drives must receive the appropriate, vendor-specific incantations or
they will refuse to write discs. Some commands mean different things to
different types of devices; what's safe for a CD burner might be a
destructive operation on a different SCSI-like device. It also doesn't
help that there are, in fact, two different SCSI command filters in the
kernel (one in drivers/scsi/sg.c, the other in
block/scsi_ioctl.c) which implement different policies. For all of these
reasons, attendees at the storage summit apparently agreed to take the
There's just one little problem with this plan: Linus feels differently about filtering:
Put another way: you will remove that command filtering in
block/scsi_ioctl.c only in a kernel that I don't maintain, or by
disabling it in some way that is so hidden that I won't
notice. Because I'm not so stupid as to think that it's ok for
normal users to set driver passwords or rewrite the disk firmware
just because they have write permissions to the device. That's
pretty damn final.
This statement would appear to be pretty damn final. That does not mean
that the situation cannot be improved, however. The leading idea at the
moment would appear to be to allow a privileged user to make changes to the
command filter table. Distributions could then ship tools which detect
problematic devices and modify the filtering tables accordingly; the whole
thing could be transparently integrated with the hotplug functionality.
Jens Axboe has a
patch (originally from Peter Jones) which turns the filter list into a
per-device object, tweakable through sysfs, so each device could have its
own set of exceptions.
Just how this interface works may yet require some discussion to nail
down. But the configurable, per-device filter looks like the way forward.
It retains the filtering of dangerous commands while moving the policy
decisions to user space. Once the policy can be changed, distributors can
do the work to ensure that specific devices are well supported, or, if they
prefer, simply mark all commands as "allowed" and, for all practical
purposes, remove the filter altogether.
Comments (11 posted)
Hans Reiser is nothing if not persistent. Back in October, 2002, he requested
that his new reiser4
filesystem be included into the 2.5 development kernel before it went into
the pre-2.6 stabilization mode. Nearly four years have passed, during
which reiser4 has been through endless linux-kernel debates, numerous
changes to fix problems found by reviewers, the removal of core features,
and a long wait in the -mm kernel. Despite all of this, reiser4 is still
not in the mainline - but Hans has not given up.
There have been a number of obstacles to overcome so far. The "files as
directories" feature tweaked POSIX semantics in a way that disturbed some
people, and, more importantly, had crucial locking problems; that feature
has been removed. The posted benchmarks have not been entirely credible to
all observers. There is concern about how committed the reiser4 developers
are to ongoing support of the filesystem, once it is merged. Hans tends to
have difficult relations with other kernel developers, and does not always
respond entirely gracefully to (often not entirely graceful) review
comments. The end result has been a difficult path toward inclusion for a
filesystem which truly does offer some interesting ideas and the potential
for top-level performance.
Partially as a result of a feeling that the reiser4 process has gone on for
too long, the debate has returned to linux-kernel. Hans and company would
like to see reiser4 put into 2.6.19, and it seems that they might just
Some outstanding issues remain, though some of them may not be as
problematic as some people think. The biggest of those, probably, is the
reiser4 plugin concept. Plugins allow the filesystem to behave differently
for every file stored there; they can add features like compression,
encryption, or many of the more esoteric things currently done with FUSE.
Plugins raise all kinds of red flags in the development community. So, for
example, Linus states:
As long you call them "plugins" and treat them as such, I (and I
suspect a lot of other people) are totally uninterested, and in
fact, a lot of people will suspect that the primary aim is to
either subvert the kernel copyright rules, or at best to create a
mess of incompatible semantics with no sane overlying rules for
Jeff Garzik has concerns as well:
I don't want to be the distro support person trying to fix a crash
in "reiser4", where the customer has secretly replaced the standard
inode data structure with a plugin written by an intern, and
secretly replaced the directory algorithm with a closed source
plugin from PickYourVendor. Trying picking through that mess with a
The message for the reiser4 developers over the last few years is that any
such mechanism, if it makes sense at all, should be implemented within the
VFS level, rather than within any specific filesystem. Reiser4 plugins are
seen as a separate, private VFS with a long potential for problems.
What a number of people have not realized, perhaps, is that the plugin
issue is much smaller than it once might have been. They cannot be loaded
at run time, so there should not be copyright issues like those that
accompany closed-source kernel modules. And most of the plugin
functionality has been removed in response to past comments. Andrew
Morton, who has recently reviewed the code
The plugins appear to be wildly misnamed - they're just an internal
abstraction layer which permits later feature additions to be added
in a clean and safe manner. Certainly not worth all this fuss.
From Andrew's point of view, the biggest problems would appear to be the
lack of direct I/O and extended attribute support. Direct I/O looks like
it might not be too far in the future, but it does not appear that there is
any immediate prospect of extended attributes. That means that, among
other things, a reiser4 filesystem cannot support SELinux. That limitation
may cause some distributors to leave reiser4 support out, even after
reiser4 has finally been merged into the mainline kernel.
The remaining objections may be enough to dissuade some users or
distributors from working with reiser4, but it would seem that they should
not be enough to block the merging of reiser4 into the mainline. A new
filesystem does not affect anybody who does not use it, and the bad
pitfalls for reiser4 users (deadlocks, for example) should be long gone.
So it may just be that Hans Reiser's long wait is nearing its end.
Comments (16 posted)
Last week's article
network channels suggested that channels might not be the way of the future
at all. Since then, there has been a great deal of discussion on how
networking might move forward on many levels, some of which might yet
include channels. Your editor plans to gain an understanding of
the Grand Unified Flow Cache and related concepts (such as Rusty's plans to
thrash up netfilter yet again) for a future article; for now,
we'll look at a different aspect of networking (and beyond): a user-space
Unlike some other operating systems, Linux currently lacks a system call
for generalized event reporting. Linux applications, instead, use calls
like poll() to figure out when there is work to be done.
Unfortunately, poll() does not solve the entire problem, so
application event loops must do complicated things to deal with things like
signals. Handling asynchronous I/O within a traditional Linux event loop
can be especially tricky. If there were a single interface which provided
an application with all of the event information it needed, applications
would get simpler. There is also the potential for significant performance
There are two active proposals for event interfaces for Linux: the kevent mechanism and the event
channel API proposed by Ulrich
Drepper at this year's Ottawa Linux Symposium. Of the two, kevents
currently have the advantage for one simple reason: there is an existing,
working implementation to look at. So most of the discussion has concerned
how kevents can be improved.
The original kevent API is seen as being a bit difficult; it relies on a
single multiplexer system call (kevent_ctl()), an approach which is generally
frowned upon. The call also requires the application to construct an array
with two different types of structures, which is a bit awkward. So one of
the first suggestions has been to separate out various parts of the API.
The current kevent patch (as
of August 1) contains a new system call:
int kevent_get_events(int ctl_fd,
unsigned int min_nr,
unsigned int max_nr,
unsigned int timeout,
This call would return between min_nr and max_nr events,
storing them sequentially in buf, subject to the given
timeout (specified in milliseconds). The flags argument
is unused in the current implementation.
There are a number of things which might be improved with this interface,
but, as it happens, its final form is likely to look quite
different. The current interface still requires frequent system calls to
retrieve events; Linux system calls are fast, but, in a high-bandwidth
situation, it still would be preferable to spend more time in user space if
possible. With a different approach to event reporting, it might just be
The idea which has been discussed is to map an array of kevent
structures between kernel and user space. This array would be treated as a
circular buffer, perhaps managed using a cache-friendly, channel-like index
mechanism. The kernel would place events into the buffer when they occur,
and user-space would consume them. Whenever there are events to process,
the application could obtain them without entering the kernel at all. Once
this mechanism is in place, the kevent_get_events() call could go
away, replaced by a simple "wait for events" interface (though glibc would
almost certainly provide a synchronous "get events" function). The result
should be a very fast interface, especially when the number of events is
There are a couple of issues to be worked out, still. One has to do with
what happens when the buffer fills. The current asynchronous I/O interface
does not allow there to be more outstanding operations than there are
available control block structures; that way, there is guaranteed to be
space to report on the status of each operation. That can be important,
since the place in the kernel which wants to do the reporting is often
running at software or hardware interrupt level. If one envisions using
kevents to track thousands of open sockets, an unknown number of connection
events, etc., however, preallocating all of the event structures becomes
increasingly impractical. So something intelligent will have to be done
when the buffer fills.
The other issue has to do with "level-triggered" events which correspond
more to a specific status than a real event which has occurred. "This
socket can be written to" is such an event. When an interface like
poll() is used to query whether a write would block, the kernel
can check the status and return immediately if the given file descriptor
can be written to. Reporting this sort of status through a circular buffer
is rather harder to do. So, one way or another, applications will have to
explicitly poll for such events.
Given the current level of interest, some way of dealing with these issues
seems likely to surface in the near future. That could clear the path for
merging kevents into the mainline, perhaps as early as 2.6.20.
Comments (7 posted)
utility has a well-defined job: take information from
kernel events and the sysfs virtual filesystem and use it to create device
files corresponding to the actual configuration of the system. If
falls down, the system will be partially or completely
unusable, a situation which tends to go over poorly with users. So, when
Andrew James Wade reported
failure with a recent -mm kernel, the developers took notice.
The problem, as it turns out, is caused by some sysfs changes designed to
improve power management in the kernel. The immediate problem can be fixed
by adding another patch, but that, in turn, only leads to further problems;
a number of distributions will break because the version of udev
they ship is too old to understand the new sysfs format. Andrew Morton complained that Fedora Core 3 breaks, but
the problem is likely to be more widespread than that.
Greg Kroah-Hartman, the developer behind the changes, responded this way:
That distro is unsupported now, right?
How long do you expect the kernel to support unsupported, community
based distros that thrive on the fact that they are quickly
And yes, I will revert the patch in mainline that causes people to
have to upgrade to a udev that is in FC5, and wait till the next
release for that to happen (the minimum will be 081, which was
released in January, 2006, by the time 2.6.19 is out, that will be
about 10 months old.)
Andrew was unimpressed:
My (repeat) point is that we're proposing to break _all_ distros
which are older than ten months. We don't play the "oh, that isn't
supported any more" game....
This sucks. Do you know what machines we'll be breaking out there?
I sure don't.
Among others, distributions scheduled to break with the 2.6.19 kernel
include Ubuntu 6.06 LTS ("dapper") and the not-yet-released Slackware 11.
So, unsurprisingly, it's not just Andrew who is displeased by this change; there is
a definite chance that the whole set of patches will be withdrawn and
Greg asks a fundamental question, however:
"How long should the community have to care about a distro after the
creators of it have abandoned it?" The traditional answer has been
"forever," but the new generation of "kernel in user space" tools is making
that promise harder to keep. Tools like udev are tightly tied to
the sysfs filesystem which, in turn, is a nearly direct representation of internal
kernel data structures. Sysfs functions, in some ways, like an internal
kernel API, but it is, in reality, a user-space interface. Keeping it
stable and avoiding compatibility problems with older user-space tools is a
difficult challenge, aggravated by the fact that the kernel developers are
still well within the process of figuring out how sysfs should really work.
At this year's Kernel Summit,
there was some talk of folding tools like
udev into the kernel code base and distributing them together.
New kernels would always come with a version of udev that worked,
and some of these compatibility problems would go away. There are limits,
however, to how many tools can be packaged in this way, and, in any case,
it can be hard to see this approach as anything other than a hack to avoid
the hard problem of keeping such a wide and complex ABI stable.
This particular problem will likely be worked around, one way or another.
But it won't be the last such. If the kernel developers are going to
continue to promise that the user-space ABI will remain stable
indefinitely, they will have to get a handle on all aspects of that ABI -
not just the system calls. It will not be easy: modern systems require
complex communications between the user and kernel realms. But the kernel
developers have solved plenty of "not easy" problems so far; given the
increased attention being paid to ABI regressions, they will probably
figure this one out too.
Comments (27 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>