Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.18-rc3, released on July 29. The patch rate is beginning to slow as this kernel stabilizes, so this prepatch adds a number of fixes but not much else. The long-format changelog has the details.Well over 100 fixes have been merged into the mainline repository since -rc3 was released.
The current -mm tree is 2.6.18-rc2-mm1. Recent changes to -mm include a big x86-64 update, an NFS update, and lots of fixes.
Kernel development news
Quote of the week
Marcelo Tosatti passes the 2.4 baton
Marcelo Tosatti has announced the availability of the third 2.4.33 release candidate, containing a very small number of patches. He has also announced that the 2.4 maintainership is passing on to Willy Tarreau, who has been running the 2.4 "hotfix" patch series for some time. Many thanks are due to Marcelo, who has maintained the 2.4 kernel since 2.4.16.SCSI command filtering
Burning data to a CD or DVD is a complicated task, involving the use of a wide range of SCSI commands. So, any application which burns discs must have the ability to send special SCSI operations to the drive. Just before the 2.6.8 kernel came out, however, the kernel developers decided that applications should not be able to send just any SCSI command. Some of those commands could lead the drive to rewrite its firmware, catch fire, or replace music tracks with recordings of Richard Stallman singing. In an attempt to keep such undesirable things from happening, Linus added a late patch which blocked unprivileged users from using any SCSI commands which do not appear in an in-kernel whitelist.It is almost certainly true that no user ever destroyed a CD drive with a 2.6.8 system. In fact, very few of them even wrote discs; the filtering at that stage was so severe that unprivileged users could not do anything useful at all. Subsequent updates made things better, however, and by about 2.6.10 burning worked again for most users.
Not for all users, however. As Dave Jones recently noted on the linux-scsi list, the command filtering still trips up some Plextor drives. The cdrecord utility tries to send vendor-specific commands to those drives, but the kernel filters them out. Everything then comes to a halt, and the user must retry the operation as root to get the job done. Dave asked: might it be a good idea to add a per-vendor exceptions capability to the filtering code?
The response which came back from a couple of block subsystem developers was that the command filtering should simply be taken out altogether. Evidently this topic had been discussed at the recent storage summit, and the participants had agreed that this feature should be removed. James Bottomley put it this way:
So I think ripping the table out and acknowledging we have no security is better than giving the illusion of having it.
There are a number of complaints about the filtering code. It is a way of encoding policy in the kernel, which is generally frowned upon - even though the policy, in this case, is really an attempt to enforce a difference between access to a disc within a drive and access to the drive itself. The command list will never be entirely correct; it seems that some drives must receive the appropriate, vendor-specific incantations or they will refuse to write discs. Some commands mean different things to different types of devices; what's safe for a CD burner might be a destructive operation on a different SCSI-like device. It also doesn't help that there are, in fact, two different SCSI command filters in the kernel (one in drivers/scsi/sg.c, the other in block/scsi_ioctl.c) which implement different policies. For all of these reasons, attendees at the storage summit apparently agreed to take the filtering out.
There's just one little problem with this plan: Linus feels differently about filtering:
This statement would appear to be pretty damn final. That does not mean that the situation cannot be improved, however. The leading idea at the moment would appear to be to allow a privileged user to make changes to the command filter table. Distributions could then ship tools which detect problematic devices and modify the filtering tables accordingly; the whole thing could be transparently integrated with the hotplug functionality. Jens Axboe has a patch (originally from Peter Jones) which turns the filter list into a per-device object, tweakable through sysfs, so each device could have its own set of exceptions.
Just how this interface works may yet require some discussion to nail down. But the configurable, per-device filter looks like the way forward. It retains the filtering of dangerous commands while moving the policy decisions to user space. Once the policy can be changed, distributors can do the work to ensure that specific devices are well supported, or, if they prefer, simply mark all commands as "allowed" and, for all practical purposes, remove the filter altogether.
Debating reiser4 - again
Hans Reiser is nothing if not persistent. Back in October, 2002, he requested that his new reiser4 filesystem be included into the 2.5 development kernel before it went into the pre-2.6 stabilization mode. Nearly four years have passed, during which reiser4 has been through endless linux-kernel debates, numerous changes to fix problems found by reviewers, the removal of core features, and a long wait in the -mm kernel. Despite all of this, reiser4 is still not in the mainline - but Hans has not given up.There have been a number of obstacles to overcome so far. The "files as directories" feature tweaked POSIX semantics in a way that disturbed some people, and, more importantly, had crucial locking problems; that feature has been removed. The posted benchmarks have not been entirely credible to all observers. There is concern about how committed the reiser4 developers are to ongoing support of the filesystem, once it is merged. Hans tends to have difficult relations with other kernel developers, and does not always respond entirely gracefully to (often not entirely graceful) review comments. The end result has been a difficult path toward inclusion for a filesystem which truly does offer some interesting ideas and the potential for top-level performance.
Partially as a result of a feeling that the reiser4 process has gone on for too long, the debate has returned to linux-kernel. Hans and company would like to see reiser4 put into 2.6.19, and it seems that they might just succeed.
Some outstanding issues remain, though some of them may not be as problematic as some people think. The biggest of those, probably, is the reiser4 plugin concept. Plugins allow the filesystem to behave differently for every file stored there; they can add features like compression, encryption, or many of the more esoteric things currently done with FUSE. Plugins raise all kinds of red flags in the development community. So, for example, Linus states:
Jeff Garzik has concerns as well:
The message for the reiser4 developers over the last few years is that any such mechanism, if it makes sense at all, should be implemented within the VFS level, rather than within any specific filesystem. Reiser4 plugins are seen as a separate, private VFS with a long potential for problems.
What a number of people have not realized, perhaps, is that the plugin issue is much smaller than it once might have been. They cannot be loaded at run time, so there should not be copyright issues like those that accompany closed-source kernel modules. And most of the plugin functionality has been removed in response to past comments. Andrew Morton, who has recently reviewed the code himself, comments:
From Andrew's point of view, the biggest problems would appear to be the lack of direct I/O and extended attribute support. Direct I/O looks like it might not be too far in the future, but it does not appear that there is any immediate prospect of extended attributes. That means that, among other things, a reiser4 filesystem cannot support SELinux. That limitation may cause some distributors to leave reiser4 support out, even after reiser4 has finally been merged into the mainline kernel.
The remaining objections may be enough to dissuade some users or distributors from working with reiser4, but it would seem that they should not be enough to block the merging of reiser4 into the mainline. A new filesystem does not affect anybody who does not use it, and the bad pitfalls for reiser4 users (deadlocks, for example) should be long gone. So it may just be that Hans Reiser's long wait is nearing its end.
Toward a kernel events interface
Last week's article on network channels suggested that channels might not be the way of the future at all. Since then, there has been a great deal of discussion on how networking might move forward on many levels, some of which might yet include channels. Your editor plans to gain an understanding of the Grand Unified Flow Cache and related concepts (such as Rusty's plans to thrash up netfilter yet again) for a future article; for now, we'll look at a different aspect of networking (and beyond): a user-space events interface.Unlike some other operating systems, Linux currently lacks a system call for generalized event reporting. Linux applications, instead, use calls like poll() to figure out when there is work to be done. Unfortunately, poll() does not solve the entire problem, so application event loops must do complicated things to deal with things like signals. Handling asynchronous I/O within a traditional Linux event loop can be especially tricky. If there were a single interface which provided an application with all of the event information it needed, applications would get simpler. There is also the potential for significant performance improvements.
There are two active proposals for event interfaces for Linux: the kevent mechanism and the event channel API proposed by Ulrich Drepper at this year's Ottawa Linux Symposium. Of the two, kevents currently have the advantage for one simple reason: there is an existing, working implementation to look at. So most of the discussion has concerned how kevents can be improved.
The original kevent API is seen as being a bit difficult; it relies on a single multiplexer system call (kevent_ctl()), an approach which is generally frowned upon. The call also requires the application to construct an array with two different types of structures, which is a bit awkward. So one of the first suggestions has been to separate out various parts of the API. The current kevent patch (as of August 1) contains a new system call:
int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void *buf, unsigned flags);
This call would return between min_nr and max_nr events, storing them sequentially in buf, subject to the given timeout (specified in milliseconds). The flags argument is unused in the current implementation.
There are a number of things which might be improved with this interface, but, as it happens, its final form is likely to look quite different. The current interface still requires frequent system calls to retrieve events; Linux system calls are fast, but, in a high-bandwidth situation, it still would be preferable to spend more time in user space if possible. With a different approach to event reporting, it might just be possible.
The idea which has been discussed is to map an array of kevent structures between kernel and user space. This array would be treated as a circular buffer, perhaps managed using a cache-friendly, channel-like index mechanism. The kernel would place events into the buffer when they occur, and user-space would consume them. Whenever there are events to process, the application could obtain them without entering the kernel at all. Once this mechanism is in place, the kevent_get_events() call could go away, replaced by a simple "wait for events" interface (though glibc would almost certainly provide a synchronous "get events" function). The result should be a very fast interface, especially when the number of events is large.
There are a couple of issues to be worked out, still. One has to do with what happens when the buffer fills. The current asynchronous I/O interface does not allow there to be more outstanding operations than there are available control block structures; that way, there is guaranteed to be space to report on the status of each operation. That can be important, since the place in the kernel which wants to do the reporting is often running at software or hardware interrupt level. If one envisions using kevents to track thousands of open sockets, an unknown number of connection events, etc., however, preallocating all of the event structures becomes increasingly impractical. So something intelligent will have to be done when the buffer fills.
The other issue has to do with "level-triggered" events which correspond more to a specific status than a real event which has occurred. "This socket can be written to" is such an event. When an interface like poll() is used to query whether a write would block, the kernel can check the status and return immediately if the given file descriptor can be written to. Reporting this sort of status through a circular buffer is rather harder to do. So, one way or another, applications will have to explicitly poll for such events.
Given the current level of interest, some way of dealing with these issues seems likely to surface in the near future. That could clear the path for merging kevents into the mainline, perhaps as early as 2.6.20.
New kernels and old distributions
The udev utility has a well-defined job: take information from kernel events and the sysfs virtual filesystem and use it to create device files corresponding to the actual configuration of the system. If udev falls down, the system will be partially or completely unusable, a situation which tends to go over poorly with users. So, when Andrew James Wade reported a udev failure with a recent -mm kernel, the developers took notice.The problem, as it turns out, is caused by some sysfs changes designed to improve power management in the kernel. The immediate problem can be fixed by adding another patch, but that, in turn, only leads to further problems; a number of distributions will break because the version of udev they ship is too old to understand the new sysfs format. Andrew Morton complained that Fedora Core 3 breaks, but the problem is likely to be more widespread than that.
Greg Kroah-Hartman, the developer behind the changes, responded this way:
How long do you expect the kernel to support unsupported, community based distros that thrive on the fact that they are quickly updated? [...]
And yes, I will revert the patch in mainline that causes people to have to upgrade to a udev that is in FC5, and wait till the next release for that to happen (the minimum will be 081, which was released in January, 2006, by the time 2.6.19 is out, that will be about 10 months old.)
Andrew was unimpressed:
Among others, distributions scheduled to break with the 2.6.19 kernel include Ubuntu 6.06 LTS ("dapper") and the not-yet-released Slackware 11. So, unsurprisingly, it's not just Andrew who is displeased by this change; there is a definite chance that the whole set of patches will be withdrawn and rethought.
Greg asks a fundamental question, however:
"How long should the community have to care about a distro after the
creators of it have abandoned it?
" The traditional answer has been
"forever," but the new generation of "kernel in user space" tools is making
that promise harder to keep. Tools like udev are tightly tied to
the sysfs filesystem which, in turn, is a nearly direct representation of internal
kernel data structures. Sysfs functions, in some ways, like an internal
kernel API, but it is, in reality, a user-space interface. Keeping it
stable and avoiding compatibility problems with older user-space tools is a
difficult challenge, aggravated by the fact that the kernel developers are
still well within the process of figuring out how sysfs should really work.
At this year's Kernel Summit, there was some talk of folding tools like udev into the kernel code base and distributing them together. New kernels would always come with a version of udev that worked, and some of these compatibility problems would go away. There are limits, however, to how many tools can be packaged in this way, and, in any case, it can be hard to see this approach as anything other than a hack to avoid the hard problem of keeping such a wide and complex ABI stable.
This particular problem will likely be worked around, one way or another. But it won't be the last such. If the kernel developers are going to continue to promise that the user-space ABI will remain stable indefinitely, they will have to get a handle on all aspects of that ABI - not just the system calls. It will not be easy: modern systems require complex communications between the user and kernel realms. But the kernel developers have solved plenty of "not easy" problems so far; given the increased attention being paid to ABI regressions, they will probably figure this one out too.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>