LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.23-rc4, released by Linus (under the code name "Pink Farting Weasel") on August 27. It has a rather large pile of fixes; "most regressions" have been dealt with at this point. See the short-form changelog for details, or the long-form changelog for lots of details.

As of this writing, there have been no patches merged into the mainline repository since the -rc4 release. There have been no -mm tree releases over the last week.

The current stable 2.6 kernel is 2.6.22.5, released on August 22. It contains about 20 patches for serious problems. The 2.6.22.6 review process (involving a couple dozen more patches) is underway, with the release being a bit overdue as of this writing.

For older kernels: 2.6.20.17 was released on August 25 with a long list of fixes. 2.6.20.18, released on August 28, reverts two of those fixes which turned out not to be such a good idea after all.

Comments (2 posted)

Kernel development news

Quote of the week

In other words, consuming half of your processor is (surprise!) detrimental to multimedia playback performance. At this point, it becomes clear that the process scheduler folks and the networking folks are bitter enemies and do not converse.
-- Robert Love (not talking about Linux)

Comments (21 posted)

Kernel Summit 2007 - an advance view

By Jonathan Corbet
August 24, 2007
For the past several years, the annual, invitation-only kernel developers' summit has been held immediately prior to the Ottawa Linux Symposium. This year is different, though: the summit is, instead, happening just after LinuxConf Europe in Cambridge, UK. As usual, your editor will be there and will be able to report from the event. The preliminary agenda has been posted, though, as has the list of attendees [PDF]. So it is possible to look forward and get a sense for what is likely to be discussed.

A few months ago, a discussion of interesting topics was held on the 2007 summit list. Many of the usual topics came around; there is always plenty of interesting development work going on in the kernel community. Andrew Morton objected to many of the topics under discussion, though, saying that the summit was not the appropriate venue to talk about them:

My overall take on kernel summit: we spend far too much time talking about technical stuff. There is little benefit in doing this: we conduct technical discussions over email and we do it well, and there are many very good reasons for doing it that way.... We fly halfway around the world to yap on about dentry cache scalability? Spare me, we'd get more done by staying home.

Andrew's conclusion, which was seconded by a number of other developers, was that the process-oriented discussions are always more interesting and useful than the deep technical sessions. Discussions of virtualization, memory management, or device drivers will always be uninteresting to a significant part of the group, and they do not necessarily add much over what can be done with email. But the process-oriented talk affects everybody and is much harder to do electronically.

So this year's agenda is more high-level than in previous years. That does not mean that there will be no technical talk, though. Some of the more technical sessions will cover:

  • Reports from mini-summits. The kernel is a big program, and developers often find that subsystem-specific questions are better addressed in smaller groups. At the summit, attendees from some recent mini-summits (covering power management, filesystems, storage, and virtualization, at least) will report back to the larger group.

  • Real time and scheduler issues are on the agenda because there are some big decisions to make. While much of the real-time tree has found its way into the mainline, some of the more disruptive chunks (sleeping spinlocks, threaded interrupt handlers) remain outside. Also outside of the mainline is the syslets/threadlets patch set. Hopefully some decisions will be made on whether these features should be merged, and, if so, what needs to be done to get them into shape.

  • There are a number of memory management issues out there, including the variable page and variable block size patches, approaches to deadlock avoidance, scalability work, and more. Also on the agenda is the more process-oriented question of why memory management patches are so hard to get into the mainline.

  • Virtualization has fallen off the agenda because most of the kernel-level work in this area has already been merged. The containers developers are just getting going, though, and there are a lot of questions about what their final destination is thought to be. A full containers implementation could impose significant overhead - on developers and on run-time performance - and could prove hard to sell.

That's about it for the serious technical talks; everything else will have a higher-level focus. The summit will start with a panel of distributor kernel maintainers. To a great extent, distributors are the immediate customers for the kernels that the developers put out; those distributors are then charged with getting mainline releases into a condition that allows it to be shipped to users. Distributor kernel maintainers tend to be on the front line when things go wrong; they always hear about all the problems. This panel will be a chance for those maintainers to talk about the quality of the kernels they are getting from the mainline and how things could be made to work better.

Once upon a time, the kernel stood alone and presented services to the system by way of the system call interface. In current systems, instead, users see a view of the system which is created by a whole set of utilities, including the C library, udev, HAL, and more. Interactions between these low-level components and the kernel is not always as smooth as it could be, and, despite the best efforts of the kernel development community, kernel releases have been known to occasionally break utilities like udev. The "greater kernel ecosystem" session will cover these issues and the general question of making the system as a whole work better together. Establishing better control over the user-space API is likely to come up, though the problem remains difficult.

There is a half-hour session on developer relations. The kernel development community is visibly growing, and that is generally a good thing. Ensuring the continued health of kernel development requires bringing in a steady stream of new developers - from all over the world. This session will be the place to talk about how that can be done, and how participation from under-represented parts of the world can be improved.

Andrew Morton gets an hour to pound the table on kernel quality and related issues. There still appears to be a consensus among the developers that the kernel is not getting buggier, but that view is not universally held. Everybody agrees that fewer bugs would be a good thing, though. So topics like bug tracking, fixing the reviewer shortage, possible stabilization releases, and so on, are likely to come up in this session.

Documentation is, inevitably, on the agenda - everybody wants more of it, but, somehow, it fails to just show up on its own. Last year there was some talk of imposing documentation requirements on new patches, but few people took the idea all that seriously. So maybe some different ideas for improving the situation will come about this time around. Also on the list may be the area of managing translations - an area of increasing interest - and standardizing kernel messaging.

Various other process-oriented questions have been swept into a session late on the second day. Are big code cleanups worth it? How can we improve our handling of large patches which affect a number of different subsystems? How do we deal with problematic maintainers? And, in general, is the kernel process going too fast? But perhaps the discussion will be dominated by Andrew Morton's suggestion that the developers form a union and demand a massive pay raise.

There are other sessions on the agenda as well; see the posted version for the full list. Whenever a group of this nature comes together, interesting things are bound to come out of it. Tune into LWN around September 6 for coverage from the event.

Comments (19 posted)

Cleaning up the block driver API

By Jonathan Corbet
August 28, 2007
Once upon a time, block device drivers implemented the same file_operations structure used by char drivers - despite the fact that block drivers are quite different and many of the file_operations methods had no relevance to them. By the 2.4 release, though, the block driver API had been significantly reworked, and struct file_operations was no longer used. Instead, block drivers have a block_device_operations structure containing many of the driver's exported operations. "Many" because certain other operations, including the ones which actually enqueue I/O requests, end up being stored in the request queue structure instead.

When the move to block_device_operations was done, a number of methods were carried over directly from the file_operations vector with their prototypes unchanged. Doing things this way minimized the pain for driver maintainers, but it led to some interesting interface artifacts. For example, consider the open() method:

    int (*open)(struct inode *ino, struct file *filp);

When a char device or an actual file is being opened, filp points to the internal file structure used by the kernel to manage the open file. If a user-space process opens a block device directly, filp will be used in the same way. Most of the time, though, block devices are opened by the kernel as a step toward mounting a filesystem stored there. In that case, there is no associated file structure. That's why a perusal of the source reveals code like this:

    /*
     * This crockload is due to bad choice of ->open() type.
     * It will go away.
     * For now, block device ->open() routine must _not_
     * examine anything in 'inode' argument except ->i_rdev.
     */
    struct file fake_file = {};
    struct dentry fake_dentry = {};
    fake_file.f_mode = mode;
    fake_file.f_flags = flags;
    fake_file.f_path.dentry = &fake_dentry;
    fake_dentry.d_inode = bdev->bd_inode;

Al Viro (who is responsible for much of the current API) has taken a look at this problem and others. In the case of open(), there is very little of the information passed in the inode and file structure pointers which is actually used by drivers. And some of that is used in hazardous ways - any driver which depends on anything in fake_file lasting beyond the open() call will find itself in trouble. There are other issues with the API as well, leading Al to propose some significant changes. The result, which is almost certain to be merged when it is ready (possibly as soon as 2.6.24), will be a cleaner block driver API - at the cost of changes for every existing driver.

The first change will be to move some of the flags found in f_flags over to f_mode, which is not subject to being changed by fcntl() calls from user space. As part of the move, drivers will be expected not to change those flags - or any other part of the file structure. This change will enable a cleanup of some code in the much-maligned floppy driver, which currently stores some information in that structure at open() time.

The new open() prototype is projected to be:

    int (*open)(struct block_device *bdev, mode_t mode);

Where mode has the usual read/write flags, but also some of the other open()-time flags like O_NDELAY. This value will not be changed by the drivers and will not necessarily exist in any sort of file structure. It will be stored safely in an undisclosed location by the kernel and will be available at release() time, when some drivers will need access to those flags.

Speaking of release(), that function, too, currently has an old prototype:

    int (*release)(struct inode *ino, struct file *filp);

In this case, filp is often passed as NULL by the kernel, forcing drivers to check the value and implement some sort of default behavior in the lack of a file structure. But, sometimes, drivers need to know about some of the flags which were provided at open() time. So the new release() method will look something like:

    int (*release)(struct gendisk *disk, mode_t mode);

The changes do not stop there. Al points out that there is a bit of confusion in the ioctl() interface:

    int (*ioctl)(struct inode *ino, struct file *filp, unsigned cmd, 
                 unsigned long arg);
    long (*unlocked_ioctl)(struct file *filp, unsigned cmd, unsigned long arg);
    long (*compat_ioctl) (struct file *filp, unsigned cmd, unsigned long arg);

The different versions have different arguments - and even different return types. Once again, drivers tend not to care about most of what can be found in the inode and file structures - even when those structures exist. So the new form of the ioctl() methods will be:

    int (*ioctl)(struct block_device *bdev, mode_t mode, unsigned int cmd, 
                 unsigned long arg);
    int (*compat_ioctl)(struct block_device *bdev, mode_t mode, unsigned int cmd,
                        unsigned long arg);

Note that unlocked_ioctl() is gone: it is arguably past time to get rid of the big kernel lock (BKL) in the block ioctl() implementation. So any driver still using the locked version (ioctl() in the old API) will be modified to take the BKL internally. Any block driver which still requires the BKL is probably in need of a more serious review, though.

As of this writing, there have been no arguments against the change. The word from Linus is:

From your description, I have no objections - everything sounds good. My only concern is how painful the patch ends up being (and a worry about whether this will affect a metric truck-load of external modules? That said, I can't really see us worrying about those)

Al claims to have a patch in progress and ready to be posted soon, and that the amount of pain should be relatively small - for in-tree drivers, anyway. For those maintaining out-of-tree block drivers, the writing is on the wall: a significant API change is coming.

Comments (none posted)

Re-deprecating sysctl()

By Jonathan Corbet
August 29, 2007
The sysctl() system call allows a suitably-privileged application to tweak various kernel parameters. It is a useful feature which, as it happens, is almost never used. The reason for that is the existence of the /proc/sys virtual directory hierarchy which exports the same functionality in a form which is much easier to use. Callers of sysctl() have been encouraged to use /proc/sys instead for a long time and the addition of new parameters to sysctl() is considered to be against the rules. One year ago, sysctl() was removed from the 2.6.19-rc kernels, only to be restored before the final release.

sysctl() is part of the user-space ABI; it is supposed to continue working forever. That is why the attempt to remove it was ultimately rolled back. So it may be surprising to some to see a new removal attempt by Eric Biederman. His latest patch adds a new deprecation warning and an entry in the feature removal schedule putting the end of sysctl() in September, 2010. Says Eric:

After adding checking to register_sysctl_table and finding a whole new set of bugs. Missed by countless code reviews and testers I have finally lost patience with the binary sysctl interface.

The binary sysctl interface has been sort of deprecated for years and finding a user space program that uses the syscall is more difficult then finding a needle in a haystack. Problems continue to crop up, with the in kernel implementation. So since supporting something that no one uses is silly, deprecate sys_sysctl with a sufficient grace period and notice that the handful of user space applications that care can be fixed or replaced.

Eric's claim is that this interface is so little-used that it is visibly rotting. There is sufficiently little common code between the sysctl() and /proc/sys implementations that it is easy for the two to diverge. In the long term, he says, the kernel community will do a better job of not breaking applications by getting rid of sysctl() in favor of the interface which is actually used and maintained.

The new patch has, predictably, drawn opposition from developers who do not want to see the user-space ABI broken in this way. Alan Cox has also suggested that the deprecation warning approach will not be successful in getting the few remaining users to switch to /proc/sys:

The whole "whine a bit" process simply doesn't work when you are trying to persuade people to move in a non-hobbyist context. They don't want to move, the message is simply an annoyance, their upstream huge package vendor won't change just to deal with it and they'll class it as a regression from previous releases, an incompatibility and file bugs until it goes away.

Andrew Morton, instead, is not opposed to the patch:

I think it's worth a try. It might take two, three or five years, who knows? If it turns out to be impractical then we we can just change our minds later, no big loss.

While there is little disagreement with the policy that the user-space ABI should never break, it does seem that there is room for discussion on how that goal might best be met. Unused code has always had a tendency to break accidentally, and sysctl() looks to be very close to being entirely unused. One could, presumably, address this problem with some sort of regression test suite - something the kernel could use more of in general. But the maintenance of interfaces which of almost entirely historical interest is not really helpful to Linux users. So, perhaps, there needs to be a way to remove system calls which have fallen into disuse for a long-enough period. Should this patch go through, we shall see whether three years is sufficient warning for such a change or not.

Comments (17 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds