User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.26-rc1, released on May 3. "So this merge window was somewhat rocky in the sense that there was a lot of arguments about it, but at the same time I at least personally think that from a technical angle, we had somewhat less scary stuff going on than has been almost the rule lately." At about 7500 commits, this cycle has fewer changes than the last couple have; a lot of the changes are infrastructural, so there will be fewer obvious new features with 2.6.26 than with some of its predecessors. See the short-form changelog for details, or the full changelog for lots of details.

A relatively slow stream of patches has been heading into the mainline git repository since the -rc1 release.

The current stable 2.6 release is, released on May 6. This release contains a single fix for a locally-exploitable security problem in the filesystem locks code. and were also released with this fix.

Previously, and had been released with a larger set of fixes. In the absence of another security issue, there will probably not be any more 2.6.24 stable updates.

Comments (none posted)

Kernel development news

Quotes of the week

Usually my git problems are root-caused down to my lack of a PhD in hermeneutic metaphysiology, but not this time, methinks.
-- Andrew Morton

Kids: do not shove random modules into your kernel. Just because Linus does something doesn't make it a good idea...

We've moved half the kernel brains to userspace with udev, initrd and modules; it's really unfair that you're not sharing all that why-won't-my-machine-boot love.

-- Rusty Russell

[T]he kernel team has evolved from a small team of buddies to a large enterprise. And to survive this evolution, we may need to apply the immoral principles found in big companies.
-- Willy Tarreau

Comments (7 posted)

The last things through the 2.6.26 merge window

By Jonathan Corbet
May 5, 2008
About 500 changesets were merged after the publication of the first and second 2.6.26 merge window summaries. The merge window is now closed; here is the final set of changes which got in:

  • New drivers for Solarflare Communications Solarstorm SFC4000 controller-based Ethernet controllers, Hauppauge HVR-1600 TV tuner cards, ISP 1760 USB host controllers, Cypress c67x00 OTG controllers, and Intel PXA 27x USB controllers.

  • 8Kb stacks are, once again, the default for the x86 architecture. "Out-of-memory situations are less problematic than silent and hard to debug stack corruption."

  • The klist type now has the usual-form macros for declaration and initialization: DEFINE_KLIST() and KLIST_INIT(). Two new functions (klist_add_after() and klist_add_before()) can be used to add entries to a klist in a specific position.

  • As had been planned, struct class_device has been removed from the driver core, along with all of the associated infrastructure. Classes are now implemented with an ordinary struct device.

  • kmap_atomic_to_page() is no longer exported to modules.

  • There are some new generic functions for performing 64-bit integer division in the kernel:

        u64 div_u64(u64 dividend, u32 divisor);
        u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder);
        s64 div_s64(s64 dividend, s32 divisor)
        s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder);
    Unlike do_div(), these functions are explicit about whether signed or unsigned math is being done. The x86-specific div_long_long_rem() has been removed in favor of these new functions.

  • There is a new string function:

         bool sysfs_streq(const char *s1, const char *s2);

    It compares the two strings while ignoring an optional trailing newline.

  • The prototype for i2c probe() methods has changed:

         int (*probe)(struct i2c_client *client, 
                      const struct i2c_device_id *id);

    The new id argument supports i2c device name aliasing.

  • There is a new configuration (MODULE_FORCE_LOAD) which controls whether the loading of modules can be forced if the kernel thinks something is not right; it defaults to "no."

Comments (10 posted)

Time to slow down?

By Jonathan Corbet
May 7, 2008
All communities develop rituals over time. One of the enduring linux-kernel rituals is the regular heated discussion on development processes and kernel quality. To an outside observer, these events can give the impression that the whole enterprise is about to come crashing down. But the reality is a lot like the New Year celebrations your editor was privileged enough to see in Beijing: vast amounts of smoke and noise, but everybody gets back to work as usual the next day.

Beyond that, though, discussions of this nature have real value. Any group which is concerned about issues like quality must, on occasion, take a step back and evaluate the situation. Even if there are no immediate outcomes, the ideas raised often reverberate over the following months, sometimes leading to real improvements.

The immediate inspiration for this round of discussion was broken systems resulting from the 2.6.26 merge window. This development cycle has had a rougher start than some, with more than the usual number of patches causing boot failures and other sorts of inconvenient behavior. That led to some back-and-forth between developers on how patches should be handled. Broken patches are unfortunate, but one thing is worth noting here: these problems were caught and fixed even before the 2.6.26-rc1 kernel release was made. The problems which set off this round of discussion are not bugs which will affect Linux users.

But, beyond any doubt, there will be other bugs which are slower to surface and slower to be fixed. The number of these bugs has led to a number of calls to slow down the development process in one way or another. To that end, it is worth noting that the process has slowed down somewhat, with the 2.6.26 merge window bringing in far fewer changesets than were seen for 2.6.24 or 2.6.25. Whether this slower pace will continue into future development cycles, or whether it's simply a lull after two exceptionally busy cycles remains to be seen.

But, if the process does not slow down on its own, there are developers who would like to find a way to force it to happen. Some have argued for simply throttling the process by, for example, limiting new features in each development cycle to specific subsystems of the kernel. There has also been talk of picking the subsystems with the worst regression counts and excluding new features from those subsystems until things improve. The fact of the matter, though, is that throttling is unlikely to help the situation.

Slowing down merging does not keep developers from developing, it just keeps their code out of the tree. An extreme example can be found in the 2.4 kernel: the merging of new code was heavily throttled for a long time. What happened was that the distributors started merging new developments themselves because their users were demanding them. So a lot of kernels which went under the name "2.4" were far removed from anything which could be downloaded from That way lies fragmentation - and almost certainly lower quality as well.

Linus actually takes this argument further by arguing that quickly merging patches leads to better quality:

[M]y personal belief is that the best way to raise quality of code is to distribute it. Yes, as patches for discussion, but even more so as a part of a cohesive whole - as _merged_ patches!

The thing is, the quality of individual patches isn't what matters! What matters is the quality of the end result. And people are going to be a lot more involved in looking at, testing, and working with code that is merged, rather than code that isn't.

Andrew Morton has also argued against throttling:

If we simply throttled things, people would spend more time watching the shopping channel while merging smaller amounts of the same old crap.

Kernel developers are, of course, known to be hard-core shoppers, so giving them more opportunity to pursue that activity is probably not the best idea. Seriously, though: Andrew is in favor of a slower development process, but only when approached from a different angle: his point is that an increased focus on quality will, as a side effect, result in slower development. Kernel developers need to be focused on finding and fixing bugs rather than creating new ones and/or shopping.

It is worth noting that a substantial portion of the development community appears to believe that there are no real problems in this regard. Bugs are being found and fixed at a high rate and the kernel is solid for most users. Arjan van de Ven notes:

Are we doing worse on quality? My (subjective) opinion is that we are doing better than last year. We are focused more on quality. We are fixing the bugs that people hit most. We are fixing most of the regressions (yes, not all). Subsystems are seeing flat or lower bugcounts/bugrates.

Ted Ts'o points out that a lot of problems result from obscure and low-quality hardware, and that it's not possible to make everybody happy. Andrew is unconvinced, though, and seems to fear that the kernel is declining in quality.

In a sense, though, that part of the discussion is moot. Nobody would argue against the idea that fewer bugs is a worthy goal, regardless of whether one believes that the current process has quality problems. So talk of ways to make things better is always on-topic.

Testing remains a big issue; the kernel, more than almost any other project, is highly sensitive to the systems on which it is run. Many problems (arguably the majority of them) are related to specific hardware, or specific combinations of hardware; there is no way for the developers, who do not have all possible hardware to test on, to ever find all of these bugs. Users have to help with that process. Getting widespread testing coverage is always hard; Peter Anvin argues that the current process has actually made that harder:

One thing is that we keep fragmenting the tester base by adding new confidence levels: we now have -mm, -next, mainline -git, mainline -rc, mainline release, stable, distro testing, and distro release (and some distros even have aggressive versus conservative tracks.) Furthermore, thanks to craniorectal immersion on the part of graphics vendors, a lot of users have to run proprietary drivers on their "main work" systems, which means they can't even test newer releases even if they would dare.

There is, in fact, a wealth of development kernels to test, and it is not always clear where users and developers should be concentrating their testing effort. A consensus may be forming, though, that more people should be looking at the linux-next tree in particular. Linux-next is where all of the patches intended for the next merge window are supposed to congregate; the current contents of linux-next, as of this writing, are targeted toward 2.6.27. This is the place where early integration issues and other problems should be found; if linux-next is well tested, the number of problems showing up in the next merge window should be somewhat reduced.

The linux-next tree is an interesting experiment. It is, for all practical purposes, making the development cycle longer: since linux-next exists, the 2.6.27 cycle has, in some sense, already started. Linux-next also does something which kernel developers have tended to resist: causing the stabilization period for one development cycle to overlap with active development for the next cycle. In the past, it has been argued that this kind of overlap will cause developers to prioritize the creation of new toys over fixing the problems with last week's toys.

Some people argue that this is happening now: developers are not spending enough time dealing with bugs - and that their carelessness is creating too many bugs in the first place. Others assert that, while it will never be possible to fix every reported bug, the bugs that really matter are being addressed. A real resolution to this disagreement seems unlikely; the creation of meaningful metrics on kernel quality is a difficult task. About the best that can be done is to try to keep the regression list as small as possible; as long as systems which once worked continue to work, it is hard to argue too forcefully that things are headed in the wrong direction.

Comments (12 posted)

Read-only bind mounts

By Jonathan Corbet
May 6, 2008
Bind mounts can be thought of as a sort of symbolic link at the filesystem level. Using mount --bind, it is possible to create a second mount point for an existing filesystem, making that filesystem visible at a different spot in the namespace. Bind mounts are thus useful for creating specific views of the filesystem namespace; one can, for example, create a bind mount which makes a piece of a filesystem visible within an environment which is otherwise closed off with chroot().

There is one constraint to be found with bind mounts as implemented in kernels through 2.6.25, though: they have the same mount options as the primary mount. So a command like:

    mount --bind -o ro /vital_data /untrusted_container/vital_data

will fail to make /vital_data read-only under /untrusted_container if it was mounted writable initially. On your editor's 2.6.25 system, the failure is silent - the bind mount will be made writable despite the read-only request and no error message will be generated (the mount man page does document that options cannot be changed).

There is clear value in the ability to make bind mounts read-only, though. Containers are one example: an administrator may wish to create a container in which processes may be running as root. It may be useful for that container to have access to filesystems on the host, but the container should not necessarily have write access to those filesystems. As of 2.6.26, this sort of configuration will be possible, thanks to the merging of the read-only bind mounts patches by Dave Hansen.

As it happens, it's still not possible to create a read-only bind mount with the command shown above; the read-only attribute can only be added with a remount operation afterward. So the necessary sequence is something like:

    mount --bind /vital_data /untrusted_container/vital_data
    mount -o remount,ro /untrusted_container/vital_data

This example raises an interesting question: what if some process opens a file for write access between the two mount operations? A system administrator has the right to expect that a read-only mount will, in fact, only be used for read operations. The 2.6.26 patch is designed to live up to that expectation, though the amount of work required turned out to be more than the developers might have expected.

Filesystems normally track which files are opened for write access, so an attempt to remount a filesystem read-only can be passed to the low-level filesystem code for approval. But the low-level filesystem knows nothing about bind mounts, which are implemented entirely within the virtual filesystem (VFS) layer. So making read-only access for bind mounts work requires that the VFS keep track of all files which have been opened for write access. Or, more precisely, the VFS really only needs to keep track of how many files are open for write access.

The technique chosen was to create something which looks like a write lock for filesystems. Whenever the VFS is about to do something which involves writing, it must first call:

    int mnt_want_write(struct vfsmount *mnt);

The return value is zero if write access is possible, or a negative error code otherwise. This call can be found in obvious places - such as in the implementation of open() - when write access is requested. But write access comes into play many other situations as well; for example, renaming a file requires write access for the duration of the operation. So mnt_want_write() calls have been sprinkled throughout the VFS code.

When write access is no longer needed, the "write lock" should be released with a call to:

    void mnt_drop_write(struct vfsmount *mnt);

One of the discoveries which has been made is that write access is needed in rather more places than one might have thought. In particular, it turns out that there is need for mnt_want_write() calls within the low-level filesystems as well as in the VFS layer. So getting the read-only bind mounts patch into shape has been an ongoing process of finding the spots which have been missed and adding mnt_want_write() calls there. In an attempt to make this process a bit less error-prone, Miklos Szeredi has put together a set of VFS helper functions which encapsulate the situations where write access is needed. Those functions have not been merged for 2.6.26, however.

Superficially, mnt_want_write() is easy to understand - it simply increments a counter of outstanding write accesses. The problem with a simple implementation, though, is that a shared, per-filesystem counter would create scalability problems. On multiprocessor systems, the cache line containing the counter would bounce around the system, slowing things considerably.

A common response to this type of problem is to turn the counter into a per-CPU variable, allowing operations on the counter to remain local to each processor. When somebody needs to know the total value of the counters, it's a simple matter of adding each CPU's version; this operation is slow, but it is also rare. On big systems, though, the number of CPUs can be large - as can the number of filesystems, and bind mounts will only increase that number. The result is a multiplicative effect which, once again, is a scalability problem, only this time it manifests itself in the form of excessive memory use.

The read-only bind mounts patch resolves this situation by, in effect, going back to global counters which are cached on specific processors. To that end, each CPU has one of these structures:

    struct mnt_writer {
	spinlock_t lock;
	unsigned long count;
	struct vfsmount *mnt;

At any given time, this structure will hold a local count for one filesystem, represented by mnt. If the processor needs to adjust the write count for that filesystem, it's a simple matter of incrementing or decrementing count. When the processor's attention turns to a different filesystem, it must first adjust the global count for the old filesystem, then it can switch its local mnt_writer structure to the new one. The result is a compromise between purely local and purely global counters which yields "good enough" performance on benchmarks designed to stress the system.

Read-only bind mounts join with other features (such as shared subtrees) to create a flexible set of tools for the construction of the filesystem namespace. It is not clear how much of this functionality is being used at this time, but, as the implementation of containers in the mainline gets closer to completion, there is likely to be more interest in this capability. Linux systems in coming years may have much more complex filesystem layouts than have been seen in the past.

Comments (9 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds