Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.25-rc6, released on March 16. The changes are mostly fixes, but there's still quite a few of them for this point in the release cycle. See the announcement for details, or the long-format changelog for lots of details.

A handful of changes have gone into the mainline git repository since the 2.6.25-rc6 release.

As of this writing, vger.kernel.org is down, slowing the development process somewhat. Or, perhaps, slowing talk and speeding development. Regardless, the failure (a disk in vger's RAID array) is being addressed with the intent of getting vger back online as soon as possible.

Comments (1 posted)

Kernel development news

Quotes of the week

One man, 12 nights (13 days), one bottle of cuban rum and little bits of scotch whisky, 82 'House M.D' series... feels good.

-- How Evgeniy Polyakov gets work done

So, we're going to have to now convert all drivers, right? Nice, I can always use a bump up in the "number of patches submitted" numbers :)

-- Greg Kroah-Hartman

Comments (none posted)

Recovering deleted files from ext3

Carlo Wood seems to have mistakenly deleted his home directory and instead of reaching for his backups, he dug into the ext3 filesystem structure. The result is an in-depth look at ext3 including how to undelete files. The end result is an ext3grep tool that looks like it might be rather useful. "However, this is utter nonsense. All information is still there, also the block pointers. It is just slightly less likely that those are still there (than on ext2), since they have to be recovered from the journal. On top of that, the meta data is less coherently related to the real data so that heuristic algorithms are needed to find things back." (seen at Val Henson's weblog)

Comments (9 posted)

Generic semaphores

By Jonathan Corbet
March 17, 2008

Most kernel patches delete some code, replacing it with newer and (presumably) better code. Much of the time, it seems, the new code is more voluminous than what came before. Occasionally, though, a patch comes along which deletes over 7600 lines of code - replacing it with a mere 314 lines - while claiming to maintain the same functionality. Matthew Wilcox's generic semaphore patch is one of those changes.

In essence, a semaphore is a counter with a wait queue attached to it. When kernel code wants to access the resource protected by the semaphore, it makes a call to:

    void down(struct semaphore *sem);

This call will check the counter associated with sem; if it is greater than zero, the counter will be decremented and control returns to the caller. Otherwise the caller will be put to sleep until sometime in the future when the counter has been increased again. Increasing the counter - when the the protected resource is no longer needed - is done with a call to up(). Semaphores can be used in any situation where there is a need to put an upper limit on the number of processes which can be within a given critical section at any time. In practice, that upper limit is almost always set to one, resulting in semaphores which are used as a straightforward mutual exclusion primitive.

In current kernels, semaphores are implemented with highly-optimized, architecture-specific code. There are, in fact, more than twenty independent semaphore implementations in the kernel code base. Matthew's patch rips all of that out and replaces it with a single, generic implementation which works on all architectures. After the patch is applied, a semaphore looks like this:

    struct semaphore {
	spinlock_t		lock;
	int			count;
	struct list_head	wait_list;
    };

The implementation follows from this definition in a straightforward way: the spinlock is used to protect manipulations of count, while wait_list is used to put processes to sleep when they must wait for count to increase. The actual code, of course, is somewhat complicated by performance and interrupt-safety considerations, but it remains relatively short and simple.

One might ask: why weren't semaphores done this way in the first place? The answer is that, once upon a time (prior to 2.6.16), semaphores were one of the primary mutual exclusion mechanisms in the kernel. The 2.6.16 cycle brought in mutexes from the realtime tree, and most semaphore users were converted over. So semaphores, which were once a performance-critical primitive, are now much less so. As a result, any need there may have been for carefully hand-tuned, architecture-specific code is gone. So the code might as well go too.

The other question which comes up is: why are semaphores still being used at all? The number of semaphore users has dropped considerably since 2.6.16, but there are still a number of them in the kernel. Some of those could certainly be converted to mutexes, but doing so requires a careful audit of the code to be sure that the semaphore's counting feature is not being used. Once that work is done, it may turn out that, in some places, a semaphore is truly the right data structure. So semaphores are likely to remain - but they'll require rather less code than before.

Comments (11 posted)

The return of authoritative hooks

By Jonathan Corbet
March 18, 2008

The containers developers have what would seem to be a relatively straightforward problem: they would like to control access to devices on a per-container basis. Then containers could safely be granted access to specific devices without compromising the overall security of the system - even if a container has a root-capable process which can create new device files. Implementing this feature has been a longer journey than these developers had imagined, though, with the "device whitelist" feature being sent around to different kernel subsystems almost like one of those famous garbage barges from years past. A final resting place may have been found, though, and it may signal a change in how some security decisions are made in the kernel in the future.

The original version of the patch, posted by Pavel Emelyanov, set up a control group for the management of device accessibility within containers. The actual rules - and their enforcement - were stored deep within the device model subsystem. This drew an objection from Greg Kroah-Hartman, who suggested that, instead, this kind of access control should done either with udev or with the Linux security module (LSM) subsystem. Udev does not give the desired degree of control and, apparently, can be problematic for those wanting to run older distributions within containers, so it was not seriously considered. The LSM suggestion was, after some resistance, taken to heart, though.

The result was the device whitelist LSM patch, posted by Serge Hallyn. It was a stacking security module which made changes to a number of hooks. This is where James Morris came in and suggested that, instead, the whitelist should just be added to the existing capabilities security module. Then there would be no need for a separate module and things could be generally simplified.

So Serge duly rolled out version 3 of the patch which moved the whitelist into the capabilities module. But this one ran into resistance as well. Quoting James Morris again:

Moving this logic into LSM means that instead of the cgroups security logic being called from one place in the main kernel (where cgroups lives), it must be called identically from each LSM (none of which are even aware of cgroups), which I think is pretty obviously the wrong solution.

Casey Schaufler also didn't like this idea:

When the next feature comes along are we going to stuff it into capabilities, too? Maybe we'll cram it into audit or CIPSO instead, but how long can this go on? Eventually we need a mechanism that allows more or less general mix-and-match, maybe with a few rules like "don't mix plaids and stripes" to keep things sane or these lesser facilities have no chance. Seems like we're still making LSM too hard to use

At this point, the complaint was clearly not with just the device whitelist, but with the capabilities module as well. It seems that capabilities are a bit of a poor fit with the LSM idea as a whole. The fact that they exist at all is a bit of a historical artifact; some developers wanted to see them implemented that way to show the flexibility of the LSM interface and to let capabilities be omitted from embedded setups. As it happens, it's still not possible to remove capabilities, and they impose a bit of a cost on all other security modules.

The core problem is this: LSM, fundamentally, is a restrictive mechanism. An LSM hook can deny an action, but it can never empower a process to do something it would not have been allowed to do in the absence of the security module. The decision to disallow "authoritative hooks" was made explicitly back in 2001 as a way of restricting the scope of LSM modules and, hopefully, ensuring that those modules would not themselves become security problems.

But capabilities are an inherently authoritative mechanism - a capability check verifies the existence of a special permission which would otherwise not be there. The device whitelist is the same sort of thing: it grants access which would otherwise be denied. So it fits poorly with the LSM model.

Serge came back with yet another patch which takes the whitelist code out of the LSM framework and, instead, inserts a separate set of hooks into the relevant places in the code. Those hooks sit right next to the LSM hooks, but operate in a permissive manner. So far, this approach seems to be passing muster, with no developers (yet) talking about booting it out into yet another subsystem.

Things may yet change, though. Casey Schaufler is now talking about the creation of a "Linux privilege module" framework for the management of all permissions checks. The normal discretionary access control checks could be moved there, as could all capability and "are they root?" logic. And, of course, the device whitelist code. Nobody has really spoken out against this idea - but, then, nobody has seen any code yet either. But, if things continue in this direction, authoritative hooks may have finally found a home, many years after having been rejected from the LSM mechanism.

Comments (8 posted)

A new suspend/hibernate infrastructure

By Jonathan Corbet
March 19, 2008

While attending conferences, your editor has, for some years, made a point of seeing just how many other attendees have some sort of suspend and resume functionality working on their laptops. There is, after all, obvious value in being able to sit down in a lecture hall, open the lid, and immediately start heckling the speaker via IRC without having to wait for the entire bootstrap sequence to unfold. But, regardless of whether one is talking about suspend-to-RAM ("suspend") or suspend-to-disk ("hibernation"), there are surprisingly few people using this capability. Despite the efforts which have been made by developers and distributors, suspend and hibernate still just do not work reliably for a lot of people.

For your editor, suspend always works, but the success rate of the resume operation is about 95% - just enough to keep using it while inspiring a fair amount of profanity in inopportune places.

Various approaches to fixing suspend and hibernation have been proposed; these include TuxOnIce and kexec jump. Another possibility, though, is to simply fix the code which is in the kernel now. There is a lot that has to be done to make that goal a reality, including making the whole process more robust and separating the suspend and hibernation cases which, as Linus has stated rather strongly several times, are really two different problems. To that end, Rafael Wysocki has posted a new suspend and hibernation infrastructure for devices which has the potential to improve the situation - but at a cost of creating no less than 20 separate device callbacks.

For the (relatively) simple suspend case, there are four basic callbacks which should be provided in the new pm_ops structure by each bus and, eventually, by every device:

    int (*prepare)(struct device *dev);
    int (*suspend)(struct device *dev);

    int (*resume)(struct device *dev);
    void (*complete)(struct device *dev);

When the system is suspending, each device will first see a call to its prepare() callback. This call can be seen as a sort of warning that the suspend is coming, and that any necessary preparation work should be done. This work includes preventing the addition of any new child devices and anything which might require the involvement of user space. Any significant memory allocations should also be done at this time; the system is still functional at this point and, if necessary, I/O can be performed to make memory available. What should not happen in prepare() is actually putting the device into a low-power state; it needs to remain functional and available.

As usual, a return value of zero indicates that the preparation was successful, while a negative error code indicates failure. In cases where the failure is temporary (a race with the addition of a new child device is one possibility), the callback should return -EAGAIN, which will cause a repeat attempt later in the process.

At a later point, suspend() will be called to actually power down the device. With the current patch, each device will see a prepare() call quickly followed by suspend(). Future versions are likely to change things so that all devices get a prepare() call before any of them are suspended; that way, even the last prepare() callback can count on the availability of a fully-functioning system.

The resume process calls resume() to wake the device up, restore it to its previous state, and generally make it ready to operate. Once the resume process is done, complete() is called to clean up anything left over from prepare(). A call to complete() could also be made directly after prepare() (without an intervening suspend) if the suspend process fails somewhere else in the system.

The hibernation process is more complicated, in that there are more intermediate states. In this case, too, the process begins with a call to prepare(). Then calls are made to:

    int (*freeze)(struct device *dev);
    int (*poweroff)(struct device *dev);

The freeze() callback happens before the hibernation image (the system image which is written to persistent store) is created; it should put the device into a quiescent state but leave it operational. Then, after the hibernation image has been saved and another call to prepare() made, poweroff() is called to shut things down.

When the system is powered back up, the process is reversed through calls to:

    int (*quiesce)(struct device *dev);
    int (*restore)(struct device *dev);

The call to quiesce() will happen early in the resume process, after the hibernation image has been loaded from disk, but before it has been used to recreate the pre-hibernation system's memory. This callback should quiet the device so that memory can be reassembled without being corrupted by device operations. A call to complete() will follow, then a call to restore(), which should put the device back into a fully-functional state. A final complete() call finishes the process.

There are still two more hibernation-related callbacks:

    int (*thaw)(struct device *dev);
    int (*recover)(struct device *dev);

These functions will be called when things go wrong; once again, each of these calls will be followed by a call to complete(). The purpose of thaw() is to undo the work done by freeze() or quiesce(); it should put the device back into a working state. The recover() call will be made if the creation of the hibernation image fails, or if restoring from that image fails; its job is to clean up and get the hardware back into an operating state.

For added fun, there are actually two sets of pm_ops callbacks. One is for normal system operation, but there is another set intended to be called when interrupts are disabled and only one CPU is operational - just before the system goes down or just after it comes back up. Clearly, interactions with devices will be different in such an environment, so different callbacks make sense. But the result is that fully 20 callbacks must be provided for full suspend and hibernate functionality. These callbacks have been added to the bus_type structure as:

    struct pm_ops *pm;
    struct pm_ops *pm_noirq;

Fields by the same name have also been added to the pci_driver structure, allowing each device driver to add its own version of these callbacks. For now, the old PCI driver suspend() and resume() callbacks will be used if the pm_ops structures have not been provided, and no drivers have been converted (at least in the patch as posted).

As of this writing, discussion of the patch is hampered by an outage at vger.kernel.org. There are some concerns, though, and things are likely to change in future revisions. Among other things, the number of "no IRQ" callbacks may be reduced. But, with luck, the final resolution will leave us all in a position where suspend and hibernate work reliably.

Comments (7 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.25-rc6 ?

Architecture-specific

Jeremy Fitzhardinge x86: unification and xen updates ?

venkatesh.pallipadi@intel.com x86: PAT support updated - v3 ?

Core kernel code

zippel@linux-m68k.org NTP updates ?

john stultz time/ntp changes ?

Matthew Wilcox Updated generic semaphore patch set ?

Andi Kleen [0/8] Predictive bitmaps for ELF executables ?

Jens Axboe Generic smp_call_function(), improvements, and smp_call_function_single() ?

Development tools

Bill Huey (hui) lockstat measurement extensions ?

Device drivers

Konrad Rzeszutek Add iSCSI iBFT Support (v0.4.9) ?

Rafael J. Wysocki PM: Rework suspend and hibernation code for devices ?

Greg KH convert the scsi layer to use struct device ?

Michael Krufky Hybrid Tuner Refactoring, phase 3 ?

Filesystems and block I/O

Jan Kara Automatic quotaon on remount ?

Miklos Szeredi fuse: writable mmap + batched write ?

Miklos Szeredi mount ownership and unprivileged mount syscall (v9) ?

Hirokazu Takahashi Block I/O tracking ?

Daniel Phillips ddtree: A git kernel tree for storage servers ?

Laurent Vivier Modify loop device to be able to manage partitions of the disk image ?

Memory management

Yasunori Goto (memory hotplug) freeing pages allocated by bootmem for hotremove ?

Andi Kleen [0/18] GB pages hugetlb support ?

Security-related

Serge E. Hallyn cgroups: implement device whitelist lsm (v2) ?

Serge E. Hallyn cgroups: implement device whitelist (v4) ?

Serge E. Hallyn cgroups: implement device whitelist (v5) ?

David Howells [PATCH 1/3] KEYS: Allow clients to set key perms in key_create_or_update() ?

Virtualization and containers

Balbir Singh Virtual address space control for cgroups ?

YAMAMOTO Takashi another swap controller for cgroup ?

Miscellaneous

Zach Brown bdar: efficiently backup allocated bytes in file systems ?

Page editor: Jonathan Corbet
Next page: Distributions>>