Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.34-rc7, released on May 9. Linus says: "I think this is the last -rc - things have been pretty quiet on the patch front, although there's been some rather spirited discussions." The full changelog contains all the details.
According to the latest regression posting, there are 24 unresolved regressions in 2.6.34.
Stable updates: the 2.6.32.13 and 2.6.33.4 stable kernel updates were released on May 12. Both are large - on the order of 100 patches each - and fix a number of important problems.
Quotes of the week
Adaptive spinning futexes
As a general rule, a well-written program should, when it needs a resource currently owned by another program, step aside and allow other work to proceed until that resource becomes available. When it comes to low-level synchronization primitives, though, this rule does not always hold. Better overall system performance can often be achieved if a program busy-waits rather than sleeping. If the wait is short, the performance benefits that come from giving the resource to an already-running, cache-hot process outweigh the cost of the busy wait.The best-supported (by the kernel) user-space synchronization primitive is the futex. Darren Hart has been working on a patch series intended to bring adaptive spinning to futexes in an attempt to improve the performance of multi-threaded applications. These patches, while still marked as "not ready for inclusion," have evolved considerably over time.
The core idea is simple: if a process attempts to acquire a futex which is already owned by another, it will spin in an acquisition loop until the holding process either releases the futex or is scheduled out. If all goes well, the new process will be able to grab the futex quickly and get on with its work in the most efficient way. In practice, adaptive spinning generally outperforms regular futexes, but only occasionally does better than the highly tweaked, assembly-coded adaptive spinning mutex code used by the pthreads library.
Adaptive spinning requires that the kernel know which process currently owns the futex; that is a minor problem because the current futex operations do not provide that information. So a new locking operation is required in situations where adaptive spinning is to be used.
There is an alternative approach which has been recommended by some developers: do the spinning in user space rather than in the kernel. User-space spinning might just be faster, but it's trickier, because it's harder for user space to know whether the current holder of a futex is executing or not. Providing the requisite information will require the design of a special (and fast) API - work which has not yet been done.
Uprobes returns - again
The Uprobes module is becoming one of the longer-lasting stories in the kernel development community. For a few years now, developers have been trying to get this code - which allows the placement of dynamic tracepoints into user-space programs - into the mainline. We last looked at Uprobes back in January; now, as the 2.6.35 merge window approaches, Uprobes is back for another round.At this point, Uprobes has been entirely separated from the utrace layer, which is not a part of this patch series. Utrace is controversial in its own right and has not proved helpful in getting Uprobes merged. Other changes which have been made include the addition of interfaces to the the tracing and perf events subsystem. That means that dynamic probes can be inserted from the command line, then watched using the Ftrace interface or aggregated with perf.
On the other hand, Uprobes retains the "execute out of line" mechanism for the execution of instructions displaced by probes. XOL works, but it does so at the cost of injecting a new virtual memory area into the probed process; that is a larger disturbance than some developers would like to see. But the alternative - adding an emulator for those instructions to the kernel - is invasive in different ways.
Review comments so far have focused on relatively small details. That does not mean that Uprobes will be accepted when the merge window opens, but its chances do seem better than they have in the past.
Detecting idle patterns
The cpuidle subsystem is charged with putting the CPU into the optimal sleep state when there is nothing for it to do. One of the key inputs into this decision is the next scheduled timer event; that event puts an upper bound on how long the processor can expect to be able to sleep undisturbed. A more distant next timer event suggests that a deeper sleep state is appropriate.But timer events are not the only way to wake up a processor; device interrupts will also do that. There are times when hardware can be expected to interrupt well before the next timer expiration, but those times can be hard for the processor to predict. There is seemingly an exception, though: sometimes hardware interrupts are so regular that they become a sort of timer tick in their own right. A moving mouse can generate that sort of pattern; network traffic can do it too. In such situations, the current cpuidle "menu" governor may repeatedly choose the wrong sleep state.
Arjan van de Ven has come to the rescue with a simple cpuidle patch which maintains an array of the last eight actual sleep periods. Whenever it is time to put the processor to sleep, the standard deviation of those sleep periods is calculated; if it is small, then the average sleep is considered to be a better guide to the expected sleep period than the next timer event.
As machine learning goes, this code is a relatively simple example. But it should be smart enough to catch simple patterns and run the hardware in something closer to an optimal mode.
Kernel development news
The Next3 filesystem
The ext3 filesystem is tried and true, but it lacks a number of features deemed interesting by contemporary users. Snapshots - the ability to quickly capture the state of the filesystem at an arbitrary time - is at the top of many lists. It is currently possible to use the LVM snapshotting feature with ext3, but snapshots taken through LVM have some significant limitations. The Next3 filesystem offers an approach which might prove easier and more flexible: snapshots implemented directly in ext3.Next3 was developed by CTERA Networks, which has started shipping it on its C200 network-attached storage device. This code has also been posted on SourceForge and proposed for merging into the mainline kernel. The Next3 filesystem adds a simple snapshot feature to ext3 in ways which are (mostly) compatible with the existing on-disk format. It looks like a useful feature, but its path into the mainline looks to be longer than its implementers might have hoped.
The Next3 filesystem is a new filesystem type - it's not just an addition to ext3. At its core, it works by creating a special, magic file to represent a snapshot of the filesystem. The files have the same apparent size as the storage volume as a whole, but they are sparse files, so they take almost no space at the outset. When a change is made to a block on disk, the filesystem must first check to see whether that block has been saved in the most recent snapshot already. If not, the affected block is moved over to the snapshot file, and a new block is allocated to replace it. Thus, over time, disk blocks migrate to the snapshot file as they are rewritten with new contents.
Gaining read-only access to a snapshot is a simple matter of doing a loopback mount of the snapshot file as an ext2 filesystem. The snapshot file is sufficiently magic that any attempts to read blocks in the holes (which represent blocks that have not been changed since the snapshot was taken) will be satisfied from a later snapshot - which will have captured the contents of that block when it was eventually changed - or from the underlying storage device. Deleting a snapshot requires moving changed blocks into the previous snapshot, if it exists, because the deleted snapshot holds blocks which are logically part of the earlier snapshots.
The changes to the ext3 on-disk format are minimal, to the point that a Next3 filesystem can be mounted by the ordinary ext3 code. If snapshots exist, though, ext3 cannot be allowed to modify the filesystem, lest the changed blocks fail to be saved in the snapshot. So, when snapshots exist on the filesystem, it will be marked with a feature flag which forces ext3 to mount the filesystem readonly.
On the performance side, the news is said to be mostly good. Writes will take a little longer due to the need to move the old block to a snapshot file. The worst performance impact is seemingly on truncate operations; these may have to save a large number of blocks and can get a lot slower. It is also worth noting that the moving of modified blocks to the snapshot file will, over time, wreck the nice, contiguous on-disk format that ext3 tries so hard to create, with an unfortunate effect on streaming read performance. Files which must not be fragmented can be marked with a special flag which will cause blocks to be copied into the snapshot file rather than moved; that will slow writes further, but will keep the file contiguous on disk.
Next3 developer Amir Goldstein requested relatively quick review of the patches because he is trying to finalize some of the on-disk formatting. The answer he got from Ted Ts'o was probably not quite what he was looking for:
Amir's response was that, while porting the patches to ext4 is on the "we'll get around to it someday" list, that port is not an easy thing to do. The biggest problem, apparently, is making the movement of blocks into the snapshot file work properly with ext4's extent-oriented format. Beyond that, Amir says, he's not actually trying to get the changes into ext3 - he wants to merge a separate filesystem called Next3 which happens to be mostly compatible with ext3.
The "separate Next3" approach is unlikely to fly very far, though. As Ted put it, ext2, ext3, and ext4 are really just different implementations of the same basic filesystem format; this format has never really been forked. Next3, as a separate filesystem, would be a fork of the format. The fact that Next3 has taken over some data structure fields which are used to different purpose in ext4 has not helped matters:
The answer appears fairly clear: patches adding the snapshot feature might be welcome, but not as a fork of the ext3 filesystem. At a bare minimum, the filesystem format will have to be changed to avoid conflicts with ext4, but the real solution appears to be simply implementing the patches on top of ext4 instead of ext3. That is a fair amount of extra work which might have been avoided had the Next3 developers talked with the community prior to starting to code.
Moving x86 to LMB
The early days of the 2.6.34 development cycle were made more difficult for some testers by difficulties in the NO_BOOTMEM patches which came in during the merge window. The kinks in that code were eventually ironed out, but things might just get interesting again in 2.6.35 - Yinghai Lu is back with another set of patches which continues the process of completely reworking how early memory allocation is done on the x86 architecture. The potential for trouble with this kind of work is always there, but the end result does indeed seem worth aiming for.Some review: in a running kernel, memory management is handled by the buddy allocator (at the page level), with the slab allocator on top. These allocators are complex pieces of code which cannot run in the absence of a mostly functional kernel, so they cannot be used in the early stages of the bootstrap process. What is used, instead, is an architecture-specific chain of simple allocators. For x86, things start with a brk()-like mechanism which yields to the "e820" early reservation code, which, in turn, gives way to the bootmem allocator. Once the bootstrap has gotten far enough, the slab allocator can take over from the bootmem code. Yinghai's 2.6.34 changes were meant to short out the bootmem stage, allowing the system to use the early reservation code until slab can run.
During the review process for that code, some reviewers asked why x86 did not use the "logical memory block" (LMB) allocator instead of its own early reservation code. LMB is currently used by the Microblaze, PowerPC, SuperH, and SPARC architectures, so it has the look of a generic solution. There are obvious advantages to using generic code over architecture-specific variants; there are more eyes to look at the code and the overall maintenance cost is reduced. So the idea of moving to LMB made obvious sense.
LMB is, as might be expected, a truly simplistic memory manager. Low-level architecture code gives it blocks of memory to manage as it discovers them with:
long lmb_add(u64 base, u64 size);
The LMB allocator will duly store that region into a fixed-length array of known memory blocks, coalescing it with existing blocks if need be. Memory may then be allocated with:
u64 lmb_alloc(u64 size, u64 align);
Allocated blocks are tracked in a second array which looks just like the first; an allocation is satisfied by iterating through the available blocks, trying to find a sufficiently large chunk which is not already reserved by somebody else. There are other functions for reserving specific regions of memory, allocating memory on specific NUMA nodes, etc. But, at its core, LMB is a simple allocator which is meant to do a good-enough job until something more sophisticated can take over.
Yinghai's patch set makes a number of changes to the LMB code itself, starting with a move from the lib directory over to mm with the rest of the memory-management code. Some new functions are added to match the different semantics supported by the early reservation code, which works in a two-step, "find a memory block, then reserve it" mode. There is also a new function to transfer LMB reservations into the bootmem allocator for configurations where bootmem is still in use. The 22-part series culminates with a switch to LMB calls for early allocations and the removal of the now-unused early reservation code.
There has been surprisingly little discussion for a patch series which makes such fundamental changes. It seems that most kernel developers pay relatively little attention to what happens at the architecture-specific levels. One exception is Ben Herrenschmidt, who keeps an eye on LMB from the PowerPC perspective. Ben disagrees with a number of the LMB-level changes, feeling that they complicate the API and potentially introduce problems. Instead, it looks like Ben would like to fix up the LMB code himself, letting Yinghai work on the x86-specific side of things.
To that end, Ben has posted a patch series of his own, saying:
Some of the changes simply clean up the LMB code, adding, for example, a for_each_lmb() macro for iterating through the array of memory blocks. The fixed-length arrays are made variable, phys_addr_t is used to represent physical addresses, and the code is substantially reorganized. There is much that Ben still plans to do, including, happily, the addition of actual documentation to the API, but even without all that, it's a significant cleanup for the LMB code.
As with Yinghai's patches there has been little in the way of discussion. It may be that these changes will remain below the radar while the two patch sets are integrated and - maybe - merged for 2.6.35. With luck, they'll remain below the radar thereafter as well, with few people even noticing the difference.
MeeGo and Btrfs
MeeGo is arguably the dark horse in the mobile platform race: it is new, unfinished, and unavailable on any currently-shipping product, but it is going after the same market as a number of more established platforms. MeeGo is interesting: it is a combined effort by two strong industry players which are trying, in the usual slow manner, to build a truly community-oriented development process. For the time being, though, important development decisions are still being made centrally. Recently, a significant decision has come to light: MeeGo will be based on the Btrfs file system by default.Btrfs is seen as the long-term future of Linux filesystems, representing a much-needed clean break from the legacy filesystem designs we have been using for all these years. With the demise of reiser4 and the unavailability of ZFS, Btrfs would seem to be the only contender for that title. But talk about Btrfs is always framed in "it's not stable yet" terms, with few people willing to commit themselves to an actual date when the filesystem might be ready for production use. It is generally assumed that most cautious users will spend some years running on ext4 before making the jump to Btrfs. The 2.6.34 kernel will be released with this text still guarding the Btrfs configuration entry:
The MeeGo 1.0 release could happen as early as this month; given that, the above words might just seem a bit scary. In fact, they are more scary than they need to be: further on-disk format changes are not expected. The warning, it seems, will be scaled down for 2.6.35.
So why pick Btrfs for MeeGo? Arjan van de Ven described the decision this way:
He went on to describe a number of reasons why Btrfs makes sense for the MeeGo platform, starting with its data integrity features. The copy-on-write design which is at the core of Btrfs has a number of nice attributes, one of which is that users should never, ever see garbage data in files, even in a "pulled out the battery at the worst moment" situation. Device manufacturers, understandably, like that idea.
The on-disk compression feature is interesting for the MeeGo environment as well. It makes the initial system load take less space, making more available for the users of the device. But, as Arjan points out, manufacturers like it too: a smaller system image takes less time to shovel onto the storage device.
It would appear that there are a number of plans for the use of the Btrfs snapshot feature, starting with reversible package updates. With snapshots, a device can support a multi-user mode where each user appears to have the entire system to him- or herself. And the "reset to factory defaults" operation becomes a simple operation which does not require a separate recovery partition on the disk. Snapshots are not just for enterprise users anymore.
There are a number of other advantages, including small-file performance, built-in defragmentation (which is most useful for keeping boot time short), the storage management features, and more. In short, there's no doubt that Btrfs offers a useful set of features for any distribution; it's not hard to see why MeeGo wanted to use it. But that does leave an interesting open question: is Btrfs ready for inclusion into MeeGo, where it will, presumably, be installed onto systems intended for users who aren't looking to become development-stage filesystem testers?
Btrfs was initially merged for the 2.6.29 kernel; since then, patch activity looks like this:
So there is a steady rate of change to the filesystem, significant but not overwhelming. There is a wide range of contributors to this code, though the bulk of the work (by far) has been done by developers from Oracle and Red Hat. There are certainly people using Btrfs in normal use, and Fedora offers it as an experimental option. The mailing list shows a number of oops reports still, and it would appear that the famous ENOSPC issue (where the filesystem reacts poorly when the storage device overflows) is still not entirely solved. Significant feature patches - direct I/O support and RAID 4/5 support, for example - remain unmerged. In summary: Btrfs does not quite have that "it's done" look to it yet.
That said, it may well be getting close to ready for the sort of restricted and well-tested environment likely to be found in MeeGo deployments. Btrfs will also have stabilized further by the time devices actually start shipping with MeeGo - helped, no doubt, by the work of the MeeGo developers themselves. So, while this decision may appear to be ambitious now, it is not necessarily unreasonable. A dark-horse platform can only be helped by taking advantage of the best technology available to it.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
