Brief items
The current development kernel is 2.6.34-rc7,
released on May 9. Linus
says: "
I think this is the last -rc - things have been pretty quiet
on the patch front, although there's been some rather spirited
discussions." The
full
changelog contains all the details.
According to the latest
regression posting, there are 24 unresolved regressions in 2.6.34.
Stable updates: the 2.6.32.13 and 2.6.33.4 stable kernel updates were released
on May 12. Both
are large - on the order of 100 patches each - and fix a number of
important problems.
Comments (1 posted)
As a side effect, this patch removes the time-travel feature in kvm
guests.
--
Glauber Costa
But at the point where you're adding code to every driver's suspend
function to determine whether or not it's got any pending events
that userspace hasn't consumed yet, and adding code to every bit of
userspace to allow it to indicate whether or not it's busy
consuming events or just busy drawing 3D bouncing cattle, I think
you've reinvented suspend blocks.
--
Matthew Garrett
Comments (none posted)
By Jonathan Corbet
May 11, 2010
As a general rule, a well-written program should, when it needs a resource
currently owned by another program, step aside and allow other work to
proceed until that resource becomes available. When it comes to low-level
synchronization primitives, though, this rule does not always hold. Better
overall system performance can often be achieved if a program
busy-waits rather than sleeping. If the wait is short, the performance
benefits that come from giving the resource to an already-running,
cache-hot process outweigh the cost of the busy wait.
The best-supported (by the kernel) user-space synchronization primitive is
the futex. Darren Hart has been working on a patch series intended to bring
adaptive spinning to futexes in an attempt to improve the performance of
multi-threaded applications. These patches, while still marked as "not
ready for inclusion," have evolved considerably over time.
The core idea is simple: if a process attempts to acquire a futex which is
already owned by another, it will spin in an acquisition loop until the
holding process either releases the futex or is scheduled out. If all goes
well, the new process will be able to grab the futex quickly and get on
with its work in the most efficient way. In practice, adaptive spinning
generally outperforms regular futexes, but only occasionally does better
than the highly tweaked, assembly-coded adaptive spinning mutex code used
by the pthreads library.
Adaptive spinning requires that the kernel know which process currently
owns the futex; that is a minor problem because the current futex
operations do not provide that information. So a new locking operation is
required in situations where adaptive spinning is to be used.
There is an alternative approach which has been recommended by some
developers: do the spinning in user space rather than in the kernel.
User-space spinning might just be faster, but it's trickier, because it's
harder for user space to know whether the current holder of a futex is
executing or not. Providing the requisite information will require the
design of a special (and fast) API - work which has not yet been done.
Comments (7 posted)
By Jonathan Corbet
May 11, 2010
The Uprobes module is becoming one of the longer-lasting stories in the
kernel development community. For a few years now, developers have been
trying to get this code - which allows the placement of dynamic tracepoints
into user-space programs - into the mainline. We last
looked at Uprobes back in
January; now, as the 2.6.35 merge window approaches, Uprobes is
back for another round.
At this point, Uprobes has been entirely separated from the utrace layer,
which is not a part of this patch series.
Utrace is controversial in its own right and has not proved helpful in
getting Uprobes merged. Other changes which have been made include the
addition of interfaces to the the tracing and perf events subsystem. That
means that dynamic probes can be inserted from the command line, then
watched using the Ftrace interface or aggregated with perf.
On the other hand, Uprobes retains the "execute out of line" mechanism for
the execution of instructions displaced by probes. XOL works, but it does
so at the cost of injecting a new virtual memory area into the probed
process; that is a larger disturbance than some developers would like to
see. But the alternative - adding an emulator for those instructions to
the kernel - is invasive in different ways.
Review comments so far have focused on relatively small details. That does
not mean that Uprobes will be accepted when the merge window opens, but its
chances do seem better than they have in the past.
Comments (1 posted)
By Jonathan Corbet
May 11, 2010
The
cpuidle subsystem is
charged with putting the CPU into the optimal sleep state when there is
nothing for it to do. One of the key inputs into this decision is the next
scheduled timer event; that event puts an upper bound on how long the
processor can expect to be able to sleep undisturbed. A more distant next
timer event suggests that a deeper sleep state is appropriate.
But timer events are not the only way to wake up a processor; device
interrupts will also do that. There are times when hardware can be
expected to interrupt well before the next timer expiration, but those
times can be hard for the processor to predict. There is seemingly an
exception, though: sometimes hardware interrupts are so regular that they
become a sort of timer tick in their own right. A moving mouse can
generate that sort of pattern; network traffic can do it too. In such
situations, the current cpuidle "menu" governor may repeatedly choose the
wrong sleep state.
Arjan van de Ven has come to the rescue with a simple cpuidle patch which
maintains an array of the last eight actual sleep periods. Whenever it is
time to put the processor to sleep, the standard deviation of those sleep
periods is calculated; if it is small, then the average sleep is considered
to be a better guide to the expected sleep period than the next timer
event.
As machine learning goes, this code is a relatively simple example. But it
should be smart enough to catch simple patterns and run the hardware in
something closer to an optimal mode.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
May 11, 2010
The ext3 filesystem is tried and true, but it lacks a number of features
deemed interesting by contemporary users. Snapshots - the ability to
quickly capture the state of the filesystem at an arbitrary time - is at
the top of many lists. It is currently possible to use the LVM
snapshotting feature with ext3, but snapshots taken through LVM have some
significant limitations. The
Next3
filesystem offers an approach which might prove easier and more
flexible: snapshots implemented directly in ext3.
Next3 was developed by CTERA Networks, which has started shipping it on its C200 network-attached
storage device. This code has also been posted on SourceForge and
proposed for merging into the mainline kernel. The Next3 filesystem adds a
simple snapshot feature to ext3 in ways which are (mostly) compatible with
the existing on-disk format. It looks like a useful feature, but its path
into the mainline looks to be longer than its implementers might have
hoped.
The Next3 filesystem is a new filesystem type - it's not just an addition
to ext3. At its core, it works by creating a special, magic file to
represent a snapshot of the filesystem. The files have the same apparent
size as the storage volume as a whole, but they are sparse files, so they
take almost no space at the outset. When a change is made to a block on
disk, the filesystem must first check to see whether that block has been
saved in the most recent snapshot already. If not, the affected block is
moved over to the snapshot file, and a new block is allocated to replace
it. Thus, over time, disk blocks migrate to the snapshot file as they are
rewritten with new contents.
Gaining read-only access to a snapshot is a simple matter of doing a
loopback mount of the snapshot file as an ext2 filesystem. The snapshot
file is sufficiently magic that any attempts to read blocks in the holes
(which represent blocks that have not been changed since the snapshot was
taken) will be satisfied from a later snapshot - which will have captured
the contents of that block when it was eventually changed - or from
the underlying storage device. Deleting a snapshot requires moving changed
blocks into the previous snapshot, if it exists, because the deleted
snapshot holds blocks which are logically part of the earlier snapshots.
The changes to the ext3 on-disk format are minimal, to the point that a Next3
filesystem can be mounted by the ordinary ext3 code. If snapshots exist,
though, ext3 cannot be allowed to modify the filesystem, lest the changed
blocks fail to be saved in the snapshot. So, when snapshots exist on the
filesystem, it will be marked with a feature flag which forces ext3 to
mount the filesystem readonly.
On the performance side, the news is said to be mostly good. Writes will
take a little longer due to the need to move the old block to a snapshot
file. The worst performance impact is seemingly on truncate operations;
these may have to save a large number of blocks and can get a lot slower.
It is also worth noting that the moving of modified blocks to the snapshot
file will, over time, wreck the nice, contiguous on-disk format that ext3
tries so hard to create, with an unfortunate effect on streaming read
performance. Files which must not be fragmented can be marked with a
special flag which will cause blocks to be copied into the snapshot file
rather than moved; that will slow writes further, but will keep the file
contiguous on disk.
Next3 developer Amir Goldstein requested relatively quick review of the
patches because he is trying to finalize some of the on-disk formatting.
The answer he got from Ted Ts'o was
probably not quite what he was looking for:
Ext4 is where new development takes place in the ext2/3/4 series.
So enhancements such as Next3 will probably not be received with
great welcome into ext3.
Amir's response was that, while porting the patches to ext4 is on the
"we'll get around to it someday" list, that port is not an easy thing to
do. The biggest problem, apparently, is making the movement of blocks into
the snapshot file work properly with ext4's extent-oriented format. Beyond
that, Amir says, he's not actually trying to get the changes into ext3 - he
wants to merge a separate filesystem called Next3 which happens to be
mostly compatible with ext3.
The "separate Next3" approach is unlikely to fly very far, though. As Ted
put it, ext2, ext3, and ext4 are really
just different implementations of the same basic filesystem format; this
format has never really been forked. Next3, as a separate filesystem,
would be a fork of the format. The fact that Next3 has taken over some
data structure fields which are used to different purpose in ext4 has not
helped matters:
The "ext" in ext2 stands for "extended", as in the "the second
extended file system" for Linux. It perhaps would be better if we
had used the term "extensible", since that's the main thing about
ext2/3/4 that has given it so much staying power. We've been able
to add, in very carefully backwards and forwards compatible way,
new features to the file system format. This is why I object to
why Next3 uses some fields that overlaps with ext4. It means that
e2fsprogs, which supports _one_ and _only_ _one_ file system
format, will now need to support two file system formats. And
that's not something I want to do.
The answer appears fairly clear: patches adding the snapshot feature might
be welcome, but not as a fork of the ext3 filesystem. At a bare minimum,
the filesystem format will have to be changed to avoid conflicts with ext4,
but the real solution appears to be simply implementing the patches on top
of ext4 instead of ext3. That is a fair amount of extra work which might
have been avoided had the Next3 developers talked with the community prior
to starting to code.
Comments (58 posted)
By Jonathan Corbet
May 11, 2010
The early days of the 2.6.34 development cycle were made more difficult for
some testers by difficulties in the
NO_BOOTMEM patches which came in
during the merge window. The kinks in that code were eventually ironed
out, but things might just get interesting again in 2.6.35 - Yinghai Lu is
back with
another set of
patches which continues the process of completely reworking how early
memory allocation is done on the x86 architecture. The potential for
trouble with this kind of work is always there, but the end result does
indeed seem worth aiming for.
Some review: in a running kernel, memory management is handled by the buddy
allocator (at the page level), with the slab allocator on top. These
allocators are complex pieces of code which cannot run in the absence of a
mostly functional kernel, so they cannot be used in the early stages of the
bootstrap process. What is used, instead, is an architecture-specific
chain of simple allocators. For x86, things start with a
brk()-like mechanism which yields to the "e820" early reservation
code, which, in turn, gives way to the bootmem allocator. Once the
bootstrap has gotten far enough, the slab allocator can take over from the
bootmem code. Yinghai's 2.6.34 changes were meant to short out the bootmem
stage, allowing the system to use the early reservation code until slab can
run.
During the review process for that code, some reviewers asked why
x86 did not use the "logical memory block" (LMB) allocator instead of its
own early reservation code. LMB is currently used by the Microblaze,
PowerPC, SuperH, and SPARC architectures, so it has the look of a generic
solution. There are obvious advantages to using generic code over
architecture-specific variants; there are more eyes to look at the code and
the overall maintenance cost is reduced. So the idea of moving to LMB made
obvious sense.
LMB is, as might be expected, a truly simplistic memory manager. Low-level
architecture code gives it blocks of memory to manage as it discovers them
with:
long lmb_add(u64 base, u64 size);
The LMB allocator will duly store that region into a fixed-length array of
known memory blocks, coalescing it with existing blocks if need be. Memory
may then be allocated with:
u64 lmb_alloc(u64 size, u64 align);
Allocated blocks are tracked in a second array which looks just like the
first; an allocation is satisfied by iterating through the available
blocks, trying to find a sufficiently large chunk which is not already
reserved by somebody else. There are other functions for reserving
specific regions of memory, allocating memory on specific NUMA nodes, etc.
But, at its core, LMB is a simple allocator which is meant to do a
good-enough job until something more sophisticated can take over.
Yinghai's patch set makes a number of changes to the LMB code itself,
starting with a move from the lib directory over to mm
with the rest of the memory-management code. Some new functions are added
to match the different semantics supported by the early reservation code,
which works in a two-step, "find a memory block, then reserve it" mode.
There is also a new function to transfer LMB reservations into the bootmem
allocator for configurations where bootmem is still in use. The 22-part
series culminates with a switch to LMB calls for early allocations and the
removal of the now-unused early reservation code.
There has been surprisingly little discussion for a patch series which
makes such fundamental changes. It seems that most kernel developers pay
relatively little attention to what happens at the architecture-specific
levels. One exception is Ben Herrenschmidt, who keeps an eye on LMB from
the PowerPC perspective. Ben disagrees with a number of the LMB-level
changes, feeling that they complicate the API and potentially introduce
problems. Instead, it looks like Ben would like to fix up the LMB code
himself, letting Yinghai work on the x86-specific side of things.
To that end, Ben has posted a
patch series of his own, saying:
My aim is still to replace the bottom part of Yinghai's patch
series rather than build on top of it, and from there, add whatever
he needs to successfully port x86 over and turn NO_BOOTMEM into
something half decent without adding a ton of unneeded crap to the
core lmb.
Some of the changes simply clean up the LMB code, adding, for example, a
for_each_lmb() macro for iterating through the array of memory
blocks. The fixed-length arrays are made variable, phys_addr_t is
used to represent physical addresses, and the code is substantially
reorganized. There is much that Ben still plans to do, including, happily,
the addition of actual documentation to the API, but even without all that,
it's a significant cleanup for the LMB code.
As with Yinghai's patches there has been little in the way of discussion.
It may be that
these changes will remain below the radar while the two patch sets are
integrated and - maybe - merged for 2.6.35. With luck, they'll remain
below the radar thereafter as well, with few people even noticing the
difference.
Comments (none posted)
By Jonathan Corbet
May 11, 2010
MeeGo is arguably the dark horse in the mobile platform race: it is new,
unfinished, and unavailable on any currently-shipping product, but it is
going after the same market as a number of more established platforms.
MeeGo is interesting: it is a combined effort by two strong industry players
which are trying, in the usual slow manner, to build a truly
community-oriented development process. For the time being, though,
important development decisions are still being made centrally. Recently,
a significant decision has come to light: MeeGo will be based on the
Btrfs file system by default.
Btrfs is seen as the long-term future of Linux filesystems, representing a
much-needed clean break from the legacy filesystem designs we have been
using for all these years. With the demise of reiser4 and the
unavailability of ZFS, Btrfs would seem to be the only contender for that
title. But talk about Btrfs is always framed in "it's not stable yet"
terms, with few people willing to commit themselves to an actual date when
the filesystem might be ready for production use. It is generally assumed
that most cautious users will spend some years running on ext4 before
making the jump to Btrfs. The 2.6.34 kernel will be released with this
text still guarding the Btrfs configuration entry:
Btrfs is highly experimental, and THE DISK FORMAT IS NOT YET
FINALIZED. You should say N here unless you are interested in
testing Btrfs with non-critical data.
The MeeGo 1.0 release could happen as early as this month; given that, the
above words might just seem a bit scary. In fact, they are more scary than
they need to be: further on-disk format changes are not expected. The
warning, it seems, will be scaled down for 2.6.35.
So why pick Btrfs for MeeGo? Arjan van de Ven described the decision this way:
It's the future of Linux filesystems. We had a case where the old
guard (ext3) is getting retired, and there are two new filesystems
on the table (btrfs and ext4). We felt that if we picked ext4, we'd
have all the pain of a new filesystem, and we'd then change again a
year later to btrfs.
He went on to describe a number of reasons why Btrfs makes sense for the
MeeGo platform, starting with its data integrity features. The
copy-on-write design which is at the core of Btrfs has a number of nice
attributes, one of which is that users should never, ever see garbage data
in files, even in a "pulled out the battery at the worst moment"
situation. Device manufacturers, understandably, like that idea.
The on-disk compression feature is interesting for the MeeGo environment as
well. It makes the initial system load take less space, making more
available for the users of the device. But, as Arjan points out,
manufacturers like it too: a smaller system image takes less time to shovel
onto the storage device.
It would appear that there are a number of plans for the use of the Btrfs
snapshot feature, starting with reversible package updates. With
snapshots, a device can support a multi-user mode where each user appears
to have the entire system to him- or herself. And the "reset to factory
defaults" operation becomes a simple operation which does not require a
separate recovery partition on the disk. Snapshots are not just for
enterprise users anymore.
There are a number of other advantages, including small-file performance,
built-in defragmentation (which is most useful for keeping boot time
short), the storage management features, and more. In short, there's no
doubt that Btrfs offers a useful set of features for any distribution; it's
not hard to see why MeeGo wanted to use it. But that does leave an
interesting open question: is Btrfs ready for inclusion into MeeGo, where
it will, presumably, be installed onto systems intended for users who
aren't looking to become development-stage filesystem testers?
Btrfs was initially merged for the 2.6.29 kernel; since then, patch
activity looks like this:
So there is a steady rate of change to the filesystem, significant but not
overwhelming. There is a wide range of contributors to this code, though
the bulk of the work (by far) has been done by developers from Oracle and
Red Hat. There are certainly people using Btrfs in normal use, and Fedora
offers it as an experimental option. The mailing list shows a number of
oops reports still, and it would appear that the famous ENOSPC issue (where
the filesystem reacts poorly when the storage device overflows) is still
not entirely solved. Significant feature patches - direct I/O support and
RAID 4/5 support, for example - remain unmerged. In summary: Btrfs does
not quite have that "it's done" look to it yet.
That said, it may well be getting close to ready for the sort of restricted
and well-tested environment likely to be found in MeeGo deployments. Btrfs
will also have stabilized further by the time devices actually start
shipping with MeeGo - helped, no doubt, by the work of the MeeGo developers
themselves. So, while this decision may appear to be ambitious now, it is
not necessarily unreasonable. A dark-horse platform can only be helped by
taking advantage of the best technology available to it.
Comments (30 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>