User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.20-rc4, released on January 6. Says Linus: "There's absolutely nothing interesting here, unless you want to play with KVM, or happened to be bitten by the bug with really old versions of the linker that made parts of entry.S just go away."

About 100 patches have been merged into the mainline git repository since -rc4, as of this writing. They are fixes, mostly in the architecture, ALSA, and networking subsystems.

The current -mm tree is 2.6.20-rc3-mm1. Recent changes to -mm include a bunch of KVM work (see below), another set of workqueue API changes, and the virtualization of struct user.

The current stable 2.6 kernel is, released on January 10. It contains a long list of fixes, including the fix for the file corruption problem and several with security implications.

For older kernels: was released on January 9 with a long list of fixes - many of which are security-related.

Comments (none posted)

Kernel development news

Why is slow is the main repository for the Linux kernel source, numerous development trees, and a great deal of associated material. It also offers mirroring for some other Linux-related projects - distribution CD images, for example. Users of have occasionally noticed that the service is rather slow. Kernel tree releases are a long time in making it to the front page, and the mirror network tends to lag behind. This important part of the kernel's development infrastructure, it seems, is not keeping up with demand.

Discussion on the mailing lists reveal that the servers (there are two of them) often run with load averages in the range of 2-300. So it's not entirely surprising that they are not always quite as responsive as one would like. There is talk of adding servers, but there is also a sense that the current servers should be able to keep up with the load. So the developers have been looking into what is going on.

The problem seems to originate with git. hosts quite a few git repositories and a version of the gitweb system as well - though gitweb is often disabled when the load gets too high. The git-related problems, in turn, come down to the speed with which Linux can read directories. According to administrator H. Peter Anvin:

During extremely high load, it appears that what slows down more than anything else is the time that each individual getdents() call takes. When I've looked this I've observed times from 200 ms to almost 2 seconds! Since an unpacked *OR* unpruned git tree adds 256 directories to a cleanly packed tree, you can do the math yourself.

Clearly, something is not quite right with the handling of large filesystems under heavy load. Part of the problem may be that Linux is not dedicating enough memory to caching directories in this situation, but the real problems are elsewhere. It turns out that:

  • The getdents() system call, used to read a directory, is, according to Linus, one of the most expensive in Linux. The locking is such that only one process can be reading a given directory at any given time. If that process must wait for disk I/O, it sleeps holding the inode semaphore and blocks all other readers - even if some of the others could work with parts of the directory which are already in memory.

  • No readahead is done on directories, so each block must be read, one by one, with the whole process stopping and waiting for I/O each time.

  • To make things worse, while the ext3 filesystem tries hard to lay out files contiguously on the disk, it does not make the same effort with directories. So the chances are good that a multi-block directory will be scattered on the disk, forcing a seek for each read and defeating any track caching the drive may be doing.

It has been reported that the third of the above-listed problems can be addressed by moving to XFS, which does a better job at keeping directories together. could make such a switch - at the cost of about a week's downtime for each server. So one should not expect it to happen overnight.

The first priority for improving the situation is, most likely, the implementation of some sort of directory readahead. That change would cut the amount of time spent waiting for directory I/O and, crucially, would require no change to existing filesystems - not even a backup and restore - to get better performance. An early readahead patch has been circulated, but this issue looks complex enough that a few iterations of careful work will be required to arrive at a real solution. So look for something to show up in the 2.6.21 time frame.

Comments (14 posted)

Some KVM developments

The KVM patch set was covered here briefly last October. In short, KVM allows for (relatively) simple support of virtualized clients on recent processors. On a CPU with Intel's or AMD's hardware virtualization support, a hypervisor can open /dev/kvm and, through a series of ioctl() calls, create virtualized processors and launch guest systems on them. Compared to a full paravirtualization system like Xen, KVM is relatively small and straightforward; that is one of the reasons why KVM went in to 2.6.20, while Xen remains on the outside.

While KVM is in the mainline, it is not exactly in a finished state yet, and it may see significant changes before and after the 2.6.20 release. One current problem has to do with the implementation of "shadow page tables," which does not perform as well as one would like. The solution is conceptually straightforward - at least, once one understands what shadow page tables do.

A page table, of course, is a mapping from a virtual address to the associated physical address (or a flag that said mapping does not currently exist). A virtualized operating system is given a range of "physical" memory to work with, and it implements its own page tables to map between its virtual address spaces and that memory range. But the guest's "physical" memory is a virtual range administered by the host; guests do not deal directly with "bare metal" memory. The result is that there are actually two sets of page tables between a virtual address space on a virtualized guest and the real, physical memory it maps to. The guest can set up one level of translation, but only the host can manage the mapping between the guest's "physical" memory and the real thing.

This situation is handled by way of shadow page tables. The virtualized client thinks it is maintaining its own page tables, but the processor does not actually use them. Instead, the host system implements a "shadow" table which mirror's the guest's table, but which maps guest virtual addresses directly to physical addresses. The shadow table starts out empty; every page fault on the guest then results in the filling in of the appropriate shadow entry. Once the guest has faulted in the pages it needs, it will be able to run at native speed with no further hypervisor attention required.

With the version of KVM found in 2.6.20-rc4, that happy situation tends not to last for very long, though. Once the guest performs a context switch, the painfully-built shadow page table is dumped and a new one is started. Changing the shadow table is required, since the process running after the context switch will have a different set of address mappings. But, when the previous process gets back into the CPU, it would be nice if its shadow page tables were there waiting for it.

The shadow page table caching patch posted by Avi Kivity does just that. Rather than just dump the shadow table, it sets that table aside so that it can be loaded again the next time it's needed. The idea seems simple, but the implementation requires a 33-part patch - there are a lot of details to take care of. Much of the trouble comes from the fact that the host cannot always tell for sure when the guest has made a page table entry change. As a result, guest page tables must be write-protected. Whenever the guest makes a change, it will trap into the hypervisor, which can complete the change and update the shadow table accordingly.

To make the write-protect mechanism work, the caching patch must add a reverse-mapping mechanism to allow it to trace faults back to the page table(s) of interest. There is also an interesting situation where, occasionally, a page will stop being used as a page table without the host system knowing about it. To detect that situation, the KVM code looks for overly-frequent or misaligned writes, either of which indicates (heuristically) that the function of the page has changed.

The 2.6.20 kernel is in a relatively late stage of development, with the final release expected later this month. Even so, Avi would like to see this large change merged now. Ingo Molnar concurs, saying:

I have tested the new MMU changes quite extensively and they are converging nicely. It brings down context-switch costs by a factor of 10 and more, even for microbenchmarks: instead of throwing away the full shadow pagetable hierarchy we have worked so hard to construct this patchset allows the intelligent caching of shadow pagetables. The effect is human-visible as well - the system got visibly snappier

Since the KVM code is new for 2.6.20, changes within it cannot cause regressions for anybody. So this sort of feature addition is likely to be allowed, even this late in the development cycle.

Ingo has been busy on this front, announcing a patch entitled KVM paravirtualization for Linux. It is a set of patches which allows a Linux guest to run under KVM. It is a paravirtualization solution, though, rather than full virtualization: the guest system knows that it is running as a virtual guest. Paravirtualization should not be strictly necessary with hardware virtualization support, but a paravirtualized kernel can take some shortcuts which speed things up considerably. With these patches and the full set of KVM patches, Ingo is able to get benchmark results which are surprisingly close to native hardware speeds, and at least an order of magnitude faster than running under Qemu.

This patch is, in fact, the current form of the paravirt_ops concept. With paravirt_ops, low-level, hardware-specific operations are hidden behind a structure full of member functions. This paravirt_ops structure, by default, contains functions which operate on the hardware directly. Those functions can be replaced, however, by alternatives which operate through a hypervisor. Ingo's patch replaces a relatively small set of operations - mostly those involved with the maintenance of page tables.

There was one interesting complaint which come out of Ingo's patch - even though Ingo's new code is not really the problem. The paravirt_ops structure is exported to modules, making it possible for loadable modules to work properly with hypervisors. But there are many operations in paravirt_ops which have never been made available to modules in the past. So paravirt_ops represents a significant widening of the module interface. Ingo responded with a patch which splits paravirt_ops into two structures, only one of which (paravirt_mod_ops) is exported to modules. It seems that the preferred approach, however, will be to create wrapper functions around the operations deemed suitable for modules and export those. That minimizes the intrusiveness of the patch and keeps the paravirt_ops structure out of module reach.

One remaining nagging little detail with the KVM subsystem is what the interface to user space will look like. Avi Kivity has noted that the API currently found in the mainline kernel has a number of shortcomings and will need some changes; many of those, it appears, are likely to show up in 2.6.21. The proposed API is still heavy on ioctl() calls, which does not sit well with all developers, but no alternatives have been proposed. This is a discussion which is likely to continue for some time yet.

Perhaps the most interesting outcome of all this, however, is how KVM is gaining momentum as the virtualization approach of choice - at least for contemporary and future hardware. One can almost see the interest in Xen (for example) fading; KVM comes across as a much simpler, more maintainable way to support full and paravirtualization. The community seems to be converging on KVM as the low-level virtualization interface; commercial vendors of higher-level products will want to adapt to this interface if they want their products to be supported in the future.

Comments (6 posted)


A longstanding (and long unsupported in Linux) filesystem concept is that of a union filesystem. In brief, a union filesystem is a logical combination of two or more other filesystems to create the illusion of a single filesystem with the contents of all the others.

As an example, imagine that a user wanted to mount a distribution DVD full of packages. It would be nice to be able to add updated packages to close today's security holes, but the DVD is a read-only medium. The solution is a union filesystem. A system administrator can take a writable filesystem and join it with the read-only DVD, creating a writable filesystem with the contents of both. If the user then adds packages, they will go into the writable filesystem, which can be smaller than would be needed if it were to hold the entire contents.

The unionfs patch posted by Josef Sipek provides this capability. With unionfs in place, the system administrator could construct the union with a command sequence like:

    mount -r /dev/dvd /mnt/media/dvd
    mount    /dev/hdb1 /mnt/media/dvd-overlay
    mount -t unionfs \
          -o dirs=/mnt/media/dvd-overlay=rw:/mnt/media/dvd=ro \

The first two lines just mount the DVD and the writable partition as normal filesystems. The final command then joins them into a single union, mounted on /writable-dvd. Each "branch" of a union has a priority, determined by the order in which they are given in the dirs= option. When a file is looked up, the branches are searched in priority order, with the first occurrence found being returned to the user. If an attempt is made to write a read-only file, that file will be copied into the highest-priority writable branch and written there.

As one might imagine, there is a fair amount of complexity required to make all of this actually work. Joining together filesystem hierarchies, copying files between them, and inserting "whiteouts" to mask files deleted from read-only branches are just a few of the challenges which must be met. The unionfs code seems to handle most of them well, providing convincing Unix semantics in the joined filesystem.

Reviewers immediately jumped on one exception, which was noted in the documentation:

Modifying a Unionfs branch directly, while the union is mounted, is currently unsupported. Any such change can cause Unionfs to oops, or stay silent and even RESULT IN DATA LOSS.

What this means is that it is dangerous to mess directly with the filesystems which have been joined into a union mount. Andrew Morton pointed out that, as user-friendly interfaces go, this one is a little on the rough side. Since bind mounts don't have this problem, he asked, why should unionfs present such a trap to its users? Josef responded:

Bind mounts are a purely VFS level construct. Unionfs is, as the name implies, a filesystem. Last year at OLS, it seemed that a lot of people agreed that unioning is neither purely a fs construct, nor purely a vfs construct.

That, in turn, led to some fairly definitive statements that unionfs should be implemented at the virtual filesystem level. Without that, it's not clear that it will ever be possible to keep the namespace coherent in the face of modifications at all levels of the union. So it seems clear that, to truly gain the approval of the kernel developers, unionfs needs a rewrite. Andrew Morton has been heard to wonder if the current version should be merged anyway in the hopes that it would help inspire that rewrite to happen. No decisions have been made as of this writing, so it's far from clear whether Linux will have unionfs support in the near future or not.

Comments (12 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds