Brief items
The current 2.6 prepatch is 2.6.20-rc4,
released on January 6.
Says Linus: "
There's absolutely nothing interesting here, unless you
want to play with KVM, or happened to be bitten by the bug with really old
versions of the linker that made parts of entry.S just go away."
About 100 patches have been merged into the mainline git repository since
-rc4, as of this writing. They are fixes, mostly in the architecture,
ALSA, and networking subsystems.
The current -mm tree is 2.6.20-rc3-mm1. Recent changes
to -mm include a bunch of KVM work (see below), another set of workqueue
API changes, and the virtualization of struct user.
The current stable 2.6 kernel is 2.6.19.2, released on January 10. It
contains a long list of fixes, including the fix for the file corruption
problem and several with security implications.
For older kernels: 2.6.16.38-rc1 was released on
January 9 with a long list of fixes - many of which are
security-related.
Comments (none posted)
Kernel development news
Kernel.org is the main repository for
the Linux kernel source, numerous development trees, and a great deal of
associated material. It also offers mirroring for some other Linux-related
projects - distribution CD images, for example. Users of kernel.org have
occasionally noticed that the service is rather slow. Kernel tree releases
are a long time in making it to the front page, and the mirror network
tends to lag behind. This important part of the kernel's development
infrastructure, it seems, is not keeping up with demand.
Discussion on the mailing lists reveal that the kernel.org servers (there
are two of them) often run with load averages in the range of 2-300. So
it's not entirely surprising that they are not always quite as responsive
as one would like. There is talk of adding servers, but there is also a
sense that the current servers should be able to keep up with the load. So
the developers have been looking into what is going on.
The problem seems to originate with git. Kernel.org hosts quite a few git
repositories and a version of the gitweb system as well - though gitweb is
often disabled when the load gets too high. The git-related problems, in
turn, come down to the speed with which Linux can read directories. According to kernel.org administrator H. Peter
Anvin:
During extremely high load, it appears that what slows kernel.org down more
than anything else is the time that each individual getdents() call takes.
When I've looked this I've observed times from 200 ms to almost 2 seconds!
Since an unpacked *OR* unpruned git tree adds 256 directories to a cleanly
packed tree, you can do the math yourself.
Clearly, something is not quite right with the handling of large
filesystems under heavy load. Part of the problem may be that Linux is not
dedicating enough memory to caching directories in this situation, but the
real problems are elsewhere. It turns out that:
- The getdents() system call, used to read a directory, is, according to Linus, one of the most
expensive in Linux. The locking is such that only one process can be
reading a given directory at any given time. If that process must
wait for disk I/O, it sleeps holding the inode semaphore and blocks
all other readers - even if some of the others could work with parts
of the directory which are already in memory.
- No readahead is done on directories, so each block must be read, one
by one, with the whole process stopping and waiting for I/O each time.
- To make things worse, while the ext3 filesystem tries hard to lay out
files contiguously on the disk, it does not make the same effort with
directories. So the chances are good that a multi-block directory
will be scattered on the disk, forcing a seek for each read and
defeating any track caching the drive may be doing.
It has been reported that the third of the above-listed problems can be
addressed by moving to XFS, which
does a better job at keeping directories together. Kernel.org could make such
a switch - at the cost of about a week's downtime for each server. So one
should not expect it to happen overnight.
The first priority for improving the situation is, most likely, the
implementation of some sort of directory readahead. That change would cut
the amount of time spent waiting for directory I/O and, crucially, would
require no change to existing filesystems - not even a backup and restore -
to get better performance. An early readahead patch has been circulated,
but this issue looks complex enough that a few iterations of careful work
will be required to arrive at a real solution. So look for something to
show up in the 2.6.21 time frame.
Comments (14 posted)
The KVM patch set was
covered
here briefly last October. In short, KVM allows for (relatively)
simple support of virtualized clients on recent processors. On a CPU with
Intel's or AMD's hardware virtualization support, a hypervisor can open
/dev/kvm and, through a series of
ioctl() calls, create
virtualized processors and launch guest systems on them. Compared to a full
paravirtualization system like Xen, KVM is relatively small and
straightforward; that is one of the reasons why KVM went in to 2.6.20,
while Xen remains on the outside.
While KVM is in the mainline, it is not exactly in a finished state yet,
and it may see significant changes before and after the 2.6.20 release.
One current
problem has to do with the implementation of "shadow page tables," which
does not perform as well as one would like. The solution is conceptually
straightforward - at least, once one understands what shadow page tables
do.
A page table, of course, is a mapping from a virtual address to the
associated physical address (or a flag that said mapping does not currently
exist). A virtualized operating system is given a range of "physical"
memory to work with, and it implements its own page tables to map between
its virtual address spaces and that memory range. But the guest's
"physical" memory is a virtual range administered by the host; guests do
not deal directly with "bare metal" memory. The result is that there are
actually two sets of page tables between a virtual address space on a
virtualized guest and the real, physical memory it maps to. The guest can
set up one level of translation, but only the host can manage the mapping
between the guest's "physical" memory and the real thing.
This situation is handled by way of shadow page tables. The virtualized
client thinks it is maintaining its own page tables, but the
processor does not actually use them. Instead, the host system implements
a "shadow" table which mirror's the guest's table, but which maps guest
virtual addresses directly to physical addresses. The shadow table starts
out empty; every page fault on the guest then results in the filling in of
the appropriate shadow entry. Once the guest has faulted in the pages it
needs, it
will be able to run at native speed with no further hypervisor attention
required.
With the version of KVM found in 2.6.20-rc4, that happy situation tends not
to last for very long, though. Once the guest performs a context switch,
the painfully-built shadow page table is dumped and a new one is started.
Changing the shadow table is required, since the process running after the
context switch will have a different set of address mappings. But, when
the previous process gets back into the CPU, it would be nice if its shadow
page tables were there waiting for it.
The shadow page table caching
patch posted by Avi Kivity does just that. Rather than just dump the
shadow table, it sets that table aside so that it can be loaded again the
next time it's needed. The idea seems simple, but the implementation
requires a 33-part patch - there are a lot of details to take care of.
Much of the trouble comes from the fact that the host cannot always tell
for sure when the guest has made a page table entry change. As a result,
guest page tables must be write-protected. Whenever the guest makes a
change, it will trap into the hypervisor, which can complete the change and
update the shadow table accordingly.
To make the write-protect mechanism work, the caching patch must add a
reverse-mapping mechanism to allow it to trace faults back to the page
table(s) of interest. There is also an interesting situation where,
occasionally, a page will stop being used as a page table without the host
system knowing about it. To detect that situation, the KVM code looks for
overly-frequent or misaligned writes, either of which indicates
(heuristically) that the function of the page has changed.
The 2.6.20 kernel is in a relatively late stage of development, with the
final release expected later this month. Even so, Avi would like to see
this large change merged now. Ingo Molnar concurs, saying:
I have tested the new MMU changes quite extensively and they are
converging nicely. It brings down context-switch costs by a factor
of 10 and more, even for microbenchmarks: instead of throwing away
the full shadow pagetable hierarchy we have worked so hard to
construct this patchset allows the intelligent caching of shadow
pagetables. The effect is human-visible as well - the system got
visibly snappier
Since the KVM code is new for 2.6.20, changes within it cannot cause
regressions for anybody. So this sort of feature addition is likely to be
allowed, even this late in the development cycle.
Ingo has been busy on this front, announcing a patch entitled KVM paravirtualization for
Linux. It is a set of patches which allows a Linux guest to run under
KVM. It is a paravirtualization solution, though, rather than full
virtualization: the guest system knows that it is running as a virtual
guest. Paravirtualization should not be strictly necessary with hardware
virtualization support, but a paravirtualized kernel can take some
shortcuts which speed things up considerably. With these patches and the
full set of KVM patches, Ingo is able to get benchmark results which are
surprisingly close to native hardware speeds, and at least an order of
magnitude faster than running under Qemu.
This patch is, in fact, the current form of the paravirt_ops concept. With
paravirt_ops, low-level, hardware-specific operations are hidden behind a
structure full of member functions. This paravirt_ops structure, by
default, contains functions which operate on the hardware directly. Those
functions can be replaced, however, by alternatives which operate through a
hypervisor. Ingo's patch replaces a relatively small set of operations -
mostly those involved with the maintenance of page tables.
There was one interesting complaint which come out of Ingo's patch - even
though Ingo's new code is not really the problem. The
paravirt_ops structure is exported to modules, making it possible
for loadable modules to work properly with hypervisors. But there are many
operations in paravirt_ops which have never been made available to
modules in the past. So paravirt_ops represents a significant
widening of the module interface. Ingo responded with a patch which
splits paravirt_ops into two structures, only one of which
(paravirt_mod_ops) is exported to modules. It seems that the
preferred approach, however, will be to create
wrapper functions around the operations deemed suitable for modules and
export those. That minimizes the intrusiveness of the patch and keeps the
paravirt_ops structure out of module reach.
One remaining nagging little detail with the KVM subsystem is what the
interface to user space will look like. Avi Kivity has noted that the API currently
found in the mainline kernel has a number of shortcomings and will need
some changes; many of those, it appears, are likely to show up in 2.6.21.
The proposed API is still heavy on ioctl() calls, which does not
sit well with all developers, but no alternatives have been proposed. This
is a discussion which is likely to continue for some time yet.
Perhaps the most interesting outcome of all this, however, is how KVM is
gaining momentum as the virtualization approach of choice - at least for
contemporary and future hardware. One can almost see the interest in Xen
(for example) fading; KVM comes across as a much simpler, more maintainable
way to support full and paravirtualization. The community seems to be
converging on KVM as the low-level virtualization interface;
commercial vendors of higher-level products will want to adapt to this
interface if they want their products to be supported in the future.
Comments (6 posted)
A longstanding (and long unsupported in Linux) filesystem concept is that
of a union filesystem. In brief, a union filesystem is a logical
combination of two or more other filesystems to create the illusion of a
single filesystem with the contents of all the others.
As an example, imagine that a user wanted to mount a distribution DVD full
of packages. It would be nice to be able to add updated packages to close
today's security holes, but the DVD is a read-only medium. The solution
is a union filesystem. A system administrator can take a writable
filesystem and join it with the read-only DVD, creating a writable
filesystem with the contents of both. If the user then adds packages, they
will go into the writable filesystem, which can be smaller than would be
needed if it were to hold the entire contents.
The unionfs patch posted by
Josef Sipek provides this capability. With unionfs in place, the system
administrator could construct the union with a command sequence like:
mount -r /dev/dvd /mnt/media/dvd
mount /dev/hdb1 /mnt/media/dvd-overlay
mount -t unionfs \
-o dirs=/mnt/media/dvd-overlay=rw:/mnt/media/dvd=ro \
/writable-dvd
The first two lines just mount the DVD and the writable partition as normal
filesystems. The final command then joins them into a single union, mounted
on /writable-dvd. Each "branch" of a union has a priority,
determined by the order in which they are given in the dirs=
option. When a file is looked up, the branches are searched in priority
order, with the first occurrence found being returned to the user. If an
attempt is made to write a read-only file, that file will be copied into
the highest-priority writable branch and written there.
As one might imagine, there is a fair amount of complexity required to make
all of this actually work. Joining together filesystem hierarchies,
copying files between them, and inserting "whiteouts" to mask files deleted
from read-only branches are just a few of the challenges which must be
met. The unionfs code seems to handle most of them well, providing
convincing Unix semantics in the joined filesystem.
Reviewers immediately jumped on one exception, which was noted in the
documentation:
Modifying a Unionfs branch directly, while the union is mounted, is
currently unsupported. Any such change can cause Unionfs to oops,
or stay silent and even RESULT IN DATA LOSS.
What this means is that it is dangerous to mess directly with the
filesystems which have been joined into a union mount. Andrew Morton
pointed out that, as user-friendly interfaces go, this one is a little on
the rough side. Since bind mounts don't have this problem, he asked, why
should unionfs present such a trap to its users? Josef responded:
Bind mounts are a purely VFS level construct. Unionfs is, as the name
implies, a filesystem. Last year at OLS, it seemed that a lot of people
agreed that unioning is neither purely a fs construct, nor purely a vfs
construct.
That, in turn, led to some fairly definitive statements that unionfs
should be implemented at the virtual filesystem level. Without
that, it's not clear that it will ever be possible to keep the namespace
coherent in the face of modifications at all levels of the union. So it
seems clear that, to truly gain the approval of the kernel developers,
unionfs needs a rewrite. Andrew Morton has been heard to wonder if the current version should
be merged anyway in the hopes that it would help inspire that rewrite to
happen. No decisions have been made as of this writing, so it's far from
clear whether Linux will have unionfs support in the near future or not.
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>