The KVM patch set was covered
briefly last October. In short, KVM allows for (relatively)
simple support of virtualized clients on recent processors. On a CPU with
Intel's or AMD's hardware virtualization support, a hypervisor can open
and, through a series of ioctl()
virtualized processors and launch guest systems on them. Compared to a full
paravirtualization system like Xen, KVM is relatively small and
straightforward; that is one of the reasons why KVM went in to 2.6.20,
while Xen remains on the outside.
While KVM is in the mainline, it is not exactly in a finished state yet,
and it may see significant changes before and after the 2.6.20 release.
problem has to do with the implementation of "shadow page tables," which
does not perform as well as one would like. The solution is conceptually
straightforward - at least, once one understands what shadow page tables
A page table, of course, is a mapping from a virtual address to the
associated physical address (or a flag that said mapping does not currently
exist). A virtualized operating system is given a range of "physical"
memory to work with, and it implements its own page tables to map between
its virtual address spaces and that memory range. But the guest's
"physical" memory is a virtual range administered by the host; guests do
not deal directly with "bare metal" memory. The result is that there are
actually two sets of page tables between a virtual address space on a
virtualized guest and the real, physical memory it maps to. The guest can
set up one level of translation, but only the host can manage the mapping
between the guest's "physical" memory and the real thing.
This situation is handled by way of shadow page tables. The virtualized
client thinks it is maintaining its own page tables, but the
processor does not actually use them. Instead, the host system implements
a "shadow" table which mirror's the guest's table, but which maps guest
virtual addresses directly to physical addresses. The shadow table starts
out empty; every page fault on the guest then results in the filling in of
the appropriate shadow entry. Once the guest has faulted in the pages it
will be able to run at native speed with no further hypervisor attention
With the version of KVM found in 2.6.20-rc4, that happy situation tends not
to last for very long, though. Once the guest performs a context switch,
the painfully-built shadow page table is dumped and a new one is started.
Changing the shadow table is required, since the process running after the
context switch will have a different set of address mappings. But, when
the previous process gets back into the CPU, it would be nice if its shadow
page tables were there waiting for it.
The shadow page table caching
patch posted by Avi Kivity does just that. Rather than just dump the
shadow table, it sets that table aside so that it can be loaded again the
next time it's needed. The idea seems simple, but the implementation
requires a 33-part patch - there are a lot of details to take care of.
Much of the trouble comes from the fact that the host cannot always tell
for sure when the guest has made a page table entry change. As a result,
guest page tables must be write-protected. Whenever the guest makes a
change, it will trap into the hypervisor, which can complete the change and
update the shadow table accordingly.
To make the write-protect mechanism work, the caching patch must add a
reverse-mapping mechanism to allow it to trace faults back to the page
table(s) of interest. There is also an interesting situation where,
occasionally, a page will stop being used as a page table without the host
system knowing about it. To detect that situation, the KVM code looks for
overly-frequent or misaligned writes, either of which indicates
(heuristically) that the function of the page has changed.
The 2.6.20 kernel is in a relatively late stage of development, with the
final release expected later this month. Even so, Avi would like to see
this large change merged now. Ingo Molnar concurs, saying:
I have tested the new MMU changes quite extensively and they are
converging nicely. It brings down context-switch costs by a factor
of 10 and more, even for microbenchmarks: instead of throwing away
the full shadow pagetable hierarchy we have worked so hard to
construct this patchset allows the intelligent caching of shadow
pagetables. The effect is human-visible as well - the system got
Since the KVM code is new for 2.6.20, changes within it cannot cause
regressions for anybody. So this sort of feature addition is likely to be
allowed, even this late in the development cycle.
Ingo has been busy on this front, announcing a patch entitled KVM paravirtualization for
Linux. It is a set of patches which allows a Linux guest to run under
KVM. It is a paravirtualization solution, though, rather than full
virtualization: the guest system knows that it is running as a virtual
guest. Paravirtualization should not be strictly necessary with hardware
virtualization support, but a paravirtualized kernel can take some
shortcuts which speed things up considerably. With these patches and the
full set of KVM patches, Ingo is able to get benchmark results which are
surprisingly close to native hardware speeds, and at least an order of
magnitude faster than running under Qemu.
This patch is, in fact, the current form of the paravirt_ops concept. With
paravirt_ops, low-level, hardware-specific operations are hidden behind a
structure full of member functions. This paravirt_ops structure, by
default, contains functions which operate on the hardware directly. Those
functions can be replaced, however, by alternatives which operate through a
hypervisor. Ingo's patch replaces a relatively small set of operations -
mostly those involved with the maintenance of page tables.
There was one interesting complaint which come out of Ingo's patch - even
though Ingo's new code is not really the problem. The
paravirt_ops structure is exported to modules, making it possible
for loadable modules to work properly with hypervisors. But there are many
operations in paravirt_ops which have never been made available to
modules in the past. So paravirt_ops represents a significant
widening of the module interface. Ingo responded with a patch which
splits paravirt_ops into two structures, only one of which
(paravirt_mod_ops) is exported to modules. It seems that the
preferred approach, however, will be to create
wrapper functions around the operations deemed suitable for modules and
export those. That minimizes the intrusiveness of the patch and keeps the
paravirt_ops structure out of module reach.
One remaining nagging little detail with the KVM subsystem is what the
interface to user space will look like. Avi Kivity has noted that the API currently
found in the mainline kernel has a number of shortcomings and will need
some changes; many of those, it appears, are likely to show up in 2.6.21.
The proposed API is still heavy on ioctl() calls, which does not
sit well with all developers, but no alternatives have been proposed. This
is a discussion which is likely to continue for some time yet.
Perhaps the most interesting outcome of all this, however, is how KVM is
gaining momentum as the virtualization approach of choice - at least for
contemporary and future hardware. One can almost see the interest in Xen
(for example) fading; KVM comes across as a much simpler, more maintainable
way to support full and paravirtualization. The community seems to be
converging on KVM as the low-level virtualization interface;
commercial vendors of higher-level products will want to adapt to this
interface if they want their products to be supported in the future.
to post comments)