March 29, 2007
This article was contributed by Aggelos Economopoulos
In this article, we will describe several aspects of the architecture of
DragonFly BSD's virtual kernel infrastructure, which allows the kernel to
be run as a user-space process. Its design and implementation is
largely the work of the project's lead developer, Matthew Dillon, who first
announced his intention of modifying the kernel to run in userspace on
September 2nd 2006. The first stable DragonFlyBSD version to
feature virtual kernel (vkernel) support was DragonFly 1.8, released on January
30th 2007.
The motivation for this work (as can be found in the initial mail linked
to above) was finding an elegant solution to one immediate and one long term
issue in pursuing the project's main goal of Single System Image clustering
over the Internet. First, as any person who is familiar with distributed
algorithms will attest, implementing cache coherency without hardware support is
a complex task. It would not be made any easier by enduring a 2-3 minute delay
in the edit-compile-run cycle while each machine goes through the boot
sequence. As a nice side effect, userspace programming errors are unlikely to
bring the machine down and one has the benefit of working with superior
debugging tools (and can more easily develop new ones).
The second, long term, issue that virtual kernels are intended to
address is finding a way to securely and
efficiently dedicate system resources to a cluster that operates over the
(hostile) Internet. Because a kernel is a more or less standalone
environment, it should be possible to completely isolate the process a
virtual kernel runs in from the rest of the system. While the
problem of process isolation is far from solved, there exist a number of
promising approaches. One option, for example, would be to use systrace
(refer to [Provos03]) to mask-out all but the few (and hopefully
carefully audited) system calls that the vkernel requires after initialization
has taken place. This setup would allow for a significantly higher degree of
protection for the host system in the event that the virtualized environment was
compromised. Moreover, the host kernel already has well-tested facilities for
arbitrating resources, although these facilities are not necessarily sufficient
or dependable; the CPU scheduler is not infallible and mechanisms for allocating
disk I/O bandwidth will need to be implemented or expanded. In any case,
leveraging preexisting mechanisms reduces the burden on the project's
development team, which can't be all bad.
Preparatory work
Getting the kernel to build as a regular, userspace, elf executable
required tidying up large portions of the source tree. In this section we
will focus on the two large sets of changes that took place as part of
this cleanup. The second set might seem superficial and hardly worthy of
mention as such, but in explaining the reason that lead to it, we shall
discuss an important decision that was made in the implementation of the
virtual kernel.
The first set of changes was separating machine dependent code to
platform- and CPU-specific parts. The real and virtual kernels can be
considered to run on two different platforms; the first is (only, as must
reluctantly be admitted) running on 32-bit PC-style hardware, while the
second is running on a DragonFly kernel. Regardless of the differences
between the two platforms, both kernels expect the same processor
architecture. After the separation, the cpu/i386
directory of the kernel tree is left with hand-optimized assembly
versions of certain kernel routines, headers relevant only to x86 CPUs
and code that deals with object relocation and debug information. The
real kernel's platform directory (platform/pc32) is
familiar with things like programmable interrupt controllers, power
management and the PC bios (that the vkernel doesn't need), while
the virtual kernel's platform/vkernel directory is
happily using the system calls that the real kernel can't have. Of
course this does not imply that there is absolutely no code duplication,
but fixing that is not a pressing problem.
The massive second set of changes involved primarily renaming quite
a few kernel symbols so that there are no clashes with the libc ones
(e.g. *printf(), qsort, errno etc.) and using kdev_t for the POSIX dev_t
type in the kernel. As should be plain, this was a prerequisite for
having the virtual kernel link with the standard C library. Given that
the kernel is self-hosted (this means that, since it cannot generally
rely on support software after it has been loaded, the kernel includes
its own helper routines), one can question the decision of pulling in all
of libc instead of simply adding the (few) system calls that the vkernel
actually uses. A controversial choice at the time, it prevailed because
it was deemed that it would allow future vkernel code to leverage the
extended functionality provided by libc. Particularly, thread-awareness in the
system C library should accommodate the (medium term) plan to mimic
multi-processor operation by the use of one vkernel thread for each hypothetical
CPU. It is safe to say that if the plan is materialized, linking against libc
will prove to be a most profitable tradeoff.
The Virtual Kernel
In this section, we will study the architecture of the virtual kernel and
the design choices made in its development, focusing on its differences from a
kernel running on actual hardware. In the process, we'll need to describe the
changes made in the real (host) kernel code, specifically in order to support a
DragonFly kernel running as a user process.
Address Space Model
The first design choice made in the development of the vkernel is that the
whole virtualized environment is executing as part of the same real-kernel
process. This imposes well defined limits on the amount of real-kernel
resources that may be consumed by it and makes containment straightforward.
Processes running under the vkernel are not in direct competition with host
processes for cpu time and most parts of the bookkeeping that is expected
from a kernel during the lifetime of a process are handled by the virtual
kernel. The alternative[1],
running each vkernel process[2]
in the context of a real
kernel process, imposes extra burden on the host kernel and requires additional
mechanisms for effective isolation of vkernel processes from the host system.
That said, the real kernel still has to deal with some amount of VM work and
reserve some memory space that is proportional to the number of processes
running under the vkernel. This statement will be made clear after we examine
the new system calls for the manipulation of vmspace objects.
In the kernel, the main purpose of a vmspace object is to describe the
address space of one or more processes. Each process normally has one vmspace,
but a vmspace may be shared by several processes. An address space is logically
partitioned into sets of pages, so that all pages in a set are backed by the
same VM object (and are linearly mapped on it) and have the same protection
bits. All such sets are represented as vm_map_entry structures. VM map entries
are linked together both by a tree and a linked list so that lookups,
additions, deletions and merges can be performed efficiently (with low time
complexity). Control information and pointers to these data structures are
encapsulated in the vm_map object that is contained in every vmspace (see the
diagram below).
A VM object (vm_object) is an interface to a data store
and can be of various types (default, swap, vnode, ...) depending on where it
gets its pages from. The existence of shadow objects somewhat complicates
matters, but for our purposes this simplified model should be sufficient. For
more information you're urged to have a look at the source and refer to
[McKusick04]
and [Dillon00].
In the first stages of the development of vkernel, a number of system
calls were added to the kernel that allow a process to associate itself with
more than one vmspace. The creation of a vmspace is accomplished by
vmspace_create(). The new vmspace is uniquely identified by an arbitrary value
supplied as an argument. Similarly, the vmspace_destroy() call deletes the
vmspace identified by the value of its only parameter. It is expected that only
a virtual kernel running as a user process will need access to alternate
address spaces. Also, it should be made clear that while a process can have
many vmspaces associated with it, only one vmspace is active at any given time.
The active vmspace is the one operated on by
mmap()/munmap()/madvise()/etc.
The virtual kernel creates a vmspace for each of its processes and it
destroys the associated vmspace when a vproc is terminated, but this behavior
is not compulsory. Since, just like in the real kernel, all information about a
process and its address space is stored in kernel memory[3], the vmspace
can be disposed of and reinstantiated at
will; its existence is only necessary while the vproc is running. One can
imagine the vkernel destroying the vproc vmspaces in response to a low memory
situation in the host system.
When it decides that it needs to run a certain process, the vkernel issues
a vmspace_ctl() system call with an argument of
VMSPACE_CTL_RUN as the command
(currently there are no other commands available), specifying the desired
vmspace to activate. Naturally, it also needs to supply the necessary context
(values of general purpose registers, instruction/stack pointers, descriptors)
in which execution will resume. The original vmspace is special; if, while
running on an alternate address space, a condition occurs which requires kernel
intervention (for example, a floating point operation throws an exception or a
system call is made), the host kernel automatically switches back to the
previous vmspace handing over the execution context at the time the exceptional
condition caused entry into the kernel and leaving it to the vkernel to resolve
matters. Signals by other host processes are likewise delivered after switching
back to the vkernel vmspace.
Support for creating and managing alternate vmspaces is also
available to vkernel processes. This requires special care so that all the
relevant code sections can operate in a recursive manner. The result is that
vkernels can be nested, that is, one can have a vkernel running as a process
under a second vkernel running as a process under a third vkernel and so
on. Naturally, the overhead incurred for each level of recursion does not
make this an attractive setup performance-wise, but it is a neat feature
nonetheless.
The previous paragraphs have described the background of vkernel
development and have given a high-level overview of how the vkernel fits in with
the abstractions provided by the real kernel. We are now ready to dive into the
most interesting parts of the code, where we will get acquainted with a new
type of page table and discuss the details of FPU virtualization and vproc <->;
vkernel communication. But this discussion needs an article of its own,
therefore it will have to wait for a future week.
Bibliography
[McKusick04] The Design and Implementation of the FreeBSD Operating
System, Kirk McKusick
and George Neville-Neil
[Dillon00] Design elements of the
FreeBSD VM system
Matthew Dillon
[Lemon00] Kqueue: A generic and
scalable event notification facility
Jonathan Lemon
[AST06] Operating Systems Design and Implementation,Andrew Tanenbaum
and Albert Woodhull.
[Provos03] Improving Host Security with
System Call PoliciesNiels Provos
[Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI, Richard Stevens.
Notes
|
[1]
| There are of course other alternatives, the most obvious one being
having one process for the virtual kernel and another for contained processes,
which is mostly equivalent to the choice made in DragonFly. |
| [2] | A process running under a virtual kernel will also be referred to as a
"vproc"
to distinguish it from host kernel processes. |
| [3] | The
small matter of the actual data belonging to the vproc is not an issue, but you
will have to wait until we get to the RAM file in the next subsection to see
why. |
(
Log in to post comments)