Brief items
The current 2.6 prepatch is 2.6.21-rc7,
released by Linus on
April 15. The list of fixes is relatively short; the next release
- expected any day now - should be the final 2.6.21 kernel.
About 30 fixes have been merged into the mainline git repository since
-rc7. Also merged is the removal of the unused
alloc_skb_from_cache() function.
The current stable 2.6 kernel is 2.6.20.7, released on April 13. It
contains fixes for a dozen or so serious problems.
For older kernels: 2.6.16.47 was released on
April 14, followed by 2.6.16.48 on April 16.
Each contains around a dozen fixes, some of which are security-related.
Comments (none posted)
Kernel development news
So I claim that anything that cannot be fair by user ID is actually
really REALLY unfair. I think it's absolutely humongously STUPID to
call something the "Completely Fair Scheduler", and then just be
fair on a thread level. That's not fair AT ALL! It's the
anti-thesis of being fair!
--
Linus Torvalds
It just reminds me that the concept of "release early, release
often" doesn't actually work in the kernel. What is far more
obvious is "release code only when it's so close to perfect that
noone can argue against it" since most of the work is done by one
person, otherwise someone will come out with a counterpatch that is
_complete_ earlier but in all possibility not as good, it's just
ready sooner.
--
Con Kolivas
Comments (8 posted)
The
RSDL scheduler (since
renamed the staircase deadline scheduler) by Con Kolivas was, for a period
of time, assumed to be positioned for merging into the mainline, perhaps as
soon as 2.6.22. Difficulties with certain workloads made the future of
this scheduler a little less certain. Now Con would appear to have
rediscovered one of the most reliable ways of getting a new idea into the
kernel: post some code then wait for Ingo Molnar to rework the whole thing
in a two-day hacking binge. So, while Con has recently
updated the SD scheduler patch,
his work now looks like it might be upstaged by Ingo's new
completely fair scheduler (CFS),
at
version 2 as of this writing.
There are a number of interesting aspects to CFS. To begin with, it does
away with the arrays of run queues altogether. Instead, the CFS works with
a single red-black tree to
track all processes which are in a runnable state. The process which pops
up at the leftmost node of the tree is the one which is most entitled to
run at any given time. So the key to understanding this scheduler is to
get a sense for how it calculates the key value used to insert a process
into the tree.
That calculation is reasonably simple. When a task goes into the run
queue, the current time is noted. As the process waits for the CPU, the
scheduler tracks the amount of processor time it would have been entitled
to; this entitlement is simply the wait time divided by the number of
running processes (with a correction for different priority values). For
all practical purposes, the key is the amount of CPU time due to the
process, with higher-priority processes getting a bit of a boost. The
short-term priority of a process will thus vary depending on whether it is
getting its fair share of the processor or not.
It is only a slight oversimplification to say that the above discussion
covers the entirety of the CFS scheduler. There is no tracking of sleep
time, no attempt to identify interactive processes, etc. In a sense, the
CFS scheduler even does away with the concept of time slices; it's all a
matter of whether a given process is getting the share of the CPU it is
entitled to given the number of processes which are trying to run. The
CFS scheduler offers a single tunable: a "granularity" value which
describes how quickly the scheduler will switch processes in order to
maintain fairness. A low granularity gives more frequent switching; this
setting translates to lower latency for interactive responses but can lower
throughput slightly. Server systems may run better with a higher
granularity value.
Ingo claims that the CFS scheduler provides solid, fair interactive
response in almost all situations. There's a whole set of nasty programs
in circulation which can be used to destroy interactivity under the current
scheduler; none of them, says Ingo, will impact interactivity under CFS.
The CFS posting came with another feature which surprised almost everybody
who has been watching this area of kernel development: a modular scheduler
framework. Ingo describes it as "an extensible hierarchy of scheduler
modules," but, if so, it's a hierarchy with no branches. It's a simple
linked list of modules in priority order; the first scheduler module which
can come up with a runnable task gets to decide who goes next. Currently
two modules are provided: the CFS scheduler described above and a
simplified version of the real-time scheduler. The real-time scheduler
appears first in the list, so any real-time tasks will run ahead of normal
processes.
There is a relatively small set of methods implemented by each scheduler
module, starting with the queueing functions:
void (*enqueue_task) (struct rq *rq, struct task_struct *p);
void (*dequeue_task) (struct rq *rq, struct task_struct *p);
void (*requeue_task) (struct rq *rq, struct task_struct *p);
When a task enters the runnable state, the core scheduler will hand it to
the appropriate scheduler module with enqueue_task(); a task which
is no longer runnable is taken out with dequeue_task(). The
requeue_task() function puts the process behind all others at the
same priority; it is used to implement sched_yield().
A few functions exist for helping the scheduler track processes:
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*task_init) (struct rq *rq, struct task_struct *p);
void (*task_tick) (struct rq *rq, struct task_struct *p);
The core scheduler will call task_new()
when processes are created.
task_init() initializes any needed priority calculations and such;
it can be called when a process is reniced, for example. The
task_tick() function is called from the timer tick to update
accounting and possibly switch to a different process.
The core scheduler can ask a scheduler module whether the currently
executing process should be preempted now:
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
In the CFS scheduler, this check tests the given process's priority against
that of the currently running process, followed by the fairness test. When
the fairness test is done, the scheduling granularity is taken into
account, possibly allowing a process to run a little longer than strict
fairness would allow.
When it's time for the core scheduler to choose a process to run, it will use
these methods:
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
The call to pick_next_task() asks a scheduler module to decide
which process (among those in the class managed by that module) should be
running currently. When a task is switched out of the CPU, the module will
be informed with a call to put_prev_task().
Finally, there's a pair of methods intended to help with load balancing
across CPUs:
struct task_struct * (*load_balance_start) (struct rq *rq);
struct task_struct * (*load_balance_next) (struct rq *rq);
These functions implement a simple iterator which the scheduler can used to
work through all processes currently managed by the scheduling module.
One assumes that this framework could be used to implement different
scheduling regimes in the future. It might need some filling out; there
is, for example,
no way to prioritize scheduling modules (or choose the default
module) other than changing the source. Beyond that, if anybody ever wants
to implement
modules which schedule tasks at the same general priority level, the strict
priority ordering of the current framework will have to change - and that
could be an interesting task. But it's a start.
The reason that this development is so surprising is that nobody had really
been talking about modular schedulers. And the reason for that silence is
that pluggable scheduling frameworks had been soundly rejected in the past
- by Ingo Molnar, among
others:
So i consider scheduler plugins as the STREAMS equivalent of
scheduling and i am not very positive about it. Just like STREAMS,
i consider 'scheduler plugins' as the easy but deceptive and wrong
way out of current problems, which will create much worse problems
than the ones it tries to solve.
So the obvious question was: what has changed? Ingo has posted an explanation which goes on at some length.
In essence, the previous pluggable scheduler patches were focused on
replacing the entire scheduler rather than smaller pieces of it; they did
not help to make the scheduler simpler.
So now there are three scheduler replacement proposals on the table: SD by
Con Kolivas, CFS by Ingo Molnar, and "nicksched" by Nick Piggin (a
longstanding project which clearly deserves treatment on this page as
well). For the moment, Con appears to have decided to take his marbles and
go home, removing SD from consideration. Still, there are a few options
out there, and one big chance (for now) to replace the core CPU scheduler.
While Ingo's work has been generally well received, not even Ingo is likely
to get a free pass on a decision like this; expect there to be some serious
discussion before an actual replacement of the scheduler is made. Among
other things, that suggests that a new scheduler for 2.6.22 is probably not
in the cards.
Comments (9 posted)
Anybody who has tried to figure out why a Linux system is running short of
memory can attest that the memory usage information made available by the
kernel is, at best, difficult to use. Matt Mackall has recently been
working on
a
set of patches aimed at improving this situation. Given the
constraints imposed by embedded Linux systems, it is not surprising that
Matt chose the Embedded Linux Conference to present his work (which, incidentally, was funded by the
Consumer Electronics Linux Forum).
Matt pointed out that the currently-available information is confusing at
best. The page cache muddies the situation, and the sharing of pages
between applications complicates things even more. The result is that it
is hard to say where memory is being used; one can't even get a definitive
answer to the question of how big a specific application is. More detailed
questions - such as which parts of an application are using the most memory
- are even harder to answer. Trying to answer questions of interest to
embedded systems developers - how many applications can run on a specific
device without pushing it into thrashing, for example - is nearly
impossible without simply running a test.
The problem is that the numbers exported by the current kernels are nearly
meaningless. The reported virtual size of an application is nearly
irrelevant; it says nothing about how much of that virtual space is
actually being used. The resident set size (RSS) number is a little
better, but there is no information on sharing of pages there. The
/proc/pid/smaps file gives a bit of detail, but also lacks
sharing information. And the presence of memory pressure can change the
situation significantly.
The Linux virtual memory system, in other words, is a black box which
provides too little information on what is going on inside. Matt's project
is to open up that box and shine some light inside.
The first step is to add a new file (pagemap) in each process's
/proc directory. It is a binary file containing the page frame
number for each page in the process's address space. The file can be read
to see where a process's pages have been placed and, more interestingly, it
can be compared between processes to see which pages are being shared.
Matt has a little graphical tool which can display this file, showing the
patterns of which pages are present in memory and which are not.
Then, there is a file (/proc/kpagemap) which provides information
about the kernel's memory map. For each physical page in the system,
kpagemap contains the mapping count and the page flags. This
information can be used to learn about sharing of pages and about how each
page is being used. There were a couple of graphical applications using
this file as well; one showed the degree to which each page is being
shared, while the other showed the use of each page as determined by its
flags.
Once this information is available, one can start to generate some useful
numbers on memory use. Matt is proposing two new metrics. The
"proportional set size" (PSS) of a process is the count of pages it has in
memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one
other process, its PSS will be 1500. The unique set size (USS), instead,
is a simple count of unshared pages. It is, for all practical purposes,
the number of pages which will be returned to the system if the process is
killed.
These numbers are relatively expensive to calculate, since they required a
pass through the process's address space. So they will not be something
which is regularly exported from the kernel. They can be calculated in
user space using the pagemap files, though. Matt demonstrated a couple of
tools to do these calculations. Using "memstats" on a galeon
process, he supplemented the currently-available virtual size and resident
set size numbers (105MB and 41MB, respectively) with a PSS of 26MB and a
USS of 20MB. There is also a "memrank" tool which lists processes
in the system sorted by decreasing PSS. With a tool like that, finding the
memory hogs on the system becomes a trivial task.
Matt pointed out that these numbers, while useful, will change depending on
the amount of memory pressure being experienced by the system. It would be
nice to be able to figure out how much memory a given process truly needs
before it will begin to thrash. To this end, his patch creates a new
clear_refs file for each process; this file can be used to reset
the "referenced" flag on each page in the process's working set. After the
process runs for a bit, one can look at which pages have had their
referenced bits set again; those are the pages it actually needed to run
during that time.
The patches are in the -mm tree currently; it's possible that they could
find their way into the mainline once the 2.6.22 merge window opens up.
Those who would like to play with Matt's scripts can find them in this directory; the slides from
his talk are packaged there as well. With luck,
understanding system memory usage will require far less guesswork in the
near future.
Comments (12 posted)
April 16, 2007
This article was contributed by Aggelos Economopoulos
[
Editor's note: this article is the second and final part of the
look at the DragonFly BSD virtual
kernel article by Aggelos Economopoulos. For those who questioned why
a BSD development appears on this page, the answer is simple: there is
value in seeing how others have solved common problems.]
Userspace I/O
Our previous article gave an overview of the DragonFly
virtual kernel and the kernel virtual memory subsystem. In this
article, we can finally cover the complications that present themselves in
implementing such a virtualized execution environment. If you haven't
read the previous article, it would be a good idea to do so before
continuing.
Now that we know how the virtual kernel regains control when its processes
request/need servicing, let us turn to how it goes about satisfying those
requests. Signal transmission and most of the filesystem I/O (read, write, ...),
process control (kill, signal, ...) and net I/O system calls are easy; the
vkernel takes the same code paths that a real kernel would. The only difference
is in the implementation of the copyin()/copyout() family of routines for
performing I/O to and from userspace.
When the real kernel needs to access user memory locations, it must first
make sure that the page in question is resident and will remain in memory for
the duration of a copy. In addition, because it acts on behalf of a user
process, it should adhere to the permissions associated with that process. Now,
on top of that, the vkernel has to work around the fact that the process address
space is not mapped while it is running. Of course, the vkernel knows which
pages it needs to access and can therefore perform the copy by creating a
temporary kernel mapping for the pages in question. This operation is
reasonably fast; nevertheless, it does incur measurable overhead compared to
the host kernel.
Page Faults
The interesting part is dealing with page faults (this includes lazily
servicing mmap()/madvise()/... operations). When a process mmap()s a file (or
anonymous memory) in its address space, the kernel (real or virtual) does not
immediately allocate pages to read in the file data (or locate the pages in the
cache, if applicable), nor does it setup the pagetable entries to fulfill the
request. Instead, it merely notes in its data structures that it has promised
that the specified data will be there when read and that writes to the
corresponding memory locations will not fail (for a writable mapping) and will
be reflected on disk (if they correspond to a file area). Later, if the process
tries to access these addresses (which do not
still have valid pagetable entries (PTES), if they ever did, because new
mappings invalidate old ones), the CPU throws a pagefault and the fault
handling code has to deliver as promised; it obtains the necessary data
pages and updates the PTES. Following that, the faulting instruction is
restarted.
Consider what happens when a process running on an alternate vmspace of a
vkernel process generates a page fault trying to access the memory region it
has just mmap()ed. The real kernel knows nothing about this and through a
mechanism that will be described later, passes the information about the fault
on to the vkernel. So, how does the vkernel deal with it? The case when the
faulting address is invalid is trivially handled by delivering a signal (SIGBUS
or SIGSEGV) to the faulting vproc. But in the case of a reference to a valid
address, how can the vkernel ensure that the current and succeeding accesses
will complete? Existing system facilities are not appropriate for this task;
clearly, a new mechanism is called for.
What we need, is a way for the vkernel to execute mmap-like operations on
its alternate vmspaces. With this functionality available as a set of system
calls, say vmspace_mmap()/vmspace_munmap()/etc, the vkernel code servicing an
mmap()/munmap()/mprotect()/etc vproc call would, after doing some sanity
checks, just execute the corresponding new system call specifying the vmspace
to operate on. This way, the real kernel would be made aware of the required
mapping and its VM system would do our work for us.
The DragonFly kernel provides a vmspace_mmap() and a vmspace_munmap()
like the ones we described above, but none of the other calls we thought we
would
need. The reason for this is that it takes a different, non-obvious, approach
that is probably the most intriguing aspect of the vkernel work. The kernel's
generic mmap code now recognizes a new flag, MAP_VPAGETABLE. This flag
specifies that the created mapping is governed by a userspace virtual pagetable
structure (a vpagetable), the address of which can be set using the new
vmspace_mcontrol() system call (which is an extension of madvise(), accepting an
extra pointer parameter) with an argument of MADV_SETMAP. This software
pagetable structure is similar to most architecture-defined pagetables. The complementary
vmspace_munmap(), not surprisingly, removes mappings in alternate address
spaces. These are the primitives on which the memory management of the virtual
kernel is built.
Table 1. New vkernel-related system calls
int vmspace_create(void *id, int type, void *data);
int vmspace_destroy(void *id,);
int vmspace_ctl(void *id, int cmd, struct trapframe *tf,
struct vextframe *vf);
int vmspace_mmap(void *id, void *start, size_t len, int prot,
int flags, int fd, off_t offset);
int vmspace_munmap(void *id, void *start, size_t len);
int mcontrol(void *start, size_t len, int adv, void *val);
int vmspace_mcontrol(void *id, void *start, size_t len, int adv,
void *val);
At this point, an overview of the virtual memory map of each
vmspace associated with the vkernel process is in order. When the
virtual kernel starts up, there is just one vmspace for the process and it is
similar to that of any other process that just begun executing (mainly
consisting of mappings for the heap, stack, program text and libc). During its
initialization, the vkernel mmap()s a disk file that serves the role of physical
memory (RAM). The real kernel is instructed (via madvise(MADV_NOSYNC)) to not
bother synchronizing this memory region with the disk file unless it has to,
which is typically when the host kernel is trying to reclaim RAM pages in a low
memory situation. This is imperative; otherwise all the vkernel "RAM" data
would be treated as valuable by the host kernel and would periodically be
flushed to disk. Using MADV_NOSYNC, the vkernel data will be lost if the system
crashes, just like actual RAM, which is exactly what we want: it is up to the
vkernel to sync user data back to its own filesystem. The memory file is
mmap()ed specifying MAP_VPAGETABLE. It is in this region that all
memory allocations (both for the virtual kernel and its processes) take place.
The pmap module, the role of which is to manage the vpagetables according to
instructions from higher level VM code, also uses this space to create the
vpagetables for user processes.
On the real kernel side, new vmspaces that are created for these user
processes are very simple in structure. They consist of a single vm_map_entry
that covers the 0 - VM_MAX_USER_ADDRESS address range. This entry is of type
MAPTYPE_VPAGETABLE and the address for its vpagetable has been set (by means of
vmspace_mcontrol()) to point to the vkernel's RAM, wherever the pagetable for
the process has been allocated.
The true vm_map_entry structures are managed by the vkernel's VM
subsystem. For every one of its processes, the virtual kernel maintains the
whole set of vmspace/vm_map, vm_map_entry, vm_object objects that we described
earlier. Additionally, the pmap module needs to keep its own (not to be
described here) data structures. All of the above objects reside in
the vkernel's "physical" memory. Here we see the primary benefit of the
DragonFly approach: no matter how fragmented an alternate vmspace's virtual
memory map is and independently of the amount of sharing of a given page by
processes of the virtual kernel, the host kernel expends a fixed (and
reasonably sized) amount of memory for each vmspace. Also, after the initial
vmspace creation, the host kernel's VM system is taken out of the equation
(expect for pagefault handling), so that when vkernel processes require VM
services, they only compete among themselves for CPU time and not with the host
processes. Compared to the "obvious" solution, this approach saves large
amounts of host kernel memory and achieves a higher degree of isolation.
Now that we have grasped the larger picture, we can finally examine our
"interesting" case: a page fault occurs while the vkernel process is using one
of its alternate vmspaces. In that case, the vm_fault() code will notice it is
dealing with a mapping governed by a virtual pagetable and proceed to walk the
vpagetable much like the hardware would. Suppose there is a valid entry in the
vpagetable for the faulting address; then the host kernel simply updates its
own pagetable and returns to userspace. If, on the other hand, the search
fails, the pagefault is passed on to the vkernel which has the necessary
information to update the vpagetable or deliver a signal to the faulting vproc
if the access was invalid. Assuming the vpagetable was updated, the next time
the vkernel process runs on the vmspace that caused the fault, the host kernel
will be able to correct its own pagetable after searching the vpagetable as
described above.
There are a few complications to take into account, however. First of
all, any level of the vpagetable might be paged out. This is straightforward to
deal with; the code that walks the vpagetable must make sure that a page is
resident before it tries to access it. Secondly, the real and virtual
kernels must work together to update the accessed and modified bits in
the virtual pagetable entries (VPTES). Traditionally, in
architecture-defined pagetables, the hardware conveniently sets those
bits for us. The hardware knows nothing about vpagetables, though.
Ignoring the problem altogether is not a viable solution. The
availability of these two bits is necessary in order for the VM subsystem
algorithms to be able to decide if a page is heavily used and whether it
can be easily reclaimed or not (see [AST06]). Note
that the different semantics of the modified and accessed bits mean that we are
dealing with two separate problems.
Keeping track of the accessed bit turns out to require a minimal
amount of work. To explain this, we need to give a short, incomplete,
description of how the VM subsystem utilizes the accessed bit to keep
memory reference statistics for every physical page it manages. When the
DragonFly pageout daemon is awakened and begins scanning pages, it first
instructs the pmap subsystem to free whatever memory it can that is consumed by
process pagetables, updating the physical page reference and modification
statistics from the PTES it throws away. Until the next scan, any pages that are
referenced will cause a pagefault and the fault code will have to set the
accessed bit on the corresponding pte (or vpte). As a result, the hardware is
not involved[4]. The behavior of the virtual kernel is identical to that
just sketched above,
except that in this case page faults are more expensive since they must always
go through the real kernel.
While the advisory nature of the accessed bit gives us the flexibility to
exchange a little bit of accuracy in the statistics to avoid a considerable
loss in performance, this is not an option in emulating the modified bit. If
the data has been altered via some mapping the (now "dirty") page cannot be
reused at will; it is imperative that the data be stored in the backing object
first. The software is not notified when a pte has the modified bit set in
the hardware pagetable. To work around this, when a vproc requests a mapping
for a page and that said mapping be writable, the host kernel will disallow
writes in the pagetable entry that it instantiates. This way, when the vproc
tries to modify the page data, a fault will occur and the relevant code will
set the modified bit in the vpte. After that, writes on the page can finally be
enabled. Naturally, when the vkernel clears the modified bit in the vpagetable
it must force the real kernel to invalidate the hardware pte so that it can
detect further writes to the page and again set the bit in the vpte, if
necessary.
Floating Point Context
Another issue that requires special treatment is saving and
restoring of the state of the processor's Floating Point Unit (FPU) when
switching vprocs. To the real kernel, the FPU context is a per-thread
entity. On a thread switch, it is always saved[5]
and machine-dependent arrangements are made that will force an exception
("device not available" or DNA) the first time that the new thread (or any
thread that gets scheduled later) tries to access the FPU[6]. This gives the kernel
the opportunity to restore the proper FPU context so that floating point
computations can proceed as normal.
Now, the vkernel needs to perform similar tasks if one of its
vprocs throws an exception because of missing FPU context. The only
difficulty is that it is the host kernel that initially receives the
exception. When such a condition occurs, the host kernel must
first restore the vkernel thread's FPU state, if another host thread was given
ownership of the FPU in the meantime. The virtual kernel, on the other
hand, is only interested in the exception if it has some saved context to
restore. The correct behavior is obtained by having the vkernel inform the real
kernel whether it also needs to handle the DNA exception. This is done by
setting a new flag (PGEX_FPFAULT) in the trapframe argument of vmspace_ctl(). Of
course, the flag need not be set if the to-be-run virtualized thread is the
owner of the currently loaded FPU state. The existence of PGEX_FPFAULT causes
the vkernel host thread to be tagged with FP_VIRTFP. If the host kernel notices
said tag when handed a "device not available" condition, it will restore the
context that was saved for the vkernel thread, if any, before passing the
exception on to the vkernel.
Platform drivers
Just like for ports to new hardware platforms, the changes made for
vkernel are confined to few parts of the source tree and most of the kernel code
is not aware that it is in fact running as a user process. This applies to
filesystems, the vfs, the network stack and core kernel code. Hardware device
drivers are not needed or wanted and special drivers have been developed
to allow the vkernel to communicate with the outside world. In this
subsection, we will briefly mention a couple of places in the platform code
where the virtual kernel needs to differentiate itself from the host
kernel. These examples should make clear how much easier it is to emulate
platform devices using the high level primitives provided by the host
kernel, than dealing directly with the hardware.
Timer. The DragonFly
kernel works with two timer types. The first type provides an abstraction for a
per-CPU timer (called a systimer) implemented on top of a cputimer. The latter
is just an interface to a platform-specific timer. The vkernel implements one
cputimer using kqueue's EVFILT_TIMER. kqueue is the BSD high performance event
notification and filtering facility described in some detail in
[Lemon00]. The EVFILT_TIMER filter provides access to a periodic or
one-shot timer. In DragonFly, kqueue has been extended with signal-driven I/O
support (see [Stevens99]) which, coupled with the a signal mailbox
delivery mechanism allows for fast
and very low overhead signal reception. The vkernel makes full use of the two
extensions.
Console. The system console is simply the terminal from which the vkernel
was executed. It should be mentioned that the vkernel applies special
treatment to some of the signals that might be generated by this
terminal; for instance, SIGINT will drop the user to the in-kernel
debugger.
Virtual Device Drivers
The virtual kernel disk driver exports a standard disk driver
interface and provides access to an externally specified file. This file
is treated as a disk image and is accessed with a combination of the read(),
write() and lseek() system calls. Probably the simplest driver in the kernel
tree, the memio driver for /dev/zero included in the comparison.
VKE implements an ethernet interface (in the vkernel) that tunnels all the
packets it gets to the corresponding tap interface in the host kernel. It is a
typical example of a network interface driver, with the exception that its
interrupt routine runs as a response to an event notification by kqueue. A
properly configured vke interface is the vkernel's window to the outside
world.
Bibliography
[McKusick04] The Design and Implementation of the FreeBSD Operating System, Kirk McKusick and George Neville-Neil
[Dillon00]
Design elements of the FreeBSD VM system
Matthew Dillon
[Lemon00]
Kqueue: A generic and scalable event notification facility
Jonathan Lemon
[AST06] Operating Systems Design and Implementation,
Andrew Tanenbaum and Albert Woodhull.
[Provos03]
Improving Host Security with System Call Policies
Niels Provos
[Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI,
Richard Stevens.
Notes
| [4] |
Well not really, but a thorough VM walkthrough is
out of scope here. |
| [5] |
This is not optimal; x86 hardware supports fully lazy FPU save, but the
current implementation does not take advantage of that yet.
|
| [6] |
The
kernel will occasionally make use of the FPU itself, but this does not directly
affect the vkernel related code paths. |
| [7] |
Or any alternative stack the user has designated for
signal delivery. |
Comments (10 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>