Brief items
The current 2.6 prepatch is 2.6.21-rc2,
released by Linus on
February 27. This prepatch contains a big Video4Linux update, a big
PA-RISC architecture update, the beginning of "SMARTMIPS" support, a driver
for Davicom DM9601 USB ethernet adapters, a driver for Code Mercenaries "IO
Warrior" devices, and HID support in the Bluetooth subsystem. Several
patches were also reverted in -rc2 as a result of regressions.
Says Linus:
"
This is not how an -rc2 should look. Need to really calm things
down!" See
the
changelog for the details.
As of this writing, there have been no commits to the mainline repository
since -rc2 was released.
There have been no -mm releases over the last week.
On the stable front: 2.6.19.5 and 2.6.18.8 were both released on
February 23. They contain a fair number of fixes. Further updates to
2.6.18 are unlikely; there will probably be one more 2.6.19 release in the
near future.
2.6.16.42 was released on
February 26 with several fixes, some of which are security-related.
Comments (none posted)
Kernel development news
Because if you don't see why I'm complaining, I can't pull from
you. You can send me patches, but for me to pull a git patch from
you, I need to know that you know what you're doing, and I need to
be able to trust things *without* then having to go and check every
individual change by hand.
-- Linus Torvalds
Comments (26 posted)
Progress in the virtualization world sometimes seems slow. Xen has been
the hot topic in the paravirtualization area for some years now - the first
"stable" release was
announced
in 2003 - but the code remains outside of the mainline Linux kernel. News
from that project has been relatively scarce as of late - though the Xen
hackers are certainly still out there working on the code.
On the other hand, KVM
appears to be to be on the fast path. This project first surfaced in
October, 2006; it found its way into the 2.6.20 kernel a few months later.
On February 25, KVM 15 was announced; this release has an
interesting new feature: live migration. The speed with which the KVM
developers have been able to add relatively advanced features is
impressive; equally impressive is just how simple the code which implements
live migration is.
KVM starts with a big advantage over other virtualization projects: it
relies on support from the hardware, which is only available in recent
processors. As a result, KVM will not work on the bulk of
currently-deployed systems. On the other hand, designing for future
hardware is often a good idea - the future tends to come quickly in the
technology world. By focusing on hardware-supported virtualization, KVM
is able to concentrate on developing interesting features to run on the systems
that companies are buying now.
The migration code is built into the QEMU emulator; the relevant source
file is less than 800 lines long. The live migration task comes down to
the following steps:
- A connection is made to the destination system. This can currently be
done with a straight TCP connection to an open port on the destination
(which would not be the most secure way to go) or by way of ssh.
- The guest's memory is copied to the destination. This process is just
a matter of looping through the guest's physical address space (which
is just virtual memory on the host side) and sending it, one page at a
time, to the destination system. As each page is copied, it is made
read-only for the guest.
- The guest is still running while this copy process is happening.
Whenever it tries to modify a page which has already been copied, it
will trap back into QEMU, which restores write access and marks the
page dirty. Copying memory thus becomes an iterative process; once
the entire range has been done, the migration code loops back to the
beginning and re-copies all pages which have been modified by the
guest. The hope is that the list of pages which must be copied
shrinks with each pass over the space.
- Once the number of dirty pages goes below a threshold, the guest
system is stopped and the remaining pages are copied. Then it's just
a matter of transmitting the current state of the guest (registers, in
particular) and the job is done; the migrated guest can be restarted
on its new host system.
As it happens, guest systems can be moved between Intel and AMD processors
with no problems at all. Moving a 64-bit guest to a 32-bit host remains
impossible; the KVM developers appear uninterested in fixing this
particular limitation anytime soon. A little more information can be found
on the KVM migration
page.
The other feature of note is the announced plan to freeze the KVM interface
for 2.6.21. This interface has been evolving quickly, despite the fact
that it is a user-space API; this flexibility has been allowed because KVM
is new, experimental, and has no real user base yet. The freezing of the
API suggests that the KVM developers think things are reaching a stable
point where KVM can be put to work in production systems. Perhaps that
means that, soon, we'll find out how Qumranet, the company which has been
funding the KVM work, plans to make its living.
Comments (10 posted)
Remember
fibrils? The memory
may be dim, seeing as the fibril concept was posted way back in January,
but the work inspired by this idea continues. The latest
syslet patch from Ingo Molnar
was posted on February 24; it brings some interesting changes to this
approach to asynchronous system call execution.
The concept of "atoms" which was part of the first syslet patch remains;
an atom is a unit of work which is executed in kernel space. Atoms can be
chained together with some simple flow control operations, with the entire
sequence being executed without leaving the kernel. A sequence of atoms
will be executed synchronously if possible; if an atom blocks, however, a
new thread will be created to return to user space. As a result,
asynchronous code can be executed in parallel, but the overhead of thread
creation is only incurred when there is a need for it.
The syslet API has changed, however, in response to some concerns about how
completion events were handled. User space must now create create a
structure to go along with the atom sequence:
struct async_head_user {
unsigned long kernel_ring_idx;
unsigned long user_ring_idx;
struct syslet_uatom __user **completion_ring;
unsigned long ring_size_bytes;
/* There is other stuff here too */
};
This structure defines the completion ring - a circular buffer which is
filled (by the kernel) with pointers to atoms which have completed
execution. There is no longer a need to register this buffer with the
kernel; instead, the structure is passed in when the atoms are passed to
the kernel for execution:
struct syslet_uatom *async_exec (struct syslet_uatom *atom,
struct async_head_user *ahu);
An implication of this new interface is that each chain of atoms can, if
desired, have its own completion ring. These rings are no longer pinned
into memory, so there can be an arbitrary number of them. The return value
from async_exec() will be a pointer to the last atom to execute if
the chain runs without blocking, or NULL if the chain blocked and
user space is running in a new thread.
Jens Axboe, Suparna Bhattacharya, and others have been doing some
benchmarking with the current syslet code. Many (but not all) of the
benchmark runs show that syslets perform better than the current
asynchronous I/O implementation. The causes for the divergence between
results are still being investigated; one thing that has come out is that
the CFQ I/O scheduler does not work properly with syslets. CFQ takes a
process-oriented approach to scheduling, so it is not entirely surprising
that changes to the process model could prove confusing there.
Nonetheless, Ingo is confident that syslets
are a performance win:
[I]n my own (FIO based) measurements syslets beat the native KAIO
interfaces both in the cached and in the non-cached [== many
threads] case. I did not expect the latter at all: the non-cached
syslet codepath is not optimized at all yet, so i expected it to
have (much) higher CPU overhead than KAIO.
This means that KAIO is in worse shape than i thought - there's
just way too much context KAIO has to build up to submit parallel
IO contexts. Many years of optimizations went into KAIO already,
so it's probably at its outer edge of performance capabilities.
Perhaps the biggest change in the new patch set, however, is the creation
of a new concept known as "threadlets." The threadlet idea brings the
on-demand thread creation idea to user space. Threadlets are ordinary
user-space code which will be run synchronously if possible; should this
code block, however, a new thread will be created to allow user space to
continue while the threadlet waits.
The API as described by Ingo requires the application to define a function
to run as a threadlet:
long threadlet_fn(void *data)
{
/* Almost anything can go here */
return complete_threadlet_fn(event, ahu);
}
About the only thing which is different here is that the call to
complete_threadlet_fn() is required:
long complete_threadlet_fn(void *event, struct async_head_user *ahu);
The event parameter is stored in the completion ring - since there
is no atom structure here, user-space must provide a value to identify
which threadlet completed. The async_head_user structure
describes the completion ring, as before.
The application can fire off a
threadlet with:
long threadlet_exec(long threadlet_fn(void *),
unsigned long stack,
struct async_user_head *ahu);
Besides the threadlet_fn() described above, this call requires
that the application provide stack space for the new threadlet. The
stack argument is thus a pointer (despite its unsigned
long type) to a few pages of ordinary user-space memory set aside for
this purpose. There is also an async_user_head structure to
provide for the reporting of threadlet completion. If
threadlet_fn() runs to completion without blocking, the return
value of threadlet_exec() will be 1; otherwise zero is
returned.
As it happens, threadlet_exec() is a user-space wrapper which
hides much of the complexity of the real interface. This function switches
over to the given stack immediately, then calls
threadlet_on(), which is a true system call, passing it the
original stack address as a parameter. This call saves that stack address,
ensures that a "cache miss thread" will be available if needed, and marks
the process as running in an asynchronous mode. It then returns to user
space, which executes the user's threadlet_fn(). Should that
function block, the kernel will grab a new thread, set it up with the
original stack, and send it back to user space. The threadlet function
will then continue to execute in the original thread once the condition
which blocked it is resolved.
Unsurprisingly, complete_threadlet_fn() is also a wrapper. It
calls threadlet_off() to indicate that the execution of the
threadlet is complete. If threadlet_off() returns 1, the
threadlet ran synchronously and there is no more to do. Otherwise, a call
is made to:
long async_thread(void *event, struct async_head_user *ahu);
This system call will store event in the completion ring. Since
this thread is running asynchronously, returning to user space is not in
the cards - user space went its own way when things first blocked. So
async_thread() puts the current thread onto the list of threads
available the next time one is needed for asynchronous execution.
The above description has left out a couple of details, mostly related to
the management of user-space stacks. It's worth noting that there appears
to be no guard page put at the end of a threadlet stack, meaning that, if
the stack is too small, user space could easily overflow it. The result
would likely be some truly obscure bugs which would not be fun to find.
This API could also change a bit; Ingo apparently has plans for turning
threadlet_on() and threadlet_off() into vsyscalls which
could execute without going into the kernel at all. That, of course, would
improve the performance of threadlets further.
While the syslet interface provided interesting functionality, it was
immediately seen as being hard to work with. The new threadlet API was
designed to get around those objections by getting away from the whole
"atom" concept and making it possible to run user-space code asynchronously
with a minimum of fuss. The syslet mechanism is likely to remain, as it
will still be the fastest way to get a task done. But syslets may see
little use outside of special-purpose libraries which hide their
complexity. For everything else, threadlets could prove to be the way to
go.
Comments (5 posted)
The ongoing discussion of threadlets (or fibrils, or whatever they will be
called next week) has considered the addition of a major new API to the
kernel. This discussion has, however, studiously ignored an important
question: what about the longstanding kevent patch which, at some level,
solves the same problems? The motivation for the first fibril patch was to
make it easier to provide comprehensive asynchronous I/O in the kernel -
and that was one of the reasons for kevents as well. So it has been
surprising that kevents have not figured into this conversation.
Kevents have finally become part of the discussion, however, resulting in
an interesting exchange between kevent hacker Evgeniy Polyakov, threadlet
(and everything else) hacker Ingo Molnar, and several others as well.
Benchmarks have been thrown around to illustrate the performance
characteristics of both approaches, but the real question is this: what is
the best way to allow user-space applications to juggle multiple
simultaneous operations in a scalable manner?
Evgeniy's core claim appears to be that an event-oriented approach is
inherently more scalable than using threads. He says:
If things decreases performance noticeably, it is a bad things, but
it is matter of taste. Anyway, kevents are very small, threads are
very big, and both are the way they are exactly on purpose -
threads serve for processing of any generic code, kevents are used
for event waiting - IO is such an event, it does not require a lot
of infrastructure to handle, it only needs some simple bits, so it
can be optimized to be extremely fast, with huge infrastructure
behind each IO (like in case when it is a separated thread) it can
not be done effectively.
In other words, using threads for event management is simply too slow.
David Miller has also argued that threads
are inherently wrong for network-oriented tasks. One of the big advantages
behind the threadlet approach is that it is very fast in the non-blocking
case, which is expected to be the situation much of the time. In
networking, however, one normally expects to block. As a result, a highly
multi-threaded networking application could create massive numbers of
threads in short order. Networking is inherently an event-oriented
activity.
Ingo challenges the notion that using
threads and the scheduler will be slower than maintaining lists of jobs
which turn into events:
To me the picture is this: conceptually the scheduler runqueue is a
queue of work. You get items queued upon certain events, and they
can unqueue themselves. (there is also register context but that is
already optimized to death by hardware) So whatever scheduling
overhead we have, it's a pure software thing...
Now look at kevents as the queueing model. It does not queue
'tasks', it lets user-space queue requests in essence, in various
states. But it's still the same conceptual thing: a memory buffer
with some state associated to it. Yes, it has no legacies, it has
no priorities and other queueing concepts attached to it
... yet. If kevents got mainstream, it would get the same kind of
pressure to grow 'more advanced' event queueing and event
scheduling capabilities. Prioritization would be needed, etc.
The point here is that the scheduler has been brutally optimized over the
course of many years. The actual overhead of switching contexts is quite
small - perhaps less than that of a system call to manage events. The only
real difference is that the memory overhead of maintaining threads is quite
a bit higher than the overhead of kevents. But, says Ingo, with proper
programming that should not be an insurmountable problem.
The real issue, though, tends to be one of ease of programming - on both
the kernel and the user sides. In user space, the classic pattern for an
event-based application involves a central loop which only blocks when it
is waiting for events. Any actual work done within the loop must happen in
a non-blocking manner; should the loop block, events will pile up while the
application is doing nothing. Blocking in the wrong place can kill
performance. But avoiding blocking in all situations is
tricky at best, and sometimes impossible. The threadlet model lets the
application developer stop worrying about blocking; if an operation blocks,
the application simply continues to run in a newly-created thread.
More generally, programs written as state machines - the style
necessitated by event-driven models - tend to be hard for people to
understand. And there are a number of kernel operations (opening a file,
for example) which can block in any of a number of places, and which are
just about impossible to code in a state-machine style. Multi-threaded
programs present their own challenges for developers who are not prepared
to think about concurrency issues, but they still tend to be easier for
most to understand. Threadlets, by making any sequence of calls easily
implementable in a threaded model, should be relatively easy to program.
At least, that's how the argument goes.
That argument applies to kernel space as well. The struggle to bring
event-based asynchronous I/O to Linux has occupied a number of
highly-capable kernel developers for years - and the job is still far from
complete. It requires the addition of an entirely new infrastructure and
the application of state-machine techniques to inherently sequential series
of events. The complexity of the retry-based asynchronous buffered
file I/O patch set is a case in point: this code has seen work (on and
off) for years, and it still hasn't found its way into the mainline. It
still depends on worker threads for some of its operation as well.
Threadlets, it is argued, allow for any system call to be invoked
asynchronously, with almost no added complexity or overhead at all.
Eventually the discussion reached a point where Linus jumped in to express a bit of frustration.
His position is that it's not a matter of choosing between event-based and
thread-based mechanisms, since there is a place for both:
Use select/poll/epoll/kevent/whatever for event mechanisms. STOP
CLAIMING that you'd use threadlets/syslets/aio for that.... Event
mechanisms are *superior* for events. But they *suck* for things
that aren't events, but are actual code execution with random
places that can block.
In this view, it's not a matter of picking one or the other, but providing
both so that the right tool can be used for each job. It seems likely that
this opinion is fairly widespread, meaning that some sort of thread-based
asynchronous mechanism will probably find its way into the mainline before
too long. Event-based interfaces will continue to be supported as well; the big
question there is whether the existing interfaces (epoll in particular) are
sufficient, or whether the addition of kevents is called for.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
- Christoph Lameter: SLUB v2.
(February 26, 2007)
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>