Brief items
The current development kernel is 2.6.31-rc5,
released on July 31.
"
Apart from various regression fixes, the diffstat shows a couple of
new drivers (at_hdmac, uc2322, gspca/sn9c20x, ds2782 battery driver), and
some big KMS radeon changes..." Also included was the "flexible
array" infrastructure (see below). See
the
full changelog for the details.
The current stable kernel is 2.6.30.4, released (along with 2.6.27.29) on July 30.
Both updates contain another long list of important fixes.
Comments (none posted)
Kernel development news
My great act of protest? Disabling Twitter forwarding to my
Facebook status. Ha! Take that, enormous corporation!
--
Valerie Aurora
shows them who's boss
Anyway, Andrew Morton was right, we should have merged into
mainline as soon as Tux3 was booting as root. That would have
taken a big load off me. Instead, somebody posted to LKML and
called for atomic commit as a precondition for merging. Sounds
like a good idea, sounds logical. But actually, in open source it
is counter productive, it just puts a bigger load on me, a limited
resource. We should have merged first, then got the logging and
replay working. In fact, we probably should still do that. I will
say this now: if we are invited to merge in the next major release,
or in -mm or whatever, we will happily do it. If we are not
invited to merge, nobody has any cause to complain about progress
slowing down.
--
Daniel Phillips
0 bits in the green bag, 1 bits in the black bag please
--
Alan Cox on how to recycle code
Comments (4 posted)
By Jonathan Corbet
August 5, 2009
TTY maintenance: Greg Kroah-Hartman, admitting that he is a glutton
for punishment, has agreed to take on maintenance of the TTY layer - a job
recently abandoned by Alan Cox. Patches have begun to flow toward the
mainline, with Linus taking a larger-than-usual interest in getting them
into shape. The fate of Alan's longer-term cleanup plans remains
uncertain, but basic maintenance and bug fixing, at least, seems to be in
place.
Regressions. Rafael Wysocki has posted the 2.6.31-rc5 known regressions
list. A total of 76 regressions have been reported in this development
cycle; 28 of those remain unresolved. For this stage in the process, that
is about normal, or, perhaps, just a bit better than average. Less
encouraging, perhaps, is the fact that the 2.6.30 regression list still
shows 39 unresolved problems.
make V=1. Once upon a time, building a kernel filled the screen
with vast amounts of output, including the full command line for each
compilation command. Needless to say, it was hard to get much information
out of that much noise; in more recent times, the kernel build system emits
much more concise information about what it's doing. Sometimes, though,
one needs to see what's really going on; in such cases, running
"make V=1" will cause the build system to output everything
it's doing.
Except that, as Dave Airlie discovered, it
doesn't; some commands are still hidden from view even when V=1 is
specified. Build system maintainer Sam Ravnborg explained: "The problem is that V=1 is
already too chatty, so people sometimes hide their stuff - as in this
case." His suggestion is to implement multiple levels of verbosity,
so that "V=2" could be used to view the truly full stream of
commands. There's a minor problem in that "V=2" is already used
to get make to print out which file caused a particular rebuild to
happen. But, as Sam puts it, few people ever use that option, so maybe it
could be replaced with a "be more verbose" mode. Unless somebody objects
soon, that's likely to be how it goes.
devtmpfs. Greg Kroah-Hartman, evidently not feeling sufficiently
challenged by the TTY layer, has reposted the devtmpfs patch,
suggesting that it's ready for merging into the mainline. Greg says:
For .32 it's a simple and clean patch. It's been tested and agreed
by three major distros that this is a good idea. SuSE has been
shipping this in their kernels for a while now with no problems,
and actual speedups measured on their boot times.
It would be fair to say, though, that the development community is not yet
sold on the desirability of merging this patch; expect some interesting
discussion in the near future.
Xtables2. The future of Linux packet filtering might be nftables, but Jan Engelhardt
isn't holding his breath. He has, instead, put together an immense patch set massively
reworking the existing iptables mechanism. The internal data structures
have been torn out and reimplemented as a more flexible linked list,
setting the stage for easier single-rule changes in the future. Perhaps
the biggest payoff, though, is in the unification of the IPv4, IPv6, and
ARP versions of the packet-filtering engine; that, he says, enables the
removal of about 50% of the code.
The initial responses suggested that potential reviewers were overwhelmed
by the magnitude of the change. Jan has posted a more detailed explanation of what various
groups of patches do, which has helped. Eventual merging of this code will
probably require breaking the sequence up into multiple steps, though.
Montreal Linux power management mini-summit notes have been posted by Len Brown;
they give a good (if terse) summary
of recent developments in the area and what is being worked on now.
Comments (none posted)
By Jake Edge
August 5, 2009
While virtualization has been a boon for many users and data centers, it
tends to suffer from performance problems, particularly I/O performance.
Addressing that problem is the goal of a newly announced project, AlacrityVM,
which has created a hypervisor based on KVM. By shortening the I/O path
for guests,
AlacrityVM seeks to provide I/O performance near that of "bare metal"
hardware.
The project is in a "pre-alpha" stage, according to the web page, but it is
already reporting some fairly impressive results from a proof-of-concept network
driver. Both for throughput
and latency,
the AlacrityVM guest performance compared favorably to that of 2.6.28 and
2.6.29-rc8 hosts. It also clearly out-performed the virtio drivers in a
KVM guest.
The major change that allows AlacrityVM to achieve those gains come
from a new kernel-based virtual I/O scheme known as Virtual-Bus
(or vbus). Currently, KVM guests use emulated devices—implemented
in user space by QEMU—in order to handle I/O requests. That leads to
multiple
kernel-to-user-space transitions for each I/O operation. The idea behind
vbus is to allow guests to directly access the host kernel driver, thus
reducing the overhead for I/O.
Using vbus, a host administrator can define a virtual bus that contains
virtual devices—closely patterned on the Linux device
model—which allow access to the underlying kernel driver. The
guest accesses the bus through vbus guest drivers and will only be able to
use those devices that the administrator explicitly instantiates on
that vbus. The vbus interface supports only two "verbs": call()
for synchronous requests, and shm() for asynchronous communication
using shared memory.
A document
[PDF] by AlacrityVM developer Gregory Haskins describes
how to configure
and use vbus. Vbus provides a sysfs interface that an administrator can
use to create container-like objects that will constrain
guests so that they can only access those devices specifically configured
for their use. That helps alleviate one of the potential problems with
guests accessing kernel drivers more-or-less directly: security.
The vbus web page has a look at the security issues and how they are
handled. The main concerns are ensuring that guests cannot use the vbus
mechanism to escape their isolation from other guests and processes, as well
as making sure that guests cannot cause a denial of service on the host.
The bus can
only be created and populated on the host side,
and each lives in an isolated namespace, which reduces or eliminates the
risk of a cross-bus exploit to violate the isolation. In addition, each task
can only be associated with one vbus—enforced by putting a vbus
reference in the task struct—so that a guest can only see the
device ids specified for that bus.
Care was taken in the vbus implementation to punish guests for any
misbehavior, rather than the host. The two areas mentioned are for guests
that, maliciously or otherwise, mangle data structures in the shared memory
or fail to service their ring buffer. A naïve implementation could
allow these conditions to cause a denial of service by stalling host OS
threads or by creating a condition that might normally be handled by a
BUG_ON(). Vbus takes steps to ensure that the host to guest path
is resistant to stalling, while also aborting guests that write garbage to
the ring buffer data structures.
Haskins has posted a series
of patches to add the vbus infrastructure, along with a driver for
accelerated ethernet. So far, the patches seem to be fairly well-received,
though
there are not, yet, very many comments. The web page makes it clear
that the project's goal is "to work towards upstream acceptance of
the project on a timeline that suits the community". The
flexibility shown in that goal should serve the project well in getting
mainline acceptance down the road.
The project sums up its status and future plans on the web page as
well: "we have a working design which includes the basic hypervisor,
linux-guest support, and accelerated networking. We will be expanding this
to include other areas of importance, such as accelerated disk-io, IPC,
real-time extensions, and accelerated MS Windows guest support." As
one might guess, the web page also has mailing lists for users and
developers as well as kernel and user-space git trees available for
interested folks.
AlacrityVM and vbus both look to be interesting projects, that are probably
worth investigating as potential virtualization solutions sometime in the
future. The performance gains that come with vbus make it
likely to be useful to other projects as well.
Comments (24 posted)
By Jonathan Corbet
August 5, 2009
There has been relatively little noise out of the realtime preemption camp
in recent months. That does not mean that the realtime developers have
been idle, though; instead, they are preparing for the realtime endgame:
the merger of the bulk of the remaining patches into the mainline kernel.
The
2.6.31-rc4-rt1 tree
recently announced by Thomas Gleixner shows the results of much of this
work. This article will look at some of the recent changes to -rt.
The point of the realtime preemption project is to enable a general-purpose
Linux kernel to provide deterministic response times to high-priority
processes. "Realtime" does not (necessarily) mean "fast"; it means knowing
for sure that the system can respond to important events within a specific
time period. It has often been said that this cannot be done, that the
complexity of a full operating system would thwart any attempt to guarantee
bounded response times. Of course, it was also said that free software
developers could never create a full operating system in the first place.
The realtime hackers believe that both claims are equally false, and they
have been working to prove it.
One of the long-term realtime features was threaded interrupt handlers. A
"hard" interrupt handler can monopolize the CPU for as long as it runs;
that can create latencies for other users. Moving interrupt handlers into
their own threads, instead, allows them to be scheduled like any other
process on the system. Thus, threaded interrupt handlers cannot get in the
way of higher-priority processes.
Much of the threaded interrupt handling code moved into the mainline for
the 2.6.30 release, but in a
somewhat different form. While the threading of interrupt handlers is
nearly universal in a realtime kernel, it's an optional (and, thus far,
little-used) feature in the mainline, so the APIs had to change somewhat.
Realtime interrupt handling has been reworked on top of the mainline
threaded interrupt mechanism, but it still has its own twists.
In particular, the kernel can still be configured to force all interrupt
handlers into threads. If a given driver explicitly requests a threaded
handler, behavior is similar to a non-realtime kernel; the driver's "hard"
interrupt handler runs as usual in IRQ context. Drivers which do not
request threaded handlers get one anyway, with a special hard handler which
masks the interrupt line while the driver's handler runs. Interrupt
handler threads are per-device now (rather than per-IRQ line). All told,
the amount of code which is specific to the realtime tree is fairly small
now; the bulk of it is in the mainline.
Software interrupt handling is somewhat different in the realtime tree.
Mainline kernels will normally handle software interrupts at convenient
moments - context switches or when returning to user space from a system
call, usually. If the software interrupt load gets too heavy, though, handling will
be deferred to the per-CPU "ksoftirqd" thread. In the realtime tree
(subject to a configuration option), all software interrupt handling goes
into ksoftirqd - but now there is a separate thread for each interrupt
line. So each CPU will get a couple of ksoftirqd threads for network
processing, one for the block subsystem, one for RCU, one for tasklets, and
so on. Software interrupts are also preemptable, though that may not
happen very often; they run at realtime priority.
The work which first kicked off the realtime preemption tree was the
replacement of spinlocks with sleeping mutexes. The spinlock technique is
difficult to square with deterministic latencies; any processor which is
spinning on a lock will wait an arbitrary period of time, depending on what
code in another CPU is doing. Code holding spinlocks also cannot be
preempted; doing so would cause serious latencies (at best) or deadlocks.
So the goal of ensuring bounded response times required the elimination of
spinlocks to the greatest extent possible.
Replacing spinlocks throughout the kernel with realtime mutexes solves much
of the problem. Threads waiting for a mutex will sleep, freeing the
processor for some other task. Threads holding mutexes can be preempted if
a higher-priority process comes along. So, if the priorities have been set
properly, there should be little in the way of the highest-priority process
being able to respond to events at any time. This is the core idea behind
the entire realtime preemption concept.
As it happens, though, not all spinlocks can be replaced by mutexes. At
the lowest levels of the system, there is still a need for true (or "raw")
spinlocks; the locks which are used to implement mutexes are one obvious
example. Over the years, a fair amount of effort has gone into the task of
figuring out which spinlocks really needed to be "raw" locks. At the code
level, the difference was papered over through the use of some rather ugly
trickery in the spinlock primitives. Regardless of whether a raw spinlock
or a sleeping lock was being used, the code would call spin_lock()
to acquire it; the only difference was where the lock was declared.
This approach was probably useful during the early development phases where
it was often necessary to change the type of specific locks. But ugly
compiler trickery which serves to obfuscate the type of lock being used in
any specific context seems unlikely to fly when it comes to merger into the
mainline. So the realtime hackers have bitten the bullet and split the two
types of locks entirely. The replacement of "spinlocks" with mutexes still
happens as before, for the simple reason that changing every spinlock call
would be a massive, disruptive change across the entire kernel code base.
But the "raw" spinlock type, which is used in far fewer places, is more
amenable to this kind of change.
The result is a new mutual exclusion primitive, called
atomic_spinlock_t, which looks a lot like
traditional spinlocks:
#include <linux/spinlock.h>
DEFINE_ATOMIC_SPINLOCK(name)
atomic_spin_lock_init(atomic_spinlock_t *lock);
void atomic_spin_lock(atomic_spinlock_t *lock);
void atomic_spin_lock_irqsave(atomic_spinlock_t *lock, long flags);
void atomic_spin_lock_irq(atomic_spinlock_t *lock);
void atomic_spin_lock_bh(atomic_spinlock_t *lock);
int atomic_spin_trylock(atomic_spinlock_t *lock);
void atomic_spin_unlock(atomic_spinlock_t *lock);
void atomic_spin_unlock_irqrestore(atomic_spinlock_t *lock, long flags);
void atomic_spin_unlock_irq(atomic_spinlock_t *lock);
void atomic_spin_unlock_bh(atomic_spinlock_t *lock);
These new "atomic spinlocks" are used in the scheduler, low-level interrupt
handling code, clock-handling, PCI bus management, ACPI subsystem, and in
many other places. The change is still large and disruptive - but much
less so than changing ordinary "spinlock" users would have been.
[PULL QUOTE:
One might argue that putting atomic spinlocks back into the kernel will
reintroduce the same latency problems that the realtime developers are
working to get rid of.
END QUOTE]
One might argue that putting atomic spinlocks back into the kernel will
reintroduce the same latency problems that the realtime developers are
working to get rid of. There is certainly a risk of that happening, but it
can be minimized with due care. Auditing every kernel path which uses
spinlocks is clearly not a feasible task, but it is possible to look
very closely at the (much smaller) number of code paths using atomic
spinlocks. So there can be a reasonable degree of assurance that the
remaining atomic spinlocks will not cause the kernel to exceed the latency
goals.
(As an aside, Thomas Gleixner is looking for a
better name for the atomic_spinlock_t type. Suggest the
winning idea, and free beer at the next conference may be your reward.)
Similar changes have been made to a number of other kernel mutual exclusion
mechanisms. There is a new atomic_seqlock_t variant on seqlocks for cases where the
seqlock writer cannot be preemptable. The anon_semaphore type
mostly appears to be a renaming of semaphores and their related functions;
it is a part of the continuing effort to eliminate the use of semaphores in
any place where a mutex or completion should be used instead. There is
also a rw_anon_semaphore type as a replacement for
rw_semaphore.
Quite a few other realtime-specific changes remain in the -rt tree. The
realtime code is incompatible with the SLUB allocator, so only slab is
allowed. There is also an interesting problem with kmap_atomic();
this function creates a temporary, per-CPU kernel-space address mapping for
a given memory page. Preemption cannot be allowed to happen when an atomic
kmap is active; it would be possible for other code to change the mapping
before the preempted code tries to use it. In the realtime setting, the
performance benefits from atomic kmaps are outweighed by the additional
latency they can cause. So, for all practical purposes,
kmap_atomic() does not exist in a realtime kernel; calls to
kmap_atomic() are mapped to ordinary kmap() calls. And so
on.
As for work which is not yet even in the realtime tree, the first priority
would appear to be clear:
We seriously want to tackle the elimination of the PREEMPT_RT
annoyance #1, aka BKL. The Big Kernel Lock is still used in ~330
files all across the kernel.
At this point, the remaining BKL-removal work comes down to low-level
audits of individual filesystems and drivers; for the most part, it has
been pushed out of the core kernel.
Beyond that, of course, there is the little task of getting as much of this
code as possible into the mainline kernel. To that end, a proper git tree
with a bisectable sequence of patches is being prepared, though that work
is not yet complete. There will also be a gathering of realtime Linux
developers at the Eleventh Real-Time
Linux Workshop this September in Dresden; getting the realtime work
into the mainline is expected to be discussed seriously there. As it
happens, your editor plans to be in the room; watch this space in late September
for an update.
Comments (31 posted)
By Jonathan Corbet
August 5, 2009
Kernel developers must keep in mind many constraints which are unique to
that programming environment; one of those is that memory
allocations become less reliable as they get larger. Single-page
allocations will, for all practical purposes, always succeed. A request
for two physically-contiguous pages has a high probability of working, but
each doubling of the size decreases the chances of a successful
allocation. The fragmentation of memory which occurs over the system's
life time makes it increasingly hard to find groups of groups of
physically-contiguous pages on demand. So large allocations are strongly
discouraged.
Kernel programmers will sometimes respond to this problem by allocating
pages with vmalloc(). Memory allocated this way is virtually
contiguous, but physically scattered. So, as long as physically-contiguous
pages are not needed, vmalloc() looks like a good solution to the
problem. It's not ideal, though. On 32-bit systems, memory from
vmalloc() must be mapped into a relatively small address space; it's easy to
run out. On SMP systems, the page table changes required by
vmalloc() allocations can require expensive cross-processor
interrupts on all CPUs. And, on all systems, use of space in the
vmalloc() range increases pressure on the translation lookaside
buffer (TLB), reducing the performance of the system.
So it would be nice to have a mechanism which could handle the allocation
of large arrays in a manner which (1) is reliable, and (2) does not use
vmalloc(). To date, any such mechanisms have generally been
pieced together by developers solving a specific problem; there has been
nothing designed for more general use. That has changed, though, with the
merging of the "flexible array" mechanism, written by Dave Hansen, for
2.6.31-rc5.
A flexible array holds an arbitrary (within limits) number of fixed-sized
objects, accessed via an integer index. Sparse arrays are handled
reasonably well. Only single-page allocations are made, so memory
allocation failures should be relatively rare. The down sides are that the
arrays cannot be indexed directly, individual object size cannot exceed the
system page size, and putting data into a flexible array
requires a copy operation. It's also worth noting that flexible arrays do
no internal locking at all; if concurrent access to an array is possible,
then the caller must arrange for appropriate mutual exclusion.
The creation of a flexible array is done with:
#include <linux/flex_array.h>
struct flex_array *flex_array_alloc(int element_size, int total, gfp_t flags);
The individual object size is provided by element_size, while
total is the maximum number of objects which can be stored in the
array. The flags argument is passed directly to the internal
memory allocation calls. With the current code, using flags to
ask for high memory is likely to lead to notably unpleasant side effects.
Storing data into a flexible array is accomplished with a call to:
int flex_array_put(struct flex_array *array, int element_nr, void *src, gfp_t flags);
This call will copy the data from src into the array,
in the position indicated by element_nr (which must be less than
the maximum specified when the array was created). If any memory allocations must be
performed, flags will be used. The return value is zero on
success, a negative error code otherwise.
There might possibly be a need to store data into a flexible array while
running in some sort of atomic context; in this situation, sleeping
in the memory allocator would be a bad thing. That can be avoided by using
GFP_ATOMIC for the flags value, but, often, there is a
better way. The trick is to ensure that any needed memory allocations are
done before entering atomic context, using:
int flex_array_prealloc(struct flex_array *array, int start, int end, gfp_t flags);
This function will ensure that memory for the elements indexed in the range
defined by start and end has been allocated. Thereafter,
a flex_array_put() call on an element in that range is guaranteed
not to block.
Getting data back out of the array is done with:
void *flex_array_get(struct flex_array *fa, int element_nr);
The return value is a pointer to the data element, or NULL if that
particular element has never been allocated.
Note that it is possible to get back a valid pointer for an element which
has never been stored in the array. Memory for array elements is allocated
one page at a time; a single allocation could provide memory for several
adjacent elements. The flexible array code does not know if a specific
element has been written to; it only knows if the associated memory is
present. So a flex_array_get() call on an element which was never
stored in the array has the potential to return a pointer to random data.
If the caller does not have a separate way to know which elements were
actually stored, it might be wise, at least, to add GFP_ZERO to
the flags argument to ensure that all elements are zeroed.
There is no way to remove a single element from the array. It is possible,
though, to remove all elements with a call to:
void flex_array_free_parts(struct flex_array *array);
This call frees all elements, but leaves the array itself in place.
Freeing the entire array is done with:
void flex_array_free(struct flex_array *array);
As of this writing, there are no users of flexible arrays in the mainline
kernel. The functions described here are also not exported to modules;
that will probably be fixed when somebody comes up with a need for it.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>