LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.31-rc5, released on July 31. "Apart from various regression fixes, the diffstat shows a couple of new drivers (at_hdmac, uc2322, gspca/sn9c20x, ds2782 battery driver), and some big KMS radeon changes..." Also included was the "flexible array" infrastructure (see below). See the full changelog for the details.

The current stable kernel is 2.6.30.4, released (along with 2.6.27.29) on July 30. Both updates contain another long list of important fixes.

Comments (none posted)

Kernel development news

Quotes of the week

My great act of protest? Disabling Twitter forwarding to my Facebook status. Ha! Take that, enormous corporation!
-- Valerie Aurora shows them who's boss

Anyway, Andrew Morton was right, we should have merged into mainline as soon as Tux3 was booting as root. That would have taken a big load off me. Instead, somebody posted to LKML and called for atomic commit as a precondition for merging. Sounds like a good idea, sounds logical. But actually, in open source it is counter productive, it just puts a bigger load on me, a limited resource. We should have merged first, then got the logging and replay working. In fact, we probably should still do that. I will say this now: if we are invited to merge in the next major release, or in -mm or whatever, we will happily do it. If we are not invited to merge, nobody has any cause to complain about progress slowing down.
-- Daniel Phillips

0 bits in the green bag, 1 bits in the black bag please
-- Alan Cox on how to recycle code

Comments (4 posted)

In brief

By Jonathan Corbet
August 5, 2009
TTY maintenance: Greg Kroah-Hartman, admitting that he is a glutton for punishment, has agreed to take on maintenance of the TTY layer - a job recently abandoned by Alan Cox. Patches have begun to flow toward the mainline, with Linus taking a larger-than-usual interest in getting them into shape. The fate of Alan's longer-term cleanup plans remains uncertain, but basic maintenance and bug fixing, at least, seems to be in place.

Regressions. Rafael Wysocki has posted the 2.6.31-rc5 known regressions list. A total of 76 regressions have been reported in this development cycle; 28 of those remain unresolved. For this stage in the process, that is about normal, or, perhaps, just a bit better than average. Less encouraging, perhaps, is the fact that the 2.6.30 regression list still shows 39 unresolved problems.

make V=1. Once upon a time, building a kernel filled the screen with vast amounts of output, including the full command line for each compilation command. Needless to say, it was hard to get much information out of that much noise; in more recent times, the kernel build system emits much more concise information about what it's doing. Sometimes, though, one needs to see what's really going on; in such cases, running "make V=1" will cause the build system to output everything it's doing.

Except that, as Dave Airlie discovered, it doesn't; some commands are still hidden from view even when V=1 is specified. Build system maintainer Sam Ravnborg explained: "The problem is that V=1 is already too chatty, so people sometimes hide their stuff - as in this case." His suggestion is to implement multiple levels of verbosity, so that "V=2" could be used to view the truly full stream of commands. There's a minor problem in that "V=2" is already used to get make to print out which file caused a particular rebuild to happen. But, as Sam puts it, few people ever use that option, so maybe it could be replaced with a "be more verbose" mode. Unless somebody objects soon, that's likely to be how it goes.

devtmpfs. Greg Kroah-Hartman, evidently not feeling sufficiently challenged by the TTY layer, has reposted the devtmpfs patch, suggesting that it's ready for merging into the mainline. Greg says:

For .32 it's a simple and clean patch. It's been tested and agreed by three major distros that this is a good idea. SuSE has been shipping this in their kernels for a while now with no problems, and actual speedups measured on their boot times.

It would be fair to say, though, that the development community is not yet sold on the desirability of merging this patch; expect some interesting discussion in the near future.

Xtables2. The future of Linux packet filtering might be nftables, but Jan Engelhardt isn't holding his breath. He has, instead, put together an immense patch set massively reworking the existing iptables mechanism. The internal data structures have been torn out and reimplemented as a more flexible linked list, setting the stage for easier single-rule changes in the future. Perhaps the biggest payoff, though, is in the unification of the IPv4, IPv6, and ARP versions of the packet-filtering engine; that, he says, enables the removal of about 50% of the code.

The initial responses suggested that potential reviewers were overwhelmed by the magnitude of the change. Jan has posted a more detailed explanation of what various groups of patches do, which has helped. Eventual merging of this code will probably require breaking the sequence up into multiple steps, though.

Montreal Linux power management mini-summit notes have been posted by Len Brown; they give a good (if terse) summary of recent developments in the area and what is being worked on now.

Comments (none posted)

AlacrityVM

By Jake Edge
August 5, 2009

While virtualization has been a boon for many users and data centers, it tends to suffer from performance problems, particularly I/O performance. Addressing that problem is the goal of a newly announced project, AlacrityVM, which has created a hypervisor based on KVM. By shortening the I/O path for guests, AlacrityVM seeks to provide I/O performance near that of "bare metal" hardware.

The project is in a "pre-alpha" stage, according to the web page, but it is already reporting some fairly impressive results from a proof-of-concept network driver. Both for throughput and latency, the AlacrityVM guest performance compared favorably to that of 2.6.28 and 2.6.29-rc8 hosts. It also clearly out-performed the virtio drivers in a KVM guest.

The major change that allows AlacrityVM to achieve those gains come from a new kernel-based virtual I/O scheme known as Virtual-Bus (or vbus). Currently, KVM guests use emulated devices—implemented in user space by QEMU—in order to handle I/O requests. That leads to multiple kernel-to-user-space transitions for each I/O operation. The idea behind vbus is to allow guests to directly access the host kernel driver, thus reducing the overhead for I/O.

Using vbus, a host administrator can define a virtual bus that contains virtual devices—closely patterned on the Linux device model—which allow access to the underlying kernel driver. The guest accesses the bus through vbus guest drivers and will only be able to use those devices that the administrator explicitly instantiates on that vbus. The vbus interface supports only two "verbs": call() for synchronous requests, and shm() for asynchronous communication using shared memory.

A document [PDF] by AlacrityVM developer Gregory Haskins describes how to configure and use vbus. Vbus provides a sysfs interface that an administrator can use to create container-like objects that will constrain guests so that they can only access those devices specifically configured for their use. That helps alleviate one of the potential problems with guests accessing kernel drivers more-or-less directly: security.

The vbus web page has a look at the security issues and how they are handled. The main concerns are ensuring that guests cannot use the vbus mechanism to escape their isolation from other guests and processes, as well as making sure that guests cannot cause a denial of service on the host. The bus can only be created and populated on the host side, and each lives in an isolated namespace, which reduces or eliminates the risk of a cross-bus exploit to violate the isolation. In addition, each task can only be associated with one vbus—enforced by putting a vbus reference in the task struct—so that a guest can only see the device ids specified for that bus.

Care was taken in the vbus implementation to punish guests for any misbehavior, rather than the host. The two areas mentioned are for guests that, maliciously or otherwise, mangle data structures in the shared memory or fail to service their ring buffer. A naïve implementation could allow these conditions to cause a denial of service by stalling host OS threads or by creating a condition that might normally be handled by a BUG_ON(). Vbus takes steps to ensure that the host to guest path is resistant to stalling, while also aborting guests that write garbage to the ring buffer data structures.

Haskins has posted a series of patches to add the vbus infrastructure, along with a driver for accelerated ethernet. So far, the patches seem to be fairly well-received, though there are not, yet, very many comments. The web page makes it clear that the project's goal is "to work towards upstream acceptance of the project on a timeline that suits the community". The flexibility shown in that goal should serve the project well in getting mainline acceptance down the road.

The project sums up its status and future plans on the web page as well: "we have a working design which includes the basic hypervisor, linux-guest support, and accelerated networking. We will be expanding this to include other areas of importance, such as accelerated disk-io, IPC, real-time extensions, and accelerated MS Windows guest support." As one might guess, the web page also has mailing lists for users and developers as well as kernel and user-space git trees available for interested folks.

AlacrityVM and vbus both look to be interesting projects, that are probably worth investigating as potential virtualization solutions sometime in the future. The performance gains that come with vbus make it likely to be useful to other projects as well.

Comments (24 posted)

The realtime preemption endgame

By Jonathan Corbet
August 5, 2009
There has been relatively little noise out of the realtime preemption camp in recent months. That does not mean that the realtime developers have been idle, though; instead, they are preparing for the realtime endgame: the merger of the bulk of the remaining patches into the mainline kernel. The 2.6.31-rc4-rt1 tree recently announced by Thomas Gleixner shows the results of much of this work. This article will look at some of the recent changes to -rt.

The point of the realtime preemption project is to enable a general-purpose Linux kernel to provide deterministic response times to high-priority processes. "Realtime" does not (necessarily) mean "fast"; it means knowing for sure that the system can respond to important events within a specific time period. It has often been said that this cannot be done, that the complexity of a full operating system would thwart any attempt to guarantee bounded response times. Of course, it was also said that free software developers could never create a full operating system in the first place. The realtime hackers believe that both claims are equally false, and they have been working to prove it.

One of the long-term realtime features was threaded interrupt handlers. A "hard" interrupt handler can monopolize the CPU for as long as it runs; that can create latencies for other users. Moving interrupt handlers into their own threads, instead, allows them to be scheduled like any other process on the system. Thus, threaded interrupt handlers cannot get in the way of higher-priority processes.

Much of the threaded interrupt handling code moved into the mainline for the 2.6.30 release, but in a somewhat different form. While the threading of interrupt handlers is nearly universal in a realtime kernel, it's an optional (and, thus far, little-used) feature in the mainline, so the APIs had to change somewhat. Realtime interrupt handling has been reworked on top of the mainline threaded interrupt mechanism, but it still has its own twists.

In particular, the kernel can still be configured to force all interrupt handlers into threads. If a given driver explicitly requests a threaded handler, behavior is similar to a non-realtime kernel; the driver's "hard" interrupt handler runs as usual in IRQ context. Drivers which do not request threaded handlers get one anyway, with a special hard handler which masks the interrupt line while the driver's handler runs. Interrupt handler threads are per-device now (rather than per-IRQ line). All told, the amount of code which is specific to the realtime tree is fairly small now; the bulk of it is in the mainline.

Software interrupt handling is somewhat different in the realtime tree. Mainline kernels will normally handle software interrupts at convenient moments - context switches or when returning to user space from a system call, usually. If the software interrupt load gets too heavy, though, handling will be deferred to the per-CPU "ksoftirqd" thread. In the realtime tree (subject to a configuration option), all software interrupt handling goes into ksoftirqd - but now there is a separate thread for each interrupt line. So each CPU will get a couple of ksoftirqd threads for network processing, one for the block subsystem, one for RCU, one for tasklets, and so on. Software interrupts are also preemptable, though that may not happen very often; they run at realtime priority.

The work which first kicked off the realtime preemption tree was the replacement of spinlocks with sleeping mutexes. The spinlock technique is difficult to square with deterministic latencies; any processor which is spinning on a lock will wait an arbitrary period of time, depending on what code in another CPU is doing. Code holding spinlocks also cannot be preempted; doing so would cause serious latencies (at best) or deadlocks. So the goal of ensuring bounded response times required the elimination of spinlocks to the greatest extent possible.

Replacing spinlocks throughout the kernel with realtime mutexes solves much of the problem. Threads waiting for a mutex will sleep, freeing the processor for some other task. Threads holding mutexes can be preempted if a higher-priority process comes along. So, if the priorities have been set properly, there should be little in the way of the highest-priority process being able to respond to events at any time. This is the core idea behind the entire realtime preemption concept.

As it happens, though, not all spinlocks can be replaced by mutexes. At the lowest levels of the system, there is still a need for true (or "raw") spinlocks; the locks which are used to implement mutexes are one obvious example. Over the years, a fair amount of effort has gone into the task of figuring out which spinlocks really needed to be "raw" locks. At the code level, the difference was papered over through the use of some rather ugly trickery in the spinlock primitives. Regardless of whether a raw spinlock or a sleeping lock was being used, the code would call spin_lock() to acquire it; the only difference was where the lock was declared.

This approach was probably useful during the early development phases where it was often necessary to change the type of specific locks. But ugly compiler trickery which serves to obfuscate the type of lock being used in any specific context seems unlikely to fly when it comes to merger into the mainline. So the realtime hackers have bitten the bullet and split the two types of locks entirely. The replacement of "spinlocks" with mutexes still happens as before, for the simple reason that changing every spinlock call would be a massive, disruptive change across the entire kernel code base. But the "raw" spinlock type, which is used in far fewer places, is more amenable to this kind of change.

The result is a new mutual exclusion primitive, called atomic_spinlock_t, which looks a lot like traditional spinlocks:

    #include <linux/spinlock.h>

    DEFINE_ATOMIC_SPINLOCK(name)
    atomic_spin_lock_init(atomic_spinlock_t *lock);

    void atomic_spin_lock(atomic_spinlock_t *lock);    
    void atomic_spin_lock_irqsave(atomic_spinlock_t *lock, long flags);
    void atomic_spin_lock_irq(atomic_spinlock_t *lock);
    void atomic_spin_lock_bh(atomic_spinlock_t *lock);
    int atomic_spin_trylock(atomic_spinlock_t *lock);    

    void atomic_spin_unlock(atomic_spinlock_t *lock);
    void atomic_spin_unlock_irqrestore(atomic_spinlock_t *lock, long flags);
    void atomic_spin_unlock_irq(atomic_spinlock_t *lock);
    void atomic_spin_unlock_bh(atomic_spinlock_t *lock);

These new "atomic spinlocks" are used in the scheduler, low-level interrupt handling code, clock-handling, PCI bus management, ACPI subsystem, and in many other places. The change is still large and disruptive - but much less so than changing ordinary "spinlock" users would have been.

One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. One might argue that putting atomic spinlocks back into the kernel will reintroduce the same latency problems that the realtime developers are working to get rid of. There is certainly a risk of that happening, but it can be minimized with due care. Auditing every kernel path which uses spinlocks is clearly not a feasible task, but it is possible to look very closely at the (much smaller) number of code paths using atomic spinlocks. So there can be a reasonable degree of assurance that the remaining atomic spinlocks will not cause the kernel to exceed the latency goals.

(As an aside, Thomas Gleixner is looking for a better name for the atomic_spinlock_t type. Suggest the winning idea, and free beer at the next conference may be your reward.)

Similar changes have been made to a number of other kernel mutual exclusion mechanisms. There is a new atomic_seqlock_t variant on seqlocks for cases where the seqlock writer cannot be preemptable. The anon_semaphore type mostly appears to be a renaming of semaphores and their related functions; it is a part of the continuing effort to eliminate the use of semaphores in any place where a mutex or completion should be used instead. There is also a rw_anon_semaphore type as a replacement for rw_semaphore.

Quite a few other realtime-specific changes remain in the -rt tree. The realtime code is incompatible with the SLUB allocator, so only slab is allowed. There is also an interesting problem with kmap_atomic(); this function creates a temporary, per-CPU kernel-space address mapping for a given memory page. Preemption cannot be allowed to happen when an atomic kmap is active; it would be possible for other code to change the mapping before the preempted code tries to use it. In the realtime setting, the performance benefits from atomic kmaps are outweighed by the additional latency they can cause. So, for all practical purposes, kmap_atomic() does not exist in a realtime kernel; calls to kmap_atomic() are mapped to ordinary kmap() calls. And so on.

As for work which is not yet even in the realtime tree, the first priority would appear to be clear:

We seriously want to tackle the elimination of the PREEMPT_RT annoyance #1, aka BKL. The Big Kernel Lock is still used in ~330 files all across the kernel.

At this point, the remaining BKL-removal work comes down to low-level audits of individual filesystems and drivers; for the most part, it has been pushed out of the core kernel.

Beyond that, of course, there is the little task of getting as much of this code as possible into the mainline kernel. To that end, a proper git tree with a bisectable sequence of patches is being prepared, though that work is not yet complete. There will also be a gathering of realtime Linux developers at the Eleventh Real-Time Linux Workshop this September in Dresden; getting the realtime work into the mainline is expected to be discussed seriously there. As it happens, your editor plans to be in the room; watch this space in late September for an update.

Comments (31 posted)

Flexible arrays

By Jonathan Corbet
August 5, 2009
Kernel developers must keep in mind many constraints which are unique to that programming environment; one of those is that memory allocations become less reliable as they get larger. Single-page allocations will, for all practical purposes, always succeed. A request for two physically-contiguous pages has a high probability of working, but each doubling of the size decreases the chances of a successful allocation. The fragmentation of memory which occurs over the system's life time makes it increasingly hard to find groups of groups of physically-contiguous pages on demand. So large allocations are strongly discouraged.

Kernel programmers will sometimes respond to this problem by allocating pages with vmalloc(). Memory allocated this way is virtually contiguous, but physically scattered. So, as long as physically-contiguous pages are not needed, vmalloc() looks like a good solution to the problem. It's not ideal, though. On 32-bit systems, memory from vmalloc() must be mapped into a relatively small address space; it's easy to run out. On SMP systems, the page table changes required by vmalloc() allocations can require expensive cross-processor interrupts on all CPUs. And, on all systems, use of space in the vmalloc() range increases pressure on the translation lookaside buffer (TLB), reducing the performance of the system.

So it would be nice to have a mechanism which could handle the allocation of large arrays in a manner which (1) is reliable, and (2) does not use vmalloc(). To date, any such mechanisms have generally been pieced together by developers solving a specific problem; there has been nothing designed for more general use. That has changed, though, with the merging of the "flexible array" mechanism, written by Dave Hansen, for 2.6.31-rc5.

A flexible array holds an arbitrary (within limits) number of fixed-sized objects, accessed via an integer index. Sparse arrays are handled reasonably well. Only single-page allocations are made, so memory allocation failures should be relatively rare. The down sides are that the arrays cannot be indexed directly, individual object size cannot exceed the system page size, and putting data into a flexible array requires a copy operation. It's also worth noting that flexible arrays do no internal locking at all; if concurrent access to an array is possible, then the caller must arrange for appropriate mutual exclusion.

The creation of a flexible array is done with:

    #include <linux/flex_array.h>

    struct flex_array *flex_array_alloc(int element_size, int total, gfp_t flags);

The individual object size is provided by element_size, while total is the maximum number of objects which can be stored in the array. The flags argument is passed directly to the internal memory allocation calls. With the current code, using flags to ask for high memory is likely to lead to notably unpleasant side effects.

Storing data into a flexible array is accomplished with a call to:

    int flex_array_put(struct flex_array *array, int element_nr, void *src, gfp_t flags);

This call will copy the data from src into the array, in the position indicated by element_nr (which must be less than the maximum specified when the array was created). If any memory allocations must be performed, flags will be used. The return value is zero on success, a negative error code otherwise.

There might possibly be a need to store data into a flexible array while running in some sort of atomic context; in this situation, sleeping in the memory allocator would be a bad thing. That can be avoided by using GFP_ATOMIC for the flags value, but, often, there is a better way. The trick is to ensure that any needed memory allocations are done before entering atomic context, using:

    int flex_array_prealloc(struct flex_array *array, int start, int end, gfp_t flags);

This function will ensure that memory for the elements indexed in the range defined by start and end has been allocated. Thereafter, a flex_array_put() call on an element in that range is guaranteed not to block.

Getting data back out of the array is done with:

    void *flex_array_get(struct flex_array *fa, int element_nr);

The return value is a pointer to the data element, or NULL if that particular element has never been allocated.

Note that it is possible to get back a valid pointer for an element which has never been stored in the array. Memory for array elements is allocated one page at a time; a single allocation could provide memory for several adjacent elements. The flexible array code does not know if a specific element has been written to; it only knows if the associated memory is present. So a flex_array_get() call on an element which was never stored in the array has the potential to return a pointer to random data. If the caller does not have a separate way to know which elements were actually stored, it might be wise, at least, to add GFP_ZERO to the flags argument to ensure that all elements are zeroed.

There is no way to remove a single element from the array. It is possible, though, to remove all elements with a call to:

    void flex_array_free_parts(struct flex_array *array);

This call frees all elements, but leaves the array itself in place. Freeing the entire array is done with:

    void flex_array_free(struct flex_array *array);

As of this writing, there are no users of flexible arrays in the mainline kernel. The functions described here are also not exported to modules; that will probably be fixed when somebody comes up with a need for it.

Comments (4 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds