Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.62, which was released by Linus on February 17. It included a new version of the dentry cache which uses read-copy-update for lockless file lookups, a number of architecture updates, some kbuild fixes (including module alias and device table support), more signal cleanup work, and (a classic sign that the freeze is progressing) lots of spelling fixes. The long-format changelog has the details.

2.5.61 was released on February 14. Changes in this release include a number of SCSI driver fixes, an x86-64 merge, a new set of AGP changes, some ACPI work, an SCTP update, and, of course, numerous other fixes. Once again, see the long-format changelog for the details.

Linus's (pre-2.5.63) BitKeeper tree contains, as of this writing, the longstanding POSIX timers patch (but without the high-resolution timers), a new set of IDE changes (see below), updates for obscure architectures (Visual Workstation, v850, m68k-nommu), an ACPI update (including the license change to dual BSD/GPL), and another big set of spelling fixes.

The current stable kernel is 2.4.20; there have been no 2.4.21 prepatches issued over the last week.

Alan Cox has released the fourth 2.2.24 release candidate.

Comments (12 posted)

Synchronous signal handling

While fixing various problems in the signal handling code of recent kernels, Linus evidently decided to take a stab at the issue of signal handling races. The result was this patch implementing a prototype of a new signal handling mechanism. The idea needs some fleshing out before it might be merged into the kernel, but it has attracted a certain amount of interest among the developers.

The patch adds a new sigfd() system call:

    int sigfd(sigset_t *mask, unsigned long flags);

The system call returns a file descriptor which will report on the set of signals found in the given mask (the flags argument is not used for now). A process reading from the file descriptor will receive a structure describing one signal which was delivered to the process; it will block if there are no outstanding signals.

This approach offers some advantages. Since signals are queued up and read one at a time, they can be dealt with in an orderly manner. The user-space application need not worry about races between signal handlers and other code. The signal file descriptor can also be used with the select() and poll() system calls, allowing signal handling to be folded into application event processing loops. An application can even pass the file descriptor to another process, should there be, for some reason, a desire to let that other process listen in on the first process's signals.

There was some immediate discussion of expanding this interface into a more generic event-handling mechanism. For example, timer events, asychronous I/O events, etc. could also be delivered via the same file descriptor. Linus stated that, to an extent, expanding the interface is what the flags argument was intended for. He doesn't want to put too much into this interface, however:

I'm not in the least interested in some "generic event" mechanism, and it's not where I think this should even go. This was very much about signals, and while I can see the potential to extend the notion of signals to things like timers, I don't think it's necessarily a good idea to extend it too far

Looking at the patch, a few developers commented on how much of it is really just boilerplate filesystem and inode code. It has to be there to make the file descriptor work, but really has little to do with the task at hand. Much of that code is duplicated with other subsystems which have to make "virtual" file descriptors. Davide Libenzi responded to this observation with a patch implementing a new, shared, "virtual filesystem" capability. If some variant of that patch goes in, it has the potential of ridding the kernel of a fair amount of tedious and error-prone code duplication.

Comments (none posted)

A new round of IDE patches

After a long pause, a new set of IDE patches has found its way into Linus's pre-2.5.63 BitKeeper tree. Most of these patches have been around for a while (in the 2.4-ac tree), but Alan Cox has not felt that 2.5.x was stable enough to attempt new IDE work. Now that things are working a little better, the patches are flowing again.

The new generation of IDE changes is rather more restrained than last year's "cleanup" effort. Changes that have gone in this time around include cleaning out some old data structures that were either unused or did not suit the purpose to which they were being put. Some improved locking has been put in place, and the handling of missing drives (i.e. PCMCIA drives which are removed by the user) has been improved - though work remains to be done in that area. There is also a new ide_execute_command() function which is meant to be the way commands are passed down to drives in the future. For now, though, it is only used for CD drives ("As with 2.4 I want it to run for a bit on read only media first".)

The IDE work is one of the more prominent entries remaining on the "todo" list for 2.5. Given the need to proceed slowly (it really is no fun to ship a kernel with broken IDE), this work may take some time yet. So it's good to see the patches finding their way into Linus's tree again.

Comments (3 posted)

GFP_KERNEL or SLAB_KERNEL?

The low-level kernel memory allocation functions take a set of flags describing how that allocation is to be performed. Among other things, these GFP_ ("get free page") flags control whether the allocation process can sleep and wait for memory, whether high memory can be used, and so on. See this article for the full set.

The kernel slab allocator is an additional layer built on top of the low-level code; it handles situations where numerous objects of the same size are frequently allocated and freed. The slab code, too, has a set of flags describing how memory allocation is to happen. They look suspiciously like the low-level flags, but they have different names; instead of GFP_KERNEL, for example, user of the slab code are expected to say SLAB_KERNEL.

Underneath it all, however, the two sets of flags are the same. As a result, many calls to the slab code just use the GFP_ flags, rather than the SLAB_ flags. William Lee Irwin decided it was time to fix that; he posted a patch converting several slab users over to the SLAB_ flags. It looked like a fairly standard, freeze-stage kernel cleanup.

The question came up, however: why bother? Not everybody, it seems, thinks that the separate SLAB_ flags are worth the trouble. William responded with another patch which gets rid of the SLAB_ flags altogether. So far, neither patch has been merged. But they do raise a worthwhile question: why do we need a separate set of flags if the callers have nothing different to say?

Comments (none posted)

New additions to the driver porting series

The LWN.net series on porting drivers (and other kernel code) to the 2.5 kernel continues this week with three new articles. Two of them (on low-level memory allocation and per-CPU variables) appear below; the third (an updated description of the seqlock mechanism) is available but won't be included inline here. As always, the full series can be found at http://lwn.net/Articles/driver-porting/.

Comments (none posted)

Driver porting: low-level memory allocation

This article is part of the LWN Porting Drivers to 2.6 series.

The 2.5 development series has brought relatively few changes to the way device drivers will allocate and manage memory. In fact, most drivers should work with no changes in this regard. There are a few improvements that have been made, however, that are worth a mention. These include some changes to page allocation, and the new "mempool" interface. Note that the allocation and management of per-CPU data is described in a separate article.

Allocation flags

The old <linux/malloc.h> include file is gone; it is now necessary to include <linux/slab.h> instead.

The GFP_BUFFER allocation flag is gone (it was actually removed in 2.4.6). That will bother few people, since almost nobody used it. There are two new flags which have replaced it: GFP_NOIO and GFP_NOFS. The GFP_NOIO flag allows sleeping, but no I/O operations will be started to help satisfy the request. GFP_NOFS is a bit less restrictive; some I/O operations can be started (writing to a swap area, for example), but no filesystem operations will be performed.

For reference, here is the full set of allocation flags, from the most restrictive to the least::

GFP_ATOMIC: a high-priority allocation which will not sleep; this is the flag to use in interrupt handlers and other non-blocking situations.
GFP_NOIO: blocking is possible, but no I/O will be performed.
GFP_NOFS: no filesystem operations will be performed.
GFP_KERNEL: a regular, blocking allocation.
GFP_USER: a blocking allocation for user-space pages.
GFP_HIGHUSER: for allocating user-space pages where high memory may be used.

The __GFP_DMA and __GFP_HIGHMEM flags still exist and may be added to the above to direct an allocation to a particular memory zone. In addition, 2.5.69 added some new modifiers:

__GFP_REPEAT This flag tells the page allocater to "try harder," repeating failed allocation attempts if need be. Allocations can still fail, but failure should be less likely.
__GFP_NOFAIL Try even harder; allocations with this flag must not fail. Needless to say, such an allocation could take a long time to satisfy.
__GFP_NORETRY Failed allocations should not be retried; instead, a failure status will be returned to the caller immediately.

The __GFP_NOFAIL flag is sure to be tempting to programmers who would rather not code failure paths, but that temptation should be resisted most of the time. Only allocations which truly cannot be allowed to fail should use this flag.

Page-level allocation

For page-level allocations, the alloc_pages() and get_free_page() functions (and variants) exist as always. They are now defined in <linux/gfp.h>, however, and there are a few new ones as well. On NUMA systems, the allocator will do its best to allocate pages on the same node as the caller. To explicitly allocate pages on a different NUMA node, use:

    struct page *alloc_pages_node(int node_id, 
                                  unsigned int gfp_mask, 
				  unsigned int order);

The memory allocator now distinguishes between "hot" and "cold" pages. A hot page is one that is likely to be represented in the processor's cache; cold pages, instead, must be fetched from RAM. In general, it is preferable to use hot pages whenever possible, since they are already cached. Even if the page is to be overwritten immediately (usually the case with memory allocations, after all), hot pages are better - overwriting them will not push some other, perhaps useful, data from the cache. So alloc_pages() and friends will return hot pages when they are available.

On occasion, however, a cold page is preferable. In particular, pages which will be overwritten via a DMA read from a device might as well be cold, since their cache data will be invalidated anyway. In this sort of situation, the __GFP_COLD flag should be passed into the allocation.

Of course, this whole scheme depends on the memory allocator knowing which pages are likely to be hot. Normally, order-zero allocations (i.e. single pages) are assumed to be hot. If you know the state of a page you are freeing, you can tell the allocator explicitly with one of the following:

    void free_hot_page(struct page *page);
    void free_cold_page(struct page *page);

These functions only work with order-zero allocations; the hot/cold status of larger blocks is not tracked.

Memory pools

Memory pools were one of the very first changes in the 2.5 series - they were added to 2.5.1 to support the new block I/O layer. The purpose of mempools is to help out in situations where a memory allocation must succeed, but sleeping is not an option. To that end, mempools pre-allocate a pool of memory and reserve it until it is needed. Mempools make life easier in some situations, but they should be used with restraint; each mempool takes a chunk of kernel memory out of circulation and raises the minimum amount of memory the kernel needs to run effectively.

To work with mempools, your code should include <linux/mempool.h>. A mempool is created with mempool_create():

    mempool_t *mempool_create(int min_nr, 
                              mempool_alloc_t *alloc_fn,
    			      mempool_free_t *free_fn,
			      void *pool_data);

Here, min_nr is the minimum number of pre-allocated objects that the mempool tries to keep around. The mempool defers the actual allocation and deallocation of objects to user-supplied routines, which have the following prototypes:

    typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
    typedef void (mempool_free_t)(void *element, void *pool_data);

The allocation function should take care not to sleep unless __GFP_WAIT is set in the given gfp_mask. In all of the above cases, pool_data is a private pointer that may be used by the allocation and deallocation functions.

Creators of mempools will often want to use the slab allocator to do the actual object allocation and deallocation. To do that, create the slab, pass it in to mempool_create() as the pool_data value, and give mempool_alloc_slab and mempool_free_slab as the allocation and deallocation functions.

A mempool may be returned to the system by passing it to mempool_destroy(). You must have returned all items to the pool before destroying it, or the mempool code will get upset and oops the system.

Allocating and freeing objects from the mempool is done with:

    void *mempool_alloc(mempool_t *pool, int gfp_mask);
    void mempool_free(void *element, mempool_t *pool);

mempool_alloc() will first call the pool's allocation function to satisfy the request; the pre-allocated pool will only be used if the allocation function fails. The allocation may sleep if the given gfp_mask allows it; it can also fail if memory is tight and the preallocated pool has been exhausted.

Finally, a pool can be resized, if necessary, with:

    int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);

This function will change the size of the pre-allocated pool, using the given gfp_mask to allocate more memory if need be. Note that, as of 2.5.60, mempool_resize() is disabled in the source, since nobody is actually using it.

Comments (none posted)

Driver porting: per-CPU variables

This article is part of the LWN Porting Drivers to 2.6 series.

The 2.6 kernel makes extensive use of per-CPU data - arrays containing one object for each processor on the system. Per-CPU variables are not suitable for every task, but, in situations where they can be used, they do offer a couple of advantages:

Per-CPU variables have fewer locking requirements since they are (normally) only accessed by a single processor. There is nothing other than convention that keeps processors from digging around in other processors' per-CPU data, however, so the programmer must remain aware of what is going on.
Nothing destroys cache performance as quickly as accessing the same data from multiple processors. Restricting each processor to its own area eliminates cache line bouncing and improves performance.

Examples of per-CPU data in the 2.6 kernel include lists of buffer heads, lists of hot and cold pages, various kernel and networking statistics (which are occasionally summed together into the full system values), timer queues, and so on. There are currently no drivers using per-CPU values, but some applications (i.e. networking statistics for high-bandwidth adapters) might benefit from their use.

The normal way of creating per-CPU variables at compile time is with this macro (defined in <linux/percpu.h>):

    DEFINE_PER_CPU(type, name);

This sort of definition will create name, which will hold one object of the given type for each processor on the system. If the variables are to be exported to modules, use:

    EXPORT_PER_CPU_SYMBOL(name);
    EXPORT_PER_CPU_SYMBOL_GPL(name);

If you need to link to a per-CPU variable defined elsewhere, a similar macro may be used:

    DECLARE_PER_CPU(type, name);

Variables defined in this way are actually an array of values. To get at a particular processor's value, the per_cpu() macro may be used; it works as an lvalue, so so code like the following works:

    DEFINE_PER_CPU(int, mypcint);

    per_cpu(mypcint, smp_processor_id()) = 0;

The above code can be dangerous, however. Accessing per-CPU variables can often be done without locking, since each processor has its own private area to work in. The 2.6 kernel is preemptible, however, and that adds a couple of challenges. Since kernel code can be preempted, it is possible to encounter race conditions with other kernel threads running on the same processor. Also, accessing a per-CPU variable requires knowing which processor you are running on; it would not do to be preempted and moved to a different CPU between looking up the processor ID and accessing a per-CPU variable.

For both of the above reasons, kernel preemption usually must be disabled when working with per-CPU data. The usual way of doing this is with the get_cpu_var and put_cpu_var macros. get_cpu_var works as an lvalue, so it can be assigned to, have its address taken, etc. Perhaps the simplest example of the use of these macros can be found in net/socket.c:

	get_cpu_var(sockets_in_use)++;
	put_cpu_var(sockets_in_use);

Of course, since preemption is disabled between the calls, the code should take care not to sleep. Note that there is no version of these macros for access to another CPU's data; cross-processor access to per-CPU data requires explicit locking arrangements.

It is also possible to allocate per-CPU variables dynamically. Simply use these functions:

    void *alloc_percpu(type);
    void free_percpu(const void *);

alloc_percpu() will allocate one object (of the given type) for each CPU on the system; the allocated storage will be zeroed before being returned to the caller.

There is another set of macros which may be used to access per-CPU data obtained with kmalloc_percpu(). At the lowest level, you may use:

    per_cpu_ptr(void *ptr, int cpu)

which returns (without any concurrency control) a pointer to the per-CPU data for the given cpu. For access to a local processor's data, with preemption disabled, use:

    get_cpu_ptr(ptr)
    put_cpu_ptr(ptr)

With the usual proviso that you do not sleep between the two.

Comments (none posted)

Martin J. Bligh 2.5.61-mjb1 (scalability / NUMA patchset) ?

Martin J. Bligh 2.5.62-mjb1 (scalability / NUMA patchset) ?

Stephen Hemminger 2.5.62-dcl2 ?

Alan Cox Linux 2.2.24-rc3 ?

Andrea Arcangeli 2.4.21pre4aa2 ?

Andrea Arcangeli 2.4.21pre4aa3 ?

Suparna Bhattacharya Modified (smaller) x86 kexec hwfixes patch ?

Dave Hansen x86 early boot ioremap ?

Dominik Brodowski cpufreq: move frequency table helpers to extra module ?

Ed Tomlinson (0-2) governors for 60-bk ?

Ed Tomlinson (1-2) governors for 60-bk ?

Ed Tomlinson (2-2) governors for 60-bk ?

Linus Torvalds Synchronous signal delivery.. ?

Martin J. Bligh percpu load avererages / *real* load averages ?

Rusty Russell Per-CPU variables for modules ?

Zwane Mwaikambo RCU callback offload queue ?

Andy Pfiffer [KEXEC][2.5.61][2.5.62] Untested patches available ?

Tim Schmielau make jiffies wrap 5 min after boot ?

Con Kolivas contest v0.61 benchmark ?

Tomas Szepe morse code panics for 2.5.62 ?

Paul Larson 2.5 kernel errata list ?

Pavel Machek accessing bitkeeper without bitkeeper ?

Linus Torvalds doublefault debugging ?

Rusty Lynch Proposal for a new watchdog interface using sysfs ?

David Dillow 3Com 3cr990 driver release ?

Greg KH USB changes for 2.5.62 ?

Jens Axboe CFQ scheduler, #2 ?

Hans Reiser ReiserFS CPU efficient large writes for 2.5 ?

Davide Libenzi Virtual filesystem support ?

William Lee Irwin III clean up SLAB_KERNEL non-usage ?

William Lee Irwin III Kill SLAB_KERNEL and SLAB_ATOMIC. ?

Andrew Morton 2.5.60-mm2 ?

Andrew Morton 2.5.61-mm1 ?

Andrew Morton 2.5.62-mm1 ?

YOSHIFUJI Hideaki / USAGI Project USAGI STABLE RELEASE 4.1 ?

Bruce Allan subset of RFC2553 ?

Shirley Ma new release for dhcpv6 is available (ver 0.2) ?

Kazunori MIyazawa IPv6 IPsec support ?

Greg KH LSM changes for 2.5.62 ?

Aniruddha M Marathe 2.5.60 Lmbench performance ?

Aniruddha M Marathe TIObench 2.5.60 performance ?

Sowmya Adiga AIM9 benchmark result for kernel 2.5.60. ?

Sowmya Adiga AIM9 benchmark result for kernel 2.5.62. ?

Sowmya Adiga unixbench result for kernel 2.5.60. ?

Sowmya Adiga unixbench result for kernel 2.5.62. ?

Con Kolivas 2.5.61 with contest ?

Con Kolivas cfq3a i/o scheduler with contest ?

Con Kolivas 2.5.61-mm1 +/- as or cfq with contest ?

Matthias Andree lk-changelog.pl 0.73 ?

Greg KH klibc for 2.5.62 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Synchronous signal handling

A new round of IDE patches

GFP_KERNEL or SLAB_KERNEL?

Driver porting

New additions to the driver porting series

Driver porting: low-level memory allocation

Allocation flags

Page-level allocation

Memory pools

Driver porting: per-CPU variables

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous