Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.16.1, released on March 27. 2.6.15.7 was released at the same time. Both patches contain a fair number of important fixes, some of which are security-related.

There has been no 2.6 development prepatch released over the last week. Patches are flowing into the mainline git repository at a high rate, however; see below for a list.

The current -mm tree is 2.6.16-mm2. Recent changes to -mm include the ability to call poll() on sysfs files (LWN coverage), support for 64-bit I/O and memory resources, priority-inheriting futex support, and a new set of central time management patches.

Comments (none posted)

What's going into 2.6.17, part 2

The flood of patches heading into the mainline continues at full rate - though the merge window should be closing soon. The following is the highlights from code merged since last week's summary, starting with the user-visible changes:

The lightweight robust futexes patch.
The software RAID (MD) layer can now handle on-the-fly resizing of RAID5 arrays.
Support for devfs has been removed from the SCSI subsystem, though it remains in many other parts of the kernel.
The user-space software suspend patch.
A big XFS update
An 802.11 software MAC implementation for wireless networking stacks. Version 20 of the wireless extensions API was also merged.
The reverse-engineered Broadcom 43xx driver has been merged. As a result, the list of wireless network cards supported by Linux has just grown considerably.
A "memory spreading" mechanism which can be used to spread page cache and filesystem buffer allocations across all nodes of a NUMA system.
Two new fadvise() operations for controlling asynchronous file writeout behavior.
Support for reordering functions in the linked kernel image. The idea here is to put the highly-used bits of kernel code together so that the highly-trafficked part of the kernel fits within a single TLB entry. Currently, only x86-64 has the infrastructure for reordering.
Multiple-block allocation and mapping has been added to the ext3 filesystem, improving performance for sequential file access patterns.
A new scheduling domain has been added to represent multi-core systems.
A new RTC subsystem has been added, providing support for a variety of real-time hardware clocks.

Internal kernel API changes merged include:

A new utility function has been added:
```
     int execute_in_process_context(void (*fn)(void *data),
                                    void *data, 
				    struct execute_work *work);
```
This function will arrange for fn() to be called in process context (where it can sleep). Depending on when execute_in_process_context() is called, fn() could be invoked immediately or delayed by way of a work queue.
The SMP alternatives patch.
A rework of the relayfs API - but the sysfs interface has been left out for now.
A tracing mechanism for developers debugging block subsystem code.
There is a new internal flag (FMODE_EXEC) used to indicate that a file has been opened for execution.
The obsolete MODULE_PARM() macro is gone forevermore.
A new function, flush_anon_page(), can be used in conjunction with get_user_pages() to safely perform DMA to anonymous pages in user space.
Zero-filled memory can now be allocated from slab caches with kmem_cache_zalloc(). There is also a new slab debugging option to produce a /proc/slab_allocators file with detailed allocation information.

There are four new ways of creating mempools:

     mempool_t *mempool_create_page_pool(int min_nr, int order);
     mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size);
     mempool_t *mempool_create_kzalloc_pool(int min_nr, size_t size);
     mempool_t *mempool_create_slab_pool(int min_nr, 
                                         struct kmem_cache *cache);

The first creates a pool which allocates whole pages (the number of which is determined by order), while the second and third create a pool backed by kmalloc() and kzalloc(), respectively. The fourth is a shorthand form of creating slab-backed pools.

The prototype for hrtimer_forward() has changed:
```
     unsigned long hrtimer_forward(struct hrtimer *timer,
                                   ktime_t now, ktime_t interval);
```
The new now argument is expected to be the current time. This change allows some calls to be optimized. The data field has also been removed from the hrtimer structure.
A whole set of generic bit operations (find first set, count set bits, etc.) has been added, helping to unify this code across architectures and subsystems.
The inode f_ops pointer - which refers to the file_operations structure for the open file - has been marked const. Quite a bit of code, which used to change that structure, has been changed to compensate. Similar changes have been made in many filesystems. "The goal is both to increase correctness (harder to accidentally write to shared datastructures) and reducing the false sharing of cachelines with things that get dirty in .data (while .rodata is nicely read only and thus cache clean)."

If the usual pattern holds, the merging of new features will stop sometime around the end of the month, with 2.6.17-rc1 being released shortly thereafter.

Comments (6 posted)

A framework for page replacement policies

"Holy cow."

That was Andrew Morton's reaction to a 34-part patch, posted by Peter Zijlstra, which creates an abstract API for page replacement policies. The page replacement code is at the core of the virtual memory system; it is, essentially, a set of heuristics which must decide which pages should be evicted from main memory and made available for other uses. Page replacement is a bit of a black art; it is easy to see when a system is managing memory poorly, but path to improvements is often far from clear. Memory management in Linux was a sore point for many years, but it seems to work well for most loads now. Given that all this tricky code has finally been beaten into reasonably good shape, why would anybody want to mess with it now?

The answer is that there is quite a bit of research work going into alternative page replacement mechanisms, and Linux might just be able to benefit from some of that work. After all, few people would say that Linux virtual memory works so well that there is no room for improvement.

This massive patch set creates an API for page replacement algorithms, allowing them to be changed at will. Or, at least, changed at reboot; there is currently no provision for loading replacement algorithms as modules or swapping them out on the fly. But, by selecting a page replacement scheme at kernel configuration time, system administrators can choose one which best suits their workload. Virtual memory hackers and others can play with different algorithms to see how they work out. And there is no need to pick one in particular as the page replacement algorithm for the Linux kernel.

To work with this API, a page replacement algorithm must define a set of specific functions. Thus, for example, there is a pair of initialization functions:

    void page_replace_init(void);
    void page_replace_init_zone(struct zone *zone);

These functions, called at boot time, prepare the page replacement code to work with the system it finds itself running on.

When the core kernel knows something about the use of specific pages, it can tell the replacement algorithm with these calls:

    void page_replace_hint_active(struct page *page);
    void page_replace_hint_use_once(struct page *page);

The first is called when the kernel notes that the page is in active use, while the second indicates that the page is unlikely to be used again in the near future.

There are various other functions for helping with the housekeeping, but the core of the API is this function here:

    void page_replace_candidates(struct zone *zone, int count,
                                 struct list_head *list);

This function must select up to count pages from the given zone as candidates for eviction. This is where the page replacement code will gaze into its crystal ball to figure out which pages will not be used again anytime soon; those are the ones which will be singled out and passed back to the core kernel.

Quite a few other functions exist. They deal with issues like page migration, tracking of non-resident pages, printing out information from the page replacement code, and more. See the documentation file for a full list and brief explanation of those other functions.

The patch set also contains four different page replacement mechanisms. One is the modified least-recently-used (LRU) code found in current kernels, reworked to use the new API. Another is the CLOCK-PRO algorithm, covered here last August. There is an implementation of the CART technique, discussed in this paper [PDF]. Then there is a simple random replacement scheme, seemingly just for the fun of it. Actually, the random replacement patch is, due to its simplicity, a good place to start for somebody interested in seeing what a modularized page replacement algorithm looks like.

This patch looks somewhat similar to the pluggable CPU schedulers patch, which allows the scheduling algorithm to be changed. That patch continues to be maintained, but, since its initial posting in 2004, it has never been seriously considered for inclusion into the mainline kernel. There is a strong preference toward figuring out what's wrong - if anything - with the current code and fixing it, rather than creating a mechanism for playing with entirely different implementations. Thus, Andrew Morton followed his initial response with:

Rather than replacing the whole lot four times I'd really prefer to see precise descriptions of these problems, see if we can improve the situation incrementally rather than wholesale slash-n-burn...

Linus has a similar opinion, and, additionally, is not convinced that page replacement is really an issue needing a great deal of attention. "It smells like university research to me."

The proponents of this patch respond that there are, indeed, situations where the current code falls apart. Given that, the next logical step would seem to be gathering information on the cases where Linux memory management fails. Then the developers can start to think about what needs to be done to address those failures. Even if the page replacement framework patches are never merged, it looks like they may help to drive forward the next phase of work in Linux memory management algorithms. That should be a good thing regardless.

Comments (none posted)

The new pselect() system call

March 24, 2006

This article was contributed by Michael Kerrisk.

Applications like network servers that need to monitor multiple file descriptors using select(), poll(), or (on Linux) epoll_wait() sometimes face a problem: how to wait until either one of the file descriptors becomes ready, or a signal (say, SIGINT) is delivered. These system calls, as it turns out, do not interact entirely well with signals.

A seemingly obvious solution would be to write an empty handler for the signal, so that the signal delivery interrupts the select() call:

    static void handler(int sig) { /* do nothing */  }
    
    int main(int argc, char *argv[])
    {
        fd_set readfds;
        struct sigaction sa;
        int nfds, ready;
    
        sa.sa_handler = handler;     /* Establish signal handler */
        sigemptyset(&sa.sa_mask);
        sa.sa_flags = 0;
        sigaction(SIGINT, &sa, NULL);
	/* ... */    
        ready = select(nfds, &readfds, NULL, NULL, NULL);
	/* ... */

After select() returns we can determine what happened by looking at the function result and errno. If errno comes back as EINTR, we know that the select() call was interrupted by a signal, and can act accordingly. But this solution suffers from a race condition: if the SIGINT signal is delivered after the call to sigaction(), but before the call to select(), it will fail to interrupt that select() call and will thus be lost.

We can try playing various games like setting a global flag within the signal handler and monitoring that flag in the main program, and using sigprocmask() to block the signal until just before the select() call. However, none of these techniques can entirely eliminate the race condition: there is always some interval, no matter how brief, where the signal could be handled before the select() call is started.

The traditional solution to this problem is the so-called self-pipe trick, often credited to D J Bernstein. Using this technique, a program establishes a signal handler that writes a byte to a specially created pipe whose read end is also monitored by the select(). The self-pipe trick cleverly solves the problem of safely waiting either for a file descriptor to become ready or a signal to be delivered. However, it requires a relatively large amount of code to implement a requirement that is essentially simple. (For example, a robust solution requires marking both the read and write ends of the pipe non-blocking.)

For this reason, the POSIX.1g committee devised an enhanced version of select(), called pselect(). The major difference between select() and pselect() is that the latter call has a signal mask (sigset_t) as an additional argument:

    int pselect(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, 
                const struct timespec *timeout, const sigset_t *sigmask);

The sigmask argument specifies a set of signals that should be blocked during the pselect() call; it overrides the current signal mask for the duration of that call. So, when we make the following call:

    ready = pselect(nfds, &readfds, &writefds, &exceptfds, 
                    timeout, &sigmask);

the kernel performs a sequence of steps that is equivalent to atomically performing the following system calls:

    sigset_t sigsaved;

    sigprocmask(SIG_SETMASK, &sigmask, &sigsaved);
    ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
    sigprocmask(SIG_SETMASK, &sigsaved, NULL);

For some time now, glibc has provided a library implementation of pselect() that actually uses the above sequence of system calls. The problem is that this implementation remains vulnerable to the very race condition that pselect() was designed to avoid, because the separate system calls are not executed as an atomic unit.

Using pselect(), we can safely wait for either a signal to be delivered or a file descriptor to become ready, by replacing the first part of our example program with the following code:

        sigset_t emptyset, blockset;

        sigemptyset(&blockset);         /* Block SIGINT */
        sigaddset(&blockset, SIGINT);
        sigprocmask(SIG_BLOCK, &blockset, NULL);

        sa.sa_handler = handler;        /* Establish signal handler */
        sa.sa_flags = 0;
	sigemptyset(&sa.sa_mask);
        sigaction(SIGINT, &sa, NULL);
    
        /* Initialize nfds and readfds, and perhaps do other work here */
        /* Unblock signal, then wait for signal or ready file descriptor */

        sigemptyset(&emptyset);
        ready = pselect(nfds, &readfds, NULL, NULL, NULL, &emptyset);
        ...

This code works because the SIGINT signal is only unblocked once control has passed to the kernel. As a result, there is no point where the signal can be delivered before pselect() executes. If the signal is generated while pselect() is blocked, then, as with select(), the system call is interrupted, and the signal is delivered before the system call returns.

Although pselect() was conceived several years ago, and was already publicized in 1998 by W. Richard Stevens in his Unix Network Programming, vol. 1, 2nd ed., actual implementations have been slow to appear. Their eventual appearance in recent releases of various Unix implementations has been driven in part by the fact that the 2001 revision of the POSIX.1 standard requires a conforming system to support pselect(). With the 2.6.16 kernel release, and the required wrapper function that appears in the recently released glibc 2.4, pselect() also becomes available on Linux.

Linux 2.6.16 also includes a new (but nonstandard) ppoll() system call, which adds a signal mask argument to the traditional poll() interface:

   int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *timeout, 
             const sigset_t *sigmask);

This system call adds the same functionality to poll() that pselect() adds to select(). Not to be left in the cold, the epoll maintainer has patches in the pipeline to add similar functionality in the form of a new epoll_pwait() system call.

There are a few other, minor differences between pselect() and ppoll() and their traditional counterparts. For example the type of the timeout is:

    struct timespec {
        long tv_sec;        /* Seconds */
        long tv_nsec;       /* Nanoseconds */
    };

This allows the timeout interval to be specified with greater precision than is available with the older system calls.

The glibc wrappers for pselect() and ppoll() also hide a couple of details of the underlying system calls.

First, the system calls actually expect the signal mask argument to be described by two arguments, one of which is a pointer to a sigset_t structure, while the other is an integer that indicates the size of that structure in bytes. This allows for the possibility of a larger sigset_t type in the future.

The underlying system calls also modify their timeout argument so that on an early return (because a file descriptor became ready, or a signal was delivered), the caller knows how much of the timeout remained. However, the respective wrapper functions hide this detail by making a local copy of the timeout argument and passing that copy to the underlying system calls. (The Linux select() system call also modifies its timeout argument, and this behavior is visible to applications. However, many other select() implementations don't modify this argument. POSIX.1 permits either behavior in a select() implementation.)

Further details of pselect() and ppoll() can be found in the latest versions of the select(2) and poll(2) man pages, which can be found here.

Comments (20 posted)

Greg KH Linux 2.6.16.1 ?

Andrew Morton 2.6.16-mm1 ?

Andrew Morton 2.6.16-mm2 ?

Ingo Molnar 2.6.16-rt5 ?

Con Kolivas 2.6.16-ck2 ?

Greg KH Linux 2.6.15.7 ?

Greg Ungerer : linux-2.6.16-uc0 (MMU-less support) ?

Arnd Bergmann Cell kernel updates ?

Jeff Dike UML - Hotplug memory, take 2 ?

john stultz Time: Generic Timekeeping (v.C1) ?

Vivek Goyal 64 bit resources ?

Mike Galbraith throttling tree patches ?

Ingo Molnar PI-futex: -V1 ?

Ingo Molnar PI-futex: -V2 ?

Nigel Cunningham Suspend2-2.2.2 for 2.6.16. ?

Jens Axboe splice support ?

Petr Baudis Cogito-0.17.1 ?

Jeff Garzik libata updates ?

Jeff Garzik net driver updates ?

Dave Airlie [git tree] Intel i9xx support for intelfb ?

Kenji Kaneshige PCI legacy I/O port free driver (take 6) ?

Mariusz Mazur State of userland headers ?

David Howells Document Linux's memory barriers [try #6] ?

Michael Kerrisk man-pages-2.27 is released ?

Prasanna S Panchamukhi RFC - Approaches to user-space probes ?

Yi Yang Connector: Filesystem Events Connector v3 ?

Anton Altaparmakov NTFS: Release 2.1.27 ?

Valerie Henson [PATCH] fs-wide dirty bit + reservations + multiple block allocation ?

Peter Zijlstra mm: Page Replacement Policy Framework ?

John W. Linville wireless direction statement ?

John W. Linville wireless: Add softmac layer to the kernel Jan 04

Herbert Xu : Introduce tunnel4/tunnel6 ?

Kirill Korotaev Virtualization patches for IPC/UTS. 2nd step ?

Kirill Korotaev OpenVZ patch for 2.6.16 and beta SUSE10.1 kernels ?

Andrew Morton -mm merge plans ?

Roland Dreier InfiniBand 2.6.17 merge plans ?

Stephen Hemminger iproute2 2.6.16-060323 ?

Kyle Moffett Create initial kernel ABI header infrastructure ?

Kernel development

Brief items

Kernel release status

Kernel development news

What's going into 2.6.17, part 2

A framework for page replacement policies

The new pselect() system call

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Virtualization and containers

Miscellaneous