|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.63, released by Linus on February 24. It includes an ISAPnP update, some IDE changes (see last week's Kernel Page), an ACPI update, various architecture updates, a new x86 "double fault" handler, a bluetooth update, and the inevitable set of spelling fixes. The the long-format changelog has the details.

Linus's BitKeeper tree includes some more loadable module fixes, more spelling fixes ("A 'wether' is a castrated goat"), a uClinux update, an XFS update, a software suspend update, and various other fixes and performance improvements.

The current stable kernel is 2.4.20. Marcelo has promised a new 2.4.21 prepatch soon, but it had not appeared as of this writing.

The current 2.4 prepatch from Alan Cox is 2.4.21-pre4-ac6; it adds mostly driver updates.

Comments (3 posted)

Kernel development news

The object-based reverse-mapping VM

The reverse-mapping VM (RMAP) was merged into 2.5 to solve a specific problem: there was no easy way for the kernel to find out which page tables referred to a given physical page. Certain activities - swapping being at the top of the list - require making changes to all relevant page tables. You simply can not swap a page to disk until all of the page table entries pointing to it have been invalidated. The 2.4 kernel handles swapping by scanning through the page tables, one process at a time, and invalidating entries for pages that look like suitable victims. If it happens to find all of the page table entries in time, the page can then be evicted to disk.

In 2.5, a new data structure was added to make this process easier. Initially each page in the system (as represented by its struct page structure in the system memory map) had a linked list of reverse mapping entries pointing to every page table entry referencing that page. That worked, but it introduced some problems of its own. The reverse mapping entries took up a lot of memory, and quite a bit of time to maintain. Operations which required working with a lot of pages slowed down. And the fork() system call, which must add a new reverse mapping entry for every page in the process's address space, slowed significantly. As a result, there has been an ongoing effort to mitigate RMAP's costs.

Now a new technique, as embodied in this patch by Dave McCracken, has been proposed. This approach, called "object-based reverse mapping," is based on the realization that, in some cases at least, there are other paths from a struct page to a page table entry. If those paths can be used, the full RMAP overhead is unnecessary and can be cut out.

By one reckoning, there are two basic types of user-mode page in a Linux system. Anonymous pages are just plain memory, the kind a process would get from malloc(). Most other pages are file-backed in some way; this means that, behind the scenes, the contents of that page are associated with a file somewhere in the system. File-backed pages include program code and files mapped in with mmap(). For these pages, it is possible to find their page table entries without using RMAP entries. To see how, let us refer to the following low-quality graphic, the result of your editor's nonexistent drawing skills:

[Cheezy drawing]

The struct page structure for a given page is in the upper left corner. One of the fields of that structure is called mapping; it points to an address_space structure describing the object which backs up that page. That structure includes the inode for the file, various data structures for managing the pages belonging to the file, and two linked lists (i_mmap and i_mmap_shared) containing the vm_area_struct structures for each process which has a mapping into the file. The vm_area_struct (usually called a "VMA") describes how the mapping appears in a particular process's address space; the file /proc/pid/maps lists out the VMAs for the process with ID pid. The VMA provides the information needed to find out what a given page's virtual address is in that process's address space, and that, in turn, can be used to find the correct page table entry.

So all the object-based RMAP patch does is remove the direct reverse mapping entry (pointing from the page structure directly to the page table entry). When it is necessary to find that entry, the virtual memory subsystem simply takes the longer way around, via the address_space and vm_area_struct structures. Finding a page table entry this way certainly will take longer than following a direct pointer, but it should come out cheaper when one considers all of the RMAP information that no longer needs to be maintained.

The object-based RMAP patch does not change the handling of anonymous pages, which do not have an associated address_space structure.

Martin Bligh has posted some initial benchmarks showing some moderate improvement in the all-important kernel compilation test. The object-based approach does seem to help with some of the worst RMAP performance regressions. Andrew Morton pointed out a worst-case performance scenario for this approach, but it is not clear how big a problem it would really be. Andrew has included this patch in his 2.5.62-mm3 tree.

Assuming that this patch goes in (it's late in the development process, but that hasn't stopped Linus from taking rather more disruptive VM patches before...), one might wonder if a complete object-based implementation might follow. The answer is "probably not." Anonymous pages tend to be private to individual processes, so there is no long chain of reverse mappings to manage in any case. So even if such pages came to look like file-backed pages (as could happen, say, with a rework of the swapping code), there isn't necessarily much to be gained from the object-based approach.

Comments (3 posted)

Page clustering

The object-based RMAP patch is one approach to reducing the overhead of the virtual memory subsystem. William Lee Irwin has posted another: page clustering. Much of the VM subsystem's overhead is per-page; each page requires a memory map entry, possibly RMAP chains, etc. One way of reducing that overhead, clearly, would be to have fewer pages. Since most users will react poorly to suggestions that they remove memory from their systems, the only feasible way of reducing the page count is to make the pages themselves bigger.

The page clustering patch (based on work originally done by Hugh Dickins) works by taking physical pages (as seen by the hardware) and grouping them into larger, virtual pages as seen by the kernel. x86 hardware works (normally) with 4K pages; with page clustering the kernel can work with pages as large as 32K (according to the comments in page.h or 64K (according to what the code is actually doing). Thus, the page count (and associated overhead) can be reduced by a factor of up to 16.

This idea is not particularly new; early versions of BSD clustered the 512-byte pages provided by VAX systems into 1024-byte internal pages. Still, it's a bit tricky to implement inside the Linux kernel. Much kernel code thinks it understands the concept of the "page size," but, with this patch, there are two different page sizes. Code dealing with the hardware memory management unit (MMU) must work on the MMU's terms, while code working with kernel pages should see the larger size. The result is a great deal of work trying to figure out whether each bit of code should be working with PAGE_SIZE units, or the new MMUPAGE_SIZE. It is not a job for the faint of heart.

This patch is, for now, not for casual users; by William's admission a number of things are still broken. But, fear not: "I've yet to encounter non-fsck-recoverable filesystem corruption with remotely current sources." Even when the problems are fixed, this patch looks fairly involved for 2.5 at this point. But, one never knows.

Comments (1 posted)

Threads and /proc

One result of all the work that was done with improved threading support in the 2.5 kernel is that threads stopped showing up in the /proc filesystem. Most people don't miss them, but there are reasons for wanting to be able to deal with individual threads through /proc. The main problems have been useability and performance. If you are running a system with thousands of threads, /proc becomes rather large and difficult to work with. It's also slow. Ingo Molnar found that, with 16,000 threads in /proc, the top utility took 22 seconds to work through them all.

The result of Ingo's work, of course, is a patch improving the situation. The first thing Ingo did was to create a "lookup cursor" that gets stashed into the file structure for a process that is digging through /proc. That cursor caches the current state of the directory read operation, greatly speeds the process of reading through a large /proc directory. Ingo also added some new process information so that the thread group leader can be queried for cumulative information on the whole group.

Nobody complained much about those changes; there was one other, though, that was a bit more controversial. With Ingo's patch, threads show up in /proc with a period in front of the process ID. Thus, a normal process might be represented as /proc/1234, while a thread would, instead, be /proc/.1234. That change makes it easy for applications to distinguish threads from "full" processes; it also has the effect of hiding threads from a casual /proc directory listing.

Unsurprisingly, a number of developers (including Linus) see the period as being a bit of a hack. Wouldn't it be better to put threads in a subdirectory under the thread group leader's ID? Linus even posted a quick patch showing how he thought it could be done. A new patch from Ingo has not yet appeared, but it seems likely that the next revision will put threads into subdirectories. At that point, threads will probably return to /proc in the 2.5 kernel.

And /proc will remain fast even with large numbers of threads; Ingo's 16,000-thread top case went from 22 seconds to 0.16 seconds.

Comments (none posted)

Driver porting

New articles in the driver porting series

Below you'll find two new articles in the LWN driver porting series; they deal with timekeeping and safe sleeping. Since last week we have also added an article on working with the preemptible kernel and an updated description of the 2.5 workqueue interface. Those articles, and all the others, can be found on the driver porting page.

Comments (none posted)

Driver porting: Timekeeping changes

This article is part of the LWN Porting Drivers to 2.6 series.
One might be tempted to think that the basic task of keeping track of the time would not change that much from one kernel to the next. And, in fact, most kernel code which worries about times (and time intervals) will likely work unchanged in the 2.6 kernel. Code which gets into the details of how the kernel manages time may well need to adapt to some changes, however.

Internal clock frequency

One change which shouldn't be problematic for most code is the change in the internal clock rate on the x86 architecture. In previous kernels, HZ was 100; in 2.6 it has been bumped up to 1000. If your code makes any assumptions about what HZ really was (or, by extension, what jiffies really signified), you may have to make some changes now. For what it's worth, as of 2.6.0-test9, the default values of HZ in the mainline kernel source (which sometimes lags the architecture-specific trees) is as follows: Alpha: 1024/1200; ARM: 100/128/200/1000; CRIS: 100; i386: 1000; IA-64: 1024; M68K: 100; M68K-nommu: 50-1000; MIPS: 100/128/1000; MIPS64: 100; PA-RISC: 100/1000; PowerPC32: 100; PowerPC64: 1000; S/390: 100; SPARC32: 100; SPARC64: 100; SuperH: 100/1000; UML: 100; v850: 24-100; x86-64: 1000.

Kernel time variables

When the internal clock rate on a 32-bit system is set to 1000, the classic 32-bit jiffies variable will overflow in just over 49 days. Overflows could always happen on systems with a long uptime, but, when it took well over a year of uptime, it was a relatively rare occurrence - even on Linux systems. It is not uncommon at all, however, for a system to be up for more than 50 days. In most cases, having jiffies wrap around is not a real problem; it can be inconvenient for tasks like process accounting, however. So the 2.5 kernel has a new counter called jiffies_64. With 64 bits to work with, jiffies_64 will not wrap around in a time frame that need concern most of us - at least until some future kernel starts using a gigahertz internal clock.

For what it's worth, on most architectures, the classic, 32-bit jiffies variable is now just the least significant half of jiffies_64.

Note that, on 32-bit systems, a 64-bit jiffies value raises concurrency issues. It is deliberately not declared as a volatile value (for performance reasons), so the possibility exists that code like:

    u64 my_time = jiffies_64;

could get an inconsistent version of the variable, where the top and bottom halves do not match. To avoid this possibility, code accessing jiffies_64 should use xtime_lock, which is the new seqlock type as of 2.5.60. In most cases, though, it will be easier to just use the convenience function provided by the kernel:

    #include <linux/jiffies.h>

    u64 my_time = get_jiffies_64();

Users of the internal xtime variable will notice a couple of similar changes. One is that xtime, too, is now protected by xtime_lock (as it is in 2.4 as of 2.4.10), so any code which plays around with disabling interrupts or such before accessing xtime will need to change. The best solution is probably to use:

    struct timespec current_kernel_time(void);

which takes care of locking for you. xtime also now is a struct timespec rather than struct timeval; the difference being that the sub-second part is called tv_nsec, and is in nanoseconds.

Timers

The kernel timer interface is essentially unchanged since 2.4, with one exception. The new function:

    void add_timer_on(struct timer_list *timer, int cpu);

will cause the timer function to run on the given CPU with the expiration time hits.

Delays

The 2.5 kernel includes a new macro ndelay(), which delays for a given number of nanoseconds. It can be useful for interactions with hardware which insists on very short delays between operations. On most architectures, however, ndelay(n) is equal to udelay(1) for waits of less than one microsecond.

POSIX clocks

The POSIX clocks patch (merged into 2.5.63) is beyond the scope of this article. If you are working with a device which can provide an interesting time service (high resolution or high accuracy), you may want to consider using it to drive a POSIX clock. Look into kernel/posix-timers.c for more information.

Comments (2 posted)

Driver porting: sleeping and waking up

This article is part of the LWN Porting Drivers to 2.6 series.
Contrary to expectations, the classic functions sleep_on() and interruptible_sleep_on() were not removed in the 2.5 series. It seems that they are still needed in a few places where (1) taking them out is quite a bit of work, and (2) they are actually used in a way that is safe. Most authors of kernel code should, however, pretend that those functions no longer exist. There are very few situations in which they can be used safely, and better alternatives exist.

wait_event() and friends

Most of those alternatives have been around since 2.3 or earlier. In many situations, one can use the wait_event() macros:

    DECLARE_WAIT_QUEUE_HEAD(queue);

    wait_event(queue, condition);
    int wait_event_interruptible (queue, condition);

These macros work the same as in 2.4: condition is a boolean condition which will be tested within the macro; the wait will end when the condition evaluates true.

It is worth noting that these macros have moved from <linux/sched.h> to <linux/wait.h>, which seems a more sensible place for them. There is also a new one:

    int wait_event_interruptible_timeout(queue, condition, timeout);

which will terminate the wait if the timeout expires.

prepare_to_wait() and friends

In many situations, wait_event() does not provide enough flexibility - often because tricky locking is involved. The alternative in those cases has been to do a full "manual" sleep, which involves the following steps (shown here in a sort of pseudocode, of course):

    DECLARE_WAIT_QUEUE_HEAD(queue);
    DECLARE_WAITQUEUE(wait, current);

    for (;;) {
        add_wait_queue(&queue, &wait);
        set_current_state(TASK_INTERRUPTIBLE);
	if (condition)
	    break;
        schedule();
	remove_wait_queue(&queue, &wait);
	if (signal_pending(current))
	    return -ERESTARTSYS;
    }
    set_current_state(TASK_RUNNING);

A sleep coded in this manner is safe against missed wakeups. It is also a fair amount of error-prone boilerplate code for a very common situation. In 2.6, a set of helper functions has been added which makes this task easier. The modern equivalent of the above code would look like:

    DECLARE_WAIT_QUEUE_HEAD(queue);
    DEFINE_WAIT(wait);

    while (! condition) {
        prepare_to_wait(&queue, &wait, TASK_INTERRUPTIBLE);
	if (! condition)
	    schedule();
        finish_wait(&queue, &wait)
    }

prepare_to_wait_exclusive() should be used when an exclusive wait is needed. Note that the new macro DEFINE_WAIT() is used here, rather than DECLARE_WAITQUEUE(). The former should be used when the wait queue entry is to be used with prepare_to_wait(), and should probably not be used in other situations unless you understand what it is doing (which we'll get into next).

Wait queue changes

In addition to being more concise and less error prone, prepare_to_wait() can yield higher performance in situations where wakeups happen frequently. This improvement is obtained by causing the process to be removed from the wait queue immediately upon wakeup; that removal keeps the process from seeing multiple wakeups if it doesn't otherwise get around to removing itself for a bit.

The automatic wait queue removal is implemented via a change in the wait queue mechanism. Each wait queue entry now includes its own "wake function," whose job it is to handle wakeups. The default wake function (which has the surprising name default_wake_function()), behaves in the customary way: it sets the waiting task into the TASK_RUNNING state and handles scheduling issues. The DEFINE_WAIT() macro creates a wait queue entry with a different wake function, autoremove_wake_function(), which automatically takes the newly-awakened task out of the queue.

And that, of course, is how DEFINE_WAIT() differs from DECLARE_WAITQUEUE() - they set different wake functions. How the semantics of the two differ is not immediately evident from their names, but that's how it goes. (The new runtime initialization function init_wait() differs from the older init_waitqueue_entry() in exactly the same way).

If need be, you can define your own wake function - though the need for that should be quite rare (about the only user, currently, is the support code for the epoll() system calls). The wake function has this prototype:

    typedef int (*wait_queue_func_t)(wait_queue_t *wait, 
                                     unsigned mode, int sync);

A wait queue entry can be given a different wakeup function with:

    void init_waitqueue_func_entry(wait_queue_t *queue, 
                                   wait_queue_func_t func);

One other change that most programmers won't notice: a bunch of wait queue cruft from 2.4 (two different kinds of wait queue lock, wait queue debugging) has been removed from 2.6.

Comments (7 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Andrew Morton 2.5.62-mm2 ?
Andrew Morton 2.5.62-mm3 ?
William Lee Irwin III pgcl-2.5.63-1 ?

Networking

Max Krasnyansky Bluetooth updates for 2.5.62 ?
Kazunori Miyazawa IPv6 IPSEC support ?

Benchmarks and bugs

Miscellaneous

Greg KH klibc for 2.5.63 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds