LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.6, which was announced by Linus on May 9. Changes since the last prepatch include an NTFS update, an XFS update, some small virtual memory patches, an ACPI update, various architecture updates, and lots of fixes. The list of changes since 2.6.5 is much more extensive, including POSIX message queues, significant ext2 and ext3 filesystem performance improvements, the "laptop mode" patch, 4KB stacks for the i386 architecture, non-executable stack support for several architectures, a big reiserfs update, the lightweight auditing framework, the "completely fair queueing" I/O scheduler, TCP "Vegas" congestion avoidance, and much more. The long-format changelog has the details.

As of this writing, no 2.6.7 prepatches have been released. Patches are accumulating in Linus's BitKeeper repository, however; they include a libata update, some architecture updates, the scheduling domains patch set (covered here last month), the removal of the Intermezzo filesystem due to lack of use and support, a sysctl variable giving "huge page" access to a administrator-specified group (see below), the ability to re-enable interrupts while waiting in spin_lock_irqsave() (for all architectures now), support in reiserfs for quotas and external attributes (added over Hans Reiser's objections), and lots of fixes.

The current prepatch from Andrew Morton is 2.6.6-mm1. Recent additions to -mm include backing store for sysfs (covered here last February), a number of patches for shrinking the heavily-used dentry structure, another set of (relatively small) virtual memory patches, ia64 hotplug CPU support, a generic qsort() function for the kernel, and the usual pile of fixes.

The current 2.4 kernel is 2.4.26; no 2.4.27 prepatches have been released since 2.4.27-pre2 came out on May 3.

Comments (4 posted)

Kernel development news

Magic groups in 2.6

The 2.6.6-mm1 tree includes, among many other things, patches which add two new /proc/sys variables. They are:

/proc/sys/vm/hugetlb_shm_group
If this value is non-zero, it is interpreted as a group ID which gives access to the the "huge pages" feature of the 2.6 VM.

/proc/sys/vm/mlock_group
This variable behaves similarly, but it controls access to the mlock() system call (which locks memory into physical RAM) instead.

The current Linux kernel will not allow a process to perform either of the above actions unless that process has the CAP_IPC_LOCK capability; in practice, this means that the process needs to run as root. The main user of huge pages would appear to be a small program called "Oracle," which is something that many users would rather not run with root privileges. The new sysctl variables allow an administrator to give the ability to use huge pages (and mlock()) to a specific group; if Oracle runs within that group, it will be able to do what it needs without higher privileges.

These patches are not universally popular; the addition of "magic groups" with special meaning inside the kernel strikes many developers as an inelegant, un-Unix-like solution to the problem. So these developers were not happy when the hugetlb_vm_group patch was merged for 2.6.7 shortly after appearing in the -mm tree. Rather than rush an ugly hack into the kernel (which will then have to be supported indefinitely into the future), they argue, it would have been better to come up with a proper solution.

The problem, it seems, is that there are no better solutions on the horizon. Says Andrew Morton:

Capabilities are broken and don't work. Nobody has a clue how to provide the required services with SELinux and nobody has any code and we need the feature *now* before vendors go shipping even more ghastly stuff.

The problems with capabilities were covered here back in April, when this issue last came up. SELinux can, in principle, solve this problem, but there is the little disadvantage that nobody has been able to put together a production-ready, working distribution with SELinux yet. The distributors have been creating their own patches to enable Oracle to use the huge pages feature, and many of those are seen as being worse than the "magic groups" approach. Rather than see each distribution take the kernel in a different direction, Andrew merged the magic groups patch as the least evil alternative:

Nasty workarounds will be shipped to end users by vendors. That's a certainty. We cannot change this now. What I wish to do is to ensure that all users receive the *same* nasty workaround. Call it damage control.

To some, however, the control appears worse than the damage. If vendors add their own hacks, they take responsibility for maintaining those hacks, or for weaning users off of them at some future time. Pulling features out of the mainline kernel is harder. Be that as it may, for lack of a better short-term solution the "magic groups" patch is now part of 2.6.

Comments (13 posted)

4K Stacks in 2.6

Traditionally, the Linux kernel has used 8KB kernel stacks on most architectures. That stack must suffice for any sequence of calls that may result from a system call - plus the needs of any (hard or soft) interrupt handlers that may be invoked at the same time. In practice, stack overflows are pretty much unheard of in stable kernels; the kernel developers have long since learned to avoid large automatic variables, recursive functions, and other things which can use large amounts of stack space.

There have been patches circulating for some time now which reduce the kernel stack to 4KB. It is generally understood that the switch to smaller stacks will happen at some point; as a result, much work has recently gone into finding code paths in the kernel which are overly stack-hungry. Part of that effort is simply lots of testing; for that reason, recent -mm kernels no longer even offer an 8KB stack option. The hope is that, if enough people try out the smaller stacks and shake out the bugs, 4KB stacks can be merged into 2.6 in the near future.

The smaller stacks are scary to some people; it is hard to be certain that all of the possible paths through the kernel have actually been tested. 4KB stacks also break binary modules, and the nVidia drivers in particular. So there is a certain amount of pressure to defer this change into 2.7.

One might well wonder why the kernel hackers are trying to put this sort of change into a stable kernel series. The problem with 8KB stacks is that they require an "order 1" memory allocation: two pages which are contiguous in physical memory. Order 1 allocations can be very hard to satisfy once the system has been running for a while; physical memory can become so fragmented that two adjacent free pages simply do not exist. The kernel will try hard to free up pages to satisfy larger allocations; the result can be a slow, painful, thrashing system.

Each process on the system has its own kernel stack, which is used whenever the system goes into kernel mode while that process is running. Since each process requires a kernel stack, the creation of a new process requires an order 1 allocation. So the two-page kernel stacks can limit the creation of new processes, even though the system as a whole is not particularly short of resources. Shrinking kernel stacks to a single page eliminates this problem and makes it easy for Linux systems to handle far more processes at any given time.

Arjan van de Ven also made the interesting claim that the 4KB stacks are actually safer. His reasoning has to do with one other aspect of the 4KB stack patch: it moves interrupt handling onto a separate, dedicated stack. Software interrupts also get their own stack. Since interrupt handling has been moved away from the per-process kernel stack, the amount of space for system call handling remains about the same, and the stack space for interrupts has been increased.

The final decision on the integration of 4KB stacks has not yet been made; there are, seemingly, a few problems which still need to be tracked down. If things settle out, however, this fairly significant change could yet be merged into 2.6.

Comments (2 posted)

Deleting timers quickly

Kernel timers are a mechanism which allows kernel code to request that a function be called, in software interrupt context, after a given period of time has passed. They are heavily used for all sorts of delays and deferred actions within the kernel. The timer interface has been relatively stable for some time; it has not changed greatly in 2.6. Linux Device Drivers, Chapter 6 covers the timer interface in some detail.

Often, kernel code which has queued a timer finds that it needs to delete that timer. There are two functions which perform this task:

    int del_timer(struct timer_list *timer);
    int del_timer_sync(struct timer_list *timer);

del_timer() ensures that the given timer is not queued to run anywhere in the system; it returns a non-zero value if the timer actually had to be dequeued. del_timer_sync() performs the same function, but it also guarantees that the timer is not actually running on any processor in the system; it will block the current process if necessary while it waits for a running timer to complete. The stronger guarantee is often needed; an unexpected timer running in the corner can create no end of unpleasant race conditions.

Geoff Gustafson recently discovered that del_timer_sync() was one of the biggest kernel CPU hogs on a 32-processor NUMA system running "an enterprise database application." The problem is that del_timer_sync() must query each processor to ensure that the given timer is not currently running there. As the number of processors grows, this query loop takes longer to run. The situation is even worse on NUMA systems, since the loop must look at non-local (read "slow") memory for each processor.

Geoff posted a patch which solved the problem by remembering where each timer last ran. Since the kernel does not move timers across processors, the query loop in del_timer_sync() could then be reduced to looking at the single processor where the timer would have to be. It was observed, however, that a simpler solution is possible:

    if (! del_timer(timer))
        /* Do the full CPU query loop */

The idea here is that, if the timer was successfully deleted from the queue before it ran, there is no need to check to see if it is running anywhere. The only problem with this idea is that it is wrong. Timer functions can - and often do - resubmit themselves. If the timer to be deleted has resubmitted itself, but is still running, the above code will fail. If kernel code is deleting a timer, it really should first ensure that said timer will not resubmit itself, but the timer code cannot count on that behavior.

That said, some of the top callers of del_timer_sync() within the kernel are using timers which do not resubmit themselves. There is no reason why that code should pay the overhead of a full system search when, if a timer has been deleted off the queue before running, it is already guaranteed that the timer will not be running on any processor. For cases like this, a new function has been created:

    int del_singleshot_timer_sync(struct timer_list *timer);

Callers of this function must guarantee that the timer does not resubmit itself; in its current form, del_singleshot_timer_sync() will generate an oops if it detects a resubmitted timer. This function has not yet found its way into the mainline, but, given that it can yield a performance improvement of 2-3 orders of magnitude on large NUMA systems, its addition seems likely.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds