User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.16, released on March 19. A fair number of fixes have been merged since 2.6.16-rc6, but nothing too major. For those just tuning in, some of the big, user-visible changes in this kernel include the OCFS2 cluster filesystem, a number of networking changes including CUBIC congestion control, TIPC support, and an IPv6 version of DCCP, the swap migration and direct migration patches, a new SCHED_BATCH scheduler class, a number of new filesystem-oriented system calls, and the error detection and correction code. Big internal changes include the mutex changeover and the high-resolution timer code. The long-format changelog has lots of details.

The mainline git repository contains a big pile of patches merged for 2.6.17-rc1; see below for a summary.

The current -mm tree is 2.6.16-rc6-mm2. Recent changes to -mm include a reorganization of the page migration code (since merged), some high-resolution timers changes, some scheduler tweaks, and the MD RAID reshaping patches.

Comments (none posted)

Kernel development news

Quote of the week

I do nothing more than trade stout rope for good behavior. I anchor one end to a boulder, the other to a task's neck. The mechanism is agnostic. The task determines whether it gets hung or not, and the user determines how long the rope is.

-- Mike Galbraith. Who says scheduler patches are hard to understand?

Comments (1 posted)

OSDL's technical advisory board

As has been promised earlier, OSDL has announced the formation of a "technical advisory board" to help improve its relations with the kernel development community. Initial members are James Bottomley, Wim Coekaerts, Randy Dunlap, Greg Kroah-Hartman, Christoph Lameter, Matt Mackall, Theodore Ts'o, Arjan van de Ven, and Chris Wright.

Full Story (comments: none)

What's coming in 2.6.17

As of this writing, the process of merging patches into the mainline for 2.6.17 has been underway for a couple of days. Something over 1500 patches have been merged, though the number of user-visible changes is relatively small. Here is what has gone into the kernel so far:

  • There is a big SPARC update, which, among other things, includes support for the new "Niagara" architecture.

  • A large number of wireless networking updates, including some 802.11 development work. The ipw2200 driver has seen changes which, among other things, will require users to have version 3.0 of the adapter firmware.

  • The DCCP code continues to develop; among other things, CCID2 (using TCP-like congestion control) has been added.

  • A netfilter connection tracking helper for the H.323 protocol.

  • A big JFS update.

  • A huge set of video/DVB patches, adding support for a number of new devices and fixing many issues.

  • A big serial ATA update. The SCSI and ALSA subsystems have also seen large updates.

  • A number of USB audio drivers have been removed; USB audio hardware is better supported through the ALSA subsystem.

  • The semaphore-to-mutex conversion process continues in many parts of the tree.

  • EXPORT_SYMBOL_GPL_FUTURE() has been merged.

  • The SLAB_NO_REAP slab cache option, which ostensibly caused the slab not to be cleaned up when the system is under memory pressure, has been removed. The kmem_cache_t typedef is also being phased out in favor of struct kmem_cache.

  • Reservation of "huge" pages has been tightened up in an effort to avoid out-of-memory situations in some use cases. mprotect() can also now be used on huge pages.

The merge window for 2.6.17 should stay open until around the end of the month, so there is still plenty of time for more patches to find their way in.

Comments (none posted)

The last-minute unshare() discussion

One of many new system calls added in the 2.6.16 kernel is unshare(). Its purpose is to perform the opposite of the various sharing flags provided with clone(): it is used to disconnect some of a process's resources from those of its ancestor and sibling processes. With unshare(), a process can ask to have its own filesystems, namespaces, or file descriptor table. The unsharing of other resources, including semaphore undo information, virtual memory, signal handlers, and more is stubbed in for future releases.

A couple of last-second issues with unshare() surfaced just as 2.6.16 was being prepared for final release; only some of those issues were resolved in the resulting kernel.

One of those had to do with the implementation of unshare(CLONE_VM), which causes the calling process to stop sharing memory with others. It seemed that this functionality was present and complete, until Oleg Nesterov noticed that the code does not take into account the possibility that a core dump of the address space may be in process. The solution, for now, is to simply disable unsharing of memory. It seems that there is nobody who needs this feature immediately, and it was too late to be trying to fix up a core memory management function.

Eric Biederman raised a couple of other issues relating to the unshare() API which he would have liked to see fixed before that API becomes part of a released kernel. One was the use of the same set of flags used by clone() to specify sharing. Eric says:

sys_unshare can't implement half of the clone flags under any circumstances and those that it does implement have subtlely different semantics than the clone flags. Using a different set of flags sets the expectation that things will be different.

That discussion did not get very far, however; Linus prefers to use the same flags, and nobody else seems to be terribly upset about it.

Eric's other point was that unshare() does not test for unrecognized flags; they are silently ignored. So user space can ask for the unsharing of resources which are not known to - or supported by - the unshare() call and no error status will be returned. This behavior could be a problem in the future, when the set of legal flags for unshare() is expected to grow. A program written to use one of the new flags may not do the right thing if it is subsequently run on a 2.6.16 kernel; the functionality it asks for will not be present, but the kernel will not inform it of the fact.

The patch submitted by Eric addressed both issues: the names of the flags and testing for unrecognized flags. It was not merged for 2.6.16, however. The unrecognized flag test, on its own, might have gotten in (and such a patch has been merged for 2.6.17), but the combined patch didn't make it. Andrew Morton remarked: "Your single patch did two different things - there's a lesson here." The creation of tightly-focused patches truly is important, especially just prior to a final kernel release.

Comments (none posted)

Solving starvation problems in the scheduler

The Linux CPU scheduler has come a long way since the early 2.6 days, when it was the cause for quite a bit of worry. Scheduling domains fixed many of the problems on larger systems, while a whole set of interactivity heuristics made desktops work better. The interactivity work, in particular, is based on the notion of a "sleep average." Any process which spends a significant amount of its time sleeping, relative to the time it runs, is deemed to be "interactive" and is given a higher priority.

This mechanism works well enough that few people complain about interactive response with current 2.6 kernels. Every now and then, however, somebody comes up with a workload which manages to fool the scheduler and bring the desktop to a halt. Mike Galbraith has been chasing down a few of these, producing patches in the process which should help to mitigate the problems.

The Linux scheduler maintains two "arrays" of run queues for each processor. When a process starts out, it is given a time slice and put onto the "active" array, where it can compete for the CPU. Once the time slice runs out, that process will move over to the "expired" array, where it languishes until all other runnable processes have used up their time slices. Once all processes are on the expired array, the two arrays are switched and the process begins again.

There is an exception, however, in the 2.6 kernel: a process which is deemed to be interactive (because it spends enough time in interruptible sleeps) will, on expiration of its time slice, be put back onto the active array. As a result, an interactive process should not have to wait while some long-running batch process cranks through its time slice. To keep this mechanism from blocking out expired processes altogether, however, the scheduler checks to see if the processes in the expired array have been waiting for too long. Once the starvation threshold has been exceeded, all processes go to the expired array at the end of their slices, allowing the scheduler to perform the array switch in the relatively near future.

Mike found that, on a system with a heavily-loaded Apache server running, tasks could find themselves starved for long periods of time; it seems that the starvation-avoidance logic was not working right. The problem turned out to be in the wakeup code. That code was always putting freshly-awakened processes onto the active array, regardless of what was going on elsewhere in the system. With a large number of server processes being continually awakened as requests came in, the scheduler was never able to switch arrays. The fix was to put the starvation test into __activate_task(); as a result, when expired processes are starving, processes will be awakened onto the expired array. That small fix fixed much of the problem.

A fuller fix, however, involves the task throttling patch which Mike has been working on for some time. There's a number of fixes involved in this work, but the core observation is this: the "sleep average" code can be too generous to processes which sleep only part of the time. A process which manages regular, short sleeps can boost its priority significantly, to the point that it can force out other processes running on the system. And once a process obtains an interactivity bonus, it can keep it for some time. This behavior is all by design; some interactive programs can sit for a very long time, then perform some serious processing for a while. Think about the X server with that nice compositing window manager; it spends quite a bit of time idle, only to pin the CPU when the user starts dragging windows around. But this behavior can also give an interactive priority bonus to processes which are not truly interactive.

The solution here involves a few changes. One of them is to simply be a bit less generous with the interactivity bonuses. But the core of the patch is a function called refresh_timeslice(). This function looks at the current sleep average, and compares it to the amount of time that the process is actually spending in the CPU. Based on this comparison, a per-process throttle time is adjusted. If more CPU time is being used than would be suggested by the sleep average, the throttle time is moved backward; otherwise it moves forward. If a process runs into the throttle time, its sleep average starts to decay quickly, depriving it of its interactivity bonus.

The throttle time provides a grace period which allows processes to use short bursts of CPU time without being penalized. The amount of grace time can be adjusted by way of a pair of knobs exported by the throttling code. "Grace 1" is the amount of time new processes get to establish their averages before being exposed to the throttling mechanism, while "grace 2" is how long a process can run above its expected CPU usage before the throttle kicks in. There have been some objections to the addition of these knobs; they look like another obscure set of kernel tunables that most administrators will not know how to set properly. So there has been a push for the knobs to be replaced with a simple on/off switch. Systems meant for interactive use would leave the throttling on, while server systems would simply turn it all off. Working this issue out may delay the acceptance of this patch, though there seems to be little disagreement with the rest of it.

Comments (7 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds