Brief itemsreleased on March 19. A fair number of fixes have been merged since 2.6.16-rc6, but nothing too major. For those just tuning in, some of the big, user-visible changes in this kernel include the OCFS2 cluster filesystem, a number of networking changes including CUBIC congestion control, TIPC support, and an IPv6 version of DCCP, the swap migration and direct migration patches, a new SCHED_BATCH scheduler class, a number of new filesystem-oriented system calls, and the error detection and correction code. Big internal changes include the mutex changeover and the high-resolution timer code. The long-format changelog has lots of details.
The mainline git repository contains a big pile of patches merged for 2.6.17-rc1; see below for a summary.
The current -mm tree is 2.6.16-rc6-mm2. Recent changes to -mm include a reorganization of the page migration code (since merged), some high-resolution timers changes, some scheduler tweaks, and the MD RAID reshaping patches.
Kernel development news
-- Mike Galbraith. Who says scheduler patches are hard to understand?
The merge window for 2.6.17 should stay open until around the end of the month, so there is still plenty of time for more patches to find their way in.
A couple of last-second issues with unshare() surfaced just as 2.6.16 was being prepared for final release; only some of those issues were resolved in the resulting kernel.
One of those had to do with the implementation of unshare(CLONE_VM), which causes the calling process to stop sharing memory with others. It seemed that this functionality was present and complete, until Oleg Nesterov noticed that the code does not take into account the possibility that a core dump of the address space may be in process. The solution, for now, is to simply disable unsharing of memory. It seems that there is nobody who needs this feature immediately, and it was too late to be trying to fix up a core memory management function.
Eric Biederman raised a couple of other issues relating to the unshare() API which he would have liked to see fixed before that API becomes part of a released kernel. One was the use of the same set of flags used by clone() to specify sharing. Eric says:
That discussion did not get very far, however; Linus prefers to use the same flags, and nobody else seems to be terribly upset about it.
Eric's other point was that unshare() does not test for unrecognized flags; they are silently ignored. So user space can ask for the unsharing of resources which are not known to - or supported by - the unshare() call and no error status will be returned. This behavior could be a problem in the future, when the set of legal flags for unshare() is expected to grow. A program written to use one of the new flags may not do the right thing if it is subsequently run on a 2.6.16 kernel; the functionality it asks for will not be present, but the kernel will not inform it of the fact.
The patch submitted by Eric addressed both issues: the names of the flags and testing for unrecognized flags. It was not merged for 2.6.16, however. The unrecognized flag test, on its own, might have gotten in (and such a patch has been merged for 2.6.17), but the combined patch didn't make it. Andrew Morton remarked: "Your single patch did two different things - there's a lesson here." The creation of tightly-focused patches truly is important, especially just prior to a final kernel release.
This mechanism works well enough that few people complain about interactive response with current 2.6 kernels. Every now and then, however, somebody comes up with a workload which manages to fool the scheduler and bring the desktop to a halt. Mike Galbraith has been chasing down a few of these, producing patches in the process which should help to mitigate the problems.
The Linux scheduler maintains two "arrays" of run queues for each processor. When a process starts out, it is given a time slice and put onto the "active" array, where it can compete for the CPU. Once the time slice runs out, that process will move over to the "expired" array, where it languishes until all other runnable processes have used up their time slices. Once all processes are on the expired array, the two arrays are switched and the process begins again.
There is an exception, however, in the 2.6 kernel: a process which is deemed to be interactive (because it spends enough time in interruptible sleeps) will, on expiration of its time slice, be put back onto the active array. As a result, an interactive process should not have to wait while some long-running batch process cranks through its time slice. To keep this mechanism from blocking out expired processes altogether, however, the scheduler checks to see if the processes in the expired array have been waiting for too long. Once the starvation threshold has been exceeded, all processes go to the expired array at the end of their slices, allowing the scheduler to perform the array switch in the relatively near future.
Mike found that, on a system with a heavily-loaded Apache server running, tasks could find themselves starved for long periods of time; it seems that the starvation-avoidance logic was not working right. The problem turned out to be in the wakeup code. That code was always putting freshly-awakened processes onto the active array, regardless of what was going on elsewhere in the system. With a large number of server processes being continually awakened as requests came in, the scheduler was never able to switch arrays. The fix was to put the starvation test into __activate_task(); as a result, when expired processes are starving, processes will be awakened onto the expired array. That small fix fixed much of the problem.
A fuller fix, however, involves the task throttling patch which Mike has been working on for some time. There's a number of fixes involved in this work, but the core observation is this: the "sleep average" code can be too generous to processes which sleep only part of the time. A process which manages regular, short sleeps can boost its priority significantly, to the point that it can force out other processes running on the system. And once a process obtains an interactivity bonus, it can keep it for some time. This behavior is all by design; some interactive programs can sit for a very long time, then perform some serious processing for a while. Think about the X server with that nice compositing window manager; it spends quite a bit of time idle, only to pin the CPU when the user starts dragging windows around. But this behavior can also give an interactive priority bonus to processes which are not truly interactive.
The solution here involves a few changes. One of them is to simply be a bit less generous with the interactivity bonuses. But the core of the patch is a function called refresh_timeslice(). This function looks at the current sleep average, and compares it to the amount of time that the process is actually spending in the CPU. Based on this comparison, a per-process throttle time is adjusted. If more CPU time is being used than would be suggested by the sleep average, the throttle time is moved backward; otherwise it moves forward. If a process runs into the throttle time, its sleep average starts to decay quickly, depriving it of its interactivity bonus.
The throttle time provides a grace period which allows processes to use short bursts of CPU time without being penalized. The amount of grace time can be adjusted by way of a pair of knobs exported by the throttling code. "Grace 1" is the amount of time new processes get to establish their averages before being exposed to the throttling mechanism, while "grace 2" is how long a process can run above its expected CPU usage before the throttle kicks in. There have been some objections to the addition of these knobs; they look like another obscure set of kernel tunables that most administrators will not know how to set properly. So there has been a push for the knobs to be replaced with a simple on/off switch. Systems meant for interactive use would leave the throttling on, while server systems would simply turn it all off. Working this issue out may delay the acceptance of this patch, though there seems to be little disagreement with the rest of it.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds