The current 2.6 kernel is 2.6.16
on March 19. A
fair number of fixes have been merged since 2.6.16-rc6, but nothing too
major. For those just tuning in, some of the big, user-visible changes in
this kernel include the OCFS2
cluster filesystem, a number of networking changes including CUBIC
congestion control, TIPC
, and an IPv6 version of DCCP
, the swap migration
and direct migration
patches, a new
scheduler class, a number of new filesystem-oriented system
, and the error
detection and correction
code. Big internal changes include the mutex changeover
and the high-resolution timer code
has lots of details.
The mainline git repository contains a big pile of patches merged for
2.6.17-rc1; see below for a summary.
The current -mm tree is 2.6.16-rc6-mm2. Recent changes
to -mm include a reorganization of the page migration code (since merged),
some high-resolution timers changes, some scheduler tweaks, and the MD RAID
Comments (none posted)
Kernel development news
I do nothing more than trade stout rope for good behavior. I
anchor one end to a boulder, the other to a task's neck. The
mechanism is agnostic. The task determines whether it gets hung or
not, and the user determines how long the rope is.
-- Mike Galbraith. Who says scheduler
patches are hard to understand?
Comments (1 posted)
As has been promised earlier, OSDL has announced the formation of a
"technical advisory board" to help improve its relations with the kernel
development community. Initial members are James Bottomley, Wim Coekaerts,
Randy Dunlap, Greg Kroah-Hartman, Christoph Lameter, Matt Mackall, Theodore
Ts'o, Arjan van de Ven, and Chris Wright.
Full Story (comments: none)
As of this writing, the process of merging patches into the mainline for
2.6.17 has been underway for a couple of days. Something over 1500 patches
have been merged, though the number of user-visible changes is relatively
small. Here is what has gone into the kernel so far:
- There is a big SPARC update, which, among other things, includes
support for the new "Niagara" architecture.
- A large number of wireless networking updates, including some 802.11
development work. The ipw2200 driver has seen changes which, among
other things, will require users to have version 3.0 of the adapter
- The DCCP code continues to develop; among other things, CCID2 (using
TCP-like congestion control) has been added.
- A netfilter connection tracking helper for the H.323 protocol.
- A big JFS update.
- A huge set of video/DVB patches, adding support for a number of new
devices and fixing many issues.
- A big serial ATA update. The SCSI and ALSA subsystems have also seen
- A number of USB audio drivers have been removed; USB audio hardware is
better supported through the ALSA subsystem.
- The semaphore-to-mutex conversion process continues in many parts of
has been merged.
- The SLAB_NO_REAP slab cache option, which ostensibly caused
the slab not to be cleaned up when the system is under memory
pressure, has been removed. The kmem_cache_t typedef is also
being phased out in favor of struct kmem_cache.
- Reservation of "huge" pages has been tightened up in an effort to
avoid out-of-memory situations in some use cases. mprotect()
can also now be used on huge pages.
The merge window for 2.6.17 should stay open until around the end of the
month, so there is still plenty of time for more patches to find their way
Comments (none posted)
One of many new system calls added in the 2.6.16 kernel is
. Its purpose is to perform the opposite of the various
sharing flags provided with clone()
: it is used to disconnect some
of a process's resources from those of its ancestor and sibling processes. With
, a process can ask to have its own filesystems,
namespaces, or file descriptor table. The unsharing of other resources,
including semaphore undo information, virtual memory, signal handlers, and
more is stubbed in for future releases.
A couple of last-second issues with unshare() surfaced just as
2.6.16 was being prepared for final release; only some of those issues were
resolved in the resulting kernel.
One of those had to do with the implementation of
unshare(CLONE_VM), which causes the calling process to stop
sharing memory with others. It seemed that this functionality was present
and complete, until Oleg Nesterov noticed that the code does not take into
account the possibility that a core dump of the address space may be in
process. The solution, for now, is to simply disable unsharing of memory.
It seems that there is nobody who needs this feature immediately, and it
was too late to be trying to fix up a core memory management function.
Eric Biederman raised a couple of other
issues relating to the unshare() API which he would have liked
to see fixed before that API becomes part of a released kernel. One was
the use of the same set of flags used by clone() to specify
sharing. Eric says:
sys_unshare can't implement half of the clone flags under any
circumstances and those that it does implement have subtlely
different semantics than the clone flags. Using a different set of
flags sets the expectation that things will be different.
That discussion did not get very far, however; Linus prefers to use the same flags, and nobody else
seems to be terribly upset about it.
Eric's other point was that unshare() does not test for
unrecognized flags; they are silently ignored. So user space can ask for
the unsharing of resources which are not known to - or supported by - the
unshare() call and no error status will be returned. This
behavior could be a problem in the future, when the set of legal flags for
unshare() is expected to grow. A program written to use one of
the new flags may not do the right thing if it is subsequently run on a
2.6.16 kernel; the functionality it asks for will not be present, but the
kernel will not inform it of the fact.
The patch submitted by Eric addressed both issues: the names of the flags
and testing for unrecognized flags. It was not merged for 2.6.16,
however. The unrecognized flag test, on its own, might have gotten in
(and such a patch has been merged for 2.6.17), but
the combined patch didn't make it. Andrew Morton remarked: "Your single patch did two
different things - there's a lesson here." The creation of
tightly-focused patches truly is important, especially just prior to a
final kernel release.
Comments (none posted)
The Linux CPU scheduler has come a long way since the early 2.6 days, when
it was the cause for quite a bit of worry. Scheduling domains fixed many
of the problems on larger systems, while a whole set of interactivity
heuristics made desktops work better. The interactivity work, in
particular, is based on the notion of a "sleep average." Any process which
spends a significant amount of its time sleeping, relative to the time it
runs, is deemed to be "interactive" and is given a higher priority.
This mechanism works well enough that few people complain about interactive
response with current 2.6 kernels. Every now and then, however, somebody
comes up with a workload which manages to fool the scheduler and bring the
desktop to a halt. Mike Galbraith has been chasing down a few of these,
producing patches in the process which should help to mitigate the
The Linux scheduler maintains two "arrays" of run queues for each
processor. When a process starts out, it is given a time slice and put
onto the "active" array, where it can compete for the CPU. Once the time
slice runs out, that process will move over to the "expired" array, where
it languishes until all other runnable processes have used up their time
slices. Once all processes are on the expired array, the two arrays are
switched and the process begins again.
There is an exception, however, in the 2.6 kernel: a process which is
deemed to be interactive (because it spends enough time in interruptible
sleeps) will, on expiration of its time slice, be put back onto the active
array. As a result, an interactive process should not have to wait while
some long-running batch process cranks through its time slice. To keep
this mechanism from blocking out expired processes altogether, however, the
scheduler checks to see if the processes in the expired array have been
waiting for too long. Once the starvation threshold has been exceeded, all processes
go to the expired array at the end of their slices, allowing the scheduler
to perform the array switch in the relatively near future.
Mike found that, on a system with a heavily-loaded Apache server running,
tasks could find themselves starved for long periods of time; it seems that
the starvation-avoidance logic was not working right. The problem turned
out to be in the wakeup code. That code was always putting
freshly-awakened processes onto the active array, regardless of what was
going on elsewhere in the system. With a large number of server processes
being continually awakened as requests came in, the scheduler was never
able to switch arrays. The fix was to put
the starvation test into __activate_task(); as a result, when
expired processes are starving, processes will be awakened onto the expired
array. That small fix fixed much of the problem.
A fuller fix, however, involves the task throttling patch which Mike
has been working on for some time. There's a number of fixes involved in
this work, but the core observation is this: the "sleep average" code can
be too generous to processes which sleep only part of the time. A process
which manages regular, short sleeps can boost its priority significantly,
to the point that it can force out other processes running on the system.
And once a process obtains an interactivity bonus, it can keep it for some
time. This behavior is all by design; some interactive programs can sit
for a very long time, then perform some serious processing for a while.
Think about the X server with that nice compositing window manager; it
spends quite a bit of time idle, only to pin the CPU when the user starts
dragging windows around. But this behavior can also give an interactive
priority bonus to processes which are not truly interactive.
The solution here involves a few changes. One of them is to simply be a
bit less generous with the interactivity bonuses. But the core of the
patch is a function called refresh_timeslice(). This function
looks at the current sleep average, and compares it to the amount of time
that the process is actually spending in the CPU. Based on this
comparison, a per-process throttle time is adjusted. If more CPU time is
being used than would be suggested by the sleep average, the throttle time
is moved backward; otherwise it moves forward. If a process runs into the
throttle time, its sleep average starts to decay quickly, depriving it of
its interactivity bonus.
The throttle time provides a grace period which allows processes to use
short bursts of CPU time without being penalized. The amount of grace time
can be adjusted by way of a pair of knobs exported by the throttling code.
"Grace 1" is the amount of time new processes get to establish their
averages before being exposed to the throttling mechanism, while
"grace 2" is how long a process can run above its expected CPU usage
before the throttle kicks in. There have been some objections to the
addition of these knobs; they look like another obscure set of kernel
tunables that most administrators will not know how to set properly. So
there has been a push for the knobs to be replaced with a simple on/off
switch. Systems meant for interactive use would leave the throttling on,
while server systems would simply turn it all off. Working this issue out
may delay the acceptance of this patch, though there seems to be little
disagreement with the rest of it.
Comments (7 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>