The current 2.6 prepatch is 2.6.9-rc3
by Linus on September 29.
Changes this time include lots more annotations for the "sparse" checker,
an NTFS update, a patch causing ICMP "source quench" messages to be
ignored, the new I/O memory access functions (see the September 23 Kernel Page
), a big set
of input driver patches, m32r architecture support, a User-mode Linux
update, the merger of the two in-kernel software suspend implementations, a
tunable "max sectors" limit for block I/O requests (a latency reduction
feature), and a new prctl()
option allowing programs to change
their name. The long-format changelog
For what it's worth, Andrew Morton estimates that 2.6.9-rc4 will be out "later
this week," with the final 2.6.9 release happening about a week after
Linus's BitKeeper repository contains another big set of "sparse"
annotations, the removal of get_cpu_ptr(), a generic,
netlink-based network statistics interface, some networking fixes, and a
number of architecture updates.
Also to be found in BitKeeper is a kernel
management style document which Linus quietly committed as "wisdom
passed down the ages on clay tablets."
This preemptive admission of incompetence might also make the
people who actually do the work also think twice about whether it's
worth doing or not. After all, if _they_ aren't certain whether
it's a good idea, you sure as hell shouldn't encourage them by
promising them that what they work on will be included. Make them
at least think twice before they embark on a big endeavor
The current prepatch from Andrew Morton is 2.6.9-rc3-mm2. Recent changes to -mm include
a reworking of the IRQ subsystem, a set of ext3 online resizing fixes, a
"completely fair queueing" I/O scheduler update, the switchable I/O
schedulers patch, an I/O write barrier primitive, a security module for BSD
secure levels, and lots of fixes.
The current 2.4 prepatch remains 2.4.28-pre3, which dates back to
Comments (none posted)
Kernel development news
And I have to warn people if they think that the churn is fast
and the rate of change in the networking is high right now, you
have seen absolutely nothing yet. :-)
-- David Miller
The _reality_ is that there is _no_ point in time where you and Linus allow
for stabilization of the main tree prior to release. The release criteria
has devolved to a point where we call it done when the stack of pancakes
gets too high.
-- Jeff Garzik (among others) is concerned
about the current development model.
Comments (8 posted)
Andrew Morton's -mm kernel tree now fills the role which might have once
been taken by an odd-numbered development series. We don't have 2.7.x;
instead, new stuff finds its way into 2.6.x-mm. So it can be interesting
to step back, occasionally, and look at what patches are lurking there.
2.6.9-rc3-mm2 contains a full 1213 patches. About half of these come from
trees managed by various subsystem maintainers; seeing what those are
usually requires pulling a separate BitKeeper tree and looking inside.
These trees hold patches which are usually (usually!) relatively small and
maintenance-oriented. The external trees brought into -mm currently
include those dedicated to the ACPI, AGPGART, ALSA, i2c, IDE, IEEE 1394,
input, serial ATA, networking, NTFS, driver core, PCI, USB, and SCSI
Among the other 654 patches in 2.6.9-rc3-mm2 are found:
- A change to how rlimit settings are interpreted; they become
per-process settings, rather than per-thread.
- The sysfs backing store patches
continue to languish in -mm, apparently waiting for a review from some
of the core developers.
- Ingo Molnar's "generic IRQ subsystem" work. These patches, posted on October 2, are a big
reorganization of the interrupt handling code. Over the years, much
of the IRQ code had been copied from one architecture to the next,
leading to a lot of duplicated functions. These patches pull the
generic code out of the architecture subtrees and remove some 3000
lines of code from the kernel.
- Numerous kernel debugger (kgdb) patches continue to live in -mm; as
always, they are unlikely to move into the mainline.
- They get less attention than they used to, but there are still must-fix and should-fix lists in -mm.
- Arjan van de Ven's patch which keeps processes from being able to
overwrite kernel memory via /dev/mem. This patch has been
shipped with Red Hat/Fedora kernels for a while, but is not yet in the
- An extensive set of ext3 patches implementing block reservations. Stephen Tweedie has
recently resumed working on these patches, so they may move forward in
the near future. The ext3 online resizing patch set is also in -mm.
- Mikael Pettersson's performance counters patches.
- The -mm tree continues to be a testing ground for scheduler patches.
It currently contains Peter Williams's Single Priority Array scheduler
(covered briefly here last August).
There is also an extensive set of scheduling domains fixes and a
number of latency-reduction patches from Ingo Molnar's work.
- Ingo Molnar's big kernel semaphore
- A set of PCMCIA patches adding driver model and hotplug support.
- A big DVD+RW support patch, which includes CDRW packet writing
- Support for in-kernel keyrings and their management.
- The CacheFS filesystem.
- The kexec patches, including support for using kexec as a kernel crash
- The reiser4 filesystem and a large number of fixes.
- The modular I/O schedulers patch and
the reworked "completely fair queueing" scheduler.
- The remap_page_range() change
- A security module implementing the BSD "secure levels" mechanism.
Mixed in with these big patches is the usual array of architecture updates,
subsystem fixes, etc.
In other words, -mm is a big patch; it is significantly different from the
mainline kernel. For some developers, it is too far removed; David Miller
recently responded to a request to test
networking changes in -mm this way:
Putting the net stuff into -mm makes debugging of networking
changes harder, as -mm has a ton of experimental stuff in it as
well. -mm frequently makes machines unbootable, and particularly
this is felt on non-x86 platforms such as sparc64 which is where I
do all of my work.
This kind of observation is not new; many developers continued to create
their patches on the 2.4 kernel long after the 2.5 branch opened because
2.5 struck them as being too unstable. When one is trying to shake out
bugs in new code, it is nice to minimize the number of other unrelated,
disruptive changes. That said, -mm continues to be the main staging area
for much of the code going into the mainline, and many developers target it
specifically with their patches. Given the number of bugs found after
patches go into -mm, people are clearly running it as well.
Comments (3 posted)
"High order" allocations, in the kernel, are attempts to obtain multiple,
contiguous pages for an application which needs more than one page in a
single, physically-contiguous block. These allocations have always been a
problem for the kernel to satisfy; once the system has been running for a
while, physical memory is usually fragmented to the point that very few
groups of adjacent, free pages exist. Last month, this page looked at Nick Piggin's kswapd changes
which attempt to
mitigate this problem somewhat. There are other people working in this
One of those is Marcelo Tosatti, who posted a
patch which adds active memory defragmentation to the kernel. At a
high level, the algorithm used is relatively simple: to obtain free blocks
of order N, start with the largest, smaller blocks you can find, and
try to relocate the contents of the pages immediately before and after the
block. If enough pages can be moved, a larger block of free pages will
have been created.
Naturally, this process seems rather more complicated when looked at
closely. Not all pages can be relocated; those which are locked or
reserved, for example, are not touchable. The patch also declines to work
with pages which are currently under writeback; until the writeback I/O
completes, those pages must not move. A number of more complicated cases,
such as moving pages which are part of a nonlinear mapping, are not handled
with the current patch.
If a page does appear to be relocatable, it must first be locked and have
its contents copied to the new page. Then all page tables which reference
the old page must be re-pointed to the new page. Reverse mapping
information, if any, must be set correctly. If there is a copy of the page
in swap, that copy must be connected with the new page. And so on.
Marcelo's patch responds to many of the more complicated cases by simply
refusing to move the page. Even so, Marcelo reports good results in
creating large, contiguous blocks of free memory.
Of course, there are a few glitches, including problems on SMP systems.
But, says Marcelo, never fear:
But it works fine on UP (for a few minutes :)), and easily creates
large physically contiguous areas of memory.
It was pointed out that this patch has some common features with a
different effort: the drive to support hotpluggable memory. When memory
is to be removed from the system, all pages currently stored in that memory
must be relocated. In essence, the hotplug memory patches seek to create a
large block of free memory which happens to cover a specific set of
Dave Hansen described two patches adding
hotplug memory support - one done at IBM, and one from Fujitsu. Each
apparently has its strong and weak points.
Between Marcelo's work and the hotplug patches, there is a significant
amount of experience in moving pages aside to free blocks of memory. An
effort to bring together those patches into a single one containing the
best of each will probably be necessary before any can be merged. But the
end result of that work could be an end to problems with high-order
Comments (1 posted)
The performance of modern computers is heavily influenced by how well they
use the processor's memory cache. Going to main memory is a slow operation
(from a processor's point of view); an operating system which forces main
memory accesses too often will run slowly. One of the things the Linux
kernel does to optimize cache use is to try to avoid moving processes
between CPUs if it is likely that those processes have a fair amount of
useful data in the cache. When a process moves, it leaves its cached data
behind and must begin populating the new CPU's cache from the beginning.
That repopulation requires memory accesses and slows things down.
The metric used by the kernel to decide whether moving a particular task is
advisable is a scheduling domain parameter called cache_hot_time.
If the process has run in the current processor within the "hot time," it
is considered to have significant data in the cache and is not moved
unnecessarily. In recent kernels, cache_hot_time for processors
on non-NUMA, SMP systems is 2.5ms.
Kenneth Chen recently did some tests to see
if that value makes sense. On his four-processor system, he found that
workload throughput with a 2.5ms hot time was 12% below its peak level -
which happens with a 10ms value. As it turns out, 10ms was once the
default value for the cache hot time; Kenneth proposes that this value be
restored. Others have, instead, suggested that a new tunable parameter be
provided so that administrators could find and set the optimal value for
Ingo Molnar has come up with a different
approach - have the computer figure out for itself what the optimal
"cache hot" time is. To this end, his code performs the following steps
for each pair of processors on the system:
- The first processor fills a large, shared buffer with data, thus
populating its own cache with (some of) the contents of that buffer.
- The second processor fills a private buffer, filling its own cache.
- The second processor then overwrites the shared buffer, moving the
contents of that buffer into its own cache.
The time required for the third step is, to an approximation, a worst case
scenario for what it costs to move a process when it has filled the local
cache with data. Ingo tested the code on a few systems and got optimal
values which vary from 5ms (on a four-processor Pentium 4 system) to
87ms (for an eight-processor, semi-NUMA, Pentium 3 system). Clearly,
one default value for all systems is not the right answer. This also looks
like a good number for the computer to find for itself - assuming
subsequent tests show that this patch (or a successor) is finding something
close to the optimal value.
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>