Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.9-rc3, announced by Linus on September 29. Changes this time include lots more annotations for the "sparse" checker, an NTFS update, a patch causing ICMP "source quench" messages to be ignored, the new I/O memory access functions (see the September 23 Kernel Page), a big set of input driver patches, m32r architecture support, a User-mode Linux update, the merger of the two in-kernel software suspend implementations, a tunable "max sectors" limit for block I/O requests (a latency reduction feature), and a new prctl() option allowing programs to change their name. The long-format changelog has the details.For what it's worth, Andrew Morton estimates that 2.6.9-rc4 will be out "later this week," with the final 2.6.9 release happening about a week after that.
Linus's BitKeeper repository contains another big set of "sparse" annotations, the removal of get_cpu_ptr(), a generic, netlink-based network statistics interface, some networking fixes, and a number of architecture updates.
Also to be found in BitKeeper is a kernel management style document which Linus quietly committed as "wisdom passed down the ages on clay tablets."
The current prepatch from Andrew Morton is 2.6.9-rc3-mm2. Recent changes to -mm include a reworking of the IRQ subsystem, a set of ext3 online resizing fixes, a "completely fair queueing" I/O scheduler update, the switchable I/O schedulers patch, an I/O write barrier primitive, a security module for BSD secure levels, and lots of fixes.
The current 2.4 prepatch remains 2.4.28-pre3, which dates back to September 11.
Kernel development news
Quotes of the week
-- David Miller
-- Jeff Garzik (among others) is concerned about the current development model.
The -mm development tree
Andrew Morton's -mm kernel tree now fills the role which might have once been taken by an odd-numbered development series. We don't have 2.7.x; instead, new stuff finds its way into 2.6.x-mm. So it can be interesting to step back, occasionally, and look at what patches are lurking there.2.6.9-rc3-mm2 contains a full 1213 patches. About half of these come from trees managed by various subsystem maintainers; seeing what those are usually requires pulling a separate BitKeeper tree and looking inside. These trees hold patches which are usually (usually!) relatively small and maintenance-oriented. The external trees brought into -mm currently include those dedicated to the ACPI, AGPGART, ALSA, i2c, IDE, IEEE 1394, input, serial ATA, networking, NTFS, driver core, PCI, USB, and SCSI subsystems.
Among the other 654 patches in 2.6.9-rc3-mm2 are found:
- A change to how rlimit settings are interpreted; they become
per-process settings, rather than per-thread.
- The sysfs backing store patches
continue to languish in -mm, apparently waiting for a review from some
of the core developers.
- Ingo Molnar's "generic IRQ subsystem" work. These patches, posted on October 2, are a big
reorganization of the interrupt handling code. Over the years, much
of the IRQ code had been copied from one architecture to the next,
leading to a lot of duplicated functions. These patches pull the
generic code out of the architecture subtrees and remove some 3000
lines of code from the kernel.
- Numerous kernel debugger (kgdb) patches continue to live in -mm; as
always, they are unlikely to move into the mainline.
- They get less attention than they used to, but there are still must-fix and should-fix lists in -mm.
- Arjan van de Ven's patch which keeps processes from being able to
overwrite kernel memory via /dev/mem. This patch has been
shipped with Red Hat/Fedora kernels for a while, but is not yet in the
mainline.
- An extensive set of ext3 patches implementing block reservations. Stephen Tweedie has
recently resumed working on these patches, so they may move forward in
the near future. The ext3 online resizing patch set is also in -mm.
- Mikael Pettersson's performance counters patches.
- The -mm tree continues to be a testing ground for scheduler patches.
It currently contains Peter Williams's Single Priority Array scheduler
(covered briefly here last August).
There is also an extensive set of scheduling domains fixes and a
number of latency-reduction patches from Ingo Molnar's work.
- Ingo Molnar's big kernel semaphore
patch.
- A set of PCMCIA patches adding driver model and hotplug support.
- A big DVD+RW support patch, which includes CDRW packet writing
support.
- Support for in-kernel keyrings and their management.
- The CacheFS filesystem.
- The kexec patches, including support for using kexec as a kernel crash
dump mechanism.
- The reiser4 filesystem and a large number of fixes.
- The modular I/O schedulers patch and
the reworked "completely fair queueing" scheduler.
- The remap_page_range() change
to remap_pfn_range().
- A security module implementing the BSD "secure levels" mechanism.
Mixed in with these big patches is the usual array of architecture updates, subsystem fixes, etc.
In other words, -mm is a big patch; it is significantly different from the mainline kernel. For some developers, it is too far removed; David Miller recently responded to a request to test networking changes in -mm this way:
This kind of observation is not new; many developers continued to create their patches on the 2.4 kernel long after the 2.5 branch opened because 2.5 struck them as being too unstable. When one is trying to shake out bugs in new code, it is nice to minimize the number of other unrelated, disruptive changes. That said, -mm continues to be the main staging area for much of the code going into the mainline, and many developers target it specifically with their patches. Given the number of bugs found after patches go into -mm, people are clearly running it as well.
Active memory defragmentation
"High order" allocations, in the kernel, are attempts to obtain multiple, contiguous pages for an application which needs more than one page in a single, physically-contiguous block. These allocations have always been a problem for the kernel to satisfy; once the system has been running for a while, physical memory is usually fragmented to the point that very few groups of adjacent, free pages exist. Last month, this page looked at Nick Piggin's kswapd changes which attempt to mitigate this problem somewhat. There are other people working in this area, however.One of those is Marcelo Tosatti, who posted a patch which adds active memory defragmentation to the kernel. At a high level, the algorithm used is relatively simple: to obtain free blocks of order N, start with the largest, smaller blocks you can find, and try to relocate the contents of the pages immediately before and after the block. If enough pages can be moved, a larger block of free pages will have been created.
Naturally, this process seems rather more complicated when looked at closely. Not all pages can be relocated; those which are locked or reserved, for example, are not touchable. The patch also declines to work with pages which are currently under writeback; until the writeback I/O completes, those pages must not move. A number of more complicated cases, such as moving pages which are part of a nonlinear mapping, are not handled with the current patch.
If a page does appear to be relocatable, it must first be locked and have its contents copied to the new page. Then all page tables which reference the old page must be re-pointed to the new page. Reverse mapping information, if any, must be set correctly. If there is a copy of the page in swap, that copy must be connected with the new page. And so on. Marcelo's patch responds to many of the more complicated cases by simply refusing to move the page. Even so, Marcelo reports good results in creating large, contiguous blocks of free memory.
Of course, there are a few glitches, including problems on SMP systems. But, says Marcelo, never fear:
It was pointed out that this patch has some common features with a different effort: the drive to support hotpluggable memory. When memory is to be removed from the system, all pages currently stored in that memory must be relocated. In essence, the hotplug memory patches seek to create a large block of free memory which happens to cover a specific set of physical addresses.
Dave Hansen described two patches adding hotplug memory support - one done at IBM, and one from Fujitsu. Each apparently has its strong and weak points.
Between Marcelo's work and the hotplug patches, there is a significant amount of experience in moving pages aside to free blocks of memory. An effort to bring together those patches into a single one containing the best of each will probably be necessary before any can be merged. But the end result of that work could be an end to problems with high-order allocations.
When should a process be migrated?
The performance of modern computers is heavily influenced by how well they use the processor's memory cache. Going to main memory is a slow operation (from a processor's point of view); an operating system which forces main memory accesses too often will run slowly. One of the things the Linux kernel does to optimize cache use is to try to avoid moving processes between CPUs if it is likely that those processes have a fair amount of useful data in the cache. When a process moves, it leaves its cached data behind and must begin populating the new CPU's cache from the beginning. That repopulation requires memory accesses and slows things down.The metric used by the kernel to decide whether moving a particular task is advisable is a scheduling domain parameter called cache_hot_time. If the process has run in the current processor within the "hot time," it is considered to have significant data in the cache and is not moved unnecessarily. In recent kernels, cache_hot_time for processors on non-NUMA, SMP systems is 2.5ms.
Kenneth Chen recently did some tests to see if that value makes sense. On his four-processor system, he found that workload throughput with a 2.5ms hot time was 12% below its peak level - which happens with a 10ms value. As it turns out, 10ms was once the default value for the cache hot time; Kenneth proposes that this value be restored. Others have, instead, suggested that a new tunable parameter be provided so that administrators could find and set the optimal value for their systems.
Ingo Molnar has come up with a different approach - have the computer figure out for itself what the optimal "cache hot" time is. To this end, his code performs the following steps for each pair of processors on the system:
- The first processor fills a large, shared buffer with data, thus
populating its own cache with (some of) the contents of that buffer.
- The second processor fills a private buffer, filling its own cache.
- The second processor then overwrites the shared buffer, moving the contents of that buffer into its own cache.
The time required for the third step is, to an approximation, a worst case scenario for what it costs to move a process when it has filled the local cache with data. Ingo tested the code on a few systems and got optimal values which vary from 5ms (on a four-processor Pentium 4 system) to 87ms (for an eight-processor, semi-NUMA, Pentium 3 system). Clearly, one default value for all systems is not the right answer. This also looks like a good number for the computer to find for itself - assuming subsequent tests show that this patch (or a successor) is finding something close to the optimal value.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
