User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.6-rc3, unchanged from last week.

Linus's BitKeeper tree contains, as of this writing, an important workqueue fix (it seems nobody had actually tried to use cancel_delayed_work() until now...), an updated MTD concatenating driver, several architecture updates, and lots of fixes.

The current tree from Andrew Morton is 2.6.6-rc3-mm2. Recent additions to the -mm tree include another set of reverse mapping VM patches from Hugh Dickins, a new ia_64 hotplug CPU patch set, a patch to enable interrupts while waiting on spinlocks, the permanent abolition of 8K stacks on the x86 architecture, a new /proc/sys/kernel/vermagic file to enable package installers to figure out how the kernel was built, filtered sleeps and wakeups (see below), a new NUMA API, and, of course, lots of fixes.

Andrew indicates that the scheduling domains patches are being fixed up and prepared for merging once 2.6.6 is released. He also plans to merge a number of the reverse mapping VM patches, including the anonmm work, even though the final decision on whether to go that way or with the rival anon_vma technique has not yet been made.

The current 2.4 prepatch is 2.4.27-pre2, which was released by Marcelo on May 3. Changes this time include some crypto updates, some XFS fixes, various networking updates, and a handful of other fixes.

Comments (1 posted)

Kernel development news

Quote of the week


-- Alexander Viro's alternative for a less alarming replacement for the term "tainted," applied to kernels which have had non-free modules loaded into them.

Comments (none posted)

2.6 swapping behavior

There has, recently, been a new round of complaints about how the 2.6 kernel swaps out memory. Some users have been very vocal in their belief that, if they have sufficient physical memory, their applications should never be swapped out. These people get annoyed when they sit down at their display in the morning and find that their office suite or web browser is unresponsive, and stays that way for some time. They get even more annoyed when they look and see how much memory the kernel is using for caching file contents rather than process memory. The obvious question to ask is: couldn't the kernel cut back a bit on the file caches and keep applications in memory?

The answer is that the kernel can be made to behave that way by tweaking a runtime parameter, but it is not necessarily a good idea. Before getting into that, however, it's worth noting that recent 2.6 kernels have a memory management problem which can cause serious problems after an application which reads through entire filesystems (updatedb, say, or a backup) has run. The problem is the slab cache's tendency to request allocations of multiple, contiguous pages; these allocations, when done at the behest of filesystem code, can bring the system to a halt. A patch has been merged which fixes this particular problem for 2.6.6.

The bigger issue remains, however: should the kernel swap out user applications in order to cache more file contents? There are plenty of arguments in favor of this behavior. Quite a few large applications set up big areas of memory which they rarely, if ever use. If application memory is occasionally forced to disk, the unused parts will remain there, and that much physical memory will be freed for more useful contents. Without swapping application memory to disk and seeing what gets faulted back in, it is almost impossible to figure out which pages are not really needed. A large file cache is also a performance enhancer. The speedups that come from having frequently-accessed data in memory are harder to see than the slowdowns caused by having to fault in a large application, but they can lead to better system throughput overall.

Still, there are users who insist that, for example, a system backup should never force OpenOffice out to disk. They don't care how quickly a system maintenance application runs at 3:00 in the morning, but they care a lot about how the system responds when they are at the keyboard. This wish was expressed repeatedly until Andrew Morton exclaimed:

I'm gonna stick my fingers in my ears and sing "la la la" until people tell me "I set swappiness to zero and it didn't do what I wanted it to do".

This helped quiet the debate as the parties involved looked more closely at this particular parameter. Or, perhaps, it was just fear of Andrew's singing. Either way, it has become clear that most people are unaware of what the "swappiness" parameter does; the fact that it has never been documented may have something to do with that.

So... swappiness, which is exported to /proc/sys/vm/swappiness, is a parameter which sets the kernel's balance between reclaiming pages from the page cache and swapping out process memory. The reclaim code works (in a very simplified way) by calculating a few numbers:

  • The "distress" value is a measure of how much trouble the kernel is having freeing memory. The first time the kernel decides it needs to start reclaiming pages, distress will be zero; if more attempts are required, that value goes up, approaching a high value of 100.

  • mapped_ratio is an approximate percentage of how much of the system's total memory is mapped (i.e. is part of a process's address space) within a given memory zone.

  • vm_swappiness is the swappiness parameter, which is set to 60 by default.

With those numbers in hand, the kernel calculates its "swap tendency":

	swap_tendency = mapped_ratio/2 + distress + vm_swappiness;

If swap_tendency is below 100, the kernel will only reclaim page cache pages. Once it goes above that value, however, pages which are part of some process's address space will also be considered for reclaim. So, if life is easy, swappiness is set to 60, and distress is zero, the system will not swap process memory until it reaches 80% of the total. Users who would like to never see application memory swapped out can set swappiness to zero; that setting will cause the kernel to ignore process memory until the distress value gets quite high.

The swappiness parameter should do what a lot of users want, but it does not solve the whole problem. Swappiness is a global parameter; it affects every process on the system in the same way. What a number of people would like to see, however, is a way to single out individual applications for special treatment. Possible approaches include using the process's "nice" value to control memory behavior; a low-priority process would not be able to push out significant amounts of a high-priority process's memory. Alternatively, the VM subsystem and the scheduler could become more tightly integrated. The scheduler already makes an effort to detect "interactive" processes; those processes could be given the benefit of a larger working set in memory. That sort of thing is 2.7 work, however; in the mean time, people who are unhappy with the kernel's swap behavior may want to try playing with the knobs which have been provided.

Comments (26 posted)

Filtered wakeups

Kernel code often finds itself having to wait for a particular physical page; if, for example, a page is currently under I/O, prospective users must wait until that operation has completed. In the early days of 2.4 (and before), the struct page structure (which the kernel uses to track physical memory) contained a wait queue head for this purpose. This technique worked, but adding a wait queue for every page in the system was not a particularly efficient use of memory. At any given time, only a tiny percentage of those wait queues are actually in use.

To recover some of the memory used by wait queues, the kernel developers added the concept of hashed wait queues. The per-page queues were replaced with a much smaller number of shared queues; when a thread needs to wait on a particular page, it hashes the page address to pick the appropriate queue. When the page becomes available, all processes waiting on that queue will be awakened. The use of this technique has since been extended to other parts of the kernel as well.

Hashed wait queues have achieved the desired space savings, but, as it turns out, at a certain computational cost. William Lee Irwin did some research, and found that hash queue collisions are fairly common. So, when a wakeup is performed on one of the hashed wait queues, it is likely that unrelated processes are being awakened. Each of those processes must run, determine that the event they are waiting for has not yet occurred, and go back to sleep. This variant on the "thundering herd" problem can hurt performance.

One possible solution to this problem would be to expand the number of wait queues to make collisions less likely. That approach is simple, but it also would bring back the original problem by expanding the amount of memory dedicated to wait queues. So William came up with another approach, which he calls "filtered wakeups."

The idea behind a filtered wakeup is fairly simple. When a process goes to sleep on a (shared) filtered wait queue, it provides a "key" value, which will typically be the address of the resource being waited for. The wakeup call is made with a key value as well; as the wait queue is traversed, only the processes waiting for the given key are awakened.

The patch which implements filtered waits is fairly simple, and includes an example of their use. It creates a new filtered_wait_queue structure:

	struct filtered_wait_queue {
		void *key;
		wait_queue_t wait;

A process which is about to go into a filtered wait will use code which looks something like the following to create an use a filtered queue entry:


	do {
		prepare_to_wait(queue, &wait.wait, TASK_INTERRUPTIBLE);
		if (not_ready_yet(key))
	} while(not_ready_yet(key));
	finish_wait_(queue, &wait.wait);

Awakening a process in this sort of sleep is a simple matter of calling:

    void wake_up_filtered(wait_queue_head_t *queue, void *key);

William claims some significant performance improvements from his changes, including large reductions in CPU usage and a near tripling of the peak I/O rates in some situations.

Comments (1 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers

Filesystems and block I/O

Memory management


  • Ulrich Drepper: NUMA API. (April 30, 2004)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds