Brief items
The current 2.6 prepatch is 2.6.6-rc3, unchanged from last week.
Linus's BitKeeper tree contains, as of this writing, an important workqueue
fix (it seems nobody had actually tried to use
cancel_delayed_work() until now...), an updated MTD concatenating
driver, several architecture updates, and lots of fixes.
The current tree from Andrew Morton is 2.6.6-rc3-mm2. Recent additions to the -mm
tree include another set of reverse mapping VM patches from Hugh Dickins, a
new ia_64 hotplug CPU patch set, a patch to enable interrupts while waiting
on spinlocks, the permanent abolition of 8K stacks on the x86 architecture,
a new /proc/sys/kernel/vermagic file to enable package installers
to figure out how the kernel was built, filtered sleeps and wakeups (see
below), a new NUMA API, and, of course, lots of fixes.
Andrew indicates that the scheduling domains
patches are being fixed up and prepared for merging once 2.6.6 is
released. He also plans to merge a number of the reverse mapping VM
patches, including the anonmm work, even
though the final decision on whether to go that way or with the rival anon_vma technique has not yet been made.
The current 2.4 prepatch is 2.4.27-pre2, which was released by Marcelo on May 3. Changes
this time include some crypto updates, some XFS fixes, various networking
updates, and a handful of other fixes.
Comments (1 posted)
Kernel development news
"screwed"
-- Alexander Viro's alternative for a less
alarming replacement for the term "tainted," applied to kernels which have
had non-free modules loaded into them.
Comments (none posted)
There has, recently, been a new round of complaints about how the 2.6
kernel swaps out memory. Some users have been very vocal in their belief
that, if they
have sufficient physical memory, their applications should never be swapped
out. These people get annoyed when they sit down at their display in the
morning and find that their office suite or web browser is unresponsive,
and stays that way for some time. They get even more annoyed when they look
and see how much memory the kernel is using for caching file contents
rather than process memory. The obvious question to ask is: couldn't the
kernel cut back a bit on the file caches and keep applications in memory?
The answer is that the kernel can be made to behave that way by tweaking a
runtime parameter, but it is not necessarily a good idea. Before getting
into that, however, it's worth noting that recent 2.6 kernels have a memory
management problem which can cause serious problems after an application
which reads through entire filesystems (updatedb, say, or a backup) has
run. The problem is the slab cache's tendency to request allocations of
multiple, contiguous pages; these allocations, when done at the behest
of filesystem code, can bring the system to a halt. A patch has been merged which fixes this
particular problem for 2.6.6.
The bigger issue remains, however: should the kernel swap out user
applications in order to cache more file contents? There are plenty of
arguments in favor of this behavior. Quite a few large applications set up
big areas of memory which they rarely, if ever use. If application memory
is occasionally forced to disk, the unused parts will remain there, and
that much physical memory will be freed for more useful contents. Without
swapping application memory to disk and seeing what gets faulted back in,
it is almost impossible to figure out which pages are not really needed.
A large file cache is also a performance enhancer. The speedups that come
from having frequently-accessed data in memory are harder to see than the
slowdowns caused by having to fault in a large application, but they can
lead to better system throughput overall.
Still, there are users who insist that, for example, a system backup should
never force OpenOffice out to disk. They don't care how quickly a system
maintenance application runs at 3:00 in the morning, but they care a lot
about how the system responds when they are at the keyboard. This wish was
expressed repeatedly until Andrew Morton exclaimed:
I'm gonna stick my fingers in my ears and sing "la la la" until
people tell me "I set swappiness to zero and it didn't do what I
wanted it to do".
This helped quiet the debate as the parties involved looked more closely at
this particular parameter. Or, perhaps, it was just fear of Andrew's
singing. Either way, it has become clear that most people are unaware of
what the "swappiness" parameter does; the fact that it has never been documented may
have something to do with that.
So... swappiness, which is exported to
/proc/sys/vm/swappiness, is a parameter which sets the kernel's
balance between reclaiming pages from the page cache and swapping out
process memory. The reclaim code works (in a very simplified way) by
calculating a few numbers:
- The "distress" value is a measure of how much trouble the kernel
is having freeing memory. The first time the kernel decides it needs
to start reclaiming pages, distress will be zero; if more
attempts are required, that value goes up, approaching a high value of
100.
- mapped_ratio is an approximate percentage of how much of the
system's total memory is mapped (i.e. is part of a process's address
space) within a given memory zone.
- vm_swappiness is the swappiness parameter, which is set to 60
by default.
With those numbers in hand, the kernel calculates its "swap tendency":
swap_tendency = mapped_ratio/2 + distress + vm_swappiness;
If swap_tendency is below 100, the kernel will only reclaim page
cache pages. Once it goes above that value, however, pages which are part
of some process's address space will also be considered for reclaim. So,
if life is easy, swappiness is set to 60, and distress is zero,
the system will not swap
process memory until it reaches 80% of the total. Users who would like to
never see application memory swapped out can set swappiness to zero; that
setting will cause the kernel to ignore process memory until the
distress value gets quite high.
The swappiness parameter should do what a lot of users want, but it does
not solve the whole problem. Swappiness is a global parameter; it affects
every process on the system in the same way. What a number of people would
like to see, however, is a way to single out individual applications for
special treatment. Possible approaches include using the process's "nice"
value to control memory behavior; a low-priority process would not be able
to push out significant amounts of a high-priority process's memory.
Alternatively, the VM subsystem and the scheduler could become more tightly
integrated. The scheduler already makes an effort to detect "interactive"
processes; those processes could be given the benefit of a larger working
set in memory. That sort of thing is 2.7 work, however; in the mean time,
people who are unhappy with the kernel's swap behavior may want to try
playing with the knobs which have been provided.
Comments (23 posted)
Kernel code often finds itself having to wait for a particular physical
page; if, for example, a page is currently under I/O, prospective users
must wait until that operation has completed. In the early days of 2.4
(and before), the
struct page structure (which the kernel uses to
track physical memory) contained a wait queue head for this purpose. This
technique worked, but adding a wait queue for every page in the system was
not a particularly efficient use of memory. At any given time, only a tiny
percentage of those wait queues are actually in use.
To recover some of the memory used by wait queues, the kernel developers
added the concept of hashed wait queues. The per-page queues were replaced
with a much smaller number of shared queues; when a thread needs to wait on
a particular page, it hashes the page address to pick the appropriate
queue. When the page becomes available, all processes waiting on that
queue will be awakened. The use of this technique has since been extended
to other parts of the kernel as well.
Hashed wait queues have achieved the desired space savings, but, as it
turns out, at a certain computational cost. William Lee Irwin did some research, and found that hash queue
collisions are fairly common. So, when a wakeup is performed on one of the
hashed wait queues, it is likely that unrelated processes are being
awakened. Each of those processes must run, determine that the event they
are waiting for has not yet occurred, and go back to sleep. This variant
on the "thundering herd" problem can hurt performance.
One possible solution to this problem would be to expand the number of wait
queues to make collisions less likely. That approach is simple, but it
also would bring back the original problem by expanding the amount of
memory dedicated to wait queues. So William came up with another approach,
which he calls "filtered wakeups."
The idea behind a filtered wakeup is fairly simple. When a process goes to
sleep on a (shared) filtered wait queue, it provides a "key" value, which
will typically be the address of the resource being waited for. The wakeup
call is made with a key value as well; as the wait queue is traversed, only
the processes waiting for the given key are awakened.
The patch which implements filtered waits is
fairly simple, and includes an example of their use. It creates a new
filtered_wait_queue structure:
struct filtered_wait_queue {
void *key;
wait_queue_t wait;
};
A process which is about to go into a filtered wait will use code which
looks something like the following to create an use a filtered queue entry:
DEFINE_FILTERED_WAIT(wait, key);
do {
prepare_to_wait(queue, &wait.wait, TASK_INTERRUPTIBLE);
if (not_ready_yet(key))
schedule();
} while(not_ready_yet(key));
finish_wait_(queue, &wait.wait);
Awakening a process in this sort of sleep is a simple matter of calling:
void wake_up_filtered(wait_queue_head_t *queue, void *key);
William claims some significant performance
improvements from his changes, including large reductions in CPU usage and
a near tripling of the peak I/O rates in some situations.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Miscellaneous
- Ulrich Drepper: NUMA API.
(April 30, 2004)
Page editor: Jonathan Corbet
Next page: Distributions>>