Brief itemsannounced on August 14. Among other things, this update contains fixes for a few security problems.
The current 2.6 prepatch remains 2.6.13-rc6. There has been a slow but steady stream of fixes trickling into Linus's git repository. It is unclear, as of this writing, whether the quantity of patches is sufficient to force another -rc release before 2.6.13 comes out.
The current -mm release remains 2.6.13-rc5-mm1; there have been no -mm releases since August 7.
Kernel development news
The reigning algorithm used in most systems is a variant of the least-recently-used (LRU) scheme. If a page has not been used in a long time, the reasoning goes, it probably will not be needed again in the near future; pages which have not been used for a while are thus candidates for eviction from main memory. In practice, tracking the usage of every page would impose an unacceptable amount of overhead, and is not done. Instead, the VM subsystem scans sequentially through the "active list" of pages in use, marking them as "inactive." Pages on the inactive list are candidates for eviction. Some of those pages will certainly be needed soon, however, with the result that they will be referenced before that eviction takes place. When this happens, the affected pages are put back on the active list at the "recently used" end. As long as pages stay in the inactive list for a reasonable time before eviction, this algorithm approximates a true LRU scheme.
This mechanism tends to fall apart with certain types of workloads, however. Actions like initializing a huge array, reading a large file (for streaming media playback, for example), starting OpenOffice, or walking through a large part of the filesystem can fill main memory with pages which are unlikely to be used again anytime soon - at the expense of the pages the system actually needs. Pages from files start in the inactive list and may, at least, be shoved back out relatively quickly, but anonymous memory pages go straight to the active list. Many Linux users are familiar with the occasional sluggish response which can come after the active list has been flushed in this way; with some workloads, this behavior can be a constant thing, and the system will consistently perform poorly.
Rik van Riel has recently posted a set of patches aimed at improving the performance of the VM subsystem under contemporary loads. The algorithm implemented is based on CLOCK-Pro, developed by Song Jiang, Feng Chen, and Xiaodong Zhang. CLOCK-Pro attempts to move beyond the LRU approach by tracking how often pages are accessed and tweaking the behavior of the VM code to match. At its core, CLOCK-Pro tries to ensure that pages in the inactive list are referenced less frequently than those on the active list. It thus differs from LRU schemes, which prioritize the most recently accessed pages even if those particular pages are almost never used by the application. Consider, as an example, the diagram to the right showing access patterns for two pages. At the time t1 marked by the red line, an LRU algorithm would prefer page 2 over page 1, even though the latter is more likely to be used again in the near future.
Implementing CLOCK-Pro requires that the kernel keep track of pages which have recently been evicted from main memory. To this end, Rik's patches create a new data structure which tries to perform this tracking without adding much extra overhead. There is a new kernel function:
int do_remember_page(struct address_space *mapping, unsigned long index);
The VM code will, when moving a page out of main memory, first call remember_page() with the relevant information. This function implements a data structure which looks a little like the following:
When a page is to be remembered, a hash value is generated from the mapping and index parameters; this value will be used as an index into the nonres_table array. Each hash bucket contains a fixed number of entries for nonresident pages. do_remember_page() treats the hash bucket like a circular buffer; it will use the hand index to store a cookie representing the page (a separate hash, essentially) in the next available slot, possibly overwriting information which was there before. The size of the entire data structure is chosen so that it can remember approximately as many evicted pages as there are pages of real memory in the system. The cost of the structure is one 32-bit word for each remembered page.
At some point in the future, the kernel will find itself faulting a page into memory. It can then see if it has seen that page before with a call to:
int recently_evicted(struct address_space *mapping, unsigned long index);
A non-negative return value indicates that the given page was found in the nonresident page cache, and had, indeed, been evicted not all that long ago. The return value is actually an estimate of the page's "distance" - a value which is taken by seeing how far the page's entry is from the current value of the hand index (in a circular buffer sense) and scaling it by the size of the array. In a rough sense, the distance is the number of pages which have been evicted since the page of interest was pushed out.
Whenever a page is faulted in, the kernel computes a distance for the oldest page in the active list; this distance is an estimate taken from how long ago the oldest page would have been scanned (at the current rate). This distance is compared to the distance of the newly-faulted page (which is scaled relative to the total number of recently evicted pages) to get a sense for whether this page (which had been evicted) has been accessed more frequently than the oldest in-memory page. If so, the kernel concludes that the wrong pages are in memory; in response, it will decrease the maximum desired size of the active list to make room for the more-frequently accessed pages which are languishing in secondary storage. The kernel will also, in this case, add the just-faulted page directly to the active list, on the theory that it will be useful for a while.
If, instead, pages being faulted in are more "distant" than in-core pages, the VM subsystem concludes that it is doing the right thing. In this situation, the size of the active list will be slowly increased (up to a maximum limit). More distant pages are faulted in to the inactive list, meaning that they are more likely to be evicted again in the near future.
Your editor applied the patch to a vanilla 2.6.12 kernel and ran some highly scientific tests: a highly parallel kernel make while simultaneously running a large "grep -r to read large amounts of file data into the page cache. The patched kernel adds a file (/proc/refaults) which summarizes the results from the nonresident page cache; after this experiment it looked like this:
Refault distance Hits 0 - 4096 138 4096 - 8192 108 8192 - 12288 93 12288 - 16384 88 16384 - 20480 86 20480 - 24576 84 24576 - 28672 59 28672 - 32768 48 32768 - 36864 53 36864 - 40960 46 40960 - 45056 43 45056 - 49152 46 49152 - 53248 39 53248 - 57344 39 57344 - 61440 39 New/Beyond 61440 11227
This histogram shows that the vast majority of pages brought into the system had never been seen before; they would be mainly the result of the large grep. A much smaller number of pages - a few hundred - had very small distances. If the patch is working right, those pages (being, one hopes, important things like the C compiler) would be fast-tracked into the active list while the large number of unknown pages would be hustled back out of main memory relatively quickly.
As it turns out, the patch doesn't work right quite yet. Much of the structure is in place, but the desired results are not yet being seen. These details will presumably be worked out before too long. Only at that point will it be possible to benchmark the new paging code and decide whether it truly performs better or not. One never knows ahead of time with virtual memory code; the proof, as they say, is in the paging.
[Thanks to Rik van Riel for his review of a previous draft of this article.]ran into a little problem. He was trying to read the value of a kernel variable using /dev/kmem, but his attempts returned an I/O error. The resulting inquiry has led to people asking whether /dev/kmem should exist at all.
Unix-like systems have, since nearly the beginning, offered a couple of character device files called /dev/mem and /dev/kmem. /dev/mem is a straightforward window into main memory; a suitably privileged application can access any physical page in the system by opening /dev/mem and seeking to its physical address. This special file can also be used to map parts of the physical address space directly into a process's virtual space, though this only works for addresses which do not correspond to RAM (the X server uses it, for example, to access the video adapter's memory and control registers).
/dev/kmem is supposed to be different in that its window is from the kernel's point of view. A valid offset in /dev/kmem would be a kernel virtual address - these addresses look much like physical addresses, but they are not. On commonly-configured i386 systems, for example, the base of the kernel's virtual address space is at 0xc0000000. The code which implements mmap() for /dev/kmem looks like this in 2.6.12:
if (!pfn_valid(vma->vm_pgoff)) return -EIO; val = (u64)vma->vm_pgoff << PAGE_SHIFT; vma->vm_pgoff = __pa(val) >> PAGE_SHIFT; return mmap_mem(file, vma);
The idea is to turn the kernel virtual address into a physical address (using __pa()), then use the regular /dev/mem mapping function. The problem, of course, is that the pfn_valid() test is performed before the given page frame number has been moved into the physical space; thus, any attempt to map an address in the kernel's virtual space will return -EIO - except on some systems with large amounts of physical memory, and, even then, the result will not be what the programmer was after. This mistake would almost certainly be a security hole, except that only root can access /dev/kmem in the first place.
Linus has merged a simple fix for 2.6.13. It does not even try to solve the whole problem, in that it still fails to properly check the full address range requested by the application. But the real question that has come out of this episode is: is there any reason to keep /dev/kmem around? The fact that it has been broken for some time suggests that there are not a whole lot of users out there. It has been suggested that root kits are the largest user community for this kind of access, but there are no forward compatibility guarantees for root kit authors. The Fedora kernel, as it turns out, has not supported /dev/kmem for a long time.
Removing a feature like that is not in the cards for 2.6.13. But, unless some sort of important user shows up, chances are that /dev/kmem will not survive into 2.6.14. Anybody who would be inconvenienced by that change should speak up soon.
Consider, for example, a flag called PG_checked. Its definition in include/linux/page-flags.h (2.6.13-rc6) reads as follows:
#define PG_checked 8 /* kill me in 2.5.<early>. */
Somebody clearly missed a deadline. In fact, there is a certain amount of confusion over just what this flag does. A bit of research revealed that it is used in several filesystems, and that it is unlikely to go away anytime soon. ext3 uses this flag to mark pages to be written to disk at a future time. AFS uses it to indicate valid directory pages. Reiserfs uses this flag for journaling purposes. And the (out-of-tree) cachefs implementation uses it to mark pages currently being written to local backing store.
So this flag clearly is not going away anytime soon, much less by 2.5.early. In an effort to clarify the situation, Daniel Phillips has posted a patch which renames the flag as follows:
#define PG_fs_misc 8 /* don't let me spread */
There is some disagreement over naming, but the core of the patch is uncontroversial. This flag will officially be dedicated to filesystem use.
Another flag with significant history is PG_reserved. In this case, too, the meaning of the flag has been somewhat obscured over time, though it can be summarized as "this page is special and the VM subsystem should leave it alone." It marks parts of the physical address space which have page structures, but which are not real memory - the legacy ISA hole in the i386 space, for example. The memory dedicated to the kernel text is also marked reserved. The kernel function which maps physical address spaces into a process's virtual space (remap_pfn_range()) will refuse to remap unreserved memory, leading to a long history of device drivers setting that flag to remap internal buffers.
The consensus seems to be that the "reserved" flag can go. So Nick Piggin has been working on a patch which takes it out - mostly. In many cases, code which was testing that flag was really trying to decide if it was looking at a valid RAM page; there are other, better ways of making that test. In other cases, the higher-level VMA structure (which has its own VM_RESERVED flag) contains all of the needed information. In the remap_pfn_range() case, the test is simply removed, allowing all memory to be remapped. This change will modify the behavior of /dev/mem, which, previously, could not be used to mmap() regular RAM.
All that is left, after Nick's patch, is a set of tests in the software suspend code. Once that has been taken care of, PG_reserved can go.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds