Brief items
The current stable 2.6 release is 2.6.12.5, which was
announced on August 14.
Among other things, this update contains fixes for a few security problems.
The current 2.6 prepatch remains 2.6.13-rc6. There has been a slow
but steady stream of fixes trickling into Linus's git repository. It is
unclear, as of this writing, whether the quantity of patches is sufficient
to force another -rc release before 2.6.13 comes out.
The current -mm release remains 2.6.13-rc5-mm1; there have been no -mm
releases since August 7.
Comments (none posted)
Kernel development news
Virtual memory management appears to be a perennially unsolved operating
systems problem. Nobody has yet figured out how to perform page
replacement in such a way as to ensure that the pages that will be needed
in the future may be found in main memory. Crystal balls, it seems, remain
fiendishly difficult to implement.
The reigning algorithm used in most systems is a variant of the
least-recently-used (LRU) scheme. If a page has not been used in a long
time, the reasoning goes, it probably will not be needed again in the near
future; pages which have not been used for a while are thus candidates for
eviction from main memory. In practice, tracking the usage of every page
would impose an unacceptable amount of overhead, and is not done. Instead,
the VM subsystem scans sequentially through the "active list" of pages in
use, marking them as "inactive." Pages on the inactive
list are candidates for eviction. Some of those pages will certainly be
needed soon, however, with the result that they will be referenced
before that eviction takes place. When this happens, the affected pages
are put back on the active list at the "recently used" end. As long as
pages stay in the inactive list for a reasonable time before eviction, this
algorithm approximates a true LRU scheme.
This mechanism tends to fall apart with certain types of workloads,
however. Actions like initializing a huge array, reading a large file (for
streaming media playback, for example), starting OpenOffice, or walking
through a large part of the filesystem can fill main memory with pages
which are unlikely to be used again anytime soon - at the expense of the
pages the system actually needs. Pages from files start in the inactive
list and may, at least, be shoved back out relatively quickly, but
anonymous memory pages go straight to the active list.
Many Linux users are familiar with the
occasional sluggish response which can come after the active list has been
flushed in this way; with some workloads, this behavior can be a constant
thing, and the system will consistently perform poorly.
Rik van Riel has recently posted a set of patches aimed at improving the
performance of the VM subsystem under contemporary loads. The algorithm
implemented is based on CLOCK-Pro,
developed by Song Jiang, Feng Chen, and Xiaodong Zhang. CLOCK-Pro attempts
to move beyond the LRU approach by tracking how often pages are accessed
and tweaking the behavior of the VM code to match. At its core, CLOCK-Pro
tries to ensure that pages in the inactive list are referenced less
frequently than those on the active list. It thus differs from LRU
schemes, which prioritize the most recently accessed pages even if those
particular pages are almost never used by the application. Consider, as an
example, the diagram to the right showing access patterns for two pages.
At the time t1 marked by the red line, an LRU algorithm would
prefer page 2 over page 1, even though the latter is more likely
to be used again in the near future.
Implementing CLOCK-Pro requires that the kernel keep track of pages which
have recently been evicted from main memory. To this end, Rik's patches
create a new data structure which tries to perform this tracking without
adding much extra overhead. There is a new kernel function:
int do_remember_page(struct address_space *mapping, unsigned long index);
The VM code will, when moving a page out of main memory, first call
remember_page() with the relevant information. This function
implements a data structure which looks a little like the following:
When a page is to be remembered, a hash value is generated from the
mapping and index parameters; this value will be used as
an index into the nonres_table array. Each hash bucket contains a
fixed number of entries for nonresident pages. do_remember_page()
treats the hash bucket like a circular buffer; it will use the
hand index to store a cookie representing the page (a separate
hash, essentially) in the next available slot, possibly overwriting
information which was there before. The size of the entire data structure
is chosen so that it can remember approximately as many evicted pages as
there are pages of real memory in the system. The cost of the structure is
one 32-bit word for each remembered page.
At some point in the future, the kernel will find itself faulting a page
into memory. It can then see if it has seen that page before with a call
to:
int recently_evicted(struct address_space *mapping, unsigned long index);
A non-negative return value indicates that the given page was found in the
nonresident page cache, and had, indeed, been evicted not all that long
ago. The return value is actually an estimate of the page's "distance" - a
value which is taken by seeing how far the page's entry is from the current
value of the hand index (in a circular buffer sense) and scaling
it by the size of the array. In a rough sense, the distance is the number
of pages which have been evicted since the page of interest was pushed out.
Whenever a page is faulted in, the kernel computes a distance for the
oldest page in the active list; this distance is an estimate taken from how
long ago the oldest page would have been scanned (at the current rate).
This distance is compared to the distance of the newly-faulted page (which
is scaled relative to the total number of recently evicted pages) to get a
sense for whether this page (which had been evicted) has been accessed more
frequently than the oldest in-memory page. If so, the kernel concludes that
the wrong pages are in memory; in response, it will decrease the maximum
desired size of the active list to make room for the more-frequently
accessed pages which are languishing in secondary storage. The kernel will
also, in this case, add the just-faulted page directly to the active list,
on the theory that it will be useful for a while.
If, instead, pages being faulted in are more "distant" than in-core pages,
the VM subsystem concludes that it is doing the right thing. In this
situation, the size of the active list will be slowly increased (up to a
maximum limit). More
distant pages are faulted in to the inactive list, meaning that they are
more likely to be evicted again in the near future.
Your editor applied the patch to a vanilla 2.6.12 kernel and ran some
highly scientific tests: a highly parallel kernel make while simultaneously
running a large "grep -r to read large amounts of file data
into the page cache. The patched kernel adds a file
(/proc/refaults) which summarizes the results from the nonresident
page cache; after this experiment it looked like this:
Refault distance Hits
0 - 4096 138
4096 - 8192 108
8192 - 12288 93
12288 - 16384 88
16384 - 20480 86
20480 - 24576 84
24576 - 28672 59
28672 - 32768 48
32768 - 36864 53
36864 - 40960 46
40960 - 45056 43
45056 - 49152 46
49152 - 53248 39
53248 - 57344 39
57344 - 61440 39
New/Beyond 61440 11227
This histogram shows that the vast majority of pages brought into the
system had never been seen before; they would be mainly the result of the
large grep. A much smaller number of pages - a few hundred - had
very small distances. If the patch is working right, those pages (being,
one hopes, important things like the C compiler) would be fast-tracked into
the active list while the large number of unknown pages would be hustled
back out of main memory relatively quickly.
As it turns out, the patch doesn't work
right quite yet. Much of the structure is in place, but the desired
results are not yet being seen. These details will presumably be worked
out before too long. Only at that point will it be possible to benchmark
the new paging code and decide whether it truly performs better or not.
One never knows ahead of time with virtual memory code; the proof, as they
say, is in the paging.
[Thanks to Rik van Riel for his review of a previous draft of this
article.]
Comments (4 posted)
Steven Rostedt recently
ran into a little
problem. He was trying to read the value of a kernel variable using
/dev/kmem, but his attempts returned an I/O error. The resulting
inquiry has led to people asking whether
/dev/kmem should exist at
all.
Unix-like systems have, since nearly the beginning, offered a couple of
character device files called /dev/mem and /dev/kmem.
/dev/mem is a straightforward window into main memory; a suitably
privileged application can access any physical page in the system by
opening /dev/mem and seeking to its physical address. This
special file can also be used to map parts of the physical address space
directly into a process's virtual space, though this only works for
addresses which do not correspond to RAM (the X server uses it, for
example, to access the video adapter's memory and control registers).
/dev/kmem is supposed to be different in that its window is from
the kernel's point of view. A valid offset in /dev/kmem would be
a kernel virtual address - these addresses look much like physical
addresses, but they are not. On commonly-configured i386 systems, for
example, the base of the kernel's virtual address space is at
0xc0000000. The code which implements mmap() for
/dev/kmem looks like this in 2.6.12:
if (!pfn_valid(vma->vm_pgoff))
return -EIO;
val = (u64)vma->vm_pgoff << PAGE_SHIFT;
vma->vm_pgoff = __pa(val) >> PAGE_SHIFT;
return mmap_mem(file, vma);
The idea is to turn the kernel virtual address into a physical address
(using __pa()), then use the regular /dev/mem mapping
function. The problem, of course, is that the pfn_valid() test is
performed before the given page frame number has been moved into the
physical space; thus, any attempt to map an address in the kernel's virtual
space will return -EIO - except on some systems with large amounts
of physical memory, and, even then, the result will not be what the
programmer was after. This mistake would almost certainly be a security
hole, except that only root can access /dev/kmem in the first
place.
Linus has merged a simple fix for 2.6.13.
It does not even try to solve the whole problem, in that it still fails to
properly check the full address range requested by the application. But
the real question that has come out of this episode is: is there any reason
to keep /dev/kmem around? The fact that it has been broken for
some time suggests that there are not a whole lot of users out there. It
has been suggested that root kits are the largest user community for this
kind of access, but there are no forward compatibility guarantees for root
kit authors. The Fedora kernel, as it turns out, has not supported
/dev/kmem for a long time.
Removing a feature like that is not in the cards for 2.6.13. But, unless
some sort of important user shows up, chances are that /dev/kmem
will not survive into 2.6.14. Anybody who would be inconvenienced by that
change should speak up soon.
Comments (15 posted)
struct page is at the core of the memory management subsystem;
one of these structures exists for every physical page of memory on the
system (and for a few places which are not memory). Since a typical system
will contain large numbers of
page structures, there is a great
deal of pressure to keep that structure small. But there are a lot of
things that the kernel needs to know about pages. The result is that
struct page contains a densely-packed
flags field, and
that the developers continually worry about running out of space for flags
- even though a fair number of them are currently unused. Some of these
flags also carry a fair amount of historical baggage which would be nice to
clean up.
Consider, for example, a flag called PG_checked. Its definition
in include/linux/page-flags.h (2.6.13-rc6) reads as follows:
#define PG_checked 8 /* kill me in 2.5.<early>. */
Somebody clearly missed a deadline. In fact, there is a certain amount of
confusion over just what this flag does. A bit of research revealed that
it is used in several filesystems, and that it is unlikely to go away
anytime soon. ext3 uses this flag to mark pages to be written to disk at a
future time. AFS uses it to indicate valid directory pages. Reiserfs uses
this flag for journaling purposes. And the (out-of-tree) cachefs
implementation uses it to mark pages currently being written to local
backing store.
So this flag clearly is not going away anytime soon, much less by
2.5.early. In an effort to clarify the situation, Daniel Phillips has
posted a patch which renames the flag as
follows:
#define PG_fs_misc 8 /* don't let me spread */
There is some disagreement over naming, but the core of the patch is
uncontroversial. This flag will officially be dedicated to filesystem use.
Another flag with significant history is PG_reserved. In this
case, too, the meaning of the flag has been somewhat obscured over time,
though it can be summarized as "this page is special and the VM subsystem
should leave it alone." It marks parts of the physical address space which
have page structures, but which are not real memory - the legacy
ISA hole in the i386 space, for example. The memory dedicated to the
kernel text is also marked reserved. The kernel function which maps
physical address spaces into a process's virtual space
(remap_pfn_range()) will refuse to remap unreserved memory,
leading to a long history of device drivers setting that flag to remap
internal buffers.
The consensus seems to be that the "reserved" flag can go. So Nick Piggin
has been working on a patch
which takes it out - mostly. In many cases, code which was testing that
flag was really trying to decide if it was looking at a valid RAM page;
there are other, better ways of making that test. In other cases, the
higher-level VMA structure (which has its own VM_RESERVED flag)
contains all of the needed information. In the remap_pfn_range()
case, the test is simply removed, allowing all memory to be remapped. This
change will modify the behavior of /dev/mem, which, previously,
could not be used to mmap() regular RAM.
All that is left, after Nick's patch, is a set of tests in the software
suspend code. Once that has been taken care of, PG_reserved can
go.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Marco Costalba: qgit-0.9.
(August 16, 2005)
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>