LWN.net Logo

Flushing the page cache from user space

Martin Hicks recently posted a patch which adds a new degree of user-space control over memory management policy. In particular, it creates a new /proc entry:

    /proc/sys/vm/toss_page_cache_nodes

If a suitably privileged process writes one or more NUMA node numbers to that file, all pages belonging to that node which are found in the page cache will be flushed out. Essentially, this operation causes a node to forget about all locally-cached pages from files in the filesystem.

Clearing the page cache in this way would normally be bad for performance. The page cache exists to allow the filesystem to satisfy common filesystem requests without going to the disk; clearing the cache defeats that functionality and would normally be undesirable. There are exceptions to everything, however. This patch is aimed at large-scale high-performance computing tasks running in a cluster environment. Such jobs typically do best if they can start with a clean system; they have no real use for whatever may have been cached for the previous user. More to the point, a full page cache can cause memory allocations to be satisfied with non-local (slower) memory, resulting in significantly worse performance. By clearing the cache before starting a new job, a system administrator can ensure that local memory is available for that job.

Not everybody likes the patch. Ingo Molnar thinks that this capability will create confusion and make the debugging of memory problems even harder.

How are we supposed to debug VM problems where one player periodically flushes the whole pagecache? ... Providing APIs to flush system caches, sysctl or syscall, is the road to VM madness.

Andrew Morton, instead, sees the value of the patch for some users, but doesn't like the implementation. He would like to see this capability made useful for other classes of users, such as kernel developers who want to put the system into a known state before running tests. He also doesn't like the /proc interface, and argues for a new system call instead. His suggestion was:

    sys_free_node_memory(long node_id, long pages_to_make_free, 
                         long what_to_free);

This form of the call would allow the clearing of something less than the entire page cache, making the tool a bit less crude. The what_to_free argument would be a bitmask specifying which types of memory to free; beyond the page cache, this call could cause the kernel to reclaim anonymous memory or slab caches.

The system call approach would seem to make sense; there is one remaining glitch, however: SUSE already shipped the /proc interface in SLES9. That revelation drew a complaint from Andrew:

This is why you should target kernel.org kernels first. Now we risk ending up with poor old suse carrying an obsolete interface and application developers have to be able to cater for both interfaces.

An explicit purpose behind the 2.6 development model is to get patches into the mainline quickly so that their form can be stabilized before distributors ship them. As the developers become used to this mode of operation, this sort of issue should become relatively rare.


(Log in to post comments)

I want something similar

Posted Feb 24, 2005 17:00 UTC (Thu) by ncm (subscriber, #165) [Link]

To me, and to most of us, a much more helpful feature would be a way to identify inodes whose pages should not be cached at all. So that it could be used by unprivileged processes, the effect might apply only for the current process: pages seen by another process from that inode would still stick in the cache, and the current process would still use those, but reads or writes (or unmappings) it initiates would wash out of the cache immediately.

We might also, or instead, have an attribute stored on the inode to identify an uncached file. Then, most programs wouldn't need to know about the feature.

Tar and xine might be good examples of programs that should know about, or at least benefit from, such a feature.

I want something similar

Posted Feb 24, 2005 18:09 UTC (Thu) by jsbarnes (guest, #4096) [Link]

fadvise and madvise could provide this feature for programs like tar &
xine. madvise(addr, len, MADV_DONTNEED) will tell the kernel that a
given virtual address range is unlikely to be accessed again anytime
soon, and I believe there's an equivalent for files as well.

I want something similar

Posted Mar 13, 2005 18:16 UTC (Sun) by farnz (guest, #17727) [Link]

posix_fadvise already provides that functionality. posix_fadvise(fd,0,0,POSIX_FADV_NOREUSE); will tell the kernel that each byte of a file is only accessed once, while posix_fadvise(fd,start,len,POSIX_FADV_DONTNEED); will tell the kernel that a particular range of data is finished with.

Apart from developer time, there's no real reason not to use posix_fadvise and posix_madvise (the equivalent for memory mapped objects); if the kernel doesn't handle them, they get ignored, and they make it possible for the kernel to do more intelligent caching if it does understand them.

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds