LWN.net Logo

POSIX_FADV_VOLATILE

By Jonathan Corbet
November 22, 2011
Caching plays an important role at almost all levels of a contemporary operating system. Without the ability to cache frequently-used objects in faster memory, performance suffers; the same idea holds whether one is talking about individual cache lines in the processor's memory cache or image data cached by a web browser. But caching requires resources; those needs must be balanced with other demands on the same resources. In other words, sometimes cached data must be dropped; often, overall performance can be improved if the program doing the caching has a say in what gets removed from the cache. A recent patch from John Stultz attempts to make it easier for applications to offer up caches for reclamation when memory gets tight.

John's patch takes a lot of inspiration from the ashmem device implemented for Android by Robert Love. But ashmem functions like a device and performs its own memory management, which makes it hard to merge upstream. John's patch, instead, tries to integrate things more deeply into the kernel's own memory management subsystem. So it takes the form of a new set of options to the posix_fadvise() system call. In particular, an application can mark a range of pages in an open file as "volatile" with the POSIX_FADV_VOLATILE operation. Pages that are so marked can be discarded by the kernel if memory gets tight. Crucially, even dirty pages can be discarded - without writeback - if they have been marked volatile. This operation differs from POSIX_FADV_DONTNEED in that the given pages will not (normally) be discarded right away - the application might want the contents of volatile pages in the future, but it will be able to recover if they disappear.

If a particular range of pages becomes useful later on, the application should use the POSIX_FADV_NONVOLATILE operation to remove the "volatile" marking. The return value from this operation is important: a non-zero return from posix_fadvise() indicates that the kernel has removed one or more pages from the indicated range while it was marked volatile. That is the only indication the application will get that the kernel has accepted its offer and cleaned out some volatile pages. If those pages have not been removed, posix_fadvise() will return zero and the cached data will be available to the application.

There is also a POSIX_FADV_ISVOLATILE operation to query whether a given range has been marked volatile or not.

Rik van Riel raised a couple of questions about this functionality. He expressed concern that the kernel might remove a single page of a multi-page cached object, thus wrecking the caching while failing to reclaim all of the memory used to cache that object. Ashmem apparently does its own memory management partially to avoid this very situation; when an object's memory is reclaimed, all of it will be taken back. John would apparently rather avoid adding another least-recently-used list to the kernel, but he did respond that it might be possible to add logic to reclaim an entire volatile range once a single page is taken from that range.

Rik also worried about the overhead of this mechanism and proposed an alternative that he has apparently been thinking about for a while. In this scheme, applications would be able to open (and pass to poll()) a special file descriptor that would receive a message whenever the kernel finds itself short of memory. Applications would be expected to respond by freeing whatever memory they can do without. The mechanism has a certain kind of simplicity, but could also prove difficult in real-world use. When an application gets a "free up some memory" message, the first thing it will probably need to do is to fault in its code for handling that message - an action which will require the allocation of more memory. Marking the memory ahead of time and freeing it directly from the kernel may turn out to be a more reliable approach.

After the recent frontswap discussions, it is perhaps unsurprising that nobody has dared to observe that volatile memory ranges bear a more than passing resemblance to transcendent memory. In particular, it looks a lot like "cleancache," which was merged in the 3.0 development cycle. There are differences: putting a page into cleancache removes it from normal memory while volatile memory can remain in place, and cleancache lacks a user-space interface. But the core idea is the same: asking the system to hold some memory, but allowing that memory to be dropped if the need arises. It could be that the two mechanisms could be made to work together.

But, as noted above, nobody has mentioned this idea, and your editor would certainly not be so daring.

One other question that has not been discussed is whether this code could eventually replace ashmem, reducing the differences between the mainline and the Android kernel. Any such replacement would not happen anytime soon; ashmem has its own ABI that will need to be supported by Android kernels for a long time. Over years, a transition to posix_fadvise() could possibly be made if the Android developers were willing to do so. But first the posix_fadvise() patch will need to get into the mainline. It is a very new patch, so it is hard to say if or when that might happen. Its relatively non-intrusive nature and the clear need for this capability would tend to argue in its favor, though.


(Log in to post comments)

POSIX_FADV_VOLATILE

Posted Nov 24, 2011 13:25 UTC (Thu) by juliank (subscriber, #45896) [Link]

Defining a new variable with POSIX in its name without it being in POSIX seems wrong.

POSIX_FADV_VOLATILE

Posted Nov 25, 2011 6:30 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

But, as noted above, nobody has mentioned this idea, and your editor would certainly not be so daring.

Except, of course, in the context of an LWN article. ;-)

The idea also reminds me a bit of the "swapping in userspace" ideas that GNU Hurd has explored, at least on L4. IIRC, the Hurd L4 microkernel only deals in physical pages, and can hand physical pages to tasks or ask for physical pages back. All actual "swapping" decisions get made by the individual user-space tasks themselves.

This does have a certain conceptual nicety to it: When memory pressure increases, you can run garbage collectors more frequently and drop caches more aggressively. And each application can choose the strategy that makes the most sense for itself.

The downside, of course, is that it requires perfect cooperation between all processes. There is no way for the OS to override the decisions of an ill-behaved userspace application, other than to terminate it.

(Note: I don't know what the current state of the art is in GNU Hurd land. The article I linked was from 2005. Then again, Wikipedia suggests that not much has happened since then other than changing microkernels a few times after L4. I think that adds another twist to the microkernel-vs-monolithic kernel debate, but I don't really feel like going there right now. ;-) )

POSIX_FADV_VOLATILE

Posted Nov 25, 2011 10:04 UTC (Fri) by civodul (subscriber, #58311) [Link]

> In this scheme, applications would be able to open (and pass to poll()) a special file descriptor that would receive a message whenever the kernel finds itself short of memory. Applications would be expected to respond by freeing whatever memory they can do without.

Neal Walfield (GNU Hurd hacker) also explored adaptive application-driven memory management in a series of papers found at http://walfield.org/ .

POSIX_FADV_VOLATILE

Posted Nov 30, 2011 0:41 UTC (Wed) by vomlehn (subscriber, #45588) [Link]

Though the idea of process notifications when the kernel runs low on free pages is a common one, this approach still gives me the willies. Two issues give me pause:

  • In the general case, processes freeing memory may temporarily use more memory (say, by swapping pages in or allocating file buffers for output) in order to free memory. Marking pages for the kernel to grab without running the process that has them seems much less likely to dead- or livelock.
  • The reason the kernel is running out of memory may very be due to to a new process grabbing memory as it starts up. Such a process may grab memory a lot faster the clean up process happens.

Rik Van Riel's issue about multipage objects is valid, but having the kernel grab all associated pages addresses that and, I expect, does so in a way that matches the way multi-page objects are likely to be used, so I think this is a moot point.

Looking things from a higher level, though, the two approaches are not exclusive and each has cases where it can do things the other can't.

POSIX_FADV_VOLATILE and Android

Posted Nov 27, 2011 0:21 UTC (Sun) by jhhaller (subscriber, #56103) [Link]

The statement that Android needs to support an older mechanism for some time is too strong, as an Android version generally is an association between a user and a kernel. It's quite unusual to upgrade the kernel without updating the userland (although the opposite isn't necessarily true). Adding a configuration to the userland build to match it to the kernel should be sufficient to allow a flag day change.

You forgot about NDK

Posted Nov 27, 2011 0:58 UTC (Sun) by khim (subscriber, #9252) [Link]

Close, but no cigar. If you think only about Dalvik, then yes, everything can be emulated. But NDK programs can (and do) use ashmem directly. This means it must be supported for long, long time. No flag day for you :-)

You forgot about NDK

Posted Nov 27, 2011 10:16 UTC (Sun) by jhhaller (subscriber, #56103) [Link]

If NDK applications were staticly linked with libcutils and libc, then yes, they won't work. The shared libraries could be changed to emulate the previous behavior, but, admittedly, that's starting to get a bit extreme to change open and ioctl to strip off the ashmem operations and emulate them.

Changing the ashmem_* API implementation would be the most straightforward, and would address most applications. Those not using the ashmem_* APIs, or using a static library version of libcutils, deserve what they get, particularly taking heed of the NDK warning about which APIs are stable. If an application uses a static library, they should set the maxSdkVersion attribute, as they are likely to get broken in some future release. But, the ashmem APIs will need to be available essentially forever in the library, although the ashmem header file could be removed in future NDK releases, assuming the new service was implemented in the kernel. The user/kernel dependencies are still somewhat nebulous. While ICS devices are generally built with a 3.0 kernel, the emulator still runs under 2.6.39, so the flag day might still take an additional platform release, or the ashwin implementation would need to know the capability of the kernel. The Android ROM hacker community is generally still working with 2.6.36 kernels, particularly for non-OMAP devices.

POSIX_FADV_VOLATILE

Posted Dec 1, 2011 1:02 UTC (Thu) by zlynx (subscriber, #2285) [Link]

Overriding this call seems wrong to me. In my opinion a new flag to mmap would work better. Then the entire mmap range could be cleared, fixing the potential problem of losing one page from the middle of an object.

Kernel to process notification?

Posted Mar 23, 2012 18:36 UTC (Fri) by cheako (guest, #81350) [Link]

This solution still needs a way for the kernel to let the process know when it's memory was relinquished.

POSIX_FADV_VOLATILE as a corresponding call for this task and it would be better overall if this lock/unlock||unPOSIX_FADV_VOLATILE/POSIX_FADV_VOLATILE didn't have many options.

POSIX_FADV_VOLATILE

Posted Dec 1, 2011 6:22 UTC (Thu) by kevinm (guest, #69913) [Link]

Pagecache pages are shared between processes.

What happens if a page I've dirtied happens to lie in a range marked volatile by another process - can my changes still be lost?

Not so sure

Posted Dec 1, 2011 15:57 UTC (Thu) by renox (subscriber, #23785) [Link]

> When an application gets a "free up some memory" message, the first thing it will probably need to do is to fault in its code for handling that message - an action which will require the allocation of more memory.

Except for the stack for function calls, that's not necessarily true: the coder know that the goal of the code is to handle 'low memory' condition so a good coder will
a) try to minimize the memory needed for this handling of 'low memory'
b) reserve ahead the memory needed for the task

> Marking the memory ahead of time and freeing it directly from the kernel may turn out to be a more reliable approach.

Uh? Let's say the application has a cache, how could it delegate to the kernel the task to reduce the cache size in 'low memory' condition?

Normally reserved memory, but unuesed?

Posted Mar 23, 2012 18:31 UTC (Fri) by cheako (guest, #81350) [Link]

b doesn't sound too pleasant. Perhaps a process can register the size of memory needed for the task, but that's trying to find a cure for the cure.

POSIX_FADV_VOLATILE

Posted Mar 23, 2012 18:46 UTC (Fri) by cheako (guest, #81350) [Link]

I believe that the kernel should reclaim the entire range, however I also think that a big application would have several levels of cache and in the case of tabs each tab might have a cache.

I don't know what effect having many small POSIX_FADV_VOLATILE ranges would be, but an application could clearly make decisions that would pool small caches into one bigger in order to keep the total number of POSIX_FADV_VOLATILE ranges reasonable.

Perhaps exposing the average lifetime over the past five minuets of any POSIX_FADV_VOLATILE range would help an application decide if it should even bother. There can be other performance counters that would help an application determine the number and target size of
POSIX_FADV_VOLATILE ranges.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds