| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Caching plays an important role at almost all levels of a contemporary operating system. Without the ability to cache frequently-used objects in faster memory, performance suffers; the same idea holds whether one is talking about individual cache lines in the processor's memory cache or image data cached by a web browser. But caching requires resources; those needs must be balanced with other demands on the same resources. In other words, sometimes cached data must be dropped; often, overall performance can be improved if the program doing the caching has a say in what gets removed from the cache. A recent patch from John Stultz attempts to make it easier for applications to offer up caches for reclamation when memory gets tight.
John's patch takes a lot of inspiration from the ashmem device implemented for Android by Robert Love. But ashmem functions like a device and performs its own memory management, which makes it hard to merge upstream. John's patch, instead, tries to integrate things more deeply into the kernel's own memory management subsystem. So it takes the form of a new set of options to the posix_fadvise() system call. In particular, an application can mark a range of pages in an open file as "volatile" with the POSIX_FADV_VOLATILE operation. Pages that are so marked can be discarded by the kernel if memory gets tight. Crucially, even dirty pages can be discarded - without writeback - if they have been marked volatile. This operation differs from POSIX_FADV_DONTNEED in that the given pages will not (normally) be discarded right away - the application might want the contents of volatile pages in the future, but it will be able to recover if they disappear.
If a particular range of pages becomes useful later on, the application should use the POSIX_FADV_NONVOLATILE operation to remove the "volatile" marking. The return value from this operation is important: a non-zero return from posix_fadvise() indicates that the kernel has removed one or more pages from the indicated range while it was marked volatile. That is the only indication the application will get that the kernel has accepted its offer and cleaned out some volatile pages. If those pages have not been removed, posix_fadvise() will return zero and the cached data will be available to the application.
There is also a POSIX_FADV_ISVOLATILE operation to query whether a given range has been marked volatile or not.
Rik van Riel raised a couple of questions about this functionality. He expressed concern that the kernel might remove a single page of a multi-page cached object, thus wrecking the caching while failing to reclaim all of the memory used to cache that object. Ashmem apparently does its own memory management partially to avoid this very situation; when an object's memory is reclaimed, all of it will be taken back. John would apparently rather avoid adding another least-recently-used list to the kernel, but he did respond that it might be possible to add logic to reclaim an entire volatile range once a single page is taken from that range.
Rik also worried about the overhead of this mechanism and proposed an alternative that he has apparently been thinking about for a while. In this scheme, applications would be able to open (and pass to poll()) a special file descriptor that would receive a message whenever the kernel finds itself short of memory. Applications would be expected to respond by freeing whatever memory they can do without. The mechanism has a certain kind of simplicity, but could also prove difficult in real-world use. When an application gets a "free up some memory" message, the first thing it will probably need to do is to fault in its code for handling that message - an action which will require the allocation of more memory. Marking the memory ahead of time and freeing it directly from the kernel may turn out to be a more reliable approach.
After the recent frontswap discussions, it is perhaps unsurprising that nobody has dared to observe that volatile memory ranges bear a more than passing resemblance to transcendent memory. In particular, it looks a lot like "cleancache," which was merged in the 3.0 development cycle. There are differences: putting a page into cleancache removes it from normal memory while volatile memory can remain in place, and cleancache lacks a user-space interface. But the core idea is the same: asking the system to hold some memory, but allowing that memory to be dropped if the need arises. It could be that the two mechanisms could be made to work together.
But, as noted above, nobody has mentioned this idea, and your editor would certainly not be so daring.
One other question that has not been discussed is whether this code could eventually replace ashmem, reducing the differences between the mainline and the Android kernel. Any such replacement would not happen anytime soon; ashmem has its own ABI that will need to be supported by Android kernels for a long time. Over years, a transition to posix_fadvise() could possibly be made if the Android developers were willing to do so. But first the posix_fadvise() patch will need to get into the mainline. It is a very new patch, so it is hard to say if or when that might happen. Its relatively non-intrusive nature and the clear need for this capability would tend to argue in its favor, though.
POSIX_FADV_VOLATILE
Posted Nov 24, 2011 13:25 UTC (Thu) by juliank (subscriber, #45896) [Link]
POSIX_FADV_VOLATILE
Posted Nov 25, 2011 6:30 UTC (Fri) by jzbiciak (guest, #5246) [Link]
But, as noted above, nobody has mentioned this idea, and your editor would certainly not be so daring.
Except, of course, in the context of an LWN article. ;-)
The idea also reminds me a bit of the "swapping in userspace" ideas that GNU Hurd has explored, at least on L4. IIRC, the Hurd L4 microkernel only deals in physical pages, and can hand physical pages to tasks or ask for physical pages back. All actual "swapping" decisions get made by the individual user-space tasks themselves.
This does have a certain conceptual nicety to it: When memory pressure increases, you can run garbage collectors more frequently and drop caches more aggressively. And each application can choose the strategy that makes the most sense for itself.
The downside, of course, is that it requires perfect cooperation between all processes. There is no way for the OS to override the decisions of an ill-behaved userspace application, other than to terminate it.
(Note: I don't know what the current state of the art is in GNU Hurd land. The article I linked was from 2005. Then again, Wikipedia suggests that not much has happened since then other than changing microkernels a few times after L4. I think that adds another twist to the microkernel-vs-monolithic kernel debate, but I don't really feel like going there right now. ;-) )
POSIX_FADV_VOLATILE
Posted Nov 25, 2011 10:04 UTC (Fri) by civodul (subscriber, #58311) [Link]
Neal Walfield (GNU Hurd hacker) also explored adaptive application-driven memory management in a series of papers found at http://walfield.org/ .
POSIX_FADV_VOLATILE
Posted Nov 30, 2011 0:41 UTC (Wed) by vomlehn (subscriber, #45588) [Link]
Though the idea of process notifications when the kernel runs low on free pages is a common one, this approach still gives me the willies. Two issues give me pause:
Rik Van Riel's issue about multipage objects is valid, but having the kernel grab all associated pages addresses that and, I expect, does so in a way that matches the way multi-page objects are likely to be used, so I think this is a moot point.
Looking things from a higher level, though, the two approaches are not exclusive and each has cases where it can do things the other can't.
POSIX_FADV_VOLATILE and Android
Posted Nov 27, 2011 0:21 UTC (Sun) by jhhaller (subscriber, #56103) [Link]
You forgot about NDK
Posted Nov 27, 2011 0:58 UTC (Sun) by khim (subscriber, #9252) [Link]
Close, but no cigar. If you think only about Dalvik, then yes, everything can be emulated. But NDK programs can (and do) use ashmem directly. This means it must be supported for long, long time. No flag day for you :-)
You forgot about NDK
Posted Nov 27, 2011 10:16 UTC (Sun) by jhhaller (subscriber, #56103) [Link]
Changing the ashmem_* API implementation would be the most straightforward, and would address most applications. Those not using the ashmem_* APIs, or using a static library version of libcutils, deserve what they get, particularly taking heed of the NDK warning about which APIs are stable. If an application uses a static library, they should set the maxSdkVersion attribute, as they are likely to get broken in some future release. But, the ashmem APIs will need to be available essentially forever in the library, although the ashmem header file could be removed in future NDK releases, assuming the new service was implemented in the kernel. The user/kernel dependencies are still somewhat nebulous. While ICS devices are generally built with a 3.0 kernel, the emulator still runs under 2.6.39, so the flag day might still take an additional platform release, or the ashwin implementation would need to know the capability of the kernel. The Android ROM hacker community is generally still working with 2.6.36 kernels, particularly for non-OMAP devices.
POSIX_FADV_VOLATILE
Posted Dec 1, 2011 1:02 UTC (Thu) by zlynx (subscriber, #2285) [Link]
Kernel to process notification?
Posted Mar 23, 2012 18:36 UTC (Fri) by cheako (guest, #81350) [Link]
POSIX_FADV_VOLATILE as a corresponding call for this task and it would be better overall if this lock/unlock||unPOSIX_FADV_VOLATILE/POSIX_FADV_VOLATILE didn't have many options.
POSIX_FADV_VOLATILE
Posted Dec 1, 2011 6:22 UTC (Thu) by kevinm (guest, #69913) [Link]
What happens if a page I've dirtied happens to lie in a range marked volatile by another process - can my changes still be lost?
Not so sure
Posted Dec 1, 2011 15:57 UTC (Thu) by renox (subscriber, #23785) [Link]
Except for the stack for function calls, that's not necessarily true: the coder know that the goal of the code is to handle 'low memory' condition so a good coder will
a) try to minimize the memory needed for this handling of 'low memory'
b) reserve ahead the memory needed for the task
> Marking the memory ahead of time and freeing it directly from the kernel may turn out to be a more reliable approach.
Uh? Let's say the application has a cache, how could it delegate to the kernel the task to reduce the cache size in 'low memory' condition?
Normally reserved memory, but unuesed?
Posted Mar 23, 2012 18:31 UTC (Fri) by cheako (guest, #81350) [Link]
POSIX_FADV_VOLATILE
Posted Mar 23, 2012 18:46 UTC (Fri) by cheako (guest, #81350) [Link]
I don't know what effect having many small POSIX_FADV_VOLATILE ranges would be, but an application could clearly make decisions that would pool small caches into one bigger in order to keep the total number of POSIX_FADV_VOLATILE ranges reasonable.
Perhaps exposing the average lifetime over the past five minuets of any POSIX_FADV_VOLATILE range would help an application decide if it should even bother. There can be other performance counters that would help an application determine the number and target size of
POSIX_FADV_VOLATILE ranges.
Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds