Volatile ranges and MADV_FREE

By Jonathan Corbet
March 19, 2014

Within the kernel, the "shrinker" interface allows the memory-management subsystem to inform other subsystems that memory is tight and that some space should be freed if possible. Various attempts have been made to add a similar mechanism that would allow the kernel to ask user-space processes to do some tidying up, but all have run up against the familiar problems of complexity and the general difficulty of getting memory-management changes merged. That doesn't stop developers from trying, though; recently two new patches of this type have been posted.

Both of these patch sets implement variations on a feature that has often gone by the name volatile ranges. A volatile range is a region of memory in a process's address space that is used to store data that can be regenerated if need be. If the kernel finds itself short of memory, it can take pages from a volatile range, secure in the knowledge that the process using that range of memory can recover from the loss, albeit with a possible performance hit. But, as long as memory remains plentiful, volatile ranges will not be reclaimed by the kernel and the data cached there can be freely used by applications.

Much of the volatile range work is motivated by the desire to create a replacement for Android's ashmem mechanism that is better integrated with the core memory-management subsystem. But there are other potential users of this functionality as well.

Volatile ranges

There have been many versions of the volatile ranges patch set over the last few years. At times, volatile ranges were implemented with the posix_fadvise() system call; at others, it was added to fallocate() instead. Other versions have made it a feature of madvise(). But version 11 of the volatile ranges patch set from John Stultz takes none of those approaches. Instead, it adds a new system call:

	int vrange(void *start, size_t length, int mode, int *purged);

In this incarnation, a vrange() call operates on the length bytes of memory beginning at start. If mode is VRANGE_VOLATILE, that range of memory will be marked as volatile. If, instead, mode is VRANGE_NONVOLATILE, the volatile marking will be removed. In this case, though, some or all of the pages previously marked as being volatile might have been reclaimed; in that case, *purged will be set to a non-zero value to indicate that the previous contents of that memory range are no longer available. If *purged is set to zero, the application knows that the memory contents have not been lost.

A process may continue to access memory contained within a volatile range. Should it attempt to access a page that has been reclaimed, though, it will get a SIGBUS signal to indicate that the page is no longer there. Thus, programs that are prepared to handle that signal can use volatile ranges without the need for a second vrange() call before actually accessing the memory.

This version of the patch differs from its predecessors in another significant way: it only works with anonymous pages while the previous versions worked only with the tmpfs filesystem. Working with anonymous pages satisfies the need to simplify the patch set as much as possible in the hope of getting it reviewed and eventually merged, but it has a significant cost: the inability to work with tmpfs means that volatile ranges are not a viable replacement for ashmem. The intent is to support the file-backed case (which adds more complexity) after there is consensus on the basic patch.

Internally, vrange() works at the virtual memory area (VMA) level. All pages within a VMA are either volatile or not; if need be, VMAs will be split or coalesced in response to vrange() calls. This should make a vrange() call reasonably fast since there is no need to iterate over every page in the range.

MADV_FREE

A different approach to a similar problem can be seen in Minchan Kim's MADV_FREE patch set. This patch adds a new command to the existing madvise() system call:

    int madvise(void *addr, size_t length, int advice);

Like vrange(), madvise() operates on a range of memory specified by the caller; what it does is determined by the advice argument. Callers can specify MADV_SEQUENTIAL to tell the kernel that the pages in that range will be accessed sequentially, or MADV_RANDOM to indicate the opposite. The MADV_DONTNEED call causes the kernel to reclaim the indicated pages immediately and drop their contents.

The new MADV_FREE operation is similar to MADV_DONTNEED, but there is an important difference. Rather than reclaiming the pages immediately, this operation marks them for "lazy freeing" at some future point. Should the kernel run low on memory, these pages will be among the first reclaimed for other uses; should the application try to use such a page after it has been reclaimed, the kernel will give it a new, zero-filled page. But if memory is not tight, pages marked with MADV_FREE will remain in place; a future access to those pages will clear the "lazy free" bit and use the memory that was there before the MADV_FREE call.

There is no way for the calling application to know if the contents of those pages have been discarded or not without examining the data contained therein. So a program could conceivably implement something similar to volatile ranges by putting a recognizable structure into each page before the MADV_FREE operation, then testing for that structure's presence before accessing any other data in the pages. But that does not seem to be the intended use case for this feature.

Instead, MADV_FREE appears to be aimed at user-space memory allocator implementations. When an application frees a set of pages, the allocator will use an MADV_FREE call to tell the kernel that the contents of those pages no longer matter. Should the application quickly allocate more memory in the same address range, it will use the same pages, thus avoiding much of the overhead of freeing the old pages and allocating and zeroing the new ones. In short, MADV_FREE is meant as a way to say "I don't care about the data in this address range, but I may reuse the address range itself in the near future."

It's worth noting that MADV_FREE is already supported by BSD kernels, so, unlike vrange(), it would not be a Linux-only feature. Indeed, it would likely improve the portability of programs that use this feature on BSD systems now.

Neither patch has received much in the way of reviews as of this writing. The real review, in any case, is likely to happen at this year's Linux Storage, Filesystem, and Memory Management Summit, which begins on March 24. LWN will be there, and we promise to make at least a token effort to not be too distracted by the charms of California wine country; stay tuned for reports from that discussion.

Index entries for this article
Kernel	Volatile ranges

Volatile ranges and MADV_FREE

Posted Mar 20, 2014 17:36 UTC (Thu) by justincormack (subscriber, #70439) [Link] (4 responses)

The purged thing makes no sense to me, there is a race condition so surely you have to deal with the signal.anyway. But an interface with signals is ugly. I think pages that may just zero and using a sentinel is actually easier to code for.

Volatile ranges and MADV_FREE

Posted Mar 21, 2014 11:11 UTC (Fri) by etienne (guest, #25256) [Link] (2 responses)

> pages that may just zero and using a sentinel

Race? thread 1 check sentinel, thread 2 interrupt, use a lot of memory, page is reclaimed, free lots of memory, page is refilled again, thread 1 use wrong data.

Volatile ranges and MADV_FREE

Posted Mar 21, 2014 11:14 UTC (Fri) by justincormack (subscriber, #70439) [Link] (1 responses)

Good point. OK I can't see any way to make it non racy...

Volatile ranges and MADV_FREE

Posted Mar 21, 2014 11:21 UTC (Fri) by mchapman (subscriber, #66589) [Link]

> Good point. OK I can't see any way to make it non racy...

It's not a race. When thread 1 reads the sentinel the MADV_FREE advice is dropped, so thread 2's activity won't cause the page to be reclaimed.

Volatile ranges and MADV_FREE

Posted Mar 25, 2014 14:58 UTC (Tue) by nix (subscriber, #2304) [Link]

You have to deal with the signal anyway, but signal dispatch is notably slow: checking for purgedness first should reduce the frequency of signals in this case by an order of magnitude or so.

Volatile ranges and MADV_FREE

Posted Mar 20, 2014 19:58 UTC (Thu) by jstultz (subscriber, #212) [Link]

Just a small clarifications: While I sent out v11 and had reworked most of the changes myself for this release, it started with a code base Minchan and I have been working on (off and on) for last year together, which also included introducing the vrange syscall (mostly because madvise doesn't provide the needed semantics for properly handling both errors and providing info on purged state of the range). I really want to make sure Minchan gets credit for this, because he has really been the heavy lifter on the patch set over the last 3 revisions or so, and his contributions have been critical.

My main focus with the v11 release is to pair back the scope of the functionality and to stay within common mm code practices (both suggested by Johannes Weiner on his review of Minchan's v10 release) in order to keep the patch small and get more reviewer interest on the basic semantics.

Similarly Minchan's MADV_FREE is splitting off a usage case that we had earlier tried to fit within the volatile ranges interface, and trying to simply implement that functionality on its own.

Basically the problem with earlier volatile ranges patches was that we had adopted too many use cases, which all had subtle performance requirements and semantics. This made it difficult for reviewers to understand. So we're sort of blowing up the old patch set and seeing if we can get the same utility from separate simpler and smaller parts.

Volatile ranges and MADV_FREE

Posted Mar 20, 2014 20:05 UTC (Thu) by jstultz (subscriber, #212) [Link]

Also... regarding the bit about "This should make a vrange() call reasonably fast since there is no need to iterate over every page in the range."

The v11 actually does have to iterate over the pages in the range when marking non-volatile in order to detect purged pages. Its also likely that we will touch the pages in the range when marking volatile in order to make sure the pages are all of the same "age" on the lru, and are more likely to be purged together (so we don't purge one page from a number of different volatile ranges, requiring lots of data to eventually have to be regenerated).

These are both performance tradeoffs that we're willing to take in order to keep within existing mm code standards, and for the code to get reviewer interest. If the semantics are well understood and an implementaiton is merged, I'd like to eventually revisit some of the performance aspects.

missing flags argument

Posted Mar 21, 2014 22:20 UTC (Fri) by Jandar (subscriber, #85683) [Link] (1 responses)

Seeing a new syscall proposed, I remember Flags as a system call API design pattern. Is this a new instance of ignoring this pattern or is there a reason this doesn't apply to vrange()?

missing flags argument

Posted Mar 21, 2014 22:26 UTC (Fri) by corbet (editor, #1) [Link]

Ask and you shall receive.

Volatile ranges and MADV_FREE

Posted Mar 27, 2014 12:04 UTC (Thu) by kevinm (guest, #69913) [Link]

MADV_FREE is meant as a way to say "I don't care...

Probably should be MADV_DONTCARE then ;)