|
|
Subscribe / Log in / New account

Volatile ranges with fallocate()

By Jonathan Corbet
June 5, 2012
Last November LWN looked at the volatile ranges patch set from John Stultz. This patch is intended to bring an Android feature into the mainline, but it is a reimplemented feature that is more deeply tied into the memory management subsystem. That patch has now returned, but the API has changed so another look is warranted.

A "volatile range" is a set of pages in memory containing data that might be useful to an application at some point in the future; a key point is that, if the need arises, the application is able to reacquire (or regenerate) that data from another source. A web browser's in-RAM image cache is a classic example. Keeping the images around can reduce net traffic and improve page rendering times, but, should the cached images vanish, the browser can request a new copy from the source. Thus, while it makes sense to keep this data around, it also makes sense to get rid of it if a more pressing need for memory arises.

If the kernel knew about this sort of cached data, it could dump that data during times of memory stress and quickly reclaim the underlying memory. In such a situation, applications could cache more data than they otherwise would, knowing that there are limits to how much that caching can affect the system as a whole. The result would be better utilization of memory and a system that performs better overall.

Google's Robert Love implemented such a mechanism for Android as "ashmem." There is a desire to get the ashmem functionality into the mainline kernel, but the implementation and API were not to everybody's taste. To get around that problem, John took the core ashmem code, reworked the virtual memory integration, and hooked it into the posix_fadvise() system call; that is the version of the patch that was described last November.

Dave Chinner subsequently pointed out that this functionality might be better suited to the fallocate() system call. That call looks like this:

    int fallocate(int fd, int mode, off_t offset, off_t len);

This system call is meant to operate on ranges of data within a file. Of particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE operation, which removes a block of data from an arbitrary location within a file. Declaring a volatile range can be thought of as a form of hole punching, but with a kernel-determined delay. If memory is tight, the hole could be punched immediately; otherwise the operation could complete at some later time, or not at all. Given the similarity between these two operations, it made sense to implement them within the same system call; John duly reworked the patch along those lines.

With the new patch set, to declare that a range of a file's contents is volatile, an application would call:

    fallocate(fd, FALLOCATE_FL_MARK_VOLATILE, offset, len);

Where offset and len describe the actual range to be marked. After the call completes, the kernel is not obligated to keep that range in memory, and is not obligated to write that range to backing store before reclaiming it. The application should not attempt to access that portion of the file while it has been marked volatile, since the contents could disappear at any time. Instead, if and when the data turns out to be useful, a call should be made to:

    fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);

If the indicated range is still present in memory, the call will return zero and the application can proceed to work with the data. If, instead, any part of the range has been purged by the kernel since it was marked volatile, a non-zero return value will inform the application that it needs to find that data somewhere else.

Any filesystem could conceivably implement this functionality, but, in practice, it only makes sense for a RAM-based filesystem like tmpfs, so it is only implemented there.

The patch is in its third revision as of this writing, having gotten a number of comments in its first two iterations. The number of complaints has fallen off considerably, though, suggesting that most reviewers are happy now. So this feature may just find its way into the 3.6 kernel.

Index entries for this article
Kernelfallocate()
KernelMemory management
KernelVolatile ranges


to post comments

Volatile ranges with fallocate()

Posted Jun 7, 2012 22:15 UTC (Thu) by benh (subscriber, #43720) [Link] (2 responses)

Does it work with anonymous memory ?

Volatile ranges with fallocate()

Posted Jun 8, 2012 0:31 UTC (Fri) by ncm (guest, #165) [Link] (1 responses)

... and, can you mark a big space as volatile, and then pin and unpin individual pages?

Volatile ranges with fallocate()

Posted Jun 8, 2012 18:17 UTC (Fri) by jstultz (subscriber, #212) [Link]

Err. Its not clear to me what you mean by pin/unpin. But you can mark a large space as volatile, and then unmark individual pages (basically breaking the large range into smaller fragments).

Volatile ranges with fallocate()

Posted Jun 8, 2012 0:31 UTC (Fri) by robert_s (subscriber, #42402) [Link] (1 responses)

"Any filesystem could conceivably implement this functionality, but, in practice, it only makes sense for a RAM-based filesystem like tmpfs, so it is only implemented there."

So, for the stated example, the web browser would have to:

1. Find a mounted ram-based filesystem it had access to.
2. Create a file for the cache.
3. Presumably mmap() the file so it can be accessed conveniently.

?

Also, how would the application know when an area had been reclaimed? Canaries? What if an application tried to access an object that had only partially been reclaimed - say a single 4k page had innocuously been taken from a huge blob?

Volatile ranges with fallocate()

Posted Jun 8, 2012 0:35 UTC (Fri) by dlang (guest, #313) [Link]

to answer your second issue (how would it know that it's been reclaimed), per the article, the application does

fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);

and if the return code is 0, it can use the data, if the return code is non-zero, some part of the data has been reclaimed and the application needs to regenerate it.

I would hope that when the system grabs part of a block that's been marked volatile, it grabs the entire block, rather then just punching a small hole in it (or at least remembers what it damaged earlier when it needs more memory, on the theory that you may mark a large block volitile and then try to re-use small chunks of it)

Volatile ranges with fallocate()

Posted Jun 8, 2012 7:01 UTC (Fri) by kugel (subscriber, #70540) [Link] (1 responses)

Now I wonder if this is useful for Google in order to mainline ashmem. Did Robert Love (or another Googler) comment on that?

Volatile ranges with fallocate()

Posted Jun 8, 2012 18:25 UTC (Fri) by jstultz (subscriber, #212) [Link]

I can't speak for Google, but I have a patch I'll be sending out in my next iteration that removes the ashmem unpinned range management code and replaces it with fallocate() calls (killing ~320 lines in ashmem).

Ashmem also serves a separate purpose to Android, which is to provide a way to create atomically unlinked tmpfs files that can be shared between applications. This avoids having applications accidentally leave tmpfs files that take memory even if on one is using them.

For this purpose, the ashmem driver still is useful, but can be submitted and reviewed independently without tangling the pin/unpin functionality into the discussion.

Volatile ranges with fallocate()

Posted Jun 8, 2012 9:29 UTC (Fri) by roblucid (guest, #48964) [Link]

"This system call is meant to operate on ranges of data within a file. Of particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE operation, which removes a block of data from an arbitrary location within a file. Declaring a volatile range can be thought of as a form of hole punching, but with a kernel-determined delay. If memory is tight, the hole could be punched immediately; otherwise the operation could complete at some later time, or not at all."

So why wouldn't volatile data, be useful on disk files systems with directories like /var/cache? It seems to me to be a useful response to "FS full" conditions rather than return ENOSPACE to some (important) application writing to /var/spool. Similarly perhaps a Desktop Environment, might like to implement a "Recycle Bin" for deleted files marking them as volatile. If FS atimes are respected when choosing what to reclaim, then the kernel can actually do a rough LRU policy. But I guess that would be just TOO convenient and useful to applicaton programmers, so we'll just tell everyone to re-implement time stamping and cache management, over and over, rather than provide a simple to use reuseable feature.

Balkanised features for swap & block device backed filesystems, introduce finicky requirements infecting applications with implementation specifics, remember fsync() issues? "What do you mean, the filesystem can't efficiently sync the contents of this 1 tiny file?"

Volatile ranges with fallocate()

Posted Jun 8, 2012 10:27 UTC (Fri) by juliank (guest, #45896) [Link] (4 responses)

I think that's not really useful in its current way. Instead, this should be part of the madvise() family, and work on anonymous memory as well.

Otherwise, making it useable for applications is basically almost impossible.

Volatile ranges with fallocate()

Posted Jun 9, 2012 1:08 UTC (Sat) by ncm (guest, #165) [Link] (3 responses)

/tmp might or might not be tmpfs, /dev/shm is tmpfs. Mapping a shared memory region with shm_open and mmap, and then shm_unlinking it, should make it indistinguishable from anonymous memory. Right?

Volatile ranges with fallocate()

Posted Jun 9, 2012 2:38 UTC (Sat) by foom (subscriber, #14868) [Link] (2 responses)

Sure, but this API is a really convoluted and irritating way to make users go about things..

Consider the two options (note: code written in comment editor, may not actually work; error handling omitted):

Option 1:
volatile_storage = mmap(0, 163840, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, 0, 0);
API option 2, retardedly difficult for no good reason:
int fd = -1;
char *name;
while (fd == -1) {
  name = tmpnam(NULL);
  fd = shm_open(name, O_CREAT | O_EXCL | O_RDWR, S_IRUSR | S_IWUSR);
}
shm_unlink(name);
ftruncate(fd, 163840);
volatile_storage = mmap(NULL, 163840, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
Then, the usage is ridiculous too: Sane API:
madvise(object_addr, 4096, MADV_VOLATILE);
Insane API:
// First implement a function "find_mapping_info_for" which does e.g. a binary
// search through a list of shm_open'd regions, to find one that contains the
// address in question. Then...
mapping_info = find_mapping_info_for(object_addr);
fallocate(mapping_info.fd, FALLOCATE_FL_MARK_VOLATILE, object_addr - mapping_info.base_addr, 4096);
Why would anyone want the second API?...it basically requires users to go through an extra song-and-dance at mmap time, to keep extra file descriptors open, and to track which file descriptors belong to which memory regions, for seemingly no reason.

Volatile ranges with fallocate()

Posted Jun 9, 2012 6:32 UTC (Sat) by ncm (guest, #165) [Link]

It seems like anonymous mmap might as well be implemented using shm under the covers. Maybe it is.

The sample code looks plausible, but you omitted code to check whether backing store for an address range has been reclaimed. Actually we don't need a new flag; MADV_MREMOVE suffices. The docs don't say what the process would find in an MREMOVEd page, were it to look, or what marking a previously MREMOVEd range MADV_NORMAL does. Mark it MADV_NORMAL to be see if everything written is still there. If that fails, check progressively smaller ranges. Or unmap the lot and forget all about it. Or mark it DONTNEED, equivalent to munmapping and re-mmapping the address range.

Volatile ranges with fallocate()

Posted Jun 11, 2012 16:38 UTC (Mon) by jstultz (subscriber, #212) [Link]

So I moved from madvise to fadvise early on because the need to be able to coordinate shared volatile ranges between processes, in the fashion ashmem provides.

That said, once the backing infrastructure for fallocate() volatile regions is upstream, there isn't a reason why a simpler madvise interface wouldn't also be viable.

I'd invite you to join the discussion on lkml to further discuss this.


Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds