Volatile ranges with fallocate()
A "volatile range" is a set of pages in memory containing data that might be useful to an application at some point in the future; a key point is that, if the need arises, the application is able to reacquire (or regenerate) that data from another source. A web browser's in-RAM image cache is a classic example. Keeping the images around can reduce net traffic and improve page rendering times, but, should the cached images vanish, the browser can request a new copy from the source. Thus, while it makes sense to keep this data around, it also makes sense to get rid of it if a more pressing need for memory arises.
If the kernel knew about this sort of cached data, it could dump that data during times of memory stress and quickly reclaim the underlying memory. In such a situation, applications could cache more data than they otherwise would, knowing that there are limits to how much that caching can affect the system as a whole. The result would be better utilization of memory and a system that performs better overall.
Google's Robert Love implemented such a mechanism for Android as "ashmem." There is a desire to get the ashmem functionality into the mainline kernel, but the implementation and API were not to everybody's taste. To get around that problem, John took the core ashmem code, reworked the virtual memory integration, and hooked it into the posix_fadvise() system call; that is the version of the patch that was described last November.
Dave Chinner subsequently pointed out that this functionality might be better suited to the fallocate() system call. That call looks like this:
int fallocate(int fd, int mode, off_t offset, off_t len);
This system call is meant to operate on ranges of data within a file. Of particular interest, perhaps, is the FALLOCATE_FL_PUNCH_HOLE operation, which removes a block of data from an arbitrary location within a file. Declaring a volatile range can be thought of as a form of hole punching, but with a kernel-determined delay. If memory is tight, the hole could be punched immediately; otherwise the operation could complete at some later time, or not at all. Given the similarity between these two operations, it made sense to implement them within the same system call; John duly reworked the patch along those lines.
With the new patch set, to declare that a range of a file's contents is volatile, an application would call:
fallocate(fd, FALLOCATE_FL_MARK_VOLATILE, offset, len);
Where offset and len describe the actual range to be marked. After the call completes, the kernel is not obligated to keep that range in memory, and is not obligated to write that range to backing store before reclaiming it. The application should not attempt to access that portion of the file while it has been marked volatile, since the contents could disappear at any time. Instead, if and when the data turns out to be useful, a call should be made to:
fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);
If the indicated range is still present in memory, the call will return zero and the application can proceed to work with the data. If, instead, any part of the range has been purged by the kernel since it was marked volatile, a non-zero return value will inform the application that it needs to find that data somewhere else.
Any filesystem could conceivably implement this functionality, but, in practice, it only makes sense for a RAM-based filesystem like tmpfs, so it is only implemented there.
The patch is in its third revision as of this writing, having gotten a
number of comments in its first two iterations. The number of complaints
has fallen off considerably, though, suggesting that most reviewers are
happy now. So this feature may just find its way into the 3.6 kernel.
Index entries for this article | |
---|---|
Kernel | fallocate() |
Kernel | Memory management |
Kernel | Volatile ranges |
Posted Jun 7, 2012 22:15 UTC (Thu)
by benh (subscriber, #43720)
[Link] (2 responses)
Posted Jun 8, 2012 0:31 UTC (Fri)
by ncm (guest, #165)
[Link] (1 responses)
Posted Jun 8, 2012 18:17 UTC (Fri)
by jstultz (subscriber, #212)
[Link]
Posted Jun 8, 2012 0:31 UTC (Fri)
by robert_s (subscriber, #42402)
[Link] (1 responses)
So, for the stated example, the web browser would have to:
1. Find a mounted ram-based filesystem it had access to.
?
Also, how would the application know when an area had been reclaimed? Canaries? What if an application tried to access an object that had only partially been reclaimed - say a single 4k page had innocuously been taken from a huge blob?
Posted Jun 8, 2012 0:35 UTC (Fri)
by dlang (guest, #313)
[Link]
fallocate(fd, FALLOCATE_FL_UNMARK_VOLATILE, offset, len);
and if the return code is 0, it can use the data, if the return code is non-zero, some part of the data has been reclaimed and the application needs to regenerate it.
I would hope that when the system grabs part of a block that's been marked volatile, it grabs the entire block, rather then just punching a small hole in it (or at least remembers what it damaged earlier when it needs more memory, on the theory that you may mark a large block volitile and then try to re-use small chunks of it)
Posted Jun 8, 2012 7:01 UTC (Fri)
by kugel (subscriber, #70540)
[Link] (1 responses)
Posted Jun 8, 2012 18:25 UTC (Fri)
by jstultz (subscriber, #212)
[Link]
Ashmem also serves a separate purpose to Android, which is to provide a way to create atomically unlinked tmpfs files that can be shared between applications. This avoids having applications accidentally leave tmpfs files that take memory even if on one is using them.
For this purpose, the ashmem driver still is useful, but can be submitted and reviewed independently without tangling the pin/unpin functionality into the discussion.
Posted Jun 8, 2012 9:29 UTC (Fri)
by roblucid (guest, #48964)
[Link]
So why wouldn't volatile data, be useful on disk files systems with directories like /var/cache? It seems to me to be a useful response to "FS full" conditions rather than return ENOSPACE to some (important) application writing to /var/spool. Similarly perhaps a Desktop Environment, might like to implement a "Recycle Bin" for deleted files marking them as volatile. If FS atimes are respected when choosing what to reclaim, then the kernel can actually do a rough LRU policy. But I guess that would be just TOO convenient and useful to applicaton programmers, so we'll just tell everyone to re-implement time stamping and cache management, over and over, rather than provide a simple to use reuseable feature.
Balkanised features for swap & block device backed filesystems, introduce finicky requirements infecting applications with implementation specifics, remember fsync() issues? "What do you mean, the filesystem can't efficiently sync the contents of this 1 tiny file?"
Posted Jun 8, 2012 10:27 UTC (Fri)
by juliank (guest, #45896)
[Link] (4 responses)
Otherwise, making it useable for applications is basically almost impossible.
Posted Jun 9, 2012 1:08 UTC (Sat)
by ncm (guest, #165)
[Link] (3 responses)
Posted Jun 9, 2012 2:38 UTC (Sat)
by foom (subscriber, #14868)
[Link] (2 responses)
Posted Jun 9, 2012 6:32 UTC (Sat)
by ncm (guest, #165)
[Link]
The sample code looks plausible, but you omitted code to check whether backing store for an address range has been reclaimed. Actually we don't need a new flag; MADV_MREMOVE suffices. The docs don't say what the process would find in an MREMOVEd page, were it to look, or what marking a previously MREMOVEd range MADV_NORMAL does. Mark it MADV_NORMAL to be see if everything written is still there. If that fails, check progressively smaller ranges. Or unmap the lot and forget all about it. Or mark it DONTNEED, equivalent to munmapping and re-mmapping the address range.
Posted Jun 11, 2012 16:38 UTC (Mon)
by jstultz (subscriber, #212)
[Link]
That said, once the backing infrastructure for fallocate() volatile regions is upstream, there isn't a reason why a simpler madvise interface wouldn't also be viable.
I'd invite you to join the discussion on lkml to further discuss this.
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
2. Create a file for the cache.
3. Presumably mmap() the file so it can be accessed conveniently.
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Volatile ranges with fallocate()
Sure, but this API is a really convoluted and irritating way to make users go about things..Volatile ranges with fallocate()
Consider the two options (note: code written in comment editor, may not actually work; error handling omitted):
Option 1:
volatile_storage = mmap(0, 163840, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, 0, 0);
API option 2, retardedly difficult for no good reason:
int fd = -1;
char *name;
while (fd == -1) {
name = tmpnam(NULL);
fd = shm_open(name, O_CREAT | O_EXCL | O_RDWR, S_IRUSR | S_IWUSR);
}
shm_unlink(name);
ftruncate(fd, 163840);
volatile_storage = mmap(NULL, 163840, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
Then, the usage is ridiculous too:
Sane API:
madvise(object_addr, 4096, MADV_VOLATILE);
Insane API:
// First implement a function "find_mapping_info_for" which does e.g. a binary
// search through a list of shm_open'd regions, to find one that contains the
// address in question. Then...
mapping_info = find_mapping_info_for(object_addr);
fallocate(mapping_info.fd, FALLOCATE_FL_MARK_VOLATILE, object_addr - mapping_info.base_addr, 4096);
Why would anyone want the second API?...it basically requires users to go through an extra song-and-dance at mmap time, to keep extra file descriptors open, and to track which file descriptors belong to which memory regions, for seemingly no reason.
Volatile ranges with fallocate()
Volatile ranges with fallocate()