The volatile volatile ranges patch set

By Jonathan Corbet
June 18, 2014

"Volatile ranges" is a name given to regions of user-space memory that can be reclaimed by the kernel when memory is tight. The classic use case is for a web browser's image cache; the browser would like to keep that information in memory to speed future page loads, but it can do without that data should the memory used for the cache be needed elsewhere. Implementations of the volatile range concept have experienced more than the usual amount of change; that rate of change may well continue into the future — if a developer can be found to continue the work.

Early versions of the patch set were based on the posix_fadvise() system call. Some developers complained that it was more of an allocation-related concept, so the patch was reworked to use fallocate() instead. By 2013, the plan had shifted toward the addition of two new system calls named fvrange() and mvrange(). Version 11, released in March 2014, moved to a single system call named vrange(). During all of these iterations, there have also been concerns about user-space semantics (what happens when a process tries to access a page that has been purged, in particular) and the best way to implement volatile ranges internally. So nothing has ever been merged into the mainline kernel.

Version 14, posted by John Stultz on April 29, changes the user-space API yet again. Volatile ranges have now shifted to the madvise() system call. In particular, a call to:

    madvise(address, length, MADV_VOLATILE);

Will mark the memory range of length bytes starting at address as being volatile. Once the memory range has been marked in this way, the kernel is free to reclaim the associated pages and discard their contents at any time. Should the application need access to the range in the future, it should mark it as being nonvolatile with:

    madvise(address, length, MADV_NONVOLATILE);

The return value is zero for success (the range is now nonvolatile and the previous contents remain intact), a negative number if some sort of error occurred, or one if the operation was successful but at least one of the pages has been purged.

The use of madvise() had been considered in the past; it makes sense, given that the purpose is to advise the kernel about the importance of a particular range of memory. Previous volatile range implementations, though, had the property that marking a range nonvolatile could fail partway through. That meant that the interface had to be able to return two values: (1) how many pages had been successfully marked, and (2) whether any of them had been purged. This time around, John found a way to make the operation atomic, in that it either succeeds or fails as a whole. In the absence of a need for a second return value, the madvise() interface is adequate for the task.

What happens if user space attempts to access a volatile page that has been purged by the kernel? This implementation will deliver a SIGBUS signal in that situation. A properly-equipped application can catch the signal and respond by obtaining the needed data from some other source; applications that are not prepared will litter the disk with unsightly core dumps instead. That may seem like an unfriendly response, but one can argue that an application should not be trying to directly access memory that, according to instructions it gave to the kernel, does not actually need to be kept around.

Minchan Kim does not like this approach; he would prefer, instead, that the application simply receive a new, zero-filled page in this situation. He is, it turns out, thinking about a slightly different use case: code that reuses memory and wants to tell the kernel that the old contents need not be preserved. In this case, the reuse should be as low-overhead as possible; Minchan would prefer to have no need for either an MADV_NONVOLATILE call or a SIGBUS signal handler. John suggested that Minchan's own MADV_FREE patch was better suited to that use case, but Minchan disagreed, noting that MADV_FREE is a one-time operation, while MADV_VOLATILE can "stick" to a range of memory through several purge-and-reuse cycles. John, however, worries that silently substituting zero-filled pages could lead to data corruption or other unpleasant surprises.

Johannes Weiner, who joined the conversation in June, also prefers that purged pages be replaced by zero-filled pages on access. He asked if the patch set could be reworked on top of MADV_FREE (which, he thinks, has a better implementation internally) to provide a choice: applications could request either the new-zero-filled-page or the SIGBUS semantics. John responded that he might give it a try, someday:

I'll see if I can look into it if I get some time. However, I suspect its more likely I'll just have to admit defeat on this one and let someone else champion the effort. Interest and reviews have seemingly dropped again here and with other work ramping up, I'm not sure if I'll be able to justify further work on this.

John certainly cannot be faulted for a lack of effort; this patch set has been through fourteen revisions since 2011; it has also been the subject of sessions at the Kernel Summit and Linux Storage, Filesystem, and Memory Management Summit. It has seen extensive revisions in response to comments from several reviewers. But, somehow, this feature, which has real users waiting for it to show up in a mainline kernel, does not seem much closer to being merged than before.

At the same time, it is hard to fault the reviewers. The volatile ranges concept adds new user-visible memory-management behavior with some subtle aspects. If the implementation and interface are not right, the pain will be felt by developers in both kernel and user space for a long time. Memory-management changes are notoriously hard to get into the kernel for a good reason; user-visible changes are even worse. This patch set crosses two areas where, past history shows, we have a hard time getting things right, so some caution is warranted.

Still, one can't help but wonder if merging nothing at all yields the best kernel in the long run. Users will end up working with out-of-tree variants of this concept (Android's "ashmem" in particular) that the development community has even less control over. Unless somebody comes up with the time to continue trying to push this patch set forward, the mainline kernel may never acquire this feature, leaving users without a capability that they demonstrably have a need for.

Index entries for this article
Kernel	Volatile ranges

The volatile volatile ranges patch set

Posted Jun 19, 2014 5:19 UTC (Thu) by jstultz (subscriber, #212) [Link] (5 responses)

Just as a minor clarification (and I can understand how this was confusing)...

Minchan's objection mentioned here was somewhat tangential and short-lived. Its stems from a use-case that the Google address-sanitizer folks wanted, where volatility would be sticky and data could be purged, then rewritten and then purged again without any explicit re-marking of volatility.

I *really* don't see how this use case is at all feasible. Especially as a generic implementation. My specific objections where listed here: http://thread.gmane.org/gmane.linux.kernel.mm/116952

And after that Minchan agreed and withdrew his objection:
http://thread.gmane.org/gmane.linux.kernel.mm/116952

Johannes' suggestion for zero-fill behavior is viable, and works more in-line with the existing VMM code by overloading the page clean/dirty state as a marker of volatility. My main objection is that this works well from the VMM perspective, but creates more surprising semantics for userspace.

Those semantics could become less surprising (but still not my ideal) to userspace with his additional suggestion of adding a SIGBUS option to MADV_FREE, so this is a potential route and I *really do* appreciate the feedback and suggestion (Johannes, Hugh and many other folks have been very kind and motivational in discussions at conferences). Its just that I've run a bit out of steam on this one and have other work I need to do.

I still think its a really great and needed feature, and it shames me to feel like I've failed in pushing it upstream. But its my hope someone else might be able to pick up the torch here.

The volatile volatile ranges patch set

Posted Jun 19, 2014 5:22 UTC (Thu) by jstultz (subscriber, #212) [Link]

Oops. miss-linked to the discussions there...

Again, my objections listed here:
http://article.gmane.org/gmane.linux.kernel.mm/116959

Minchan's agreement here:
http://article.gmane.org/gmane.linux.kernel.mm/116960

The volatile volatile ranges patch set

Posted Jun 19, 2014 6:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Why not both? Add another flag, something like MADV_FAIL_ON_MISSING to send a SIGBUS if you want this behavior.

Personally, I'd prefer a SIGBUS because it would allow to:
1) Distinguish between legitimate zero-filled pages and pages that have just been evicted.
2) Allow to find code that accesses the volatile regions without proper checking.

The volatile volatile ranges patch set

Posted Jun 19, 2014 8:06 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (2 responses)

Andrea Arcangeli has been working on a "userfaultfd" system call that would let you stop the faulting thread while another thread populates the page. He's using that together with anonymous memory, but perhaps this could be useful for volatile memory ranges too.

The volatile volatile ranges patch set

Posted Jun 19, 2014 16:06 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

Wouldn't userfaultfd just work like a modified signalfd, usable for otherwise deadly signals like SIGBUS because it blocks the signaled thread until the signal gets read?

The volatile volatile ranges patch set

Posted Jun 20, 2014 12:31 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

Kind of, because IIUC userfaultfd also applies to page faults in the kernel. In the case of KVM, this means that the asynchronous page fault machinery can kick in.

Free the whole range rather than bits

Posted Jun 19, 2014 13:44 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

For many use cases, if one page of the range is freed then the rest of it becomes useless. A simple web cache would not be able to serve a JPEG with four kilobytes missing in the middle. Is there a way to signal to the kernel 'if you have to free one page from this range, you might as well free the rest of it at the same time'? The alternative policy would be 'please keep as many pages as possible intact even if some are freed'.

Free the whole range rather than bits

Posted Jun 21, 2014 9:22 UTC (Sat) by niner (subscriber, #26151) [Link] (4 responses)

A a bit less simple web cache on the other hand should be able to handle a JPEG with a page missing in the middle. It either can load the page from the disk cache or request the missing range from the web server. No need to download the ranges it still has.

Free the whole range rather than bits

Posted Jun 21, 2014 13:52 UTC (Sat) by alonz (subscriber, #815) [Link] (2 responses)

I suspect the cost of refreshing just one page from the JPEG from the 'net will tend to be almost the same as refreshing the entire image… and refreshing two (disjoint) pages will almost certainly cost more than the entire image.

Of course, these are just my estimates; they may be wrong in the particular, but I've personally designed systems that did exhibit this behavior. An obvious example is when caching a compressed version of some resource — restarting compression from the middle is far harder than redoing it all from scratch.

Free the whole range rather than bits

Posted Jun 26, 2014 6:28 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

remember that this is less for caching data from the net and more for caching the uncompressed images for faster display.

so it's more likely to be 4K out of a uncompressed image that the app would just need to pull the appropriate range of data out of the .jpg that it has stored elsewhere.

Free the whole range rather than bits

Posted Jun 26, 2014 10:52 UTC (Thu) by alonz (subscriber, #815) [Link]

Even for this use-case, I suspect repeating the entire decompression would still cost less (and have less bugs) than trying to reconstruct a single page from the middle of the image. And more modern formats (e.g. WebP) make the challenge even greater.

Free the whole range rather than bits

Posted Jul 16, 2014 11:10 UTC (Wed) by epa (subscriber, #39769) [Link]

I was imagining that the image wouldn't be cached on disk. If it were, it would probably be more efficient just to store it on disk and use the kernel's excellent disk caching to keep the pages in memory if there is room - not try to reinvent the wheel by making your own 'disk cache' using nonvolatile memory. So the use case I was thinking of is where some JPEG image (or any other piece of data) is useful in its entirety, but not with chunks missing in the middle. In that case if one piece has to be discarded the rest of it might as well be thrown out at the same time. Surely there is a way to hint this to the kernel?

The volatile volatile ranges patch set

Posted Jun 20, 2014 9:12 UTC (Fri) by dgm (subscriber, #49227) [Link]

From a (this at least) user space developer perspective, madvise() sounds like the interface that fits best the proposed behavior.

But for the sake of clarity I would prefer it to be called MADV_FREEABLE (volatile is also a C keyword with very different semantics). In fact, I think the best flag names would be MADV_FREETRYREMEMBER and MADV_TRYRECALL.

The volatile volatile ranges patch set

Posted Jun 26, 2014 21:09 UTC (Thu) by weue (guest, #96562) [Link] (1 responses)

How does this relatively simple feature take THREE $#@#$@ YEARS to not even be accepted yet?

The volatile volatile ranges patch set

Posted Jun 26, 2014 22:11 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Life really does imitate fiction.

http://www.clivebanks.co.uk/THHGTTG/THHGTTGradio6.htm

CHAIRMAN:
Yes, and, and, and the wheel. What about this wheel thingy? Sounds a terribly interesting project to me.

MARKETING GIRL:
Er, yeah, well we’re having a little, er, difficulty here…

FORD:
Difficulty?! It’s the single simplest machine in the entire universe!

MARKETING GIRL:
Well alright mister wise guy, if you’re so clever you tell us what colour it should be!

[ and don't even ask about the nasally fitted fire ]

(sorry, this comment is totally unfair to John Stultz and all the others who have worked on this - it really isn't as trivial as that at all.
My personal opinion is that there are at least 2 and possibly 3 separate problems that people want to solve and no solution proposed so far solves all of them suitably. The problems need to be clearly separated and solved separately. I think that latest patchset was trying to head in this direction to some extent)

implementations, interfaces, mumble grumble

Posted Jul 9, 2014 13:22 UTC (Wed) by tomgj (guest, #50537) [Link]

Thanks for the useful article.

May I humbly suggest that the word “implementation” should be used more conservatively than this, especially as it relates to API design and interface specifiication.

The article says “Implementations of the volatile range concept have experienced more than the usual amount of change”. Though this is true, the way it is worded risks confusing issues around interface and implementation. This is because (i) implementations of the concept have involved different interface designs (as nicely expounded upon in the article), with (ii) implementations of those interfaces also having been through change. It would be nice, in a world where even advanced developers often don’t understand the value of well-specified interfaces, if we could, for clarity, reserve the word “implementation” for the latter kind of thing, marked (ii) above.

This confusion also arises in a couple of other places. We have “Previous volatile range implementations, though, had the property that marking a range nonvolatile could fail partway through. That meant that the interface had to be able to return two values”, which is using the term implementation to talk about the interface, and “This implementation will deliver a SIGBUS signal in that situation”, which is also better described in terms of being an interface thing.

Later in the article, we have “If the implementation and interface are not right…”, and here the term is clearly being used in the type (ii) sense as defined above. But, having set up use of the term “implementation” to mean either “implementation of a concept (possibly involving different interfaces)”, and “implementation of an interface”, the confusion sewn earlier is here reaped again.

All this comes back to a phenomenon way too common in Linux development: hack something together that “works” locally, and whatever happens to be exposed on the outside when it’s “working” then becomes the “interface”. This is the wrong way round, leads to low quality interface design, and wasted effort.

The article does describe thought going into the interface design. But how much better off would we be if interface design were more widely thought of as being the preceding process to implementation, rather than something to come later if at all? This doesn’t stop interface specification and implementation being co-iterative: the experience of actually attempting the implementation will often feed back proposed changes to the interface spec. But viewing interface specification as the fundamentally “earlier” process would have huge benefits. Most Linux-originated APIs are a wreck — here is part of the reason.