The volatile volatile ranges patch set
Early versions of the patch set were based on the posix_fadvise() system call. Some developers complained that it was more of an allocation-related concept, so the patch was reworked to use fallocate() instead. By 2013, the plan had shifted toward the addition of two new system calls named fvrange() and mvrange(). Version 11, released in March 2014, moved to a single system call named vrange(). During all of these iterations, there have also been concerns about user-space semantics (what happens when a process tries to access a page that has been purged, in particular) and the best way to implement volatile ranges internally. So nothing has ever been merged into the mainline kernel.
Version 14, posted by John Stultz on April 29, changes the user-space API yet again. Volatile ranges have now shifted to the madvise() system call. In particular, a call to:
madvise(address, length, MADV_VOLATILE);
Will mark the memory range of length bytes starting at address as being volatile. Once the memory range has been marked in this way, the kernel is free to reclaim the associated pages and discard their contents at any time. Should the application need access to the range in the future, it should mark it as being nonvolatile with:
madvise(address, length, MADV_NONVOLATILE);
The return value is zero for success (the range is now nonvolatile and the previous contents remain intact), a negative number if some sort of error occurred, or one if the operation was successful but at least one of the pages has been purged.
The use of madvise() had been considered in the past; it makes sense, given that the purpose is to advise the kernel about the importance of a particular range of memory. Previous volatile range implementations, though, had the property that marking a range nonvolatile could fail partway through. That meant that the interface had to be able to return two values: (1) how many pages had been successfully marked, and (2) whether any of them had been purged. This time around, John found a way to make the operation atomic, in that it either succeeds or fails as a whole. In the absence of a need for a second return value, the madvise() interface is adequate for the task.
What happens if user space attempts to access a volatile page that has been purged by the kernel? This implementation will deliver a SIGBUS signal in that situation. A properly-equipped application can catch the signal and respond by obtaining the needed data from some other source; applications that are not prepared will litter the disk with unsightly core dumps instead. That may seem like an unfriendly response, but one can argue that an application should not be trying to directly access memory that, according to instructions it gave to the kernel, does not actually need to be kept around.
Minchan Kim does not like this approach; he would prefer, instead, that the application simply receive a new, zero-filled page in this situation. He is, it turns out, thinking about a slightly different use case: code that reuses memory and wants to tell the kernel that the old contents need not be preserved. In this case, the reuse should be as low-overhead as possible; Minchan would prefer to have no need for either an MADV_NONVOLATILE call or a SIGBUS signal handler. John suggested that Minchan's own MADV_FREE patch was better suited to that use case, but Minchan disagreed, noting that MADV_FREE is a one-time operation, while MADV_VOLATILE can "stick" to a range of memory through several purge-and-reuse cycles. John, however, worries that silently substituting zero-filled pages could lead to data corruption or other unpleasant surprises.
Johannes Weiner, who joined the conversation in June, also prefers that purged pages be replaced by zero-filled pages on access. He asked if the patch set could be reworked on top of MADV_FREE (which, he thinks, has a better implementation internally) to provide a choice: applications could request either the new-zero-filled-page or the SIGBUS semantics. John responded that he might give it a try, someday:
John certainly cannot be faulted for a lack of effort; this patch set has been through fourteen revisions since 2011; it has also been the subject of sessions at the Kernel Summit and Linux Storage, Filesystem, and Memory Management Summit. It has seen extensive revisions in response to comments from several reviewers. But, somehow, this feature, which has real users waiting for it to show up in a mainline kernel, does not seem much closer to being merged than before.
At the same time, it is hard to fault the reviewers. The volatile ranges concept adds new user-visible memory-management behavior with some subtle aspects. If the implementation and interface are not right, the pain will be felt by developers in both kernel and user space for a long time. Memory-management changes are notoriously hard to get into the kernel for a good reason; user-visible changes are even worse. This patch set crosses two areas where, past history shows, we have a hard time getting things right, so some caution is warranted.
Still, one can't help but wonder if merging nothing at all yields the best
kernel in the long run. Users will end up working with out-of-tree
variants of this concept (Android's "ashmem" in particular) that the
development community has even less control over. Unless somebody comes up
with the time to continue trying to push this patch set forward, the
mainline kernel may never acquire this feature, leaving users without a
capability that they demonstrably have a need for.
| Index entries for this article | |
|---|---|
| Kernel | Volatile ranges |
Posted Jun 19, 2014 5:19 UTC (Thu)
by jstultz (subscriber, #212)
[Link] (5 responses)
Minchan's objection mentioned here was somewhat tangential and short-lived. Its stems from a use-case that the Google address-sanitizer folks wanted, where volatility would be sticky and data could be purged, then rewritten and then purged again without any explicit re-marking of volatility.
I *really* don't see how this use case is at all feasible. Especially as a generic implementation. My specific objections where listed here: http://thread.gmane.org/gmane.linux.kernel.mm/116952
And after that Minchan agreed and withdrew his objection:
Johannes' suggestion for zero-fill behavior is viable, and works more in-line with the existing VMM code by overloading the page clean/dirty state as a marker of volatility. My main objection is that this works well from the VMM perspective, but creates more surprising semantics for userspace.
Those semantics could become less surprising (but still not my ideal) to userspace with his additional suggestion of adding a SIGBUS option to MADV_FREE, so this is a potential route and I *really do* appreciate the feedback and suggestion (Johannes, Hugh and many other folks have been very kind and motivational in discussions at conferences). Its just that I've run a bit out of steam on this one and have other work I need to do.
I still think its a really great and needed feature, and it shames me to feel like I've failed in pushing it upstream. But its my hope someone else might be able to pick up the torch here.
Posted Jun 19, 2014 5:22 UTC (Thu)
by jstultz (subscriber, #212)
[Link]
Again, my objections listed here:
Minchan's agreement here:
Posted Jun 19, 2014 6:48 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Personally, I'd prefer a SIGBUS because it would allow to:
Posted Jun 19, 2014 8:06 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Posted Jun 19, 2014 16:06 UTC (Thu)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Jun 20, 2014 12:31 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link]
Posted Jun 19, 2014 13:44 UTC (Thu)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Jun 21, 2014 9:22 UTC (Sat)
by niner (subscriber, #26151)
[Link] (4 responses)
Posted Jun 21, 2014 13:52 UTC (Sat)
by alonz (subscriber, #815)
[Link] (2 responses)
Of course, these are just my estimates; they may be wrong in the particular, but I've personally designed systems that did exhibit this behavior. An obvious example is when caching a compressed version of some resource — restarting compression from the middle is far harder than redoing it all from scratch.
Posted Jun 26, 2014 6:28 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
so it's more likely to be 4K out of a uncompressed image that the app would just need to pull the appropriate range of data out of the .jpg that it has stored elsewhere.
Posted Jun 26, 2014 10:52 UTC (Thu)
by alonz (subscriber, #815)
[Link]
Posted Jul 16, 2014 11:10 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Jun 20, 2014 9:12 UTC (Fri)
by dgm (subscriber, #49227)
[Link]
But for the sake of clarity I would prefer it to be called MADV_FREEABLE (volatile is also a C keyword with very different semantics). In fact, I think the best flag names would be MADV_FREETRYREMEMBER and MADV_TRYRECALL.
Posted Jun 26, 2014 21:09 UTC (Thu)
by weue (guest, #96562)
[Link] (1 responses)
Posted Jun 26, 2014 22:11 UTC (Thu)
by neilbrown (subscriber, #359)
[Link]
http://www.clivebanks.co.uk/THHGTTG/THHGTTGradio6.htm
CHAIRMAN:
MARKETING GIRL:
FORD:
MARKETING GIRL:
[ and don't even ask about the nasally fitted fire ]
(sorry, this comment is totally unfair to John Stultz and all the others who have worked on this - it really isn't as trivial as that at all.
Posted Jul 9, 2014 13:22 UTC (Wed)
by tomgj (guest, #50537)
[Link]
May I humbly suggest that the word “implementation” should be used more conservatively than this, especially as it relates to API design and interface specifiication.
The article says “Implementations of the volatile range concept have experienced more than the usual amount of change”. Though this is true, the way it is worded risks confusing issues around interface and implementation. This is because (i) implementations of the concept have involved different interface designs (as nicely expounded upon in the article), with (ii) implementations of those interfaces also having been through change. It would be nice, in a world where even advanced developers often don’t understand the value of well-specified interfaces, if we could, for clarity, reserve the word “implementation” for the latter kind of thing, marked (ii) above.
This confusion also arises in a couple of other places. We have “Previous volatile range implementations, though, had the property that marking a range nonvolatile could fail partway through. That meant that the interface had to be able to return two values”, which is using the term implementation to talk about the interface, and “This implementation will deliver a SIGBUS signal in that situation”, which is also better described in terms of being an interface thing.
Later in the article, we have “If the implementation and interface are not right…”, and here the term is clearly being used in the type (ii) sense as defined above. But, having set up use of the term “implementation” to mean either “implementation of a concept (possibly involving different interfaces)”, and “implementation of an interface”, the confusion sewn earlier is here reaped again.
All this comes back to a phenomenon way too common in Linux development: hack something together that “works” locally, and whatever happens to be exposed on the outside when it’s “working” then becomes the “interface”. This is the wrong way round, leads to low quality interface design, and wasted effort.
The article does describe thought going into the interface design. But how much better off would we be if interface design were more widely thought of as being the preceding process to implementation, rather than something to come later if at all? This doesn’t stop interface specification and implementation being co-iterative: the experience of actually attempting the implementation will often feed back proposed changes to the interface spec. But viewing interface specification as the fundamentally “earlier” process would have huge benefits. Most Linux-originated APIs are a wreck — here is part of the reason.
The volatile volatile ranges patch set
http://thread.gmane.org/gmane.linux.kernel.mm/116952
The volatile volatile ranges patch set
http://article.gmane.org/gmane.linux.kernel.mm/116959
http://article.gmane.org/gmane.linux.kernel.mm/116960
The volatile volatile ranges patch set
1) Distinguish between legitimate zero-filled pages and pages that have just been evicted.
2) Allow to find code that accesses the volatile regions without proper checking.
The volatile volatile ranges patch set
The volatile volatile ranges patch set
The volatile volatile ranges patch set
Free the whole range rather than bits
Free the whole range rather than bits
I suspect the cost of refreshing just one page from the JPEG from the 'net will tend to be almost the same as refreshing the entire image… and refreshing two (disjoint) pages will almost certainly cost more than the entire image.
Free the whole range rather than bits
Free the whole range rather than bits
Even for this use-case, I suspect repeating the entire decompression would still cost less (and have less bugs) than trying to reconstruct a single page from the middle of the image. And more modern formats (e.g. WebP) make the challenge even greater.
Free the whole range rather than bits
Free the whole range rather than bits
The volatile volatile ranges patch set
The volatile volatile ranges patch set
The volatile volatile ranges patch set
Yes, and, and, and the wheel. What about this wheel thingy? Sounds a terribly interesting project to me.
Er, yeah, well we’re having a little, er, difficulty here…
Difficulty?! It’s the single simplest machine in the entire universe!
Well alright mister wise guy, if you’re so clever you tell us what colour it should be!
My personal opinion is that there are at least 2 and possibly 3 separate problems that people want to solve and no solution proposed so far solves all of them suitably. The problems need to be clearly separated and solved separately. I think that latest patchset was trying to head in this direction to some extent)
implementations, interfaces, mumble grumble
