Persistent memory

By Jake Edge
March 26, 2014

2014 LSFMM Summit

Matthew Wilcox and Kent Overstreet talked about support for persistent memory in the kernel on the first day of the 2014 Linux Storage, Filesystem, and Memory Management Summit held in Napa, California. There have been, well, persistent rumors of the imminent availability of persistent memory for some time, but Wilcox said you can actually buy some devices now. He wanted to report on some progress he had made on supporting these devices as well as to ask the assembled developers for their thoughts on some unresolved issues.

Persistent memory is supposed to be as fast as DRAM, but to retain its contents even without power. To support these devices, Wilcox has written a "direct access" block layer that is called DAX ("it has an 'X', which is cool", he said—it also is a three-letter acronym that is not used by anything else in the kernel). The idea behind DAX came from the execute-in-place (XIP) code in the kernel, not because the data accessed from persistent memory will be executed, necessarily, but because it should avoid the page cache. XIP originally came from IBM, which wanted to share executables and libraries between virtual machines, but it has also been used in the embedded world to execute directly from ROM or flash.

Since persistent memory is as fast as RAM, it doesn't make sense to put another copy into memory as a page cache entry. XIP seemed like a logical starting point, since it avoided the page cache, but it required a lot of work to make it suitable for persistent memory. So Wilcox rewrote it and renamed it. Filesystems will make calls to the direct_access() block device operation in a DAX driver to access data from the device without it ending up in the page cache. Wilcox would like to see DAX merged, so he was encouraging people in the room to look at the code and comment.

But there are a few problem areas still. Currently, calling msync() to flush a range of memory to persistent storage will actually sync the entire file and metadata. That is not required by POSIX and Wilcox would like to change the behavior to just sync the range in question. Obviously that has a much further reach than just affecting DAX, and Peter Zijlstra cautioned that changing sync behavior can surprise user space, pointing to "fsync() wars from a few years back" as an example. User space often doesn't care what is supposed to be done, instead it depends on the existing semantics, he said.

Wilcox said that kernel developers "suck at implementing [syncing], user space sucks at using it" and concluded that "syncing sucks". The consensus seemed to be that any application that was syncing a range, but depending on the whole file being synced, is broken. Furthermore, Chris Mason was all in favor of fixing msync() for ranges as it would "make filesystem guys look good".

Another problem area is with the MAP_FIXED flag for mmap(). It has two meanings, one of which is not very well known, he said. MAP_FIXED means to map the pages at the address specified, which is expected. But it also means to unmap any pages that are in the way of that mapping, which is surprising. Someone must have wanted that behavior at one time, but no one wants it any more, he said. He has proposed a MAP_WEAK flag that would only map the memory if nothing else is occupying the address range.

The get_user_pages() function cannot be used with persistent memory, because there are no struct page entries created for it. There could be a lot of pages in a persistent memory device, so wasting 64 bytes per page for a mostly unused struct page is not desirable. The call to get_user_pages() is generally for I/O, so Dave Hansen has been working on a get_user_sg() that create a scatter-gather list for doing I/O. The crypto subsystem also wants this capability, Wilcox said.

There is a problem, though. A truncate() operation could remove blocks out from under get_user_sg(), which would leave a mess behind. Wilcox wondered if file truncation could just be blocked until the pages are no longer pinned by the I/O operation. That did not seem popular, but Overstreet had another idea.

Overstreet has been working on a direct I/O rewrite for some time and, in many ways, doing a DAX mapping and a direct I/O look similar, he said. His rewrite would create a new struct bio that would be the container for the I/O. It would get rid of the get_block() callback, which is, he said, a horrible interface. For one thing, it may have to read the mapping from disk, which should be asynchronous, but get_block() isn't. Moving to struct bio would allow the usual block-layer filesystem locking to avoid the truncate().

There were some complaints that making I/O be bio-based was problematic for filesystems like NFS and CIFS that don't use the bio structure. Overstreet said that we may get to a point where buffered I/O lives atop direct I/O, which would help that problem. In addition, Mason did not think that a bio-based interface would really be that big of a problem for NFS and others. A bio is just a container of pages, Overstreet said.

In the end, no really clear conclusions were drawn. It would seem that folks need to review the DAX code (and, eventually, Overstreet's direct I/O rewrite) before reaching those conclusions.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Index entries for this article
Kernel	Memory management/Nonvolatile memory
Conference	Storage, Filesystem, and Memory-Management Summit/2014

Persistent memory

Posted Mar 27, 2014 16:13 UTC (Thu) by luto (guest, #39314) [Link] (5 responses)

I actually use the silly MAP_FIXED behavior. I allocate chunks of memory with guard pages, and I do it by allocating a bit PROT_NONE, MAP_NORESERVE region, and then allocating real memory in the middle of it with MAP_FIXED.

Persistent memory

Posted Mar 27, 2014 20:02 UTC (Thu) by willy (subscriber, #9762) [Link]

We're not going to change any existing behaviour, just add a new flag to specify that the UNMAP ANYTHING IN OUR WAY behaviour isn't wanted. Your application will continue to work just fine.

Thanks for sharing your use case!

Persistent memory

Posted Mar 31, 2014 5:09 UTC (Mon) by dontstayhome (guest, #54412) [Link] (3 responses)

What do you need the guard regions for?

I understand how a guard region can be useful for detecting when a stack needs to be extended, but I've often seen other PROT_NONE mappings (like the one created by the dynamic linker for every shared library that's mapped in) and I don't really know why they're used.

Persistent memory

Posted Mar 31, 2014 17:42 UTC (Mon) by luto (guest, #39314) [Link] (2 responses)

I use them to make it more likely for wild writes and buffer overruns to segfault instead of causing random corruption.

Persistent memory

Posted Mar 31, 2014 17:53 UTC (Mon) by dontstayhome (guest, #54412) [Link] (1 responses)

Isn't the same effect achieved by not allocating that page/region at all? Why is it better to allocate it as PROT_NONE?

I guess maybe if you didn't explicitly allocate surrounding guard regions, your next mmap could be placed adjacent to your previous one (if you don't explicitly specify MAP_FIXED)? Does the kernel actually do this in practice?

Persistent memory

Posted Mar 31, 2014 19:14 UTC (Mon) by luto (guest, #39314) [Link]

The kernel seems to stick them all right next to each other if I don't explicitly add a PROT_NONE region in the way. And, since MAP_FIXED overwrites things, I can't just guess a desired address and try it with MAP_FIXED, since I'll crash if I guess badly.

The upshot is that the an improved MAP_FIXED flag would actually improve my use case a bit.

Maybe we should have MAP_WANT_GUARD_PAGES, too.

Persistent memory

Posted Mar 29, 2014 11:13 UTC (Sat) by blackwood (guest, #44174) [Link]

get_user_sg has my vote - for 3.16 we have patches floating for drm/i915 to map userspace ranges into the gpu address space, and atm we use get_user_pages + create an sg table from it. But within the driver we exclusively deal with sg lists since we already support non struct page backed memoery (gfx stolen range reserved by the firmware), so having a get_user_sg would neatly clean up our code.

Persistent memory

Posted Mar 31, 2014 19:23 UTC (Mon) by foom (subscriber, #14868) [Link]

A new flag is almost unnecessary. The current behavior is that if you pass an address without MAP_FIXED, it will map your space there, if and only if that address is available. If the address is not available, it will map it somewhere else.

So, the workaround is trivial: you call mmap with an address, without MAP_FIXED, and then compare the address returned to the address requested. If they don't match, unmap and return a failure.

Byte addressability

Posted Apr 9, 2014 22:08 UTC (Wed) by mratnad (guest, #96496) [Link]

Was there any discussion on how to enable the applications to use persistent memory through byte level addressing? Is mmap the only way?
Also, anything about atomic writes to the persistent memory?
TIA.

Persistent memory

Posted Apr 14, 2014 1:36 UTC (Mon) by kmeyer (subscriber, #50720) [Link]

> Wilcox said you can actually buy some devices now

Where?! We would really like to start playing with and thinking about supporting these in FreeBSD.