Progress on persistent memory

By Jake Edge
March 11, 2015

LSFMM 2015

It has been "the year of persistent memory" for several years now, Matthew Wilcox said with a chuckle to open his plenary session at the 2015 Storage, Filesystem, and Memory Management summit in Boston on March 9. Persistent memory refers to devices that can be accessed like RAM, but will permanently store any data written to them. The good news is that there are some battery-backed DIMMs already available, but those have a fairly small capacity at this point (8GB, for example). There are much larger devices coming, 400GB was mentioned, but it is not known when they will be shipping. From Wilcox's talk, it is clear that the two different classes of devices will have different use cases, so they may be handled differently by the kernel.

It is good news that there are "exciting new memory products" in development, he said, but it may still be some time before we see them on the market. He is not sure that we will see them this year, for example. It turns out that development delays sometimes happen when companies are dealing with "new kinds of physics".

Christoph Hellwig jumped in early on in the talk to ask if Wilcox's employer, Intel, would be releasing its driver for persistent memory devices anytime soon. Wilcox was obviously unhappy with the situation, but said that the driver could not be released until the ACPI specification for how the device describes itself to the system is released. That is part of the ACPI 6 process, which will be released "when ACPI gets around to it". As soon as that happens, Intel will release its driver.

James Bottomley noted that there is a process within UEFI (which oversees ACPI) to release portions of specifications if there is general agreement by the participants to do so. He encouraged Intel to take advantage of that process.

Another attendee asked whether it was possible to write a driver today that would work with all of the prototype devices tested but wouldn't corrupt any other of the other prototypes that had not been tested. Wilcox said no; at this point that isn't the case. "It is frustrating", he said.

Persistent memory and `struct page`

He then moved on to a topic he thought would be of interest to the memory-management folks in attendance. With a 4KB page size, and a struct page for each page, the 400GB device he mentioned would require 6GB just to track those pages in the kernel. That is probably too much space to "waste" for those devices. But if the kernel tracks the memory with page structures, it can be treated as normal memory. Otherwise, some layer, like a block device API, will be needed to access the device.

Wilcox has been operating under the assumption that those kinds of devices won't use struct page. On the other hand, Boaz Harrosh (who was not present at the summit) has been pushing patches for other, smaller devices, and those patches do use struct page. That makes sense for that use case, Wilcox said, but it is not the kind of device he has been targeting.

Those larger devices have wear characteristics that are akin to those of NAND flash, but it isn't "5000 cycles and the bit is dead". The devices have wear lifetimes of 10⁷ or 10⁸ cycles. In terms of access times, some are even faster than DRAM, he said.

Ted Ts'o suggested that the different capacity devices might need to be treated differently. Dave Chinner agreed, saying that the battery-backed devices are effectively RAM, while the larger devices are storage, which could be handled as block devices.

Wilcox said he has some preliminary patches to replace calls to get_user_pages() for these devices with a new call, get_user_sg(), which gets a scatter/gather list, rather than pages. That way, there is no need to have all those page structures to handle these kinds of devices. Users can treat the device as a block device. They can put a filesystem on it and use mmap() for data access.

That led to a discussion about what to do to handle a truncate() on a file that has been mapped with mmap(). Wilcox thinks that Unix, thus Linux, has the wrong behavior in that scenario. If a program accesses memory that is no longer part of the mapped file due to the truncation, it gets a SIGSEGV. Instead, he thinks that the truncate() call should be made to wait until the memory is unmapped.

Making truncate() wait is trivial to implement, Peter Zijlstra said, but it certainly changes the current behavior. He suggested adding a flag to mmap() to request this mode of operation. That should reduce the surprise factor as it makes the behavior dependent on what is being mapped. Ts'o said that he didn't think the kernel could unconditionally block truncate operations for hours or days without breaking some applications.

Getting back to the question of the drivers, Ts'o asked what decisions needed to be made and by when. The battery-backed devices are out there now, so patches to support them should go in soon, one attendee said. Hellwig said that it would make sense to have Harrosh's driver and the Intel driver in the kernel. People could then choose the one that made sense for their device. In general, that was agreeable, but the driver for the battery-backed devices still needs some work before it will be ready to merge. Bottomley noted that means that the group has decided to have two drivers, "one that needs cleaning up and one we haven't seen".

New instructions

Wilcox turned to three new instructions that Intel has announced for its upcoming processors that can be used to better support persistent memory and other devices. The first is clflushopt, which adds guarantees to the cache-line flush (clflush) instruction. The main benefit is that it is faster than clflush. Cache-line writeback (clwb) is another, which writes the cache line back to memory, but still leaves it in the cache. The third is pcommit, which acts as a sort of barrier to ensure that any prior cache flushes or writebacks actually get to memory.

The effect of pcommit is global for all cores. The idea is to do all of the flushes, then pcommit; when it is done, all that data will have been written. On current processors, there is no way to be sure that everything has been stored. He said that pcommit support still needs to be added to DAX, the direct access block layer for persistent memory devices that he developed.

Ts'o asked about other processors that don't have support for those kinds of instructions, but Wilcox didn't have much of an answer for that. He works for Intel, so other vendors will have to come up with their own solutions there.

There was also a question about adding a per-CPU commit, which Wilcox said was under internal discussion. But Bottomley thought that if there were more complicated options, that could just lead to more problems. Rik van Riel noted that the scheduler could move the process to a new CPU halfway through a transaction anyway, so the target CPU wouldn't necessarily be clear. In answer to another question, Wilcox assured everyone that the flush operations would not be slower than existing solutions for SATA, SAS, and others.

Error handling

His final topic was error handling. There is no status register that gives error indications when you access a persistent memory device, since it is treated like memory. An error causes a machine check, which typically results in a reboot. But if the problem persists, it could just result in another reboot when the device is accessed again, which will not work all that well.

To combat this, there will be a log of errors for the device that can be consulted at startup. It will record the block device address where problems occur and filesystems will need to be able to map that back to a file and offset, which is "not the usual direction for a filesystem". Chinner spoke up to say that XFS would have this feature "soon". Ts'o seemed to indicate ext4 would also be able to do it.

But "crashing is not a great error discovery technique", Ric Wheeler said; it is "moderately bad" for enterprise users to have to reboot their systems that way. But handling the problems when an mmap() is done for that bad region in the device is not easy either. Several suggestions were made (a signal from the mmap() call or when the page table entry is created, for example), but any of them mean that user space needs to be able to handle the errors.

In addition, Chris Mason said that users are going to expect to be able to mmap() a large file that has one bad page and still access all of the other pages from the file. That may not be reasonable, but is what they will expect. At that point, the discussion ran out of time without reaching any real conclusion on error handling.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Index entries for this article
Kernel	Memory management/Nonvolatile memory
Conference	Storage, Filesystem, and Memory-Management Summit/2015

Progress on persistent memory

Posted Mar 12, 2015 13:52 UTC (Thu) by tdz (subscriber, #58733) [Link] (1 responses)

If I had 400GB of main memory in my computer, I wouldn't care about 'wasting' 6GiB for page handling.

If this really is a problem for some people, the device could be partitioned into 'memory' and 'storage.' 16 GiB might be used for memory and the rest is storage.

Progress on persistent memory

Posted Mar 12, 2015 14:22 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

The low cache hit ratio (tlb, l1-l3) die to the massive page tables might end up annoying you considerably though. Page faults are often a significant contributor to runtime.

Progress on persistent memory

Posted Mar 12, 2015 15:22 UTC (Thu) by mtanski (guest, #56423) [Link]

I really hope IO errors in new memory technologies translate to similar semantics as the current IO errors in mmap regions. Eg a SIGBUS to the process / thread making the call.

Progress on persistent memory

Posted Mar 13, 2015 15:37 UTC (Fri) by etienne (guest, #25256) [Link]

> battery-backed DIMMs

Obviously those are DIMMs, someone knows if the 400 Gbytes devices are DIMMs too, or do they hide behind a PCIe bus?
Does DIMMs have enough "address lines" for such a range?

Blocking truncate()

Posted Mar 13, 2015 17:07 UTC (Fri) by cesarb (subscriber, #6266) [Link] (3 responses)

> He suggested adding a flag to mmap() to request this mode of operation. That should reduce the surprise factor as it makes the behavior dependent on what is being mapped.

Having truncate() block potentially forever, without a special flag to the truncate() call, would still be surprising behavior. Say a sysadmin notices a "hung" process; how would he find that it was a truncate() waiting for a mmap(), and most important, how would he find which are the files involved and which process he should kill so the truncate() goes ahead?

It would be as annoying as "mandatory locking".

We already sort of have that problem with NFS, but with NFS it's easy to find (check the server for all mounted filesystems to see if any isn't responding) and solve (reboot the hung server).

Blocking truncate()

Posted Mar 13, 2015 21:16 UTC (Fri) by zlynx (guest, #2285) [Link]

strace would show the program blocking on truncate(14) for example.

Then lsof would show that file handle 14 is /nvm/datafile.

fuser -v /nvm/datafile shows the programs using it.

I think that sequence of commands would do the trick.

Blocking truncate()

Posted Mar 15, 2015 14:52 UTC (Sun) by pflugstad (subscriber, #224) [Link]

And wouldn't you want to use huge pages for something like that? That would cut down on overhead quite a bit...

Blocking truncate()

Posted Apr 9, 2015 18:53 UTC (Thu) by Jandar (subscriber, #85683) [Link]

IMHO it's not good to block the truncate() syscall instead remember the intended file-size. I seen it like an unlink() on a file which has an open fd, the resource is released after the last usage ended.

What happens if a write enlarges the file to have a new backing for the mmapped region? If the truncate is blocked, would it revert this enlargement after unmap()?

Blocking the truncating process opens needlessly a can of worms.

Swap memory

Posted Mar 19, 2015 21:34 UTC (Thu) by Tov (subscriber, #61080) [Link]

OK, so we will soon get a really fast Non-Volative Random Access Memory with wear characteristics.

Wouldn't that be an excellent candidate for a "swap-like" memory with random read-only access and execute in place?

That would mean:
- Really fast suspend/resume!
- Read-only memory (like code segments) can be swapped out immediately on load
- Swap out based on "least written" semantics instead of "least accessed"
- Fault to DRAM on write instead of fault on access (to avoid excessive wear)
- ECC should be used together with some BadRAM-like blacklisting

It would really be a shame to convert this fast random access memory to a block storage - something like using your hard drive as a tape storage :-)

Progress on persistent memory

Persistent memory and struct page

New instructions

Error handling

Progress on persistent memory

Progress on persistent memory

Progress on persistent memory

Progress on persistent memory

Blocking truncate()

Blocking truncate()

Blocking truncate()

Blocking truncate()

Swap memory

Persistent memory and `struct page`