|
|
Subscribe / Log in / New account

DAX on BTT

By Jake Edge
May 4, 2016

LSFMM 2016

In the final plenary session of the 2016 Linux Storage, Filesystem, and Memory-Management Summit, much of the team that works on the DAX direct-access mechanism led a discussion on how DAX should interact with the block translation table (BTT)—a mechanism aimed at making persistent memory have the atomic sector-write properties that users expect from block devices. Dan Williams took the role of ringleader, but Matthew Wilcox, Vishal Verma, and Ross Zwisler were also on-stage to participate.

Williams noted that Microsoft has adopted DAX for persistent memory and is even calling it DAX. Wilcox said that it was an indication that Microsoft is "listening to customers; they've changed".

[Matthew Wilcox, Vishal Verma, Ross Zwisler, and Dan Williams]

BTT is a way to put block-layer-like semantics onto persistent memory, which handles writes at a cache-line granularity (i.e. 64 bytes), so that 512-byte (sector) writes are atomic. This eliminates the problem of "sector tearing", where a power or other failure causes a partial write to a sector resulting in a mixture of old and new data—a situation that applications (or filesystems) are probably not prepared to handle. Microsoft supports DAX on both BTT and non-BTT block devices, while Linux only supports it for non-BTT devices. Williams asked: "should we follow them [Microsoft] down that rabbit hole?"

The problem is that BTT is meant to fix a problem where persistent memory is treated like a block device, which is not what DAX is aimed at. Using BTT only for filesystem metadata might be one approach, Zwisler said. But Ric Wheeler noted that filesystems already put a lot of work into checksumming metadata, so using BTT for that would make things much slower for little or no gain.

Jeff Moyer pointed out that sector tearing can happen on block devices like SSDs, which is not what users expect. Joel Becker suggested that something like the SCSI atomic write command could be used by filesystems or applications that are concerned about torn sectors. That command guarantees that the sector is either written in full or not at all. There is no way to "magically save applications from torn sectors" unless they take some kind of precaution, he said.

There is a bit of a "hidden agenda" in supporting BTT, though, Williams said. Currently, the drivers are not aware of when DAX mappings are established and torn down, but that would change for BTT support. Wilcox said he has a patch series that addresses some parts of that by making the radix tree the source for that information.


Index entries for this article
KernelDAX
KernelMemory management/Nonvolatile memory
ConferenceStorage, Filesystem, and Memory-Management Summit/2016


to post comments

DAX on BTT

Posted May 5, 2016 20:38 UTC (Thu) by phro (subscriber, #29295) [Link] (6 responses)

Actually, what I said was that sector tearing doesn't usually happen on SSDs due to the nature of the FTL. Traditional storage, however, never guaranteed sector atomicity, but it usually does provide it. When you switch over to a block driver on top of pmem, it's possible that there will be increased risk of tripping over torn sectors (since it's just an interrupted memcpy). I don't have any numbers to back that up, presently. If you're wondering whether applications do rely on atomic sector updates, wonder no more! There is research that shows that some applications do, in fact, expect it [1].

The real issue is that DAX and the BTT are incompatible. If you want to use DAX, you have to give up sector atomicity. If applications truly depend on that, then they can't run unmodified on a pmem device mounted with -o dax. That means that you would have to separate out your pmem mount points into those that will support legacy applications and those that will only support DAX. By combining the two, you get the best of both worlds.

I got the overwhelming impression that the room was not convinced that applications should rely on atomic sector updates. Such applications are broken and should be fixed. Thus, there is little impetus to support the mixed DAX+BTT mode that was proposed.

[1] http://research.cs.wisc.edu/wind/Publications/alice-osdi1...

DAX on BTT

Posted May 5, 2016 21:11 UTC (Thu) by stellarhopper (subscriber, #84666) [Link] (1 responses)

Are you suggesting that we should indeed pursue DAX+BTT? One of the cons we also realized was that even if we do support this hybrid model, it will preclude DAX mappings of larger than a page (i.e. 2MB and 1GB) mappings, and the lost performance there is probably not worth the minor gains in convenience from the hybrid mode.

DAX on BTT

Posted May 6, 2016 14:48 UTC (Fri) by phro (subscriber, #29295) [Link]

I think breaking basic assumptions of applications is bad, yes. So, pursuing DAX+BTT is interesting in that respsect. I forgot about the limitation you described, however. That does put a rather serious monkeywrench in the works, but I think that it could be worked around. The question is whether anyone is willing to put the work in, and at this stage it's not clear whether it would be worth the effort.

DAX on BTT

Posted May 6, 2016 0:00 UTC (Fri) by neilbrown (subscriber, #359) [Link] (3 responses)

> Traditional storage, however, never guaranteed sector atomicity

citation needed.

My model of traditional storage includes a ECC for each block. So the options for a read after an aborted write are:
- old data
- new data
- read error (ECC reports an uncorrectable error)

How can you get a torn sector?

DAX on BTT

Posted May 6, 2016 0:15 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (1 responses)

I'd argue that a read error is an atomicity problem.

DAX on BTT

Posted May 8, 2016 13:39 UTC (Sun) by robbe (guest, #16131) [Link]

Is there any basic device that guarantees no errors, ever? (Sure, you can build stacks of redundancy that make them less and less probable.)

FWIW, ECC does not guarantee detection of errors. I don’t know what the distance of the code used for your disk is (are these values universal?), so I can’t tell what the probability of an undetected error is.

DAX on BTT

Posted May 6, 2016 14:50 UTC (Fri) by phro (subscriber, #29295) [Link]

>> Traditional storage, however, never guaranteed sector atomicity

> citation needed.

I suppose "never" is a strong word. What I meant to say was that the SCSI and ATA standards did not say anything about power-fail write atomicity of a single sector. Because they did not standardize it, you cannot rely on it.


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds