DAX semantics

By Jake Edge
May 13, 2019

In the filesystems track at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Ted Ts'o led a discussion about an inode flag to indicate DAX files, which is meant to be applied to files that should be directly accessed without going through the page cache. XFS has such a flag, but ext4 and other filesystems do not. The semantics of what the flag would mean are not clear to Ts'o (and probably others), so the intent of the discussion was to try to nail those down.

Dan Williams said that the XFS DAX flag is silently ignored if the device is not DAX capable. Otherwise, the file must be accessed with DAX. Ts'o said there are lots of questions about what turning on or off a DAX flag might mean; does it matter whether there are already pages in the page cache, for example. He said that he did not have any strong preference but thought that all filesystems should stick with one interpretation.

While Christoph Hellwig described things as "all broken", Ts'o was hoping that some agreement could be reached among the disparate ideas of what a DAX flag would mean. A few people think there should be no flag and that it should all be determined automatically, but most think the flag is useful. He suggested starting with something "super conservative", such as only being able to set the flag for zero-length files or only empty directories where the files in it would inherit the flag. Those constraints could be relaxed later if there was a need.

Boaz Harrosh wondered why someone might want to turn DAX off for a persistent memory device. Hellwig said that the performance "could suck"; Williams noted that the page cache could be useful for some applications as well. Jan Kara pointed out that reads from persistent memory are close to DRAM speed, but that writes are not; the page cache could be helpful for frequent writes. Applications need to change to fully take advantage of DAX, Williams said; part of the promise of adding a flag is that users can do DAX on smaller granularities than a full filesystem.

When he developed DAX, he added the DAX flag as a "chicken bit", Matthew Wilcox said. The intent was that administrators could control the use of DAX on their systems; he would like to preserve that ability going forward. It may not only depend on whether the application is DAX aware or not, but also on the workload that the application is handling. Ts'o said that there may be applications that want to use persistent memory in its "full unexpurgated form"; requiring administrators to set a flag on a file to enable that is not particularly friendly. Wilcox agreed, saying that he did not want to make administrator's jobs harder, but he did want to preserve the ability to override what an application developer chose.

But Chuck Lever wondered why these DAX-aware applications would want to use a filesystem at all; wouldn't they rather simply get access to a chunk of persistent memory directly via mmap() or similar? Williams said that is exactly what DAX does; if you mmap() a DAX file, you get a chunk of the persistent memory mapped in. The problem is that other applications using the same filesystem may not be ready to get that same kind of access; they may be relying on filesystem semantics for file-backed memory.

One way to look at it, Ts'o said, is that if someone buys a chunk of persistent memory for a single application that is going to use all of it, they don't need a filesystem at all. They can just point the application directly at the block device. But if someone wants to share that persistent memory with multiple applications, user IDs, and such, then a filesystem makes sense.

Lever asked why some kind of namespace would not be used to make that distinction. Williams said that administrators are used to the tools to deal with filesystems and files, so that would make a better interface. For the "I know what I want" users, device DAX solves their problems, but for other users, some other interface is needed and file-oriented interfaces make the most sense. In addition, Ts'o pointed out that a namespace or bind mount is not sufficient since a file either needs to use the page cache or it needs to stay completely out of it, trying to do both will lead to problems. Partitioning a block device into DAX and non-DAX portions would be another way to make it all work, but that lacks flexibility.

An attendee asked what it meant for an application to be DAX aware. Williams said that DAX-aware applications want to do in-place updates of their data and want to manage the CPU cache themselves; these applications want to do accesses on data that is smaller than a page in size directly in memory versus having some kind of buffering or the page cache.

Ts'o said that there are a number of papers out there that describe libraries that can use persistent memory to, for example, update a B-Tree in place. These algorithms do the operations in the right order and flush the caches in such a way that a crash at any point will leave the data structure in a consistent state. It is important to note that these are libraries that would be used by applications because Ts'o said he would not normally trust application authors to get this kind of thing right. But Hellwig expressed skepticism that any non-academic filesystem author would actually trust the CPU's memory subsystem to always get this right; that was met with a fair amount of laughter.

Amir Goldstein asked how allowing DAX and non-DAX access to a file was different from allowing buffered and direct I/O access to the same file. Hellwig said that, while mixing buffered and direct I/O is allowed, it does not give the results that users expect. Goldstein also asked about mixing direct I/O and DAX, but that does not work, Kara said; mixing buffered and direct I/O doesn't really work either, but the kernel pretends that it does. For DAX, kernel developers decided not to repeat that mistake.

Direct I/O and DAX do not work together, but they could if someone wanted to rework the existing implementation, Hellwig said. It would be useful to be able to read persistent memory without using the page cache and to write via the page cache, but it would be tremendously complex to handle the CPU cache coherency correctly, which is probably what has scared everyone away. Another problem is when a DAX-aware application gets surprised by having the page cache between it and persistence; it believes that the cache-flush instructions it issues are making the data persistent when they are not.

But Hellwig said that there is an API problem if applications are issuing "random weird instructions" and expecting them to work; there are too many other layers potentially in between the application and the storage. He is not entirely sure that making these kinds of programming models more widespread is a good idea, but if that's the path that will be taken, there should be some kind of interface at the VDSO level that applications can call where the kernel will ensure that they do the right thing. The kernel will issue the proper cache-flush instructions or whatever else is necessary. The existing model for applications "can't ever work", he said.

Ts'o said that there can be a debate about how reliable these models and cache-flush instructions truly are, but he is reasonably confident that if they don't work, the hardware vendors will fix them when customers put pressure on them. In any case, though, that is orthogonal to the question of having a per-file DAX flag and what its semantics should be.

Williams said he was uncomfortable calling it a "DAX flag", though he acknowledged that could lead directly to the bike shed. He thought that perhaps a MAP_SYNC flag on the inode would be better. Hellwig suggested that the name "DAX" should be retired because it is confusing at this point; he, of course, had his own suggestion for a name for the flag ("writethrough"), as did several others, though no real conclusion was reached.

The discussion moved onto how the flag, however named, could be set. Ts'o said he was uncomfortable restricting it to empty directories, and then having all of the files in it inheriting the attribute, due to hard links and renames. If that is really going to be the way forward, then filesystems need to look at restricting hard links and renames. This is why he wants to nail down the semantics before implementing it in filesystems.

Lever is uncomfortable with a "permanent sticky bit" in the filesystem that is set by administrators, however. He is concerned that administrators will turn it off when it needs to be on, or the reverse; he wondered if a flag to open() was a better path, since the application should know what it needs. But Hellwig pointed out that open() flags are not checked to see if unsupported options are being passed; applications could not be sure to get the behavior they asked for.

Williams pointed out that there already is a dax mount option, so that ship has already sailed to some extent. Ts'o also noted that open() is not the right time to specify this; it needs to be a property of the file itself. If the file is opened twice, once with DAX and once without, what would that mean? One way to handle that might be to fail an open() with the "wrong" mode if the file is already open; the "real disaster" for buffered versus direct I/O was in allowing both types of open() to succeed. Beyond that, though, Hellwig was emphatic that open() flags should never be used for data integrity purposes.

Sprinkled throughout the latter part of the discussion were more suggestions of different names for the flag, but Ts'o thinks they are stuck with the DAX name. There were also questions about how per-file flags interact with the global mount option, including whether a nodax mount option was required. Most seemed to think that option was not needed, however.

In summary, Ts'o said that he thought the overall consensus was to have a flag for empty directories that would be inherited on files created there. The flag could also be set for zero-length files and he heard no enthusiasm for allowing the flag to be cleared once it was set. He plans to summarize the discussion in a post to the relevant mailing lists (fsdevel and DAX) for further discussion.

Index entries for this article
Kernel	DAX
Kernel	Memory management/Nonvolatile memory
Conference	Storage, Filesystem, and Memory-Management Summit/2019

DAX semantics

Posted May 14, 2019 1:04 UTC (Tue) by gutschke (subscriber, #27910) [Link] (2 responses)

If regular file I/O and DAX semantics are mutually exclusive, then setting a flag on the file is going to be an administrative pain. All of a sudden, there is a file that cannot be accessed with "normal" UNIX tools. How about calling "file", "sed", "dd", "tar", or "cp" on a DAX file? These all seem reasonable administrative things to do -- at least whenever the file isn't concurrently opened in DAX mode.

I think, from a usability point of view, having a flag on open() is a lot less surprising for any object that lives inside of the filesystem namespace. I do agree though, that files should lock out incompatible access modes, if they have already been opened in another mode. There is some precedence for that. open() already returns ETXTBSY in similar situations. Currently, that can only happen for write access, but it wouldn't be too far-fetched to also do so for write access of DAX files.

DAX semantics

Posted May 14, 2019 1:13 UTC (Tue) by gutschke (subscriber, #27910) [Link]

$s/write/read/

DAX semantics

Posted May 24, 2019 18:15 UTC (Fri) by mcr (subscriber, #99374) [Link]

I want to echo gutschke's comments about the flag causing standard tools to break. They should continue to work for read-only, as getting good backups depends upon this.
Backups in general are a big issue, I have wanted the VFS layer to provide better access to backups for along time. I would go as far as saying that file systems should be able to turn file contents plus *all* metadata (and extensions, etc.) into POSIX CPIO format directly. (Or, dump format, but I think that is far less standard) The CPIO/TAR/etc. can't be expected to keep up with whatever innovation file system creators are doing. This would significantly free file systems to innovate more, knowing that backup tools would continue to work.

DAX semantics

Posted May 14, 2019 13:09 UTC (Tue) by nix (subscriber, #2304) [Link] (19 responses)

Hellwig was emphatic that open() flags should never be used for data integrity purposes

What are O_SYNC et al, if not that?

O_SYNC

Posted May 14, 2019 13:41 UTC (Tue) by corbet (editor, #1) [Link] (18 responses)

I believe that O_SYNC exactly illustrates his point. Since older kernels will simply ignore that flag, an application can never know that it's actually getting the behavior it's asking for. If you must know that your data has made it to media, you still have to call fsync() regardless of whether you opened with O_SYNC.

O_SYNC

Posted May 14, 2019 13:59 UTC (Tue) by epa (subscriber, #39769) [Link] (17 responses)

Can we patch up the broken open() interface with a system call that, given an open file descriptor, returns back the flags that are in effect? So you could open() and then check that none of the flags you passed were ignored...

O_SYNC

Posted May 14, 2019 14:20 UTC (Tue) by jlayton (subscriber, #31672) [Link]

fcntl(F_GETFL, ...) ?

...though I'm not sure how accurate that is across filesystems.

O_SYNC

Posted May 15, 2019 13:31 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (15 responses)

And then what? If it turns out to be unsupported and the application is supposed to cope with that, it'll need to fall back to fsync at runtime. Which means that this detour can be avoided by simply using fsync all the time.

NB: I strongly suspect that a huge number of 'practical' O_SYNC uses will be entirely gratuitious performance drags because it's being employed as "magic countermeasure" for occasional data corruption caused by writing through dangling pointers.

O_SYNC

Posted May 15, 2019 15:41 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (13 responses)

If latency is more relevant than throughout there can be decent reasons to use O_SYNC. The fact that an fsync might later need to flush large amounts of data, possibly generated by different processes, can be quite problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are reached).

Really wish there were a good posix way to tell the kernel 1) initiate write back very soon, I'm not going to redirty 2) I am / am not going to read that data again soon. 3) amount of pending dirty data is limited.

But without the perf overhead of O_SYNC (which often will end up generating much more random writes than necessary). One can get there using sync_file_range(WRITE) plus fadvise. But it's much harder than necessary.

O_SYNC

Posted May 16, 2019 0:34 UTC (Thu) by TMM (subscriber, #79398) [Link] (1 responses)

That's what fadvise is for, it works pretty well.

O_SYNC

Posted May 16, 2019 2:14 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> That's what fadvise is for, it works pretty well.

I referenced fadvise?

And no, it doesn't. You can't initiate writeback with it, without also causing the page cache contents to be thrown out. For that you need sync_file_range(SYNC_FILE_RANGE_WRITE).

And to not suck performance-wise you need to coalesce the sync_file_range() calls in userspace, and issue them over larger ranges of blocks, otherwise there'll be unnecessarily random IO. And no, just write()ing larger blocks isn't really a good solution either.

To me this is the kernel's job, and I should easily be able to a) max amount of dirty data caused by an process for an fd b) that I'm not going to redirty data that's being written, and that writeback should write back whenever it makes sense from a granularity perspective (i.e. optimize for efficient writes, without unnecessarily keeping dirty till the next dirty_writeback_centisecs).

O_SYNC

Posted May 16, 2019 17:19 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (10 responses)

Fsync is a system call which flushes all outstanding writes for a file referred to by a certain file descriptor. That's not the same as the sync system call.

O_SYNC

Posted May 16, 2019 17:40 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (9 responses)

> Fsync is a system call which flushes all outstanding writes for a file referred to by a certain file descriptor. That's not the same as the sync system call.

Uh, yes, obviously? What does that have to do with what I wrote? You can have a lot of dirty data for a single fd?

O_SYNC

Posted May 16, 2019 18:53 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (8 responses)

>> The fact that an fsync might later need to flush large amounts of data, possibly generated by different processes, can be quite
>> problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are >> reached).

Looks like a description of the sync system call (or rather, it looks less like something which doesn't describe sync than like something which doesn't describe fsync :->).

O_SYNC

Posted May 16, 2019 20:04 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (7 responses)

I was just talking about O_SYNC, the topic of this subthread (and referenced a few words before and after). On a phone, which makes capitalizing words a pain. If you use O_SYNC the amount of dirty data that later needs to be written back (after dirty_{writeback_centisecs, background_bytes, bytes, ...} or fsync) is pretty limited. Therefore not incurring latency at the later stage where an fsync() might need to write back huge amounts of data, or where write() might suddenly block (due to dirty_bytes), or where read() will be slow because dirty_writeback_centisecs/fsync triggered a lot of IO submissions in a very short amount of time.

When using O_SYNC, or controlling writeback using sync_file_range(SYNC_FILE_RANGE_WRITE) etc, you prevent these latency type issues.

It's quite possible, although the linux kernel has improved a fair bit in that regards in the last few years, to trigger individual read()s taking many seconds, because the kernel is busy flushing back a few gigabytes of dirty data.

I explained this because you said:
> Which means that this detour can be avoided by simply using fsync all the time.
etc. Which I think is not correct.

O_SYNC

Posted May 16, 2019 20:38 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

When using fsync, the amount of data which will be written back is just as limited: It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.

Dirty_writeback_centics is a /proc-file which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.

O_SYNC

Posted May 17, 2019 4:48 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (2 responses)

You appear to intentionally misunderstanding, and/or you're making entirely unrelated statements.

> Dirty_writeback_centics is a /proc-file

What on earth made you think I didn't know that?

> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.

It's neither. Writeback triggered by those controls doesn't trigger data integrity flushes, but sync(1) and fsync() do.

> It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.

Well, fsync is synchronous, so constantly emitting it is less performant than the alternative I mentioned with controlling writeback via sync_file_range() (and less efficient than what I wished for) - needs to interact with the drive cache. And where have I doubted that one can also issue fsyncs to control the amount of outstanding data?

O_SYNC

Posted May 17, 2019 17:28 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

>> Dirty_writeback_centics is a /proc-file
> What on earth made you think I didn't know that?

Why did you hack this sentence apart in order to insert a (wrong) conjecture about something I could have thought?

>> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's
>> still sync and not fsync.
> It's neither.

Indeed. The name is obviously different. Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do, IOW, the combination "dirty_writeback_centisecs/fsync" makes no sense. For the case of a single process writing to a certain open file, the effects of O_SYNC are by definition (POSIX) identical to pairing each write(2) with an fsync(2) on the same file descriptor. As additional benefit, fsync is more portable and it enables application to batch several writes if so desired, eg, when creating a file, write the complete file content and then fsync it.

O_SYNC

Posted May 21, 2019 23:10 UTC (Tue) by nix (subscriber, #2304) [Link]

Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do

Except that's not what it does. Every dirty_writeback_centisecs, dirty inodes that became dirty longer ago than dirty_writeback_centisecs are flushed. inodes that were dirtied more recently than that are *not* flushed. That's nothing like sync(), which flushes everything regardless of first dirty time, and you can't do it with fsync() either (even if you track the timing yourself, good luck with that) because there is extra behaviour regarding flushes that take too long which cannot be mimicked with fsync().

O_SYNC

Posted May 17, 2019 7:38 UTC (Fri) by mpr22 (subscriber, #60784) [Link] (1 responses)

Depending on the behaviour of the underlying filesystem, calling fsync() can reify rather more write()s than just the ones to the fd you're fsync()ing.

O_SYNC

Posted May 29, 2019 13:46 UTC (Wed) by Wol (subscriber, #4433) [Link]

Applications should not have to concern themselves with filesystem-specific code.

THAT caused major grief at the ext3/4 transition, iirc. ext3 was "save your data - friendly" by accidental default. ext4 was not. ext3 was NOT sync-friendly, ext4 was.

So you had a choice - write an ext4-friendly program and watch performance go through the floor on ext3, or an ext3-friendly program and watch crashes corrupt your data. NOT nice.

I don't want to know what filesystem my application is running on. I don't want serious differences in behaviour based on that fact. And as someone who is interested in databases and logging and all that, I also probably don't care whether my data made it to disk okay or not, I just want to know that whatever is on disk is consistent!

So the ability to put a "write barrier" in the queue, between the "write the log" and "update the files", is actually important to me. If the log is not updated properly, then the transaction is lost. If the log is safe, and the files are not, I can recover the transaction. Either way, the data is consistent. But if the filesystem moves HALF the file updates before the log is properly saved, I'm screwed.

Cheers,
Wol

O_SYNC

Posted May 21, 2019 23:03 UTC (Tue) by nix (subscriber, #2304) [Link]

It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor

That's not actually true (though in practice it usually is). It's outstanding modifications to a certain open file, this referring to a file description (not descriptor). This is not a single-process entity but can be shared among processes, all of which can be writing to the file description simultaneously: all that data needs synchronous flushing if O_SYNC is on, and the other writing processes will all be blocked if they try to write to it while that write is ongoing.

O_SYNC

Posted May 19, 2019 11:49 UTC (Sun) by epa (subscriber, #39769) [Link]

I meant for open flags in general, not just O_SYNC. Possibly it would be useful to introspect these things anyway, as when a process inherits open filehandles from its parent.