LWN: Comments on "DAX semantics" https://lwn.net/Articles/787973/ This is a special feed containing comments posted to the individual LWN article titled "DAX semantics". en-us Fri, 19 Sep 2025 07:20:16 +0000 Fri, 19 Sep 2025 07:20:16 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net O_SYNC https://lwn.net/Articles/789687/ https://lwn.net/Articles/789687/ Wol <div class="FormattedComment"> Applications should not have to concern themselves with filesystem-specific code.<br> <p> THAT caused major grief at the ext3/4 transition, iirc. ext3 was "save your data - friendly" by accidental default. ext4 was not. ext3 was NOT sync-friendly, ext4 was.<br> <p> So you had a choice - write an ext4-friendly program and watch performance go through the floor on ext3, or an ext3-friendly program and watch crashes corrupt your data. NOT nice.<br> <p> I don't want to know what filesystem my application is running on. I don't want serious differences in behaviour based on that fact. And as someone who is interested in databases and logging and all that, I also probably don't care whether my data made it to disk okay or not, I just want to know that whatever is on disk is consistent!<br> <p> So the ability to put a "write barrier" in the queue, between the "write the log" and "update the files", is actually important to me. If the log is not updated properly, then the transaction is lost. If the log is safe, and the files are not, I can recover the transaction. Either way, the data is consistent. But if the filesystem moves HALF the file updates before the log is properly saved, I'm screwed.<br> <p> Cheers,<br> Wol<br> </div> Wed, 29 May 2019 13:46:30 +0000 DAX semantics https://lwn.net/Articles/789380/ https://lwn.net/Articles/789380/ mcr <div class="FormattedComment"> I want to echo gutschke's comments about the flag causing standard tools to break. They should continue to work for read-only, as getting good backups depends upon this.<br> Backups in general are a big issue, I have wanted the VFS layer to provide better access to backups for along time. I would go as far as saying that file systems should be able to turn file contents plus *all* metadata (and extensions, etc.) into POSIX CPIO format directly. (Or, dump format, but I think that is far less standard) The CPIO/TAR/etc. can't be expected to keep up with whatever innovation file system creators are doing. This would significantly free file systems to innovate more, knowing that backup tools would continue to work.<br> </div> Fri, 24 May 2019 18:15:55 +0000 O_SYNC https://lwn.net/Articles/789052/ https://lwn.net/Articles/789052/ nix <blockquote> Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do </blockquote> Except that's not what it does. Every dirty_writeback_centisecs, dirty inodes that became dirty longer ago than dirty_writeback_centisecs are flushed. inodes that were dirtied more recently than that are *not* flushed. That's nothing like sync(), which flushes everything regardless of first dirty time, and you can't do it with fsync() either (even if you track the timing yourself, good luck with that) because there is extra behaviour regarding flushes that take too long which cannot be mimicked with fsync(). Tue, 21 May 2019 23:10:38 +0000 O_SYNC https://lwn.net/Articles/789051/ https://lwn.net/Articles/789051/ nix <blockquote> It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor </blockquote> That's not actually true (though in practice it usually is). It's outstanding modifications to a certain open file, this referring to a file <i>description</i> (not descriptor). This is not a single-process entity but can be shared among processes, all of which can be writing to the file description simultaneously: all that data needs synchronous flushing if O_SYNC is on, and the other writing processes will all be blocked if they try to write to it while that write is ongoing. Tue, 21 May 2019 23:03:38 +0000 O_SYNC https://lwn.net/Articles/788847/ https://lwn.net/Articles/788847/ epa <div class="FormattedComment"> I meant for open flags in general, not just O_SYNC. Possibly it would be useful to introspect these things anyway, as when a process inherits open filehandles from its parent. <br> </div> Sun, 19 May 2019 11:49:41 +0000 O_SYNC https://lwn.net/Articles/788799/ https://lwn.net/Articles/788799/ rweikusat2 <div class="FormattedComment"> <font class="QuotedText">&gt;&gt; Dirty_writeback_centics is a /proc-file</font><br> <font class="QuotedText">&gt; What on earth made you think I didn't know that?</font><br> <p> Why did you hack this sentence apart in order to insert a (wrong) conjecture about something I could have thought?<br> <p> <font class="QuotedText">&gt;&gt; which can be used to change the time interval between automatically performed sync-operation by the kernel. That's </font><br> <font class="QuotedText">&gt;&gt; still sync and not fsync.</font><br> <font class="QuotedText">&gt; It's neither. </font><br> <p> Indeed. The name is obviously different. Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do, IOW, the combination "dirty_writeback_centisecs/fsync" makes no sense. For the case of a single process writing to a certain open file, the effects of O_SYNC are by definition (POSIX) identical to pairing each write(2) with an fsync(2) on the same file descriptor. As additional benefit, fsync is more portable and it enables application to batch several writes if so desired, eg, when creating a file, write the complete file content and then fsync it.<br> <p> <p> </div> Fri, 17 May 2019 17:28:37 +0000 O_SYNC https://lwn.net/Articles/788742/ https://lwn.net/Articles/788742/ mpr22 <div class="FormattedComment"> Depending on the behaviour of the underlying filesystem, calling fsync() can reify rather more write()s than just the ones to the fd you're fsync()ing.<br> </div> Fri, 17 May 2019 07:38:23 +0000 O_SYNC https://lwn.net/Articles/788738/ https://lwn.net/Articles/788738/ andresfreund <div class="FormattedComment"> You appear to intentionally misunderstanding, and/or you're making entirely unrelated statements. <br> <p> <font class="QuotedText">&gt; Dirty_writeback_centics is a /proc-file </font><br> <p> What on earth made you think I didn't know that?<br> <p> <p> <font class="QuotedText">&gt; which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.</font><br> <p> It's neither. Writeback triggered by those controls doesn't trigger data integrity flushes, but sync(1) and fsync() do.<br> <p> <p> <font class="QuotedText">&gt; It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.</font><br> <p> Well, fsync is synchronous, so constantly emitting it is less performant than the alternative I mentioned with controlling writeback via sync_file_range() (and less efficient than what I wished for) - needs to interact with the drive cache. And where have I doubted that one can also issue fsyncs to control the amount of outstanding data?<br> <p> <p> <p> <p> </div> Fri, 17 May 2019 04:48:27 +0000 O_SYNC https://lwn.net/Articles/788718/ https://lwn.net/Articles/788718/ rweikusat2 <div class="FormattedComment"> When using fsync, the amount of data which will be written back is just as limited: It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.<br> <p> Dirty_writeback_centics is a /proc-file which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.<br> </div> Thu, 16 May 2019 20:38:35 +0000 O_SYNC https://lwn.net/Articles/788717/ https://lwn.net/Articles/788717/ andresfreund <div class="FormattedComment"> I was just talking about O_SYNC, the topic of this subthread (and referenced a few words before and after). On a phone, which makes capitalizing words a pain. If you use O_SYNC the amount of dirty data that later needs to be written back (after dirty_{writeback_centisecs, background_bytes, bytes, ...} or fsync) is pretty limited. Therefore not incurring latency at the later stage where an fsync() might need to write back huge amounts of data, or where write() might suddenly block (due to dirty_bytes), or where read() will be slow because dirty_writeback_centisecs/fsync triggered a lot of IO submissions in a very short amount of time.<br> <p> When using O_SYNC, or controlling writeback using sync_file_range(SYNC_FILE_RANGE_WRITE) etc, you prevent these latency type issues.<br> <p> It's quite possible, although the linux kernel has improved a fair bit in that regards in the last few years, to trigger individual read()s taking many seconds, because the kernel is busy flushing back a few gigabytes of dirty data. <br> <p> <p> I explained this because you said:<br> <font class="QuotedText">&gt; Which means that this detour can be avoided by simply using fsync all the time.</font><br> etc. Which I think is not correct.<br> </div> Thu, 16 May 2019 20:04:54 +0000 O_SYNC https://lwn.net/Articles/788713/ https://lwn.net/Articles/788713/ rweikusat2 <div class="FormattedComment"> <font class="QuotedText">&gt;&gt; The fact that an fsync might later need to flush large amounts of data, possibly generated by different processes, can be quite </font><br> <font class="QuotedText">&gt;&gt; problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are &gt;&gt; reached). </font><br> <p> Looks like a description of the sync system call (or rather, it looks less like something which doesn't describe sync than like something which doesn't describe fsync :-&gt;).<br> </div> Thu, 16 May 2019 18:53:20 +0000 O_SYNC https://lwn.net/Articles/788705/ https://lwn.net/Articles/788705/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Fsync is a system call which flushes all outstanding writes for a file referred to by a certain file descriptor. That's not the same as the sync system call.</font><br> <p> Uh, yes, obviously? What does that have to do with what I wrote? You can have a lot of dirty data for a single fd?<br> </div> Thu, 16 May 2019 17:40:38 +0000 O_SYNC https://lwn.net/Articles/788704/ https://lwn.net/Articles/788704/ rweikusat2 <div class="FormattedComment"> Fsync is a system call which flushes all outstanding writes for a file referred to by a certain file descriptor. That's not the same as the sync system call.<br> </div> Thu, 16 May 2019 17:19:33 +0000 O_SYNC https://lwn.net/Articles/788592/ https://lwn.net/Articles/788592/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; That's what fadvise is for, it works pretty well.</font><br> <p> I referenced fadvise?<br> <p> And no, it doesn't. You can't initiate writeback with it, without also causing the page cache contents to be thrown out. For that you need sync_file_range(SYNC_FILE_RANGE_WRITE).<br> <p> And to not suck performance-wise you need to coalesce the sync_file_range() calls in userspace, and issue them over larger ranges of blocks, otherwise there'll be unnecessarily random IO. And no, just write()ing larger blocks isn't really a good solution either.<br> <p> To me this is the kernel's job, and I should easily be able to a) max amount of dirty data caused by an process for an fd b) that I'm not going to redirty data that's being written, and that writeback should write back whenever it makes sense from a granularity perspective (i.e. optimize for efficient writes, without unnecessarily keeping dirty till the next dirty_writeback_centisecs).<br> <p> </div> Thu, 16 May 2019 02:14:17 +0000 O_SYNC https://lwn.net/Articles/788587/ https://lwn.net/Articles/788587/ TMM <div class="FormattedComment"> That's what fadvise is for, it works pretty well.<br> </div> Thu, 16 May 2019 00:34:26 +0000 O_SYNC https://lwn.net/Articles/788523/ https://lwn.net/Articles/788523/ andresfreund <div class="FormattedComment"> If latency is more relevant than throughout there can be decent reasons to use O_SYNC. The fact that an fsync might later need to flush large amounts of data, possibly generated by different processes, can be quite problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are reached). <br> <p> <p> Really wish there were a good posix way to tell the kernel 1) initiate write back very soon, I'm not going to redirty 2) I am / am not going to read that data again soon. 3) amount of pending dirty data is limited.<br> <p> But without the perf overhead of O_SYNC (which often will end up generating much more random writes than necessary). One can get there using sync_file_range(WRITE) plus fadvise. But it's much harder than necessary.<br> <p> <p> </div> Wed, 15 May 2019 15:41:46 +0000 O_SYNC https://lwn.net/Articles/788425/ https://lwn.net/Articles/788425/ rweikusat2 <div class="FormattedComment"> And then what? If it turns out to be unsupported and the application is supposed to cope with that, it'll need to fall back to fsync at runtime. Which means that this detour can be avoided by simply using fsync all the time.<br> <p> NB: I strongly suspect that a huge number of 'practical' O_SYNC uses will be entirely gratuitious performance drags because it's being employed as "magic countermeasure" for occasional data corruption caused by writing through dangling pointers.<br> <p> </div> Wed, 15 May 2019 13:31:34 +0000 O_SYNC https://lwn.net/Articles/788340/ https://lwn.net/Articles/788340/ jlayton <div class="FormattedComment"> fcntl(F_GETFL, ...) ?<br> <p> ...though I'm not sure how accurate that is across filesystems.<br> </div> Tue, 14 May 2019 14:20:33 +0000 O_SYNC https://lwn.net/Articles/788337/ https://lwn.net/Articles/788337/ epa <div class="FormattedComment"> Can we patch up the broken open() interface with a system call that, given an open file descriptor, returns back the flags that are in effect? So you could open() and then check that none of the flags you passed were ignored...<br> </div> Tue, 14 May 2019 13:59:04 +0000 O_SYNC https://lwn.net/Articles/788334/ https://lwn.net/Articles/788334/ corbet I believe that <tt>O_SYNC</tt> exactly illustrates his point. Since older kernels will simply ignore that flag, an application can never know that it's actually getting the behavior it's asking for. If you must know that your data has made it to media, you still have to call <tt>fsync()</tt> regardless of whether you opened with <tt>O_SYNC</tt>. Tue, 14 May 2019 13:41:50 +0000 DAX semantics https://lwn.net/Articles/788332/ https://lwn.net/Articles/788332/ nix <blockquote> Hellwig was emphatic that open() flags should never be used for data integrity purposes </blockquote> What are O_SYNC et al, if not that? Tue, 14 May 2019 13:09:33 +0000 DAX semantics https://lwn.net/Articles/788311/ https://lwn.net/Articles/788311/ gutschke <div class="FormattedComment"> $s/write/read/<br> </div> Tue, 14 May 2019 01:13:40 +0000 DAX semantics https://lwn.net/Articles/788310/ https://lwn.net/Articles/788310/ gutschke <div class="FormattedComment"> If regular file I/O and DAX semantics are mutually exclusive, then setting a flag on the file is going to be an administrative pain. All of a sudden, there is a file that cannot be accessed with "normal" UNIX tools. How about calling "file", "sed", "dd", "tar", or "cp" on a DAX file? These all seem reasonable administrative things to do -- at least whenever the file isn't concurrently opened in DAX mode.<br> <p> I think, from a usability point of view, having a flag on open() is a lot less surprising for any object that lives inside of the filesystem namespace. I do agree though, that files should lock out incompatible access modes, if they have already been opened in another mode. There is some precedence for that. open() already returns ETXTBSY in similar situations. Currently, that can only happen for write access, but it wouldn't be too far-fetched to also do so for write access of DAX files.<br> </div> Tue, 14 May 2019 01:04:35 +0000