O_SYNC

Posted May 16, 2019 17:40 UTC (Thu) by andresfreund (subscriber, #69562)
In reply to: O_SYNC by rweikusat2
Parent article: DAX semantics

> Fsync is a system call which flushes all outstanding writes for a file referred to by a certain file descriptor. That's not the same as the sync system call.

Uh, yes, obviously? What does that have to do with what I wrote? You can have a lot of dirty data for a single fd?

O_SYNC

Posted May 16, 2019 18:53 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (8 responses)

>> The fact that an fsync might later need to flush large amounts of data, possibly generated by different processes, can be quite
>> problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are >> reached).

Looks like a description of the sync system call (or rather, it looks less like something which doesn't describe sync than like something which doesn't describe fsync :->).

O_SYNC

Posted May 16, 2019 20:04 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (7 responses)

I was just talking about O_SYNC, the topic of this subthread (and referenced a few words before and after). On a phone, which makes capitalizing words a pain. If you use O_SYNC the amount of dirty data that later needs to be written back (after dirty_{writeback_centisecs, background_bytes, bytes, ...} or fsync) is pretty limited. Therefore not incurring latency at the later stage where an fsync() might need to write back huge amounts of data, or where write() might suddenly block (due to dirty_bytes), or where read() will be slow because dirty_writeback_centisecs/fsync triggered a lot of IO submissions in a very short amount of time.

When using O_SYNC, or controlling writeback using sync_file_range(SYNC_FILE_RANGE_WRITE) etc, you prevent these latency type issues.

It's quite possible, although the linux kernel has improved a fair bit in that regards in the last few years, to trigger individual read()s taking many seconds, because the kernel is busy flushing back a few gigabytes of dirty data.

I explained this because you said:
> Which means that this detour can be avoided by simply using fsync all the time.
etc. Which I think is not correct.

O_SYNC

Posted May 16, 2019 20:38 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

When using fsync, the amount of data which will be written back is just as limited: It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.

Dirty_writeback_centics is a /proc-file which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.

O_SYNC

Posted May 17, 2019 4:48 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (2 responses)

You appear to intentionally misunderstanding, and/or you're making entirely unrelated statements.

> Dirty_writeback_centics is a /proc-file

What on earth made you think I didn't know that?

> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.

It's neither. Writeback triggered by those controls doesn't trigger data integrity flushes, but sync(1) and fsync() do.

> It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.

Well, fsync is synchronous, so constantly emitting it is less performant than the alternative I mentioned with controlling writeback via sync_file_range() (and less efficient than what I wished for) - needs to interact with the drive cache. And where have I doubted that one can also issue fsyncs to control the amount of outstanding data?

O_SYNC

Posted May 17, 2019 17:28 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

>> Dirty_writeback_centics is a /proc-file
> What on earth made you think I didn't know that?

Why did you hack this sentence apart in order to insert a (wrong) conjecture about something I could have thought?

>> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's
>> still sync and not fsync.
> It's neither.

Indeed. The name is obviously different. Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do, IOW, the combination "dirty_writeback_centisecs/fsync" makes no sense. For the case of a single process writing to a certain open file, the effects of O_SYNC are by definition (POSIX) identical to pairing each write(2) with an fsync(2) on the same file descriptor. As additional benefit, fsync is more portable and it enables application to batch several writes if so desired, eg, when creating a file, write the complete file content and then fsync it.

O_SYNC

Posted May 21, 2019 23:10 UTC (Tue) by nix (subscriber, #2304) [Link]

Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do

Except that's not what it does. Every dirty_writeback_centisecs, dirty inodes that became dirty longer ago than dirty_writeback_centisecs are flushed. inodes that were dirtied more recently than that are *not* flushed. That's nothing like sync(), which flushes everything regardless of first dirty time, and you can't do it with fsync() either (even if you track the timing yourself, good luck with that) because there is extra behaviour regarding flushes that take too long which cannot be mimicked with fsync().

O_SYNC

Posted May 17, 2019 7:38 UTC (Fri) by mpr22 (subscriber, #60784) [Link] (1 responses)

Depending on the behaviour of the underlying filesystem, calling fsync() can reify rather more write()s than just the ones to the fd you're fsync()ing.

O_SYNC

Posted May 29, 2019 13:46 UTC (Wed) by Wol (subscriber, #4433) [Link]

Applications should not have to concern themselves with filesystem-specific code.

THAT caused major grief at the ext3/4 transition, iirc. ext3 was "save your data - friendly" by accidental default. ext4 was not. ext3 was NOT sync-friendly, ext4 was.

So you had a choice - write an ext4-friendly program and watch performance go through the floor on ext3, or an ext3-friendly program and watch crashes corrupt your data. NOT nice.

I don't want to know what filesystem my application is running on. I don't want serious differences in behaviour based on that fact. And as someone who is interested in databases and logging and all that, I also probably don't care whether my data made it to disk okay or not, I just want to know that whatever is on disk is consistent!

So the ability to put a "write barrier" in the queue, between the "write the log" and "update the files", is actually important to me. If the log is not updated properly, then the transaction is lost. If the log is safe, and the files are not, I can recover the transaction. Either way, the data is consistent. But if the filesystem moves HALF the file updates before the log is properly saved, I'm screwed.

Cheers,
Wol

O_SYNC

Posted May 21, 2019 23:03 UTC (Tue) by nix (subscriber, #2304) [Link]

It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor

That's not actually true (though in practice it usually is). It's outstanding modifications to a certain open file, this referring to a file description (not descriptor). This is not a single-process entity but can be shared among processes, all of which can be writing to the file description simultaneously: all that data needs synchronous flushing if O_SYNC is on, and the other writing processes will all be blocked if they try to write to it while that write is ongoing.