O_SYNC
O_SYNC
Posted May 15, 2019 13:31 UTC (Wed) by rweikusat2 (subscriber, #117920)In reply to: O_SYNC by epa
Parent article: DAX semantics
NB: I strongly suspect that a huge number of 'practical' O_SYNC uses will be entirely gratuitious performance drags because it's being employed as "magic countermeasure" for occasional data corruption caused by writing through dangling pointers.
Posted May 15, 2019 15:41 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (13 responses)
Really wish there were a good posix way to tell the kernel 1) initiate write back very soon, I'm not going to redirty 2) I am / am not going to read that data again soon. 3) amount of pending dirty data is limited.
But without the perf overhead of O_SYNC (which often will end up generating much more random writes than necessary). One can get there using sync_file_range(WRITE) plus fadvise. But it's much harder than necessary.
Posted May 16, 2019 0:34 UTC (Thu)
by TMM (subscriber, #79398)
[Link] (1 responses)
Posted May 16, 2019 2:14 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
I referenced fadvise?
And no, it doesn't. You can't initiate writeback with it, without also causing the page cache contents to be thrown out. For that you need sync_file_range(SYNC_FILE_RANGE_WRITE).
And to not suck performance-wise you need to coalesce the sync_file_range() calls in userspace, and issue them over larger ranges of blocks, otherwise there'll be unnecessarily random IO. And no, just write()ing larger blocks isn't really a good solution either.
To me this is the kernel's job, and I should easily be able to a) max amount of dirty data caused by an process for an fd b) that I'm not going to redirty data that's being written, and that writeback should write back whenever it makes sense from a granularity perspective (i.e. optimize for efficient writes, without unnecessarily keeping dirty till the next dirty_writeback_centisecs).
Posted May 16, 2019 17:19 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (10 responses)
Posted May 16, 2019 17:40 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (9 responses)
Uh, yes, obviously? What does that have to do with what I wrote? You can have a lot of dirty data for a single fd?
Posted May 16, 2019 18:53 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (8 responses)
Looks like a description of the sync system call (or rather, it looks less like something which doesn't describe sync than like something which doesn't describe fsync :->).
Posted May 16, 2019 20:04 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (7 responses)
When using O_SYNC, or controlling writeback using sync_file_range(SYNC_FILE_RANGE_WRITE) etc, you prevent these latency type issues.
It's quite possible, although the linux kernel has improved a fair bit in that regards in the last few years, to trigger individual read()s taking many seconds, because the kernel is busy flushing back a few gigabytes of dirty data.
I explained this because you said:
Posted May 16, 2019 20:38 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (6 responses)
Dirty_writeback_centics is a /proc-file which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.
Posted May 17, 2019 4:48 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
> Dirty_writeback_centics is a /proc-file
What on earth made you think I didn't know that?
> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's still sync and not fsync.
It's neither. Writeback triggered by those controls doesn't trigger data integrity flushes, but sync(1) and fsync() do.
> It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor. This can be as little or as much as an individual application desires it to be as it's up to the application to call fsync whenever this makes sense.
Well, fsync is synchronous, so constantly emitting it is less performant than the alternative I mentioned with controlling writeback via sync_file_range() (and less efficient than what I wished for) - needs to interact with the drive cache. And where have I doubted that one can also issue fsyncs to control the amount of outstanding data?
Posted May 17, 2019 17:28 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Why did you hack this sentence apart in order to insert a (wrong) conjecture about something I could have thought?
>> which can be used to change the time interval between automatically performed sync-operation by the kernel. That's
Indeed. The name is obviously different. Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do, IOW, the combination "dirty_writeback_centisecs/fsync" makes no sense. For the case of a single process writing to a certain open file, the effects of O_SYNC are by definition (POSIX) identical to pairing each write(2) with an fsync(2) on the same file descriptor. As additional benefit, fsync is more portable and it enables application to batch several writes if so desired, eg, when creating a file, write the complete file content and then fsync it.
Posted May 21, 2019 23:10 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted May 17, 2019 7:38 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Posted May 29, 2019 13:46 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
THAT caused major grief at the ext3/4 transition, iirc. ext3 was "save your data - friendly" by accidental default. ext4 was not. ext3 was NOT sync-friendly, ext4 was.
So you had a choice - write an ext4-friendly program and watch performance go through the floor on ext3, or an ext3-friendly program and watch crashes corrupt your data. NOT nice.
I don't want to know what filesystem my application is running on. I don't want serious differences in behaviour based on that fact. And as someone who is interested in databases and logging and all that, I also probably don't care whether my data made it to disk okay or not, I just want to know that whatever is on disk is consistent!
So the ability to put a "write barrier" in the queue, between the "write the log" and "update the files", is actually important to me. If the log is not updated properly, then the transaction is lost. If the log is safe, and the files are not, I can recover the transaction. Either way, the data is consistent. But if the filesystem moves HALF the file updates before the log is properly saved, I'm screwed.
Cheers,
Posted May 21, 2019 23:03 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted May 19, 2019 11:49 UTC (Sun)
by epa (subscriber, #39769)
[Link]
O_SYNC
O_SYNC
O_SYNC
O_SYNC
O_SYNC
O_SYNC
>> problematic. Using sync instead puts the cost at the writer, instead of a more or less random time later (i.e. when dirty limits are >> reached).
O_SYNC
> Which means that this detour can be avoided by simply using fsync all the time.
etc. Which I think is not correct.
O_SYNC
O_SYNC
O_SYNC
> What on earth made you think I didn't know that?
>> still sync and not fsync.
> It's neither.
O_SYNC
Nevertheless, it refers to a timeout for periodically flushing all dirty page cache pages to stable storage, just like the sync but not the fsync system call would do
Except that's not what it does. Every dirty_writeback_centisecs, dirty inodes that became dirty longer ago than dirty_writeback_centisecs are flushed. inodes that were dirtied more recently than that are *not* flushed. That's nothing like sync(), which flushes everything regardless of first dirty time, and you can't do it with fsync() either (even if you track the timing yourself, good luck with that) because there is extra behaviour regarding flushes that take too long which cannot be mimicked with fsync().
O_SYNC
O_SYNC
Wol
O_SYNC
It's outstanding modifications to a certain open file, this referring to a per-process data structure referred to by a file descriptor
That's not actually true (though in practice it usually is). It's outstanding modifications to a certain open file, this referring to a file description (not descriptor). This is not a single-process entity but can be shared among processes, all of which can be writing to the file description simultaneously: all that data needs synchronous flushing if O_SYNC is on, and the other writing processes will all be blocked if they try to write to it while that write is ongoing.
O_SYNC