what's needed is a application barrier() call

Posted Sep 9, 2009 20:53 UTC (Wed) by dlang (guest, #313)
In reply to: what's needed is a application barrier() call by vaurora
Parent article: POSIX v. reality: A position on O_PONIES

I'm not proposing full transactions (which to me imply rollback if the transaction is aborted)

simply

write1 write2 write3 barrier write4 write5

will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.

this is _much_ easier to implement.

what's needed is a application barrier() call

Posted Sep 9, 2009 21:49 UTC (Wed) by njs (subscriber, #40338) [Link] (5 responses)

Yes, that sort of ordering is what she's talking about. (The Featherstitch papers are worth checking out.)

IIUC, the trickiness in implementing it is that you need to keep track of which writes depend on which other writes, and which intermediate states are allowed, but then you have to keep non-dependent writes as disentangled as possible, to avoid the slowdowns and soft-updates craziness that she describes in the original article -- but you can't be *too* clever about it, or your accounting overhead will become a bottleneck.

Featherstitch's solution involves careful optimizations and giant graphs in kernel memory.

what's needed is a application barrier() call

Posted Sep 9, 2009 23:44 UTC (Wed) by dlang (guest, #313) [Link] (4 responses)

my gut feeling is that they are trying too hard.

if you have a way to prevent memory blocks that have been submitted for I/O from being changed before the I/O is completed this is just a matter of prohibiting reordering in the device stack.

this would be overkill for what is being asked for (the user cares about ordering the changes to their file, this would order changes to the entire device), but it would do the job without having to worry about tracing dependancies

if the device stack were to mark all buffers it has pending as COW when it gets a barrier() call, this doesn't seem that hard to do.

now, it does open up the possibility of running out of memory and having problems writing something to swap (until the currently pending writes complete), but if you do the barrier on a per-partition basis this shouldn't be that bad (this assumes that the device stack can easily tell what partition pending writes would go to)

what's needed is a application barrier() call

Posted Sep 10, 2009 6:51 UTC (Thu) by njs (subscriber, #40338) [Link] (3 responses)

I dunno, I'm not an expert, though "prohibiting reordering in the device stack" raises a big red flag for me -- reordering is pretty much the entire way you get speed out of spinning-disk hard drives!

But whether my analysis is accurately pin-pointing the problem or not, with all respect, I think I'll trust Val's word over yours that there *is* a problem :-).

what's needed is a application barrier() call

Posted Sep 10, 2009 7:01 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

it's only prohibiting reordering around the barrier, not all reordering.

Val is definantly right about the difficulty in doing it the best possible way, as I noted my approach would run some possibility of the COW causing out of memory problems, but I suspect that it's a 90% solution.

what's needed is a application barrier() call

Posted Sep 10, 2009 7:35 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

But now you're synchronizing all writes by all processes everywhere on the disk, not just the two blocks that actually got written to in this one file somewhere. If apps used that sort of barrier() too often then I think you'd end up with disk throughput that looks like you mounted with -o sync -- very little reordering would ever be allowed.

I guess this is complicated by the question of when dirty blocks get flushed, and in how large batches; maybe it's solvable. But I don't think memory is the main concern, at least.

what's needed is a application barrier() call

Posted Sep 10, 2009 7:48 UTC (Thu) by dlang (guest, #313) [Link]

you are absolutely correct, that's why I say it's only a 90% solution, but it may be something that gives you most of the benefit for a fraction of the effort.

an alternative to the application barrier() call

Posted Sep 11, 2009 15:05 UTC (Fri) by anton (subscriber, #25547) [Link] (9 responses)

write1 write2 write3 barrier write4 write5
will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.

An alternative would be to just extend POSIX logical ordering guarantees (as visible by other processes) to the post-recovery state. That would mean that the file system would implicitly put a barrier between any of the writes in your example.

The question is: how much would this guarantee cost compared to what you have in mind? In a copy-on-write filesystem it could cost very little, if anything. The file system could still perform the user writes in any order (all of them, not just a subset), but just would never commit a write for which the earlier writes have not been performed yet. For journaled file systems the reasoning is more complex, but I believe that in the usual case (writing new data) the cost is also very small.

The benefits of this guarantee are that it makes programming easier, and especially testing easier: If your files are always consistent as seen by other processes, they will also be consistent in case of a crash or power outage; no need to pull the power plug in order to test the crash resilience of your application.

an alternative to the application barrier() call

Posted Sep 11, 2009 16:29 UTC (Fri) by dlang (guest, #313) [Link] (8 responses)

you don't want to put an implicit barrier between any two writes because it would prevent a lot of very useful write merging and reordering (so the performance cost would be very high)

since most writes are less than a sector, multiple writes would be even more expensive for a COW system

An alternative to the application barrier() call

Posted Sep 11, 2009 17:06 UTC (Fri) by anton (subscriber, #25547) [Link] (6 responses)

Barriers certainly don't prevent write merging. Why would they? A barrier just means that logically later writes are not committed before logically earlier writes, but they can become visible at the same time. So you can merge as many writes across barriers as you want.

Ok, your formulation of barriers exclude the same-time option, but apart from the lower performance, how could an external observer tell whether two logical writes happened one after another or at the same time? Once they are both committed, there is no difference.

As my posting explains, they also don't prevent reordering of physical data writes, they only restrict which sets of writes are committed by a commit.

Multiple small writes can be merged together into a large one.

BTW, most writes probably happen through libc buffers, and are typically larger than one sector (unless most of your files are smaller than one sector).

An alternative to the application barrier() call

Posted Sep 11, 2009 19:26 UTC (Fri) by dlang (guest, #313) [Link] (5 responses)

I have seen a LOT of code that does

work...work
write one line, or a couple words of a line
work..work
write a little more
etc

enforcing a barrier between all of these writes would kill you

remember that you don't know the storage stack below you, what you submit as one write may be broken up into multiple writes, and you have no guarantee of what order those multiple writes could be done in (think a raid array where your write spans drives as one example)

as a result a barrier needs to prohibit merging across the barrier as well as just reordering across the barrier.

An alternative to the application barrier() call

Posted Sep 13, 2009 17:46 UTC (Sun) by anton (subscriber, #25547) [Link] (4 responses)

Code that writes a few characters here and a few characters there usually uses the FILE * based interface, which performs user-space buffering and then typically performs write() (or somesuch) calls of 4k or 8k at a time; just strace one of these programs. That's done to reduce the system call overhead. But even if such programs perform a write() for each of the application writes, having barriers between each of them does not kill performance, because a sequence of such writes can be merged.

Concerning the block device below, if that does not heed the block device barriers or other block device ordering mechanisms that the file system requests, then you get no guarantee at all of any consistency on crash/power failure. It's not just that merged writes won't work, your style of merge-preventing barriers won't work, either, and neither will the guarantees that fsync()/fdatasync are supposed to provide; that's because all of them require that the block device ordering mechanism(s) that the file system uses actually work, and all of them will produce inconsistent states if the writes happen in an order that violates the ordering requests. So, if you want any consistency guarentees at all, you need an appropriate block device, and then you can implement mergeable writes just as well as anything else.

As for an array where a write spans drives, implementing a barrier or other ordering mechanism on the array level certainly requires something more involved than just doing barriers on the individual block devices, but the device has to provide these facilities, or you can forget about crash consistency on that device (i.e., just don't use it).

An alternative to the application barrier() call

Posted Sep 13, 2009 20:23 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

my point is that enforcing a barrier through all these layers can be expensive (on a multi-disk array you would need to make sure that one disk has completed it's work before submitting the write to the next disk)

this isn't always needed, so don't try to do it for every write (and I've straced a lot of code that does lots of wuite() calls)

do it when the programmer says that it's important. 99+% of the time it won't be (the result is not significantly more usable after a crash with part of the file if it's not all there, or this really is performance sensitive enought to risk it)

you would be amazed at the amount of risk that people are willing to take to get performance. talk to the database gurus at MySQL or postgres about the number of people they see disabling f*sync on production databases in the name of speed.

An alternative to the application barrier() call

Posted Sep 14, 2009 22:16 UTC (Mon) by anton (subscriber, #25547) [Link] (2 responses)

Fortunately writes on the file system level can be merged across file system barriers, resulting in few barriers that have to be passed to the block device level. So there is no need to pass a block device barrier down for every file system barrier.

And since it is possible to implement these implicit barriers between each write efficiently (by merging writes), why burden programmers with inserting explicit file system barriers? Look at how long the Linux kernel hackers needed to use block device barriers in the file system code. Do you really expect application developers to do it at all? And if they did, how would they test it? This has the same untestability properties as asking application programmers to use fsync.

Concerning the risk-loving performance freaks, they will use the latest and greatest file system by Ted T'so instead of one that offers either implicit or explicit barriers, but of course they will not use fsync() on that file system:-).

BTW, if you also implement block device writes by avoiding overwriting live sectors and by using commit sectors, then you can implement mergeable writes at the block device level, too (e.g., for making them cheaper in an array). However, the file system will not request a block device barrier often, so there is no need to go to such complexity (unless you need it for other purposes, such as when your block device is a flash device).

An alternative to the application barrier() call

Posted Sep 20, 2009 5:22 UTC (Sun) by runekock (subscriber, #50229) [Link] (1 responses)

> Fortunately writes on the file system level can be merged across file system barriers, resulting in few barriers that have to be passed to the block device level.

But what about eliminating repeated writes to the same place? Take this contrived example:

repeat 1000 times:
write first byte of file A
write first byte of file B

A COW file system may well be able to merge the writes, but it would require a lot of intelligence for it to see that most of the writes could actually be skipped. And a traditional file system would be even worse off.

An alternative to the application barrier() call

Posted Sep 20, 2009 18:38 UTC (Sun) by anton (subscriber, #25547) [Link]

For a copy-on-write file system that example would be easy: Do all the writes in memory (in proper order), and when the system decides that it's time to commit the stuff to disk, just do a commit of the new logical state to disk (e.g., by writing the first block each of file A and file B and the respective metadata to new locations, and finally a commit sector that makes the new on-disk state visible.

An update-in-place file system (without journal) would indeed have to perform all the writes in order to have the on-disk state reflect one of the logical POSIX states at all times (assuming that there are no repeating patterns in the two values that are written; if there are, it is theoretically possible to skip the writes between two equal states).

an alternative to the application barrier() call

Posted Sep 12, 2009 0:39 UTC (Sat) by spitzak (guest, #4593) [Link]

barriers do not force write ordering. If you do "write 1, barrier, write 2", all that is required is that anybody looking at the file will see *one* of these three states: nothing, write 1, or write 1+2. But that does not imply that all three states have to somehow exist, if only nothing,write1+2 ever exist, you have fulfilled the requirements.

An actual implementation may be "allocate temporary space, write 2, write 1, make the file point at temporary space". Notice that write 2 is done BEFORE write 1, but we have fulfilled the requirements of barrier.