You still don't get my point, though. I'd agree that when all the writes
that are going on is the system chewing to itself, all you need is
consistency across crashes.
But when the system has just written out my magnum opus, by damn I want
that to hit persistent storage right now! The fsync() should bypass all
other disk I/O as much as possible and hit the disk absolutely as fast as
it can: slowing to disk speeds is fine, we're talking human reaction time
here which is much slower: I don't care if writing out my tax records
takes five seconds 'cos I just spent three hours slaving over them, five
seconds is nothing. But waiting behind a vast number of unimportant writes
(which were all asynchronous until our fsync() forced them out because of
filesystem infelicities) is not fine: if we have to wait for minutes for
our stuff to get out, we may as well have done an asynchronous write.
With btrfs, this golden vision of fast fsync() even under high disk write
load is possible. With ext*, it mostly isn't (you have to force earlier
stuff to the disk even if I don't give a damn about it and nobody ever
fsync()ed it), and in ext3 without data=writeback, fsync() is so slow when
contending with write loads that app developers were tempted to drop this
whole requirement and leave my magnum opus hanging about in transient
storage for many seconds. With ext4 at least fsync() doesn't stall my apps
merely because bloody firefox decided to drop another 500Mb hairball.
Again: I'm not interested in fsync() to prevent filesystem corruption
(that mostly doesn't happen, thanks to the journal, even if the power
suddenly goes out). I'm interested in saving *the contents of particular
files* that I just saved. If you're writing a book, and you save a
chapter, you care much more about preserving that chapter in case of power
fail than you care about some random FS corruption making off
with /usr/bin; fixing the latter is one reinstall away, but there's
nothing you can reinstall to get your data back.
Posted Nov 8, 2009 21:53 UTC (Sun) by anton (guest, #25547)
[Link]
Sure, if the only thing you care about in a file system is that
fsync()s complete quickly and still hit the disk, use a file system
that gives you that.
OTOH, I care more about data consistency. If we want to combine
these two concerns, we get to some interesting design choices:
Committing the fsync()ed file before earlier writes to other files
would break the ordering guarantee that makes a file system good (of
course, we would only see this in the case of a crash between the time
of the fsync() and the next regular commit). If the file system wants
to preserve the write order, then fsync() pretty much becomes sync(),
i.e., the performance behaviour that you do not want.
One can argue that an application that uses fsync() knows what it
is doing, so it will do the fsync()s in an order that guarantees data
consistency for its data anyway.
Counterarguments: 1) The crash case probably has not been tested
extensively for this application, so it may have gotten the order of
fsync()s wrong and doing the fsync()s right away may compromise the
data consistency after all. 2) This application may interact with
others in a way that makes the ordering of its writes relative to the
others important; committing these writes in a different order opens a
data inconsistency window.
Depending on the write volume of the applications on the machine,
on the trust in the correctness of the fsync()s in all the
applications, and on the way the applications interact with the users,
the following are reasonable choices: 1) fsync() as sync (slowest); 2)
fsync() as out-of-order commit; 3) fsync() as noop.
BTW, I find your motivating example still unconvincing: If you edit
your magnum opus or your tax records, wouldn't you use an editor
that autosaves regularly? Ok, your editor does not fsync() the
autosaves, so with a bad file system you will lose the work, but on a
good file system you won't, so you will also use a good file system
for that, won't you? So it does not really matter for how long you
slaved away on the file, a crash will only lose very little data. Or
if you work in a way that can lose everything, why was the tax records
after 2h59' not important enough to merit more precautions, but after
3h a fast fsync() is more important than anything else?
An example where a synchronous commit is really needed is a remote
"cvs commit" (and maybe similar operations in other version control
systems): Once a file is commited on the remote machine, the file's
version number is updated on the local machine, so the remote commit
should better stay commited, even if the remote machine crashes in the
meantime. Of course, the problem here is that a cvs commit can easily
commit hundreds of files; if it fsync()s every one of them separately,
the cumulated waiting for the disk may be quite noticable. Doing the
equivalent for all the files at once could be faster, but we have no
good way to tell that to the file system (AFAIK CVS works a file at a
time, so it wouldn't matter for CVS, but there may be other
applications where it does). Hmm, if there are few writes by other
applications at the same time, and all the fsync()s were done in the
end, then fsync()-as-sync could be faster than out-of-order fsync()s:
The first fsync() would commit all the files, and the other fsync()s
would just return immediately.