User: Password:
|
|
Subscribe / Log in / New account

JLS2009: A Btrfs update

JLS2009: A Btrfs update

Posted Nov 1, 2009 20:01 UTC (Sun) by anton (subscriber, #25547)
In reply to: JLS2009: A Btrfs update by nix
Parent article: JLS2009: A Btrfs update

An fsync() that really synchronously writes to the disk is always going to be slow, because it lets the program wait for the disk(s). And with a good file system it's completely unnecessary for an application like an editor; editors just call it as a workaround for bad file systems.


(Log in to post comments)

JLS2009: A Btrfs update

Posted Nov 1, 2009 20:32 UTC (Sun) by nix (subscriber, #2304) [Link]

So, er, you're suggesting that a good filesystem, what, calls sync() every
second? I can't see any way in which you could get the guarantees fsync()
does for files you really care about without paying some kind of price for
it in latency for those files.

And I don't really like the idea of calling sync() every second (or every
five, thank you ext3).

Being able to fsync() the important stuff *without* forcing everything
else to disk, like btrfs promises, seems very nice. Now my editor files
can be fsync()ed without also requiring me to wait for a few hundred Mb of
who-knows-what breadcrumb crud from FF to also be synced.

JLS2009: A Btrfs update

Posted Nov 1, 2009 21:37 UTC (Sun) by anton (subscriber, #25547) [Link]

In a good file system, the state after recovery is the logical state of the file system of some point in time (typically a few seconds) before the crash. It's possible to implement that efficiently (especially in a copy-on-write file system).

For an editor that does not fsync(), that would mean that you lose a few seconds of work (at worst the few seconds between autosaves plus the few seconds that the file system delays writing).

For every application (including editors), it would mean that if the developers ensure the consistency of the persistent data in case of a process kill, they will also have ensured it in case of a system crash or power outage. So they do not have to do extra work on consistency against crashes, which would also be extremely impractical to test.

It should not be too hard to turn Btrfs into a good file system. Unfortunately, the Linux file systems seem to regress into the dark ages (well, the 1980s) when it comes to data consistency (e.g., in the defaults for ext3). And some things I have read from Chris Mason lead me to believe that Btrfs will be no better.

As for the guarantees that fsync() gives, it gives no useful guarantee. It's just a prayer to the file system, and most file systems actually listen to this prayer in more or less the way you expect; but some require more prayers than others, and some ignore the prayer. I wonder why Ted T'so does not apologize for implementing fsync() in a somewhat useful way instead of the fastest way that still satisfies the letter of the POSIX specification.

JLS2009: A Btrfs update

Posted Nov 2, 2009 0:22 UTC (Mon) by nix (subscriber, #2304) [Link]

Ah, you're assuming an editor that saves the state of the program on
almost every keystroke plus a filesystem that preserves *some* consistent
state, but not necessarily the most recent one.

In that case, I agree: editors should not fsync() their autosave state if
they're preserving it every keystroke or so (and the filesystem should not
destroy the contents of the autosave file: thankfully neither ext3 nor
ext4 do so, now that they recognize rename() as implying a
block-allocation ordering barrier). But I certainly don't agree that
editors shouldn't fsync() files *when you explicitly asked it to save
them*! No, I don't think it's acceptable to lose work, even a few seconds'
work, after I tell an editor 'save now dammit'. That's what 'save'
*means*.

And that's why Ted's gone to some lengths to make fsync() fast in ext4:
because he wants people to actually *use* it.

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:09 UTC (Mon) by anton (subscriber, #25547) [Link]

Using fsync() does not prevent losing a bit of work when you press save, because the system can crash between the time when you hit save and when the application actually calls and completes fsync(). The only thing that fsync() buys is that the save takes longer, and once it's finished and the application lets you work again, you won't lose the few seconds. That may be worth the cost for you, but I wonder why?

As for Ted T'so, I would have preferred it if he went to some lengths to make ext4 a good file system; then fsync() would not be needed as much. Hmm, makes me wonder if he made fsync() fast because ext4 is bad, or if he made ext4 bad in order to encourage use of fsync().

JLS2009: A Btrfs update

Posted Nov 2, 2009 8:37 UTC (Mon) by njs (guest, #40338) [Link]

> In a good file system, the state after recovery is the logical state of the file system of some point in time (typically a few seconds) before the crash. It's possible to implement that efficiently (especially in a copy-on-write file system).

[Citation needed] -- or in other words, if this is so possible, why are no modern filesystem experts working on it, AFAICT? How are you going to be efficient when the requirement you stated requires that arbitrary requests be handled in serial order, forcing you to wait for disk seek latencies?

> I wonder why Ted T'so does not apologize for implementing fsync() in a somewhat useful way instead of the fastest way that still satisfies the letter of the POSIX specification.

Err, why should he apologize for implementing things in a useful way?

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:34 UTC (Mon) by anton (subscriber, #25547) [Link]

[Citation needed]
Yes, I have wanted to write this down for some time. Real soon now, promised!
if this is so possible, why are no modern filesystem experts working on it, AFAICT?
Maybe they are, or they consider it a solved problem and have moved on to other challenges. As for those file system experts that we read about on LWN (e.g., Ted T'so), they are not modern as far as data consistency is concerned, instead they are regressing to the 1980s. And they are so stuck in that mindset that they don't see the need for something better. Probably something like: "Sonny, when we were young, we did not need data consistency from the file system; and if fsync() was good enough for us, it's certainly good enough for you!".
How are you going to be efficient when the requirement you stated requires that arbitrary requests be handled in serial order,
It doesn't. All the changes between two commits can be written out in arbitrary order, only the commit has to come after all these writes.
Err, why should [Ted T'so] apologize for implementing things in a useful way?
He has done so before.

JLS2009: A Btrfs update

Posted Nov 3, 2009 19:56 UTC (Tue) by nix (subscriber, #2304) [Link]

That was an apology for introducing appalling latencies, not an apology
for doing things right.

I find it odd that one minute you're complaining that filesystems are
useless because problems occur if you don't fsync(), then the next moment
you're complaining that it's too slow, then the next moment you're
complaining about the precise opposite.

If you want the total guarantees you're aiming for, write an FS atop a
relational database. You *will* experience an enormous slowdown. This is
why all such filesystems (and there have been a few) have tanked: crashes
are rare enough that basically everyone is willing to trade off the chance
of a little rare corruption against a huge speedup all the time. (I can't
remember the time I last had massive filesystem corruption due to power
loss or system crashes. I've had filesystem corruption due to buggy drive
firmware, and filesystem corruption due to electrical storms... but
neither of these would be cured by your magic all-consistent filesystem,
because in both cases the drive wasn't writing what it was asked to write.
And *that* is more common than the sort of thing you're agonizing over. In
fact it seems to be getting more common all the time.)

JLS2009: A Btrfs update

Posted Nov 5, 2009 13:49 UTC (Thu) by anton (subscriber, #25547) [Link]

I understood Ted T'so's apology as follows: He thinks that application should use fsync() in lots of places, and by contributing to a better file system where that is not necessary as much, application developers were not punished by the file system as they should be in his opinion, and he apologized for spoiling them in this way.
I find it odd that one minute you're complaining that filesystems are useless because problems occur if you don't fsync(), then the next moment you're complaining that it's too slow, then the next moment you're complaining about the precise opposite.
Are you confusing me with someone else, are you trying to put up a straw man, or was my position so hard to understand? Anyway, here it is again:
On data consistency
A good file system guarantees good data consistency across crashes without needing fsync() or any other prayers (unless synchronous persistence is also required).
On fsync()
A useful implementation of fsync() requires a disk access, and the application waits for it, so it slows down the application from CPU speeds to disk speeds. If the file system provides no data consistency guarantees and the applications compensate for that by extensive use of fsync() (the situation that Ted T'so strives for), the overall system will be slow because of all these required synchronous disk accesses. With a good file system where most applications don't need to fsync() all the time, the overall system will be faster.
Your relational database file system is a straw man; I hope you had good fun beating it up.

If crashes are as irrelevant as you claim, why should anybody use fsync()? And why are you and Ted T'so agonizing over fsync() speed? Just turn it into a noop, and it will be fast.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:35 UTC (Thu) by nix (subscriber, #2304) [Link]

I'm probably confusing you with someone else, or with myself, or something
like that. Sorry.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:44 UTC (Thu) by nix (subscriber, #2304) [Link]

You still don't get my point, though. I'd agree that when all the writes
that are going on is the system chewing to itself, all you need is
consistency across crashes.

But when the system has just written out my magnum opus, by damn I want
that to hit persistent storage right now! The fsync() should bypass all
other disk I/O as much as possible and hit the disk absolutely as fast as
it can: slowing to disk speeds is fine, we're talking human reaction time
here which is much slower: I don't care if writing out my tax records
takes five seconds 'cos I just spent three hours slaving over them, five
seconds is nothing. But waiting behind a vast number of unimportant writes
(which were all asynchronous until our fsync() forced them out because of
filesystem infelicities) is not fine: if we have to wait for minutes for
our stuff to get out, we may as well have done an asynchronous write.

With btrfs, this golden vision of fast fsync() even under high disk write
load is possible. With ext*, it mostly isn't (you have to force earlier
stuff to the disk even if I don't give a damn about it and nobody ever
fsync()ed it), and in ext3 without data=writeback, fsync() is so slow when
contending with write loads that app developers were tempted to drop this
whole requirement and leave my magnum opus hanging about in transient
storage for many seconds. With ext4 at least fsync() doesn't stall my apps
merely because bloody firefox decided to drop another 500Mb hairball.

Again: I'm not interested in fsync() to prevent filesystem corruption
(that mostly doesn't happen, thanks to the journal, even if the power
suddenly goes out). I'm interested in saving *the contents of particular
files* that I just saved. If you're writing a book, and you save a
chapter, you care much more about preserving that chapter in case of power
fail than you care about some random FS corruption making off
with /usr/bin; fixing the latter is one reinstall away, but there's
nothing you can reinstall to get your data back.

I hope that's clearer :)

JLS2009: A Btrfs update

Posted Nov 8, 2009 21:53 UTC (Sun) by anton (subscriber, #25547) [Link]

Sure, if the only thing you care about in a file system is that fsync()s complete quickly and still hit the disk, use a file system that gives you that.

OTOH, I care more about data consistency. If we want to combine these two concerns, we get to some interesting design choices:

Committing the fsync()ed file before earlier writes to other files would break the ordering guarantee that makes a file system good (of course, we would only see this in the case of a crash between the time of the fsync() and the next regular commit). If the file system wants to preserve the write order, then fsync() pretty much becomes sync(), i.e., the performance behaviour that you do not want.

One can argue that an application that uses fsync() knows what it is doing, so it will do the fsync()s in an order that guarantees data consistency for its data anyway.

Counterarguments: 1) The crash case probably has not been tested extensively for this application, so it may have gotten the order of fsync()s wrong and doing the fsync()s right away may compromise the data consistency after all. 2) This application may interact with others in a way that makes the ordering of its writes relative to the others important; committing these writes in a different order opens a data inconsistency window.

Depending on the write volume of the applications on the machine, on the trust in the correctness of the fsync()s in all the applications, and on the way the applications interact with the users, the following are reasonable choices: 1) fsync() as sync (slowest); 2) fsync() as out-of-order commit; 3) fsync() as noop.

BTW, I find your motivating example still unconvincing: If you edit your magnum opus or your tax records, wouldn't you use an editor that autosaves regularly? Ok, your editor does not fsync() the autosaves, so with a bad file system you will lose the work, but on a good file system you won't, so you will also use a good file system for that, won't you? So it does not really matter for how long you slaved away on the file, a crash will only lose very little data. Or if you work in a way that can lose everything, why was the tax records after 2h59' not important enough to merit more precautions, but after 3h a fast fsync() is more important than anything else?

An example where a synchronous commit is really needed is a remote "cvs commit" (and maybe similar operations in other version control systems): Once a file is commited on the remote machine, the file's version number is updated on the local machine, so the remote commit should better stay commited, even if the remote machine crashes in the meantime. Of course, the problem here is that a cvs commit can easily commit hundreds of files; if it fsync()s every one of them separately, the cumulated waiting for the disk may be quite noticable. Doing the equivalent for all the files at once could be faster, but we have no good way to tell that to the file system (AFAIK CVS works a file at a time, so it wouldn't matter for CVS, but there may be other applications where it does). Hmm, if there are few writes by other applications at the same time, and all the fsync()s were done in the end, then fsync()-as-sync could be faster than out-of-order fsync()s: The first fsync() would commit all the files, and the other fsync()s would just return immediately.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds