User: Password:
|
|
Subscribe / Log in / New account

JLS2009: A Btrfs update

JLS2009: A Btrfs update

Posted Oct 30, 2009 21:09 UTC (Fri) by nix (subscriber, #2304)
In reply to: JLS2009: A Btrfs update by anton
Parent article: JLS2009: A Btrfs update

I think I *do* want my text editor to fsync() stuff I just wrote to disk,
thank you very much. I don't want the OS deciding to hold on to it for 30s
or 300s or whatever before flushing it back. It's not keeping the FS
consistent across crashes I care about: it's preserving *the stuff I just
saved* across crashes!

(not relevant for me, battery-backed RAID arrays now hold absolutely
everything I care about at home and at work, bwahaha)


(Log in to post comments)

JLS2009: A Btrfs update

Posted Oct 31, 2009 22:01 UTC (Sat) by anton (subscriber, #25547) [Link]

I don't mind losing a few seconds of my work on a crash, if I learn about the crash right away (as mentioned, remote servers can be a different issue). I do mind it if the file system loses an hour of my work, as has happened to me and as the ext4 author Ted T'so believes file systems should behave.

Many people don't want to wait for slow fsync()s; but if you only want to continue working after the fsync() has finished, just configure your system to stay with the slow fsync()s; fine with me.

BTW, your battery-backed RAID arrays will not help you when the kernel crashes, and the file system decides that it should empty or zero the files you have worked on when doing the fsck or journal replay.

JLS2009: A Btrfs update

Posted Nov 1, 2009 7:51 UTC (Sun) by Cato (subscriber, #7643) [Link]

Interesting example - presumably ext3 with data=journal would ensure that the data and metadata hit the disk together. This should avoid the scenario mentioned that metadata for the main and autosave files hit the disk, causing the OS to empty the autosave file, while the main file's data remains in memory and is wiped by the system crash.

JLS2009: A Btrfs update

Posted Nov 1, 2009 19:55 UTC (Sun) by anton (subscriber, #25547) [Link]

Yes, data=journal should be ok, unless they introduce one of the file system corruption bugs like one I read about (for data=journal) some years ago. I guess this was not noticed during development because it's a non-default mode, so it's tested by few (and typically those people who do use such hopefully-safer, slower features don't run bleeding-edge kernels).

The former default ext3 behaviour (data=ordered) should also be ok for simple cases such as this (i.e., no overwriting of existing blocks involved). Unfortunately, Ted T'so, the current maintainer of ext3 wants to degrade ext3 default functionality to the lowest common denominator (i.e., at least as bad as UFS), with better functionality available through mount options; will this work out any better than the non-default data=journal?

JLS2009: A Btrfs update

Posted Nov 1, 2009 13:13 UTC (Sun) by nix (subscriber, #2304) [Link]

True, I realised I misspoke there a second before hitting publish. Of
course battery-backed disk storage doesn't help if in the absence of
fsync() the OS hasn't pushed any data anywhere near said storage yet!

(And I do tend to assume that the journalling layer doesn't have lethal
data-eating bugs, or at least none that bite me. Should they do so, well,
that sort of rare disaster is what backups are for.)

If fsync() is slow, the problem is that fsync() is slow; the solution is
to speed it up, not rip the calls out of things like your text editor. (FF
using fsync() for transient-but-bulky state like the awesome bar is nuts,
agreed.)

FWIW I use KDE4 and ext4 and have turned barriers off (battery-backed RAID
array, again) and have had not a single instance of sudden death by
zeroing. So it doesn't happen to everyone.

(Of course my system doesn't crash often either.)

JLS2009: A Btrfs update

Posted Nov 1, 2009 20:01 UTC (Sun) by anton (subscriber, #25547) [Link]

An fsync() that really synchronously writes to the disk is always going to be slow, because it lets the program wait for the disk(s). And with a good file system it's completely unnecessary for an application like an editor; editors just call it as a workaround for bad file systems.

JLS2009: A Btrfs update

Posted Nov 1, 2009 20:32 UTC (Sun) by nix (subscriber, #2304) [Link]

So, er, you're suggesting that a good filesystem, what, calls sync() every
second? I can't see any way in which you could get the guarantees fsync()
does for files you really care about without paying some kind of price for
it in latency for those files.

And I don't really like the idea of calling sync() every second (or every
five, thank you ext3).

Being able to fsync() the important stuff *without* forcing everything
else to disk, like btrfs promises, seems very nice. Now my editor files
can be fsync()ed without also requiring me to wait for a few hundred Mb of
who-knows-what breadcrumb crud from FF to also be synced.

JLS2009: A Btrfs update

Posted Nov 1, 2009 21:37 UTC (Sun) by anton (subscriber, #25547) [Link]

In a good file system, the state after recovery is the logical state of the file system of some point in time (typically a few seconds) before the crash. It's possible to implement that efficiently (especially in a copy-on-write file system).

For an editor that does not fsync(), that would mean that you lose a few seconds of work (at worst the few seconds between autosaves plus the few seconds that the file system delays writing).

For every application (including editors), it would mean that if the developers ensure the consistency of the persistent data in case of a process kill, they will also have ensured it in case of a system crash or power outage. So they do not have to do extra work on consistency against crashes, which would also be extremely impractical to test.

It should not be too hard to turn Btrfs into a good file system. Unfortunately, the Linux file systems seem to regress into the dark ages (well, the 1980s) when it comes to data consistency (e.g., in the defaults for ext3). And some things I have read from Chris Mason lead me to believe that Btrfs will be no better.

As for the guarantees that fsync() gives, it gives no useful guarantee. It's just a prayer to the file system, and most file systems actually listen to this prayer in more or less the way you expect; but some require more prayers than others, and some ignore the prayer. I wonder why Ted T'so does not apologize for implementing fsync() in a somewhat useful way instead of the fastest way that still satisfies the letter of the POSIX specification.

JLS2009: A Btrfs update

Posted Nov 2, 2009 0:22 UTC (Mon) by nix (subscriber, #2304) [Link]

Ah, you're assuming an editor that saves the state of the program on
almost every keystroke plus a filesystem that preserves *some* consistent
state, but not necessarily the most recent one.

In that case, I agree: editors should not fsync() their autosave state if
they're preserving it every keystroke or so (and the filesystem should not
destroy the contents of the autosave file: thankfully neither ext3 nor
ext4 do so, now that they recognize rename() as implying a
block-allocation ordering barrier). But I certainly don't agree that
editors shouldn't fsync() files *when you explicitly asked it to save
them*! No, I don't think it's acceptable to lose work, even a few seconds'
work, after I tell an editor 'save now dammit'. That's what 'save'
*means*.

And that's why Ted's gone to some lengths to make fsync() fast in ext4:
because he wants people to actually *use* it.

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:09 UTC (Mon) by anton (subscriber, #25547) [Link]

Using fsync() does not prevent losing a bit of work when you press save, because the system can crash between the time when you hit save and when the application actually calls and completes fsync(). The only thing that fsync() buys is that the save takes longer, and once it's finished and the application lets you work again, you won't lose the few seconds. That may be worth the cost for you, but I wonder why?

As for Ted T'so, I would have preferred it if he went to some lengths to make ext4 a good file system; then fsync() would not be needed as much. Hmm, makes me wonder if he made fsync() fast because ext4 is bad, or if he made ext4 bad in order to encourage use of fsync().

JLS2009: A Btrfs update

Posted Nov 2, 2009 8:37 UTC (Mon) by njs (guest, #40338) [Link]

> In a good file system, the state after recovery is the logical state of the file system of some point in time (typically a few seconds) before the crash. It's possible to implement that efficiently (especially in a copy-on-write file system).

[Citation needed] -- or in other words, if this is so possible, why are no modern filesystem experts working on it, AFAICT? How are you going to be efficient when the requirement you stated requires that arbitrary requests be handled in serial order, forcing you to wait for disk seek latencies?

> I wonder why Ted T'so does not apologize for implementing fsync() in a somewhat useful way instead of the fastest way that still satisfies the letter of the POSIX specification.

Err, why should he apologize for implementing things in a useful way?

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:34 UTC (Mon) by anton (subscriber, #25547) [Link]

[Citation needed]
Yes, I have wanted to write this down for some time. Real soon now, promised!
if this is so possible, why are no modern filesystem experts working on it, AFAICT?
Maybe they are, or they consider it a solved problem and have moved on to other challenges. As for those file system experts that we read about on LWN (e.g., Ted T'so), they are not modern as far as data consistency is concerned, instead they are regressing to the 1980s. And they are so stuck in that mindset that they don't see the need for something better. Probably something like: "Sonny, when we were young, we did not need data consistency from the file system; and if fsync() was good enough for us, it's certainly good enough for you!".
How are you going to be efficient when the requirement you stated requires that arbitrary requests be handled in serial order,
It doesn't. All the changes between two commits can be written out in arbitrary order, only the commit has to come after all these writes.
Err, why should [Ted T'so] apologize for implementing things in a useful way?
He has done so before.

JLS2009: A Btrfs update

Posted Nov 3, 2009 19:56 UTC (Tue) by nix (subscriber, #2304) [Link]

That was an apology for introducing appalling latencies, not an apology
for doing things right.

I find it odd that one minute you're complaining that filesystems are
useless because problems occur if you don't fsync(), then the next moment
you're complaining that it's too slow, then the next moment you're
complaining about the precise opposite.

If you want the total guarantees you're aiming for, write an FS atop a
relational database. You *will* experience an enormous slowdown. This is
why all such filesystems (and there have been a few) have tanked: crashes
are rare enough that basically everyone is willing to trade off the chance
of a little rare corruption against a huge speedup all the time. (I can't
remember the time I last had massive filesystem corruption due to power
loss or system crashes. I've had filesystem corruption due to buggy drive
firmware, and filesystem corruption due to electrical storms... but
neither of these would be cured by your magic all-consistent filesystem,
because in both cases the drive wasn't writing what it was asked to write.
And *that* is more common than the sort of thing you're agonizing over. In
fact it seems to be getting more common all the time.)

JLS2009: A Btrfs update

Posted Nov 5, 2009 13:49 UTC (Thu) by anton (subscriber, #25547) [Link]

I understood Ted T'so's apology as follows: He thinks that application should use fsync() in lots of places, and by contributing to a better file system where that is not necessary as much, application developers were not punished by the file system as they should be in his opinion, and he apologized for spoiling them in this way.
I find it odd that one minute you're complaining that filesystems are useless because problems occur if you don't fsync(), then the next moment you're complaining that it's too slow, then the next moment you're complaining about the precise opposite.
Are you confusing me with someone else, are you trying to put up a straw man, or was my position so hard to understand? Anyway, here it is again:
On data consistency
A good file system guarantees good data consistency across crashes without needing fsync() or any other prayers (unless synchronous persistence is also required).
On fsync()
A useful implementation of fsync() requires a disk access, and the application waits for it, so it slows down the application from CPU speeds to disk speeds. If the file system provides no data consistency guarantees and the applications compensate for that by extensive use of fsync() (the situation that Ted T'so strives for), the overall system will be slow because of all these required synchronous disk accesses. With a good file system where most applications don't need to fsync() all the time, the overall system will be faster.
Your relational database file system is a straw man; I hope you had good fun beating it up.

If crashes are as irrelevant as you claim, why should anybody use fsync()? And why are you and Ted T'so agonizing over fsync() speed? Just turn it into a noop, and it will be fast.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:35 UTC (Thu) by nix (subscriber, #2304) [Link]

I'm probably confusing you with someone else, or with myself, or something
like that. Sorry.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:44 UTC (Thu) by nix (subscriber, #2304) [Link]

You still don't get my point, though. I'd agree that when all the writes
that are going on is the system chewing to itself, all you need is
consistency across crashes.

But when the system has just written out my magnum opus, by damn I want
that to hit persistent storage right now! The fsync() should bypass all
other disk I/O as much as possible and hit the disk absolutely as fast as
it can: slowing to disk speeds is fine, we're talking human reaction time
here which is much slower: I don't care if writing out my tax records
takes five seconds 'cos I just spent three hours slaving over them, five
seconds is nothing. But waiting behind a vast number of unimportant writes
(which were all asynchronous until our fsync() forced them out because of
filesystem infelicities) is not fine: if we have to wait for minutes for
our stuff to get out, we may as well have done an asynchronous write.

With btrfs, this golden vision of fast fsync() even under high disk write
load is possible. With ext*, it mostly isn't (you have to force earlier
stuff to the disk even if I don't give a damn about it and nobody ever
fsync()ed it), and in ext3 without data=writeback, fsync() is so slow when
contending with write loads that app developers were tempted to drop this
whole requirement and leave my magnum opus hanging about in transient
storage for many seconds. With ext4 at least fsync() doesn't stall my apps
merely because bloody firefox decided to drop another 500Mb hairball.

Again: I'm not interested in fsync() to prevent filesystem corruption
(that mostly doesn't happen, thanks to the journal, even if the power
suddenly goes out). I'm interested in saving *the contents of particular
files* that I just saved. If you're writing a book, and you save a
chapter, you care much more about preserving that chapter in case of power
fail than you care about some random FS corruption making off
with /usr/bin; fixing the latter is one reinstall away, but there's
nothing you can reinstall to get your data back.

I hope that's clearer :)

JLS2009: A Btrfs update

Posted Nov 8, 2009 21:53 UTC (Sun) by anton (subscriber, #25547) [Link]

Sure, if the only thing you care about in a file system is that fsync()s complete quickly and still hit the disk, use a file system that gives you that.

OTOH, I care more about data consistency. If we want to combine these two concerns, we get to some interesting design choices:

Committing the fsync()ed file before earlier writes to other files would break the ordering guarantee that makes a file system good (of course, we would only see this in the case of a crash between the time of the fsync() and the next regular commit). If the file system wants to preserve the write order, then fsync() pretty much becomes sync(), i.e., the performance behaviour that you do not want.

One can argue that an application that uses fsync() knows what it is doing, so it will do the fsync()s in an order that guarantees data consistency for its data anyway.

Counterarguments: 1) The crash case probably has not been tested extensively for this application, so it may have gotten the order of fsync()s wrong and doing the fsync()s right away may compromise the data consistency after all. 2) This application may interact with others in a way that makes the ordering of its writes relative to the others important; committing these writes in a different order opens a data inconsistency window.

Depending on the write volume of the applications on the machine, on the trust in the correctness of the fsync()s in all the applications, and on the way the applications interact with the users, the following are reasonable choices: 1) fsync() as sync (slowest); 2) fsync() as out-of-order commit; 3) fsync() as noop.

BTW, I find your motivating example still unconvincing: If you edit your magnum opus or your tax records, wouldn't you use an editor that autosaves regularly? Ok, your editor does not fsync() the autosaves, so with a bad file system you will lose the work, but on a good file system you won't, so you will also use a good file system for that, won't you? So it does not really matter for how long you slaved away on the file, a crash will only lose very little data. Or if you work in a way that can lose everything, why was the tax records after 2h59' not important enough to merit more precautions, but after 3h a fast fsync() is more important than anything else?

An example where a synchronous commit is really needed is a remote "cvs commit" (and maybe similar operations in other version control systems): Once a file is commited on the remote machine, the file's version number is updated on the local machine, so the remote commit should better stay commited, even if the remote machine crashes in the meantime. Of course, the problem here is that a cvs commit can easily commit hundreds of files; if it fsync()s every one of them separately, the cumulated waiting for the disk may be quite noticable. Doing the equivalent for all the files at once could be faster, but we have no good way to tell that to the file system (AFAIK CVS works a file at a time, so it wouldn't matter for CVS, but there may be other applications where it does). Hmm, if there are few writes by other applications at the same time, and all the fsync()s were done in the end, then fsync()-as-sync could be faster than out-of-order fsync()s: The first fsync() would commit all the files, and the other fsync()s would just return immediately.

JLS2009: A Btrfs update

Posted Nov 2, 2009 8:45 UTC (Mon) by njs (guest, #40338) [Link]

I disabled fsync in emacs[1] because otherwise, when working on battery, hitting save makes the whole editor will block for a second or more waiting for the disk to spin up :-/. I have laptop-mode set for 10 minutes maximum lost work on battery failure (IIRC this is the default), and I'm pretty sure I hit save more than 600 times between battery failures. Actually, I'm not sure when the last time I had a battery failure was...

[1] (setq write-region-inhibit-fsync t)

JLS2009: A Btrfs update

Posted Nov 2, 2009 17:11 UTC (Mon) by nix (subscriber, #2304) [Link]

Yeah, laptops are a case where perhaps you want to force fsync() to do nothing at all, as your largest failure case normally is power failures (not much of an issue with a laptop battery 'UPS'). You do still have the oops-OS-crashes problem, but hopefully Linux doesn't crash too much :/ if you have a crashy OS *and* a hard disk that has to spin up from a dead stop I don't think you have any good answers.

(Did the force-fsync()-to-do-nothing patch ever get lumped into laptop_mode as people were suggesting? I don't have a laptop so I don't follow this sort of thing closely...)

JLS2009: A Btrfs update

Posted Nov 2, 2009 17:39 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

fsync() is expected to provide certain guarantees. The kernel shouldn't preempt that just because
it assumes it knows better than applications - the applications should either change behaviour
themselves, or have an LD_PRELOADed library that makes fsync() behaviour conditional on battery
state.

JLS2009: A Btrfs update

Posted Nov 2, 2009 19:11 UTC (Mon) by foom (subscriber, #14868) [Link]

Of course the kernel shouldn't make such assumptions by itself, but if the user configures it
intentionally to break fsync...

What difference does it make if it's implemented in the kernel or in an LD_PRELOAD library?

JLS2009: A Btrfs update

Posted Nov 2, 2009 19:17 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

It lets you control it per-application.

JLS2009: A Btrfs update

Posted Nov 2, 2009 20:39 UTC (Mon) by njs (guest, #40338) [Link]

But I don't want fsync() to do nothing at all, because there are lots of cases where a poorly-timed crash can cause you to lose not 10 minutes of work, but your entire data store. This applies to basically anything using a more complex data storage strategy than "rewrite the entire data store every time", e.g. dbm, sqlite, databases generally. They all have to transition through a state where their data structures are inconsistent, and if your rollback log hasn't hit disk yet, well...

It's really *annoying* that firefox/sqlite issue fsync's when storing history information, but I actually find that history information valuable enough that I don't want it all blown away on every crash, and there's really no way to avoid that without fsync.

I would love to see an API that allowed sqlite to express its data integrity requirements without forcing the disk to spin up, but this is not simple: http://www.sqlite.org/atomiccommit.html

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:58 UTC (Mon) by anton (subscriber, #25547) [Link]

But I don't want fsync() to do nothing at all, because there are lots of cases where a poorly-timed crash can cause you to lose not 10 minutes of work, but your entire data store. This applies to basically anything using a more complex data storage strategy than "rewrite the entire data store every time", e.g. dbm, sqlite, databases generally.
If these applications don't corrupt their storage when they crash on their own or are killed, they won't corrupt it on a good file system even on a system crash. So it's only on bad file systems where the absence of fsync() would cause consistency problems. And how can you be sure that the fsync()s called from these applications are sufficient? Testing this stuff is pretty hard.

There is a different reason for syncing in such applications: A remote user won't notice that the database server lost power or crashed right after his transaction went through, so the database should better ensure that the data is in permanent storage before reporting completion to remote users.

As for the firefox history, a good file system would be a way to avoid losing it completely, without requiring fsync().

JLS2009: A Btrfs update

Posted Nov 2, 2009 23:14 UTC (Mon) by njs (guest, #40338) [Link]

You're right that durability and atomicity are different, that fsync provides both, and that an ideal file system would provide atomicity by default. But there are no filesystems available that do make that guarantee (maybe one of those obscure flash-targeted ones does?), so the properties of what you call a "good filesystem" are unfortunately irrelevant.

JLS2009: A Btrfs update

Posted Nov 3, 2009 23:11 UTC (Tue) by anton (subscriber, #25547) [Link]

I think that ext3 with data=journal or data=ordered is pretty close to a good file system for applications that don't overwrite files in place (e.g., editors). But I would be more confident if some file system developer actually made data consistency a design goal and gave some explicit guarantees.

JLS2009: A Btrfs update

Posted Nov 4, 2009 0:01 UTC (Wed) by nix (subscriber, #2304) [Link]

Unfortunately, both of those are only good filesystems if you really don't
care at all about either read or write speed. The latency figures Linus
posted (from one process dd(1)ing and another writing tiny files and
fsync()ing them) are appalling. We're not talking a mere few seconds,
we're talking over a minute at times.

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:04 UTC (Thu) by anton (subscriber, #25547) [Link]

ext3 with data=ordered is fast enough in my experience (which includes several multi-user servers).

What you write about these figures [citation needed] reminds me of my experiences with copying stuff to flash devices. However, no writing to an ext3 file system was involved there, and I suspect that the problem is sitting at a lower level than the msdos or vfat file system.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:08 UTC (Thu) by nix (subscriber, #2304) [Link]

Yeah, that's (as you know from the comment you linked to) a problem that
the per-bdi writeback fix should solve. I saw it back in the days before
cheap USB hard drives, when I ran backups onto pcdrw...

JLS2009: A Btrfs update

Posted Nov 4, 2009 8:40 UTC (Wed) by njs (guest, #40338) [Link]

Never overwriting data in place is a pretty huge constraint, though. There are some interesting data storage applications that can be efficiently implemented using append-only files, but they're a tiny minority...

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:09 UTC (Thu) by nye (guest, #51576) [Link]

>Never overwriting data in place is a pretty huge constraint, though

Nevertheless, it's generally a requirement for consistency in the face of application crashes (never mind system crashes or power cuts), unless you want to be dealing with full-blown transactional operations at the application level - which could be very little work if performed using facilities provided by the filesystem, but then wouldn't be portable.

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:14 UTC (Thu) by anton (subscriber, #25547) [Link]

Most applications don't even append, they just write a new file in one go (and some then rename it, unlinking the old one). I think that ext3 data=ordered is a good file system for these applications.

Of course, for applications that overwrite stuff in place (e.g., usually data bases) it's not a good file system, and these applications need fsync() with it.

JLS2009: A Btrfs update

Posted Nov 8, 2009 2:36 UTC (Sun) by butlerm (guest, #13312) [Link]

Ext3 is *great* for these applications, other than the fact that it is rather
slow for a number of important use cases.

Most importantly a high performance filesystem needs to be able to sync the
data of one file independent of all the pending data for every other open
file. That is the whole problem with ext3 - it doesn't do that, so an fsync
under competing write load is very slow.

Ext4 fixes these problems, but either requires an fsync or inserts one to
make a rename replacement an atomic operation. That delay could be avoided
with some reasonable internal modifications (keeping the old inode around
until the new inode's data commits, and then undoing the rename if necessary
on journal recovery), but I am not aware of any filesystem that actually does
that. You have to call fsync to make your code portable anyway, but there
are a number of applications where that is too expensive.

JLS2009: A Btrfs update

Posted Nov 8, 2009 22:04 UTC (Sun) by anton (subscriber, #25547) [Link]

I don't see that fsync() makes my code (or anyone else's) portable. POSIX gives no useful guarantees on fsync(); different file systems have different requirements for what you have to fsync() in order to really commit a file. So use of fsync() is inherently non-portable.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds