User: Password:
|
|
Subscribe / Log in / New account

JLS2009: A Btrfs update

JLS2009: A Btrfs update

Posted Nov 2, 2009 8:45 UTC (Mon) by njs (guest, #40338)
In reply to: JLS2009: A Btrfs update by nix
Parent article: JLS2009: A Btrfs update

I disabled fsync in emacs[1] because otherwise, when working on battery, hitting save makes the whole editor will block for a second or more waiting for the disk to spin up :-/. I have laptop-mode set for 10 minutes maximum lost work on battery failure (IIRC this is the default), and I'm pretty sure I hit save more than 600 times between battery failures. Actually, I'm not sure when the last time I had a battery failure was...

[1] (setq write-region-inhibit-fsync t)


(Log in to post comments)

JLS2009: A Btrfs update

Posted Nov 2, 2009 17:11 UTC (Mon) by nix (subscriber, #2304) [Link]

Yeah, laptops are a case where perhaps you want to force fsync() to do nothing at all, as your largest failure case normally is power failures (not much of an issue with a laptop battery 'UPS'). You do still have the oops-OS-crashes problem, but hopefully Linux doesn't crash too much :/ if you have a crashy OS *and* a hard disk that has to spin up from a dead stop I don't think you have any good answers.

(Did the force-fsync()-to-do-nothing patch ever get lumped into laptop_mode as people were suggesting? I don't have a laptop so I don't follow this sort of thing closely...)

JLS2009: A Btrfs update

Posted Nov 2, 2009 17:39 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

fsync() is expected to provide certain guarantees. The kernel shouldn't preempt that just because
it assumes it knows better than applications - the applications should either change behaviour
themselves, or have an LD_PRELOADed library that makes fsync() behaviour conditional on battery
state.

JLS2009: A Btrfs update

Posted Nov 2, 2009 19:11 UTC (Mon) by foom (subscriber, #14868) [Link]

Of course the kernel shouldn't make such assumptions by itself, but if the user configures it
intentionally to break fsync...

What difference does it make if it's implemented in the kernel or in an LD_PRELOAD library?

JLS2009: A Btrfs update

Posted Nov 2, 2009 19:17 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

It lets you control it per-application.

JLS2009: A Btrfs update

Posted Nov 2, 2009 20:39 UTC (Mon) by njs (guest, #40338) [Link]

But I don't want fsync() to do nothing at all, because there are lots of cases where a poorly-timed crash can cause you to lose not 10 minutes of work, but your entire data store. This applies to basically anything using a more complex data storage strategy than "rewrite the entire data store every time", e.g. dbm, sqlite, databases generally. They all have to transition through a state where their data structures are inconsistent, and if your rollback log hasn't hit disk yet, well...

It's really *annoying* that firefox/sqlite issue fsync's when storing history information, but I actually find that history information valuable enough that I don't want it all blown away on every crash, and there's really no way to avoid that without fsync.

I would love to see an API that allowed sqlite to express its data integrity requirements without forcing the disk to spin up, but this is not simple: http://www.sqlite.org/atomiccommit.html

JLS2009: A Btrfs update

Posted Nov 2, 2009 21:58 UTC (Mon) by anton (subscriber, #25547) [Link]

But I don't want fsync() to do nothing at all, because there are lots of cases where a poorly-timed crash can cause you to lose not 10 minutes of work, but your entire data store. This applies to basically anything using a more complex data storage strategy than "rewrite the entire data store every time", e.g. dbm, sqlite, databases generally.
If these applications don't corrupt their storage when they crash on their own or are killed, they won't corrupt it on a good file system even on a system crash. So it's only on bad file systems where the absence of fsync() would cause consistency problems. And how can you be sure that the fsync()s called from these applications are sufficient? Testing this stuff is pretty hard.

There is a different reason for syncing in such applications: A remote user won't notice that the database server lost power or crashed right after his transaction went through, so the database should better ensure that the data is in permanent storage before reporting completion to remote users.

As for the firefox history, a good file system would be a way to avoid losing it completely, without requiring fsync().

JLS2009: A Btrfs update

Posted Nov 2, 2009 23:14 UTC (Mon) by njs (guest, #40338) [Link]

You're right that durability and atomicity are different, that fsync provides both, and that an ideal file system would provide atomicity by default. But there are no filesystems available that do make that guarantee (maybe one of those obscure flash-targeted ones does?), so the properties of what you call a "good filesystem" are unfortunately irrelevant.

JLS2009: A Btrfs update

Posted Nov 3, 2009 23:11 UTC (Tue) by anton (subscriber, #25547) [Link]

I think that ext3 with data=journal or data=ordered is pretty close to a good file system for applications that don't overwrite files in place (e.g., editors). But I would be more confident if some file system developer actually made data consistency a design goal and gave some explicit guarantees.

JLS2009: A Btrfs update

Posted Nov 4, 2009 0:01 UTC (Wed) by nix (subscriber, #2304) [Link]

Unfortunately, both of those are only good filesystems if you really don't
care at all about either read or write speed. The latency figures Linus
posted (from one process dd(1)ing and another writing tiny files and
fsync()ing them) are appalling. We're not talking a mere few seconds,
we're talking over a minute at times.

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:04 UTC (Thu) by anton (subscriber, #25547) [Link]

ext3 with data=ordered is fast enough in my experience (which includes several multi-user servers).

What you write about these figures [citation needed] reminds me of my experiences with copying stuff to flash devices. However, no writing to an ext3 file system was involved there, and I suspect that the problem is sitting at a lower level than the msdos or vfat file system.

JLS2009: A Btrfs update

Posted Nov 5, 2009 18:08 UTC (Thu) by nix (subscriber, #2304) [Link]

Yeah, that's (as you know from the comment you linked to) a problem that
the per-bdi writeback fix should solve. I saw it back in the days before
cheap USB hard drives, when I ran backups onto pcdrw...

JLS2009: A Btrfs update

Posted Nov 4, 2009 8:40 UTC (Wed) by njs (guest, #40338) [Link]

Never overwriting data in place is a pretty huge constraint, though. There are some interesting data storage applications that can be efficiently implemented using append-only files, but they're a tiny minority...

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:09 UTC (Thu) by nye (guest, #51576) [Link]

>Never overwriting data in place is a pretty huge constraint, though

Nevertheless, it's generally a requirement for consistency in the face of application crashes (never mind system crashes or power cuts), unless you want to be dealing with full-blown transactional operations at the application level - which could be very little work if performed using facilities provided by the filesystem, but then wouldn't be portable.

JLS2009: A Btrfs update

Posted Nov 5, 2009 14:14 UTC (Thu) by anton (subscriber, #25547) [Link]

Most applications don't even append, they just write a new file in one go (and some then rename it, unlinking the old one). I think that ext3 data=ordered is a good file system for these applications.

Of course, for applications that overwrite stuff in place (e.g., usually data bases) it's not a good file system, and these applications need fsync() with it.

JLS2009: A Btrfs update

Posted Nov 8, 2009 2:36 UTC (Sun) by butlerm (guest, #13312) [Link]

Ext3 is *great* for these applications, other than the fact that it is rather
slow for a number of important use cases.

Most importantly a high performance filesystem needs to be able to sync the
data of one file independent of all the pending data for every other open
file. That is the whole problem with ext3 - it doesn't do that, so an fsync
under competing write load is very slow.

Ext4 fixes these problems, but either requires an fsync or inserts one to
make a rename replacement an atomic operation. That delay could be avoided
with some reasonable internal modifications (keeping the old inode around
until the new inode's data commits, and then undoing the rename if necessary
on journal recovery), but I am not aware of any filesystem that actually does
that. You have to call fsync to make your code portable anyway, but there
are a number of applications where that is too expensive.

JLS2009: A Btrfs update

Posted Nov 8, 2009 22:04 UTC (Sun) by anton (subscriber, #25547) [Link]

I don't see that fsync() makes my code (or anyone else's) portable. POSIX gives no useful guarantees on fsync(); different file systems have different requirements for what you have to fsync() in order to really commit a file. So use of fsync() is inherently non-portable.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds