LWN.net Logo

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 17:59 UTC (Tue) by iabervon (subscriber, #722)
In reply to: Ext4 filesystem hits Android, no need to fear data loss (ars technica) by mjg59
Parent article: Ext4 filesystem hits Android, no need to fear data loss (ars technica)

POSIX doesn't make any guarantees at all about what happens when the system stops and starts again, so it would be hard for Linux to be useful and not make stronger guarantees. Or, rather, Linux filesystems don't guarantee anything more (the reason your system crashes could conceivably be that your video driver has started sending data to your hard drive instead of your video card, entirely wiping out your storage), but the behavior that Linux rarely diverges from is substantially better than POSIX requires.

fsync() is kind of weird in that, if the system never crashes, it has no visible effect; and if the system ever crashes, it might not come back at all or it might do something random that happens to undo all the effects of fsync().


(Log in to post comments)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 19:46 UTC (Tue) by Trelane (subscriber, #56877) [Link]

So it sounds like fsync would work except that it flushes data and metadata, thereby causing performance issues, yes? If so, a nice solution may be to introduce a metadata-only sync?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 20:46 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

*sigh* Not this argument again. Go reread the old threads: LWN ran carefully balanced coverage of the Great FSync War.

To respond to your particular comment, the issue is really that we need a write barrier, while fsync specifies a full flush, a much stonger and unnecessary operation.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 22:55 UTC (Tue) by neilbrown (subscriber, #359) [Link]

This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.

fsync isn't a "full flush" (except in ext3), it is a controlled flush of a single file. "sync" is "full flush".

What you really need is dependencies. Journalling filesystems use dependencies a lot to make sure that things get written in the "right" order. They submit lots of antecedent writes, then a flush, then the dependent writes. e.g. metadata-to-journal, journal-commit-block, metadata-to-filesystem.

If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.

Despite comments above and elsewhere, I don't think write/close/rename is guaranteed to provide atomicity in Linux. Ted Ts'o recently wrote:
(http://www.spinics.net/linux/lists/linux-ext4/msg22395.html)

> The implementors of a number of mainstream file systems (i.e., ext4 btrfs, XFS) have agreed to do the equivalent of #1 (i.e., initiating writeback, but not necessarily waiting for the writeback to complete) in the case of a rename that replaces an existing file.

As there is no wait or dependency, there is no guarantee (as I read it).

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 2:09 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.
And the second part of that plan is to use threads and internal queues to achieve the same overall effect; that's not really an option for userspace.

As for citing Tso's: his views on the subject are widely-known; that doesn't make them correct. There is nothing wrong with an application relying on transactional rename semantics, and asking userspace to not use reasonable, convenient behavior that's worked for years is simply not a tenable position.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 3:01 UTC (Wed) by neilbrown (subscriber, #359) [Link]

1/ I wasn't citing Ts'o's opinion. I was citing his report on a meeting of several filesystem developers. He wasn't say what "should". He was saying what "is".

2/ The transaction rename semantics that ext3 provided were really a bug (or mis-feature or short-sighted design or whatever). I think it was only ext3 that provided it, certainly not ext2, probably not XFS, not sure about reiserfs. Any userspace that relies on it (without checking that the filesystem in use is ext3) is buggy.

That said - I think that it would be good if Linux did provide some transactional guarantees that didn't require fsync and that user-space could rely on across all filesystems. I suspect that enhancing rename would be a good start and could be done with little or no performance cost. It might even be good to provide something more explicit. But I don't think that 'barriers' is the right model.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 4:45 UTC (Wed) by iabervon (subscriber, #722) [Link]

The thing is that transactional rename semantics works for all POSIX-compliant filesystems when your computer doesn't crash. Anything that's only needed if your computer crashes isn't going to be used consistently, because developers' computers crash so rarely and it's awkward to include in an application regression test suite.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 8:59 UTC (Wed) by neilbrown (subscriber, #359) [Link]

fsync is only needed if your computer crashes. Not using it works perfectly when your computer doesn't crash.

By your argument, fsync will not be used consistently, and that appears to be true to some extent.

The implied conclusion seems to be that all filesystems should be mounted "-o sync" so that fsync is not needed. Strangely, I have not heard that conclusion being proposed explicitly....

I certainly agree that interfaces should be hard to misuse (the Rusty principle) but that must be balanced against the dictum against making things fool proof as then only a fool would use them. In this case, while it is quite possible to misuse fsync (e.g. by not using it), it really is an appropriate interface.

And while regression tests suites are incredibly valuable and there should be more of them etc etc, one hopes that developers don't depend of them as the sole means of ensuring correctness, but also read documentation, try to understand the systems they work with, and write code accordingly. (one also hopes for a pony.... but no, only a piece of coal in my stocking again)

Syncing has significant performance penalties.

Posted Dec 29, 2010 14:26 UTC (Wed) by gmatht (guest, #58961) [Link]

The applications that actually really need durability of data written in the last few seconds don't seem to be that common... one might hope that the people who wrote the database your bank uses, would know about fsync. Even on ext4 a trivial fsync takes 50ms[1], which is forever in computer times, much more even that the ~0.1ms for creating new files, but losing everything because the filesystem decided to write out the metadata before the data (violating atomicity) isn't that great either.

Apparently xsyncfs provides a system that is synchronous from the users point-of-view with an overhead of only 3%-7%. That might be an acceptable default. Xsyncfs doesn't seem very well documented, but is just one avenue of providing good reliability properties without the massive performance penalties from fsync/-o sync.

[1] www.ucc.asn.au/~mccabedj/fsync_benchmark.c

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 17:42 UTC (Wed) by iabervon (subscriber, #722) [Link]

Using fsync only really makes sense if you're trying to get stuff written to disk before sending a message out of the system; otherwise, it won't be possible to tell whether the fsync didn't actually do anything or the system crashed before it returned. So you need to use fsync after writing a received email message to disk and before telling the remote server that you've got it.

I believe that the model that most people have of filesystems is that what's recovered after a system crash is like a snapshot of the filesystem that you would see in a running system if you were taking the snapshot with ordinary system calls and could therefore see all the race conditions you can see between programs; however, there is arbitrary random damage because the system crashed, and the latest snapshot may not be particularly recent.

With this model, fsync is easy to (know to) use in cases where you want to make sure that the snapshot is sufficiently recent, but not for cases where it is necessary to avoid the recovered state being something that couldn't have been a snapshot.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 30, 2010 0:14 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> I believe that the model that most people have of filesystems is that ...

That is an overly naive model of a filesystem. It assumes almost completely linearisation of operations on their way to storage. Any re-ordering in the page cache before writeback or in the device queue via an IO scheduler will invalidate that model, and as you can imagine such re-ordering happens a lot.

The correct model is "nothing is safe until you call sync or fsync or some other variant", with the understanding that 'sync' is effectively called every 30 seconds or so.

I'm glad it is obvious that you need to call fsync (on both the file and the directory you created the file in) before acknowledging the receipt of a file (e.g. an email) over a network connection.

However exactly the same is true when moving a file by copying it. If you copy a file (possibly transforming it on the way) and the remove the original you really must fsync the new copy before unlinking the old. You should also fsync the directory, though if you rename the new (after fsyncing it) to replace the old, then the fsync of the directory is not required.

Note that "mv" doesn't do the fsync when moving a file between filesystems (which requires a copy/unlink). So if you use mv and then crash you could quite possibly lose both copies. And mv doesn't even have an option to request the fsync.

Now you might suggest that this should "just work" without mv needing to call fsync. But I think you would find it quite difficult to design the filesystem semantics that would allow this to always be safe, especially as you need interaction between two separate filesystems (unlink in one must not commit until writes in the other have committed). ... other than mounting everything with '-o sync' of course.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 30, 2010 1:33 UTC (Thu) by iabervon (subscriber, #722) [Link]

My model can't really fail to be accurate, since it includes the possibility of arbitrary deviations from the predicted outcome. And, actually, nothing is safe at all; your storage medium might fail, your video driver might scribble over your disk or your dirty pages, your hard drive might read garbage out of memory losing power and write it with the power left in its capacitors. I actually suspect that, based on the model I stated, a more common and more extensive source of differences from some potential snapshot is things that syncing couldn't have helped with than things that syncing could have helped with (with the exception of ext4 having a particularly common and obvious divergence).

There's also been not that long in the UNIX tradition when you could be reasonably confidant that a power failure shortly after you changed something in a directory wouldn't trash other things in the directory, making it kind of irrelevant whether you'd called fsync on the directory to make sure that the disk was correct before it got corrupted.

In general, there's a tradeoff among filesystem complexity, slowness, and
deviation from non-crash state. None of these go to zero without making the others terrible, even if you call sync all the time.

(In fact, my model does require an fsync when moving a file by copying it, at least across directories; the snapshotting process could read the destination directory before you write the file and the source directory after you unlink it.)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Jan 1, 2011 18:54 UTC (Sat) by butlerm (subscriber, #13312) [Link]

If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.

User specified dependencies are completely unnecessary. All that is necessary is for filesystems to exercise a little bit of additional effort to preserve POSIX semantics across system crashes. That means atomic renames, among other things.

Very simple: log the rename in the journal. Keep the old inode around until the data associated with the replacement inode commits to disk. If the system crashes in the meantime, on recovery undo the rename, thereby restoring the association between the name and the old inode.

No need to commit the new inode data implicitly or explicitly prior to writing any metadata journal entries. No need to call fsync unless you actually want synchronous behavior. No need to start writeout on the new inode immediately, either.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 21:44 UTC (Tue) by iabervon (subscriber, #722) [Link]

No, actually, a metadata-only sync would *increase* the chance of causing problems over not using anything at all. The necessary operation is to *prevent* writing the metadata until the data has been written.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Jan 1, 2011 19:22 UTC (Sat) by butlerm (subscriber, #13312) [Link]

The necessary operation is to *prevent* writing the metadata until the data has been written.

That can lead to severe performance issues, unless you implement some sort of multiversion concurrency (a la BSD style "soft updates") on all your metadata. There are better ways.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds