Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Ext4 filesystem hits Android, no need to fear data loss (ars technica)
Posted Dec 28, 2010 20:46 UTC (Tue) by quotemstr (subscriber, #45331)
To respond to your particular comment, the issue is really that we need a write barrier, while fsync specifies a full flush, a much stonger and unnecessary operation.
Posted Dec 28, 2010 22:55 UTC (Tue) by neilbrown (subscriber, #359)
fsync isn't a "full flush" (except in ext3), it is a controlled flush of a single file. "sync" is "full flush".
What you really need is dependencies. Journalling filesystems use dependencies a lot to make sure that things get written in the "right" order. They submit lots of antecedent writes, then a flush, then the dependent writes. e.g. metadata-to-journal, journal-commit-block, metadata-to-filesystem.
If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.
Despite comments above and elsewhere, I don't think write/close/rename is guaranteed to provide atomicity in Linux. Ted Ts'o recently wrote:
> The implementors of a number of mainstream file systems (i.e., ext4 btrfs, XFS) have agreed to do the equivalent of #1 (i.e., initiating writeback, but not necessarily waiting for the writeback to complete) in the case of a rename that replaces an existing file.
As there is no wait or dependency, there is no guarantee (as I read it).
Posted Dec 29, 2010 2:09 UTC (Wed) by quotemstr (subscriber, #45331)
This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.
As for citing Tso's: his views on the subject are widely-known; that doesn't make them correct. There is nothing wrong with an application relying on transactional rename semantics, and asking userspace to not use reasonable, convenient behavior that's worked for years is simply not a tenable position.
Posted Dec 29, 2010 3:01 UTC (Wed) by neilbrown (subscriber, #359)
2/ The transaction rename semantics that ext3 provided were really a bug (or mis-feature or short-sighted design or whatever). I think it was only ext3 that provided it, certainly not ext2, probably not XFS, not sure about reiserfs. Any userspace that relies on it (without checking that the filesystem in use is ext3) is buggy.
That said - I think that it would be good if Linux did provide some transactional guarantees that didn't require fsync and that user-space could rely on across all filesystems. I suspect that enhancing rename would be a good start and could be done with little or no performance cost. It might even be good to provide something more explicit. But I don't think that 'barriers' is the right model.
Posted Dec 29, 2010 4:45 UTC (Wed) by iabervon (subscriber, #722)
Posted Dec 29, 2010 8:59 UTC (Wed) by neilbrown (subscriber, #359)
By your argument, fsync will not be used consistently, and that appears to be true to some extent.
The implied conclusion seems to be that all filesystems should be mounted "-o sync" so that fsync is not needed. Strangely, I have not heard that conclusion being proposed explicitly....
I certainly agree that interfaces should be hard to misuse (the Rusty principle) but that must be balanced against the dictum against making things fool proof as then only a fool would use them. In this case, while it is quite possible to misuse fsync (e.g. by not using it), it really is an appropriate interface.
And while regression tests suites are incredibly valuable and there should be more of them etc etc, one hopes that developers don't depend of them as the sole means of ensuring correctness, but also read documentation, try to understand the systems they work with, and write code accordingly. (one also hopes for a pony.... but no, only a piece of coal in my stocking again)
Syncing has significant performance penalties.
Posted Dec 29, 2010 14:26 UTC (Wed) by gmatht (guest, #58961)
Apparently xsyncfs provides a system that is synchronous from the users point-of-view with an overhead of only 3%-7%. That might be an acceptable default. Xsyncfs doesn't seem very well documented, but is just one avenue of providing good reliability properties without the massive performance penalties from fsync/-o sync.
Posted Dec 29, 2010 17:42 UTC (Wed) by iabervon (subscriber, #722)
I believe that the model that most people have of filesystems is that what's recovered after a system crash is like a snapshot of the filesystem that you would see in a running system if you were taking the snapshot with ordinary system calls and could therefore see all the race conditions you can see between programs; however, there is arbitrary random damage because the system crashed, and the latest snapshot may not be particularly recent.
With this model, fsync is easy to (know to) use in cases where you want to make sure that the snapshot is sufficiently recent, but not for cases where it is necessary to avoid the recovered state being something that couldn't have been a snapshot.
Posted Dec 30, 2010 0:14 UTC (Thu) by neilbrown (subscriber, #359)
That is an overly naive model of a filesystem. It assumes almost completely linearisation of operations on their way to storage. Any re-ordering in the page cache before writeback or in the device queue via an IO scheduler will invalidate that model, and as you can imagine such re-ordering happens a lot.
The correct model is "nothing is safe until you call sync or fsync or some other variant", with the understanding that 'sync' is effectively called every 30 seconds or so.
I'm glad it is obvious that you need to call fsync (on both the file and the directory you created the file in) before acknowledging the receipt of a file (e.g. an email) over a network connection.
However exactly the same is true when moving a file by copying it. If you copy a file (possibly transforming it on the way) and the remove the original you really must fsync the new copy before unlinking the old. You should also fsync the directory, though if you rename the new (after fsyncing it) to replace the old, then the fsync of the directory is not required.
Note that "mv" doesn't do the fsync when moving a file between filesystems (which requires a copy/unlink). So if you use mv and then crash you could quite possibly lose both copies. And mv doesn't even have an option to request the fsync.
Now you might suggest that this should "just work" without mv needing to call fsync. But I think you would find it quite difficult to design the filesystem semantics that would allow this to always be safe, especially as you need interaction between two separate filesystems (unlink in one must not commit until writes in the other have committed). ... other than mounting everything with '-o sync' of course.
Posted Dec 30, 2010 1:33 UTC (Thu) by iabervon (subscriber, #722)
There's also been not that long in the UNIX tradition when you could be reasonably confidant that a power failure shortly after you changed something in a directory wouldn't trash other things in the directory, making it kind of irrelevant whether you'd called fsync on the directory to make sure that the disk was correct before it got corrupted.
In general, there's a tradeoff among filesystem complexity, slowness, and
deviation from non-crash state. None of these go to zero without making the others terrible, even if you call sync all the time.
(In fact, my model does require an fsync when moving a file by copying it, at least across directories; the snapshotting process could read the destination directory before you write the file and the source directory after you unlink it.)
Posted Jan 1, 2011 18:54 UTC (Sat) by butlerm (subscriber, #13312)
Posted Dec 28, 2010 21:44 UTC (Tue) by iabervon (subscriber, #722)
Posted Jan 1, 2011 19:22 UTC (Sat) by butlerm (subscriber, #13312)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds