Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
Doesn't Ext4 rely on a generic block layer? Isn't that going to significantly affect the reliability and life of the flash?
Ext4 filesystem hits Android, no need to fear data loss (ars technica)
Posted Dec 28, 2010 2:33 UTC (Tue) by rich0 (guest, #55509)
I was amused by the fsync banter. The last thing I want on a phone with a slow flash drive is every other function calling fsync. When you write data it should either end up being written, or not end up being written - messing up data already in storage without writing new data should not be a typical failure mode. I don't care if my DVR successfully appended 25kB to my growing 784MB recording file when the power fails, but I don't want to lose half the recording over a failure to call fsync, or lose recordings on a different tuner because the disk is busing churning because three different threads are having to fsync every 5kb write to the drive (whether fsync is implemented properly or not it kills seek performance).
Posted Dec 28, 2010 3:38 UTC (Tue) by Nick (guest, #15060)
If you have a volatile writeback cache, the app needs to specify at what point it expects the data to be durable. If every other function requires data to be durable, then every other function needs to call fsync. If you don't want that, then you can't use a writeback cache.
Unrelated files will not be corrupted or lost if fsync is not used (it's a perfectly valid mode to leave maximum amount of dirty data in writeback cache).
If the disk is not bandwidth constrained, it will write back data older than a particular time. For guarantees though, fsync should be used. There wouldn't be much problem in fsyncing every 2 minutes of video to ensure no more than that is lost after power failure.
Posted Dec 28, 2010 13:42 UTC (Tue) by butlerm (subscriber, #13312)
Posted Dec 28, 2010 14:01 UTC (Tue) by mjg59 (subscriber, #23239)
That's true, but the fact that it's POSIXly ok to transform the "Write to temporary file, rename over original" sequence into "rename over original, then commit new data to disk" means that the definition of "unrelated" is not always what application developers expect. The question is whether it's expected that Linux filesystems give stronger guarantees than POSIX, and I think the outcome of previous discussions means that (at least as far as general purpose filesystems go) the answer is yes.
Posted Dec 28, 2010 17:59 UTC (Tue) by iabervon (subscriber, #722)
fsync() is kind of weird in that, if the system never crashes, it has no visible effect; and if the system ever crashes, it might not come back at all or it might do something random that happens to undo all the effects of fsync().
Posted Dec 28, 2010 19:46 UTC (Tue) by Trelane (subscriber, #56877)
Posted Dec 28, 2010 20:46 UTC (Tue) by quotemstr (subscriber, #45331)
To respond to your particular comment, the issue is really that we need a write barrier, while fsync specifies a full flush, a much stonger and unnecessary operation.
Posted Dec 28, 2010 22:55 UTC (Tue) by neilbrown (subscriber, #359)
fsync isn't a "full flush" (except in ext3), it is a controlled flush of a single file. "sync" is "full flush".
What you really need is dependencies. Journalling filesystems use dependencies a lot to make sure that things get written in the "right" order. They submit lots of antecedent writes, then a flush, then the dependent writes. e.g. metadata-to-journal, journal-commit-block, metadata-to-filesystem.
If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.
Despite comments above and elsewhere, I don't think write/close/rename is guaranteed to provide atomicity in Linux. Ted Ts'o recently wrote:
> The implementors of a number of mainstream file systems (i.e., ext4 btrfs, XFS) have agreed to do the equivalent of #1 (i.e., initiating writeback, but not necessarily waiting for the writeback to complete) in the case of a rename that replaces an existing file.
As there is no wait or dependency, there is no guarantee (as I read it).
Posted Dec 29, 2010 2:09 UTC (Wed) by quotemstr (subscriber, #45331)
This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.
As for citing Tso's: his views on the subject are widely-known; that doesn't make them correct. There is nothing wrong with an application relying on transactional rename semantics, and asking userspace to not use reasonable, convenient behavior that's worked for years is simply not a tenable position.
Posted Dec 29, 2010 3:01 UTC (Wed) by neilbrown (subscriber, #359)
2/ The transaction rename semantics that ext3 provided were really a bug (or mis-feature or short-sighted design or whatever). I think it was only ext3 that provided it, certainly not ext2, probably not XFS, not sure about reiserfs. Any userspace that relies on it (without checking that the filesystem in use is ext3) is buggy.
That said - I think that it would be good if Linux did provide some transactional guarantees that didn't require fsync and that user-space could rely on across all filesystems. I suspect that enhancing rename would be a good start and could be done with little or no performance cost. It might even be good to provide something more explicit. But I don't think that 'barriers' is the right model.
Posted Dec 29, 2010 4:45 UTC (Wed) by iabervon (subscriber, #722)
Posted Dec 29, 2010 8:59 UTC (Wed) by neilbrown (subscriber, #359)
By your argument, fsync will not be used consistently, and that appears to be true to some extent.
The implied conclusion seems to be that all filesystems should be mounted "-o sync" so that fsync is not needed. Strangely, I have not heard that conclusion being proposed explicitly....
I certainly agree that interfaces should be hard to misuse (the Rusty principle) but that must be balanced against the dictum against making things fool proof as then only a fool would use them. In this case, while it is quite possible to misuse fsync (e.g. by not using it), it really is an appropriate interface.
And while regression tests suites are incredibly valuable and there should be more of them etc etc, one hopes that developers don't depend of them as the sole means of ensuring correctness, but also read documentation, try to understand the systems they work with, and write code accordingly. (one also hopes for a pony.... but no, only a piece of coal in my stocking again)
Syncing has significant performance penalties.
Posted Dec 29, 2010 14:26 UTC (Wed) by gmatht (guest, #58961)
Apparently xsyncfs provides a system that is synchronous from the users point-of-view with an overhead of only 3%-7%. That might be an acceptable default. Xsyncfs doesn't seem very well documented, but is just one avenue of providing good reliability properties without the massive performance penalties from fsync/-o sync.
Posted Dec 29, 2010 17:42 UTC (Wed) by iabervon (subscriber, #722)
I believe that the model that most people have of filesystems is that what's recovered after a system crash is like a snapshot of the filesystem that you would see in a running system if you were taking the snapshot with ordinary system calls and could therefore see all the race conditions you can see between programs; however, there is arbitrary random damage because the system crashed, and the latest snapshot may not be particularly recent.
With this model, fsync is easy to (know to) use in cases where you want to make sure that the snapshot is sufficiently recent, but not for cases where it is necessary to avoid the recovered state being something that couldn't have been a snapshot.
Posted Dec 30, 2010 0:14 UTC (Thu) by neilbrown (subscriber, #359)
That is an overly naive model of a filesystem. It assumes almost completely linearisation of operations on their way to storage. Any re-ordering in the page cache before writeback or in the device queue via an IO scheduler will invalidate that model, and as you can imagine such re-ordering happens a lot.
The correct model is "nothing is safe until you call sync or fsync or some other variant", with the understanding that 'sync' is effectively called every 30 seconds or so.
I'm glad it is obvious that you need to call fsync (on both the file and the directory you created the file in) before acknowledging the receipt of a file (e.g. an email) over a network connection.
However exactly the same is true when moving a file by copying it. If you copy a file (possibly transforming it on the way) and the remove the original you really must fsync the new copy before unlinking the old. You should also fsync the directory, though if you rename the new (after fsyncing it) to replace the old, then the fsync of the directory is not required.
Note that "mv" doesn't do the fsync when moving a file between filesystems (which requires a copy/unlink). So if you use mv and then crash you could quite possibly lose both copies. And mv doesn't even have an option to request the fsync.
Now you might suggest that this should "just work" without mv needing to call fsync. But I think you would find it quite difficult to design the filesystem semantics that would allow this to always be safe, especially as you need interaction between two separate filesystems (unlink in one must not commit until writes in the other have committed). ... other than mounting everything with '-o sync' of course.
Posted Dec 30, 2010 1:33 UTC (Thu) by iabervon (subscriber, #722)
There's also been not that long in the UNIX tradition when you could be reasonably confidant that a power failure shortly after you changed something in a directory wouldn't trash other things in the directory, making it kind of irrelevant whether you'd called fsync on the directory to make sure that the disk was correct before it got corrupted.
In general, there's a tradeoff among filesystem complexity, slowness, and
deviation from non-crash state. None of these go to zero without making the others terrible, even if you call sync all the time.
(In fact, my model does require an fsync when moving a file by copying it, at least across directories; the snapshotting process could read the destination directory before you write the file and the source directory after you unlink it.)
Posted Jan 1, 2011 18:54 UTC (Sat) by butlerm (subscriber, #13312)
Posted Dec 28, 2010 21:44 UTC (Tue) by iabervon (subscriber, #722)
Posted Jan 1, 2011 19:22 UTC (Sat) by butlerm (subscriber, #13312)
Posted Dec 28, 2010 16:06 UTC (Tue) by man_ls (subscriber, #15091)
The funny part about T'so's fsync obsession is that he considers fsync to be a requirement for good filesystem programming. Instead of solving the atomicity problems and shutting up, he insists on the use of fsync:
it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync.
Posted Dec 28, 2010 17:29 UTC (Tue) by mjg59 (subscriber, #23239)
Posted Dec 29, 2010 6:42 UTC (Wed) by Nick (guest, #15060)
It's true that posix filesystem API is not trivial to write a crash recoverable protocol on top of, when you are talking about multiple files and directory structure, but it's not rocket science. I think a lot of userspace developers believe that it is terribly difficult and non-performant to implement in their apps, but it would somehow become free of complexity and cost if implemented in the kernel.
Not only would it not be free, but it would add a lot of complexity and divergence (between fses, other OSes) to a layer that is currently quite simple and used by all sorts of apps that do not need such semantics.
I'm not saying it would be a useless feature. But I have not seen anywhere somebody show code and numbers showing that it is compelling, over alternatives. Alternatives include implementing your own atomicity/cleanup protocol on the posix API (which really isn't too hard, and can absolutely include asynchronous writeback until durability is required), or using sqlite or bdb or something to either store all your data or at least store a log of the state of your file/directory structure. And mind you, the alternatives are very well tested and will work on all OSes.
This is funny...
Posted Dec 29, 2010 8:47 UTC (Wed) by khim (subscriber, #9252)
The most popular SQL database in the most popular mode only offers atomicity but not durability.
The fact that it remains the most popular despite that says volumes about what peeple need (as opposed to want).
People ask for durability when they need atomicity - but that's because, as you've said, usually developers don't know what they need (they usually know what they want, but that's different thing).
Posted Dec 30, 2010 12:16 UTC (Thu) by Nick (guest, #15060)
But the article didn't really mention durability. It mentioned lack of integrity, whereas durability seems like it was implied ("MyISAM tables effectively always operate in autocommit = 1 mode").
You can definitely turn off fsync on RDBMS which I think is used to do bulk populates of a new database. But there is definitely that point of populating the thing to transitioning to live data when you need a durable point too. But this is because crash recovery at this point is trivial (recreate the db) compared to the larger cost of fsync.
Posted Dec 29, 2010 13:04 UTC (Wed) by man_ls (subscriber, #15091)
I was probably not clear enough when speaking about "developers", since they are not here the determinant force here but merely a vehicle for users needs. Users want robust filesystems which do not corrupt their data; application developers just want to do their stuff without being bothered by the underlying filesystem. It is filesystem developers which need to provide for those seemingly disjoint requirements by providing atomicity always (again, merely a vehicle for robust filesystems) and durability sometimes (only when asked for explicitly).
If POSIX does not require atomicity in renames and appends, then are filesystem developers free to corrupt user files and claim to be POSIX-compliant? Ask XFS developers, which did corrupt files once in a while for a few years and lost most of their users in the process, what good this compliance does. If ext4 did this then the whole Linux ecosystem would suffer in the process as it is the default filesystem.
Implementing filesystems on a database-like layer is what Microsoft Longhorn set out to do, one of the reasons why it was delayed some six years, and also why the final Windows Vista was so utterly bad: a lot of resources were spent in what was, simply put, bad engineering. I would have said "in retrospect" but for many people it was obviously a mistake from the beginning. There have been a few other contenders in the filesystem-on-a-databse category and IIRC all failed.
Posted Dec 30, 2010 12:07 UTC (Thu) by Nick (guest, #15060)
Write the file (to a new name, obviously, don't overwrite the old one until the new one is durable), and optionally kick off an asynchronous writeback of that file. At some point in future when durability is required (eg. if you guarantee some window or interval on your backed up data), then you would fsync (which should be mostly noop if the asynchronous writeback already completed). And then perform some checkpoint such as a lock file or sqlite transaction or something that indicates the file is now durable.
It's really not a big deal to implement some simple transactional-like recovery protocol on top of the POSIX API, and it has been done in these ways for a long time. It seems like a new wave of programmers has just forgotten about this.
Or, if you want to do slow, synchronous operations without blocking the whole app, that's trivial too -- just use processes or threads to do those slow operations.
Posted Dec 30, 2010 13:43 UTC (Thu) by man_ls (subscriber, #15091)
You can paper over these inefficiencies with new APIs; you can require all developers to use them before you make guarantees about not corrupting data; or you can force them to use threads or processes. But those are not very elegant ways to deal with the problem. I prefer to have a filesystem which silently deals with the problem without corrupting my data, like we did with ext3 in the good old times.
But once again, please read the original articles where these issues were truly beaten to death.
Posted Dec 31, 2010 1:00 UTC (Fri) by Nick (guest, #15060)
If there is a crash, then you don't have to bother with atomicity, just delete the files (which is possible because you didn't require durability).
These aren't inefficiencies, they are actually efficiencies, because they expose an API that is simple and efficient to implement for the memory management and filesystems. You can build arbitrarily more complex protocols on top of that.
And yet again here we are
Posted Dec 31, 2010 14:08 UTC (Fri) by man_ls (subscriber, #15091)
Apparently the atomic rename is too confusing, or I am not explaining the use case clearly. Think about atomic appends to files then. When you append some sectors atomically, you want either the old file without the new sectors or the file with the new sectors at the end. What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.
An atomic append would guarantee that the new sectors would either be present with the expected contents or not present at all, but never contain random garbage. We don't need an fsync after every sector, nor would it solve the problem: after the write but before the fsync the filesystem would be in an incoherent state anyway, albeit for a short time.
Posted Jan 5, 2011 12:03 UTC (Wed) by nye (guest, #51576)
To be fair there was more wrong with XFS than that - it could also corrupt *entirely unrelated* files. Personally I stopped using it when a power cut trashed /etc/passwd (and presumably a load of other files that were less obvious) despite nothing having it open at the time.
Posted Dec 31, 2010 16:08 UTC (Fri) by etienne (subscriber, #25256)
Posted Dec 31, 2010 16:21 UTC (Fri) by man_ls (subscriber, #15091)
Posted Dec 31, 2010 19:05 UTC (Fri) by etienne (subscriber, #25256)
Well, for decades you did not have hard drives with internal command queueing.
To have better performances you need to keep the queue full.
Because you cannot tell the hard drive that updating this sector is more important than that one, that information is probably not managed at all in the driver.
Moreover, you asked for this behaviour by wanting only metadata journaling of the filesystem, explicitely wanting a coherant filesystem (i.e. no fsck after crash) even if it means data inside files may be corrupted.
You can run with data journaling, but people/distributions thinks it does not worth the performance hit.
Once again, reenacting history
Posted Dec 31, 2010 20:21 UTC (Fri) by man_ls (subscriber, #15091)
The funny thing is that XFS developers eventually realized their folly and solved the atomicity issues, but now people don't trust them with their data anymore.
Posted Dec 31, 2010 22:14 UTC (Fri) by neilbrown (subscriber, #359)
which can easily leave nothing called "file". This is all that "atomic rename" means, or at least all it meant before ext3 gave rename unfortunate (though useful) semantics.
Though I cannot know the intention of the author of that post you linked to, there is no prima-face reason to believe they mean anything more than the atomicity of names (not of contents) that rename has always had in Unix.
(and half-written files are easy to detect by writing a checksum at the end. If you suffix each file with a timestamp it is easy to know which is the most recent. And file older than a few minutes will be safe-on-disk so you are always free to clean up any file older than the youngest file that is older than a few minutes)
Posted Dec 28, 2010 3:49 UTC (Tue) by mjg59 (subscriber, #23239)
Posted Dec 28, 2010 12:37 UTC (Tue) by drag (subscriber, #31333)
Looks like this is guaranteed behavior on Linux now. :)
Well, is's quite explicit decision...
Posted Dec 28, 2010 13:37 UTC (Tue) by khim (subscriber, #9252)
This is true for more then year: see here. Filesystem developers don't like to support such semantic, but they all agree that "the application developers outnumber us" so it's simpler to explicitly add such guarantee rather then try to educate said application developers.
Posted Dec 28, 2010 17:23 UTC (Tue) by Wol (guest, #4433)
Plus the fact that on some fs types you need the fsync and it's cheap, on others you don't need it and it's costly. The app should not need to modify its behaviour depending on which file system it's running on.
Posted Dec 28, 2010 8:33 UTC (Tue) by timokokk (subscriber, #52029)
Yes, ext4 relies on a generic block layer and it will not work on top of an mtd device. And no, it does not affect the reliability and life of the flash. If the device provides a generic block device interface, then it will also hide the raw flash layer and take care of the wear leveling for you. For example, eMMC memories work like that.
If you only have a raw flash device that does not hide the characteristics of the flash media, you will also need to use a file system that is designed to work with such media (JFFS2, UBIFS, whatever..).
Posted Dec 28, 2010 12:39 UTC (Tue) by drag (subscriber, #31333)
So they often don't offer access to raw flash even if you wanted to use MTD directly.
Posted Dec 28, 2010 17:36 UTC (Tue) by mikov (subscriber, #33179)
I doubt that the builtin SD controllers can perform very sophisticated wear leveling. It is an extremely difficult problem.
Posted Dec 28, 2010 14:37 UTC (Tue) by pflugstad (subscriber, #224)
The problem with this is exactly that however - you're depending on the quality of the eMMC firmware, which can be dodgy. Look at the problems the SSD drives have had of falling off a performance cliff. Also, we've had issues with eMMC reliability from some vendors. I don't know if you can update that firmware on the fly or not.
We're to the point of putting bare NAND back on our boards and using UBIFS instead.
Posted Dec 28, 2010 18:14 UTC (Tue) by mikov (subscriber, #33179)
Posted Dec 28, 2010 20:42 UTC (Tue) by robert_s (subscriber, #42402)
Not necessarily. You could always be using a software MTD-to-block translation/wear-levelling driver. I think a toy one is even shipped with the MTD layer.
Not sure how likely this solution is though.
Posted Dec 28, 2010 10:06 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
Recent Samsung phones have wear-levelling in the hardware (they present their flash drives as block devices), so ext4 is perfectly OK.
Posted Dec 28, 2010 11:09 UTC (Tue) by mikov (subscriber, #33179)
Posted Dec 28, 2010 11:18 UTC (Tue) by Los__D (guest, #15263)
Posted Dec 28, 2010 17:34 UTC (Tue) by mikov (subscriber, #33179)
I have had horrible experiences with CompactFlash cards, even expensive ones from respected vendors.
I think few people realize how truly difficult it is to do good wear leveling. Even reads cause wear! Plus, the obvious "solutions" have a write multiplication factor of about 30...
This problem has been largely ignored recently to the extent that it isn't even mentioned at all in supposedly technical articles.
Posted Dec 29, 2010 4:15 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246)
I think few people realize how truly difficult it is to do good wear leveling. Even reads cause wear!
Really? I've never never heard that reads cause any appreciable wear. Can you share a reference?
Posted Dec 29, 2010 15:14 UTC (Wed) by busterb (subscriber, #560)
Posted Dec 29, 2010 20:16 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246)
atime updates are writes (even though the app only does a read). The comment I was replying to seemed to suggest that pure reads wear out flash. ie. if I mounted a volume read-only, that I could shorten its life dramatically by reading it regularly.
While I'm sure there's some impact to reading a flash cell, I've got to believe it's a few orders of magnitude smaller than the effect due to writes, and therefore generally ignorable.
Posted Dec 28, 2010 12:28 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
There IS a specialized controller. However, it doesn't need a lot of RAM and a fast CPU, because it can offload some of the tasks on the main CPU.
Posted Dec 28, 2010 17:38 UTC (Tue) by mikov (subscriber, #33179)
Posted Dec 29, 2010 0:18 UTC (Wed) by swetland (subscriber, #63414)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds