LWN.net Logo

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Ars technica reports that Google's new Nexus S smartphone will be the first Android device to use the Ext4 filesystem. "Most Android devices currently use YAFFS, a lightweight filesystem that is optimized for flash storage and is commonly used in mobile and embedded devices. The problem with YAFFS, [Ted] T'so explained in his blog entry, is that it is single-threaded and would likely "have been a bottleneck on dual-core systems." Concurrency will be important on next-generation Android devices that use multi-core ARM processors. We expect to see dual-core Android devices, including tablets, announced as early as CES."
(Log in to post comments)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 2:19 UTC (Tue) by mikov (subscriber, #33179) [Link]

I posted this question on Ars, but I think I have a better chance of getting a good answer here:

Doesn't Ext4 rely on a generic block layer? Isn't that going to significantly affect the reliability and life of the flash?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 2:33 UTC (Tue) by rich0 (guest, #55509) [Link]

I suspect that wearing out the flash earlier so that you have to buy a new phone is considered a feature, not a bug. These are the guys who stop releasing security patches for phones that contain web browsers less than a year after they have been sold to people with 2-year contracts.

I was amused by the fsync banter. The last thing I want on a phone with a slow flash drive is every other function calling fsync. When you write data it should either end up being written, or not end up being written - messing up data already in storage without writing new data should not be a typical failure mode. I don't care if my DVR successfully appended 25kB to my growing 784MB recording file when the power fails, but I don't want to lose half the recording over a failure to call fsync, or lose recordings on a different tuner because the disk is busing churning because three different threads are having to fsync every 5kb write to the drive (whether fsync is implemented properly or not it kills seek performance).

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 3:38 UTC (Tue) by Nick (guest, #15060) [Link]

What's difficult about fsync?

If you have a volatile writeback cache, the app needs to specify at what point it expects the data to be durable. If every other function requires data to be durable, then every other function needs to call fsync. If you don't want that, then you can't use a writeback cache.

Unrelated files will not be corrupted or lost if fsync is not used (it's a perfectly valid mode to leave maximum amount of dirty data in writeback cache).

If the disk is not bandwidth constrained, it will write back data older than a particular time. For guarantees though, fsync should be used. There wouldn't be much problem in fsyncing every 2 minutes of video to ensure no more than that is lost after power failure.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 13:42 UTC (Tue) by butlerm (subscriber, #13312) [Link]

What's difficult about fsync?

The problem is that fsync is a much slower, more heavyweight operation than is actually needed in most cases. Usually what you need on recovery is consistency, not durability.

Rename-replace points are an unusually convenient place to provide consistency. The FS just needs to go a little out of its way to make sure that upon recovery after a rename replace you either get the old version or the new version of the file being replaced

To be sure, the quick and dirty way of providing consistent rename replace is for the FS to force the new version to permanent storage before committing the rename transaction. That is not really necessary though. You can also create rename undo records and on recovery roll back to the last completely written version. No sync serialization required, let alone the user visible thumb twiddling variety.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 14:01 UTC (Tue) by mjg59 (subscriber, #23239) [Link]

"Unrelated files will not be corrupted or lost if fsync is not used"

That's true, but the fact that it's POSIXly ok to transform the "Write to temporary file, rename over original" sequence into "rename over original, then commit new data to disk" means that the definition of "unrelated" is not always what application developers expect. The question is whether it's expected that Linux filesystems give stronger guarantees than POSIX, and I think the outcome of previous discussions means that (at least as far as general purpose filesystems go) the answer is yes.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 17:59 UTC (Tue) by iabervon (subscriber, #722) [Link]

POSIX doesn't make any guarantees at all about what happens when the system stops and starts again, so it would be hard for Linux to be useful and not make stronger guarantees. Or, rather, Linux filesystems don't guarantee anything more (the reason your system crashes could conceivably be that your video driver has started sending data to your hard drive instead of your video card, entirely wiping out your storage), but the behavior that Linux rarely diverges from is substantially better than POSIX requires.

fsync() is kind of weird in that, if the system never crashes, it has no visible effect; and if the system ever crashes, it might not come back at all or it might do something random that happens to undo all the effects of fsync().

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 19:46 UTC (Tue) by Trelane (guest, #56877) [Link]

So it sounds like fsync would work except that it flushes data and metadata, thereby causing performance issues, yes? If so, a nice solution may be to introduce a metadata-only sync?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 20:46 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

*sigh* Not this argument again. Go reread the old threads: LWN ran carefully balanced coverage of the Great FSync War.

To respond to your particular comment, the issue is really that we need a write barrier, while fsync specifies a full flush, a much stonger and unnecessary operation.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 22:55 UTC (Tue) by neilbrown (subscriber, #359) [Link]

This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.

fsync isn't a "full flush" (except in ext3), it is a controlled flush of a single file. "sync" is "full flush".

What you really need is dependencies. Journalling filesystems use dependencies a lot to make sure that things get written in the "right" order. They submit lots of antecedent writes, then a flush, then the dependent writes. e.g. metadata-to-journal, journal-commit-block, metadata-to-filesystem.

If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.

Despite comments above and elsewhere, I don't think write/close/rename is guaranteed to provide atomicity in Linux. Ted Ts'o recently wrote:
(http://www.spinics.net/linux/lists/linux-ext4/msg22395.html)

> The implementors of a number of mainstream file systems (i.e., ext4 btrfs, XFS) have agreed to do the equivalent of #1 (i.e., initiating writeback, but not necessarily waiting for the writeback to complete) in the case of a rename that replaces an existing file.

As there is no wait or dependency, there is no guarantee (as I read it).

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 2:09 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

This observation is somewhat amusing in light of the fact that barriers in the Linux FS/Block layer were recently replaced with controlled flushing.
And the second part of that plan is to use threads and internal queues to achieve the same overall effect; that's not really an option for userspace.

As for citing Tso's: his views on the subject are widely-known; that doesn't make them correct. There is nothing wrong with an application relying on transactional rename semantics, and asking userspace to not use reasonable, convenient behavior that's worked for years is simply not a tenable position.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 3:01 UTC (Wed) by neilbrown (subscriber, #359) [Link]

1/ I wasn't citing Ts'o's opinion. I was citing his report on a meeting of several filesystem developers. He wasn't say what "should". He was saying what "is".

2/ The transaction rename semantics that ext3 provided were really a bug (or mis-feature or short-sighted design or whatever). I think it was only ext3 that provided it, certainly not ext2, probably not XFS, not sure about reiserfs. Any userspace that relies on it (without checking that the filesystem in use is ext3) is buggy.

That said - I think that it would be good if Linux did provide some transactional guarantees that didn't require fsync and that user-space could rely on across all filesystems. I suspect that enhancing rename would be a good start and could be done with little or no performance cost. It might even be good to provide something more explicit. But I don't think that 'barriers' is the right model.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 4:45 UTC (Wed) by iabervon (subscriber, #722) [Link]

The thing is that transactional rename semantics works for all POSIX-compliant filesystems when your computer doesn't crash. Anything that's only needed if your computer crashes isn't going to be used consistently, because developers' computers crash so rarely and it's awkward to include in an application regression test suite.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 8:59 UTC (Wed) by neilbrown (subscriber, #359) [Link]

fsync is only needed if your computer crashes. Not using it works perfectly when your computer doesn't crash.

By your argument, fsync will not be used consistently, and that appears to be true to some extent.

The implied conclusion seems to be that all filesystems should be mounted "-o sync" so that fsync is not needed. Strangely, I have not heard that conclusion being proposed explicitly....

I certainly agree that interfaces should be hard to misuse (the Rusty principle) but that must be balanced against the dictum against making things fool proof as then only a fool would use them. In this case, while it is quite possible to misuse fsync (e.g. by not using it), it really is an appropriate interface.

And while regression tests suites are incredibly valuable and there should be more of them etc etc, one hopes that developers don't depend of them as the sole means of ensuring correctness, but also read documentation, try to understand the systems they work with, and write code accordingly. (one also hopes for a pony.... but no, only a piece of coal in my stocking again)

Syncing has significant performance penalties.

Posted Dec 29, 2010 14:26 UTC (Wed) by gmatht (subscriber, #58961) [Link]

The applications that actually really need durability of data written in the last few seconds don't seem to be that common... one might hope that the people who wrote the database your bank uses, would know about fsync. Even on ext4 a trivial fsync takes 50ms[1], which is forever in computer times, much more even that the ~0.1ms for creating new files, but losing everything because the filesystem decided to write out the metadata before the data (violating atomicity) isn't that great either.

Apparently xsyncfs provides a system that is synchronous from the users point-of-view with an overhead of only 3%-7%. That might be an acceptable default. Xsyncfs doesn't seem very well documented, but is just one avenue of providing good reliability properties without the massive performance penalties from fsync/-o sync.

[1] www.ucc.asn.au/~mccabedj/fsync_benchmark.c

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 17:42 UTC (Wed) by iabervon (subscriber, #722) [Link]

Using fsync only really makes sense if you're trying to get stuff written to disk before sending a message out of the system; otherwise, it won't be possible to tell whether the fsync didn't actually do anything or the system crashed before it returned. So you need to use fsync after writing a received email message to disk and before telling the remote server that you've got it.

I believe that the model that most people have of filesystems is that what's recovered after a system crash is like a snapshot of the filesystem that you would see in a running system if you were taking the snapshot with ordinary system calls and could therefore see all the race conditions you can see between programs; however, there is arbitrary random damage because the system crashed, and the latest snapshot may not be particularly recent.

With this model, fsync is easy to (know to) use in cases where you want to make sure that the snapshot is sufficiently recent, but not for cases where it is necessary to avoid the recovered state being something that couldn't have been a snapshot.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 30, 2010 0:14 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> I believe that the model that most people have of filesystems is that ...

That is an overly naive model of a filesystem. It assumes almost completely linearisation of operations on their way to storage. Any re-ordering in the page cache before writeback or in the device queue via an IO scheduler will invalidate that model, and as you can imagine such re-ordering happens a lot.

The correct model is "nothing is safe until you call sync or fsync or some other variant", with the understanding that 'sync' is effectively called every 30 seconds or so.

I'm glad it is obvious that you need to call fsync (on both the file and the directory you created the file in) before acknowledging the receipt of a file (e.g. an email) over a network connection.

However exactly the same is true when moving a file by copying it. If you copy a file (possibly transforming it on the way) and the remove the original you really must fsync the new copy before unlinking the old. You should also fsync the directory, though if you rename the new (after fsyncing it) to replace the old, then the fsync of the directory is not required.

Note that "mv" doesn't do the fsync when moving a file between filesystems (which requires a copy/unlink). So if you use mv and then crash you could quite possibly lose both copies. And mv doesn't even have an option to request the fsync.

Now you might suggest that this should "just work" without mv needing to call fsync. But I think you would find it quite difficult to design the filesystem semantics that would allow this to always be safe, especially as you need interaction between two separate filesystems (unlink in one must not commit until writes in the other have committed). ... other than mounting everything with '-o sync' of course.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 30, 2010 1:33 UTC (Thu) by iabervon (subscriber, #722) [Link]

My model can't really fail to be accurate, since it includes the possibility of arbitrary deviations from the predicted outcome. And, actually, nothing is safe at all; your storage medium might fail, your video driver might scribble over your disk or your dirty pages, your hard drive might read garbage out of memory losing power and write it with the power left in its capacitors. I actually suspect that, based on the model I stated, a more common and more extensive source of differences from some potential snapshot is things that syncing couldn't have helped with than things that syncing could have helped with (with the exception of ext4 having a particularly common and obvious divergence).

There's also been not that long in the UNIX tradition when you could be reasonably confidant that a power failure shortly after you changed something in a directory wouldn't trash other things in the directory, making it kind of irrelevant whether you'd called fsync on the directory to make sure that the disk was correct before it got corrupted.

In general, there's a tradeoff among filesystem complexity, slowness, and
deviation from non-crash state. None of these go to zero without making the others terrible, even if you call sync all the time.

(In fact, my model does require an fsync when moving a file by copying it, at least across directories; the snapshotting process could read the destination directory before you write the file and the source directory after you unlink it.)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Jan 1, 2011 18:54 UTC (Sat) by butlerm (subscriber, #13312) [Link]

If you really wanted to export journalling-style protection to user-space apps, I would look at allowing dependencies to be specified, either implicitly (close -> rename) or explicitly.

User specified dependencies are completely unnecessary. All that is necessary is for filesystems to exercise a little bit of additional effort to preserve POSIX semantics across system crashes. That means atomic renames, among other things.

Very simple: log the rename in the journal. Keep the old inode around until the data associated with the replacement inode commits to disk. If the system crashes in the meantime, on recovery undo the rename, thereby restoring the association between the name and the old inode.

No need to commit the new inode data implicitly or explicitly prior to writing any metadata journal entries. No need to call fsync unless you actually want synchronous behavior. No need to start writeout on the new inode immediately, either.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 21:44 UTC (Tue) by iabervon (subscriber, #722) [Link]

No, actually, a metadata-only sync would *increase* the chance of causing problems over not using anything at all. The necessary operation is to *prevent* writing the metadata until the data has been written.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Jan 1, 2011 19:22 UTC (Sat) by butlerm (subscriber, #13312) [Link]

The necessary operation is to *prevent* writing the metadata until the data has been written.

That can lead to severe performance issues, unless you implement some sort of multiversion concurrency (a la BSD style "soft updates") on all your metadata. There are better ways.

Not again

Posted Dec 28, 2010 16:06 UTC (Tue) by man_ls (subscriber, #15091) [Link]

This particular horse has got its share of postmortem beatings before, but there it goes: what developers usually want is atomicity, not durability. Durability (once written stay written) is a different requirement, but in this particular instance we need atomicity: do the rename in one step, so it's either finished or not done. The same goes for appending to an existing file: either append or do not append, but at no point in the process leave a corrupt file.

The funny part about T'so's fsync obsession is that he considers fsync to be a requirement for good filesystem programming. Instead of solving the atomicity problems and shutting up, he insists on the use of fsync:

it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync.
and proposes "product QA" as a substitute for proper filesystem-level atomicity. Weird.

Not again

Posted Dec 28, 2010 17:29 UTC (Tue) by mjg59 (subscriber, #23239) [Link]

Indeed. It's ironic that the outcome of optimising certain performance characteristics of a filesystem would have required userspace to behave in a way that's almost precisely pathological on the previous generation of the same filesystem, but the rational approach to this is "Optimise filesystems for the way that real-world applications work" not "Optimise filesystems for a benchmark". POSIX doesn't provide a mechanism for atomicity without durability, so if application developers want one without the other then they're going to have to rely on implementation details. Filesystems that don't understand that are doomed to irrelevance.

Not again

Posted Dec 29, 2010 6:42 UTC (Wed) by Nick (guest, #15060) [Link]

I don't know that developers want atomicity but not durability. I'm not quite sure if developers know what they want, most of the time. At least, there seem to be a lot of misconceptions and misuse of these things. I don't think SQL provides atomicity without durability, does it? (maybe an RDBMS specific extension, but I'm not aware of them).

It's true that posix filesystem API is not trivial to write a crash recoverable protocol on top of, when you are talking about multiple files and directory structure, but it's not rocket science. I think a lot of userspace developers believe that it is terribly difficult and non-performant to implement in their apps, but it would somehow become free of complexity and cost if implemented in the kernel.

Not only would it not be free, but it would add a lot of complexity and divergence (between fses, other OSes) to a layer that is currently quite simple and used by all sorts of apps that do not need such semantics.

I'm not saying it would be a useless feature. But I have not seen anywhere somebody show code and numbers showing that it is compelling, over alternatives. Alternatives include implementing your own atomicity/cleanup protocol on the posix API (which really isn't too hard, and can absolutely include asynchronous writeback until durability is required), or using sqlite or bdb or something to either store all your data or at least store a log of the state of your file/directory structure. And mind you, the alternatives are very well tested and will work on all OSes.

This is funny...

Posted Dec 29, 2010 8:47 UTC (Wed) by khim (guest, #9252) [Link]

The most popular SQL database in the most popular mode only offers atomicity but not durability.

The fact that it remains the most popular despite that says volumes about what peeple need (as opposed to want).

People ask for durability when they need atomicity - but that's because, as you've said, usually developers don't know what they need (they usually know what they want, but that's different thing).

This is funny...

Posted Dec 30, 2010 12:16 UTC (Thu) by Nick (guest, #15060) [Link]

Well I thought MyIASM lack of ACID was widely derided and apparently updates weren't even atomic versus crash recovery, and they now default to using innodb engine which is full ACID.

But the article didn't really mention durability. It mentioned lack of integrity, whereas durability seems like it was implied ("MyISAM tables effectively always operate in autocommit = 1 mode").

You can definitely turn off fsync on RDBMS which I think is used to do bulk populates of a new database. But there is definitely that point of populating the thing to transitioning to live data when you need a durable point too. But this is because crash recovery at this point is trivial (recreate the db) compared to the larger cost of fsync.

Not again

Posted Dec 29, 2010 13:04 UTC (Wed) by man_ls (subscriber, #15091) [Link]

It's not that developers don't want durability: quite often they do, for example after saving a file to disk. But in those instances it is safe and clean to provide it by using fsync(). I don't care if my application pauses for a few seconds after saving a file to disk, but at that point I want durability. I don't want it to stop every minute when it saves an emergency backup file, but neither do I want it to corrupt said emergency file; at those moments I want atomicity.

I was probably not clear enough when speaking about "developers", since they are not here the determinant force here but merely a vehicle for users needs. Users want robust filesystems which do not corrupt their data; application developers just want to do their stuff without being bothered by the underlying filesystem. It is filesystem developers which need to provide for those seemingly disjoint requirements by providing atomicity always (again, merely a vehicle for robust filesystems) and durability sometimes (only when asked for explicitly).

If POSIX does not require atomicity in renames and appends, then are filesystem developers free to corrupt user files and claim to be POSIX-compliant? Ask XFS developers, which did corrupt files once in a while for a few years and lost most of their users in the process, what good this compliance does. If ext4 did this then the whole Linux ecosystem would suffer in the process as it is the default filesystem.

Implementing filesystems on a database-like layer is what Microsoft Longhorn set out to do, one of the reasons why it was delayed some six years, and also why the final Windows Vista was so utterly bad: a lot of resources were spent in what was, simply put, bad engineering. I would have said "in retrospect" but for many people it was obviously a mistake from the beginning. There have been a few other contenders in the filesystem-on-a-databse category and IIRC all failed.

Not again

Posted Dec 30, 2010 12:07 UTC (Thu) by Nick (guest, #15060) [Link]

Then you use the normal posix API to write your emergency backup files. I still fail to see what is missing here.

Write the file (to a new name, obviously, don't overwrite the old one until the new one is durable), and optionally kick off an asynchronous writeback of that file. At some point in future when durability is required (eg. if you guarantee some window or interval on your backed up data), then you would fsync (which should be mostly noop if the asynchronous writeback already completed). And then perform some checkpoint such as a lock file or sqlite transaction or something that indicates the file is now durable.

It's really not a big deal to implement some simple transactional-like recovery protocol on top of the POSIX API, and it has been done in these ways for a long time. It seems like a new wave of programmers has just forgotten about this.

Or, if you want to do slow, synchronous operations without blocking the whole app, that's trivial too -- just use processes or threads to do those slow operations.

Not again

Posted Dec 30, 2010 13:43 UTC (Thu) by man_ls (subscriber, #15091) [Link]

What is missing is that you don't really want all the files you write to be durable, even if you think you do: many of them are just temporary files which won't perhaps be needed in the future, and for the rest the write operation can be delayed a few seconds with no ill effects. You just want your renames to be atomic: either keep the old file or overwrite it with the new one. And the same with appends. It's not really important which version (old or new) is left on disk, but do never leave corrupted files (empty files or files with unitialized sectors).

You can paper over these inefficiencies with new APIs; you can require all developers to use them before you make guarantees about not corrupting data; or you can force them to use threads or processes. But those are not very elegant ways to deal with the problem. I prefer to have a filesystem which silently deals with the problem without corrupting my data, like we did with ext3 in the good old times.

But once again, please read the original articles where these issues were truly beaten to death.

Not again

Posted Dec 31, 2010 1:00 UTC (Fri) by Nick (guest, #15060) [Link]

If you don't need files to be durable, then don't fsync them!

If there is a crash, then you don't have to bother with atomicity, just delete the files (which is possible because you didn't require durability).

These aren't inefficiencies, they are actually efficiencies, because they expose an API that is simple and efficient to implement for the memory management and filesystems. You can build arbitrarily more complex protocols on top of that.

And yet again here we are

Posted Dec 31, 2010 14:08 UTC (Fri) by man_ls (subscriber, #15091) [Link]

What I want is not a durable file, but an atomic rename. Not an empty file, but either the old version or the new.

Apparently the atomic rename is too confusing, or I am not explaining the use case clearly. Think about atomic appends to files then. When you append some sectors atomically, you want either the old file without the new sectors or the file with the new sectors at the end. What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

An atomic append would guarantee that the new sectors would either be present with the expected contents or not present at all, but never contain random garbage. We don't need an fsync after every sector, nor would it solve the problem: after the write but before the fsync the filesystem would be in an incoherent state anyway, albeit for a short time.

And yet again here we are

Posted Jan 5, 2011 12:03 UTC (Wed) by nye (guest, #51576) [Link]

>What XFS did (and I experienced first hand) was add some uninitialized sectors to the file, and afterwards write the new content to the sectors. This behavior is apparently allowed by POSIX, and yet extremely annoying to users, who moved to other filesystems in droves.

To be fair there was more wrong with XFS than that - it could also corrupt *entirely unrelated* files. Personally I stopped using it when a power cut trashed /etc/passwd (and presumably a load of other files that were less obvious) despite nothing having it open at the time.

Not again

Posted Dec 31, 2010 16:08 UTC (Fri) by etienne (subscriber, #25256) [Link]

How about, when you want to save the new configuration file, not trying to replace the old config file - just keep it in case the new config file is incomplete, still not created or corrupted (after crash)?
Should work with any filesystem.

Not again

Posted Dec 31, 2010 16:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

And, the next time? Would you keep all the old files? You can also keep two of them, but how do you know which one of them was the last one written? What if one of them is half-written? That way leads to madness. Just to avoid a feature which has worked on ext3 (and probably all other popular POSIX filesystems) for decades, and which other OSs are trying to copy.

Not again

Posted Dec 31, 2010 19:05 UTC (Fri) by etienne (subscriber, #25256) [Link]

> Just to avoid a feature which has worked on ext3 for decades

Well, for decades you did not have hard drives with internal command queueing.
To have better performances you need to keep the queue full.
Because you cannot tell the hard drive that updating this sector is more important than that one, that information is probably not managed at all in the driver.
Moreover, you asked for this behaviour by wanting only metadata journaling of the filesystem, explicitely wanting a coherant filesystem (i.e. no fsck after crash) even if it means data inside files may be corrupted.
You can run with data journaling, but people/distributions thinks it does not worth the performance hit.

Once again, reenacting history

Posted Dec 31, 2010 20:21 UTC (Fri) by man_ls (subscriber, #15091) [Link]

It didn't happen that way. I did explicitly use a journaling filesystem thinking that "journaling" meant that it did not lose data in the event of a crash. When I found out that only metadata was guaranteed to be consistent, I thought "What a sham". Then I (and millions of other people) immediately switched to another filesystem: ext3 with data=ordered, which did offer better guarantees (data journaling) at the cost of performance. Who wants performance when your files are being corrupted?

The funny thing is that XFS developers eventually realized their folly and solved the atomicity issues, but now people don't trust them with their data anymore.

Not again

Posted Dec 31, 2010 22:14 UTC (Fri) by neilbrown (subscriber, #359) [Link]

1/ ext3 is the only filesystem I know of that forces and fsync before committing a rename.
2/ ext3 is about 10 years old, so it hasn't been around for "decades" unless you mean "0.9 decades".
3/ When rename was first introduced into Unix in the BSD, it was atomic in the sense that even in the event of a crash there would always be a file with the destination name, either the original or the new. This is in contrast to the previous behaviour. which required:
- create "file.tmp"
- unlink "file"
- link "file.tmp" to "file"
- unlink "file.tmp"

which can easily leave nothing called "file". This is all that "atomic rename" means, or at least all it meant before ext3 gave rename unfortunate (though useful) semantics.

Though I cannot know the intention of the author of that post you linked to, there is no prima-face reason to believe they mean anything more than the atomicity of names (not of contents) that rename has always had in Unix.

(and half-written files are easy to detect by writing a checksum at the end. If you suffix each file with a timestamp it is easy to know which is the most recent. And file older than a few minutes will be safe-on-disk so you are always free to clean up any file older than the youngest file that is older than a few minutes)

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 3:49 UTC (Tue) by mjg59 (subscriber, #23239) [Link]

Any filesystem that doesn't guarantee the ordering of operations in the typical "write and then rename over the old file" case is effectively unusable in the desktop case, and as far as I know no significant distribution released with ext4 as the default and with those bugs still present. As long as you're not modifying files in place then you should be able to manage without fsync in pretty much every case other than explicit saves.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 12:37 UTC (Tue) by drag (subscriber, #31333) [Link]

On a side note, with BTRFS the write-then-rename without fsync is a safe operation.

Looks like this is guaranteed behavior on Linux now. :)

Well, is's quite explicit decision...

Posted Dec 28, 2010 13:37 UTC (Tue) by khim (guest, #9252) [Link]

This is true for more then year: see here. Filesystem developers don't like to support such semantic, but they all agree that "the application developers outnumber us" so it's simpler to explicitly add such guarantee rather then try to educate said application developers.

Well, is's quite explicit decision...

Posted Dec 28, 2010 17:23 UTC (Tue) by Wol (guest, #4433) [Link]

It's not that fielsystem developers are outnumbered by app developers. It's that forcing app writers to do an fsync will simply KILL performance as pretty much every app will forever be hanging on file-system access.

Plus the fact that on some fs types you need the fsync and it's cheap, on others you don't need it and it's costly. The app should not need to modify its behaviour depending on which file system it's running on.

Cheers,
Wol

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 8:33 UTC (Tue) by timokokk (subscriber, #52029) [Link]

"Doesn't Ext4 rely on a generic block layer? Isn't that going to significantly affect the reliability and life of the flash?"

Yes, ext4 relies on a generic block layer and it will not work on top of an mtd device. And no, it does not affect the reliability and life of the flash. If the device provides a generic block device interface, then it will also hide the raw flash layer and take care of the wear leveling for you. For example, eMMC memories work like that.

If you only have a raw flash device that does not hide the characteristics of the flash media, you will also need to use a file system that is designed to work with such media (JFFS2, UBIFS, whatever..).

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 12:39 UTC (Tue) by drag (subscriber, #31333) [Link]

Correct me if I am wrong, but It seems like most phones now with large amounts of on board storage use a SD-like device soldered to the mainboard.

So they often don't offer access to raw flash even if you wanted to use MTD directly.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 17:36 UTC (Tue) by mikov (subscriber, #33179) [Link]

May be so - it is probably much easier for the manufacturer. The question is how it affects reliability.

I doubt that the builtin SD controllers can perform very sophisticated wear leveling. It is an extremely difficult problem.

eMMC

Posted Dec 28, 2010 14:37 UTC (Tue) by pflugstad (subscriber, #224) [Link]

I expect a lot of phones are using eMMC, due to it's lower pin counts compared to raw flash. eMMC presents a SCSI/block interface to the OS and so has it's own block translation and wear-leveling firmware which hides the underlying flash.

The problem with this is exactly that however - you're depending on the quality of the eMMC firmware, which can be dodgy. Look at the problems the SSD drives have had of falling off a performance cliff. Also, we've had issues with eMMC reliability from some vendors. I don't know if you can update that firmware on the fly or not.

We're to the point of putting bare NAND back on our boards and using UBIFS instead.

eMMC

Posted Dec 28, 2010 18:14 UTC (Tue) by mikov (subscriber, #33179) [Link]

We are experiencing the same kind of problems, only unfortunately using bare NAND is not an option for us. That's why it seemed strange to me that large manufacturers like Samsung would voluntarily use the worse solution.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 20:42 UTC (Tue) by robert_s (subscriber, #42402) [Link]

"If you only have a raw flash device that does not hide the characteristics of the flash media, you will also need to use a file system that is designed to work with such media (JFFS2, UBIFS, whatever..)."

Not necessarily. You could always be using a software MTD-to-block translation/wear-levelling driver. I think a toy one is even shipped with the MTD layer.

Not sure how likely this solution is though.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 10:06 UTC (Tue) by Cyberax (subscriber, #52523) [Link]

>Doesn't Ext4 rely on a generic block layer? Isn't that going to significantly affect the reliability and life of the flash?

Recent Samsung phones have wear-levelling in the hardware (they present their flash drives as block devices), so ext4 is perfectly OK.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 11:09 UTC (Tue) by mikov (subscriber, #33179) [Link]

There is no wear leveling in hardware. There will have to be a separate controller with plenty of RAM and a fast CPU. Plus the trim command must be supported. Is it?
It seems like a horribly inefficient solution given that the phone itself already has lots of RAM and a fast CPU.
The wear leveling in standalone storage like compact flash is pretty bad for those reasons.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 11:18 UTC (Tue) by Los__D (subscriber, #15263) [Link]

While the flexibility of a pure software solution is greater, there is absolutely nothing inefficient about dedicated hardware to do wear leveling, just like there is nothing inefficient about RAID controllers, GPUs or other tasks done in specialized hardware instead of software.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 17:34 UTC (Tue) by mikov (subscriber, #33179) [Link]

Of course there lots of inefficiency in an independent controller. Before all wear leveling needs free space - without understanding the file system layout, and without the trim command, the so called "hardware" wear leveling (it is misnomer, of course) is very inefficient. AFAIK, the trim command itself is problematic because it is synchronous.

I have had horrible experiences with CompactFlash cards, even expensive ones from respected vendors.

I think few people realize how truly difficult it is to do good wear leveling. Even reads cause wear! Plus, the obvious "solutions" have a write multiplication factor of about 30...

This problem has been largely ignored recently to the extent that it isn't even mentioned at all in supposedly technical articles.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 4:15 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246) [Link]

I think few people realize how truly difficult it is to do good wear leveling. Even reads cause wear!

Really? I've never never heard that reads cause any appreciable wear. Can you share a reference?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 15:14 UTC (Wed) by busterb (subscriber, #560) [Link]

Just use the 'noatime' mount option.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 20:16 UTC (Wed) by jzbiciak (✭ supporter ✭, #5246) [Link]

atime updates are writes (even though the app only does a read). The comment I was replying to seemed to suggest that pure reads wear out flash. ie. if I mounted a volume read-only, that I could shorten its life dramatically by reading it regularly.

While I'm sure there's some impact to reading a flash cell, I've got to believe it's a few orders of magnitude smaller than the effect due to writes, and therefore generally ignorable.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 12:28 UTC (Tue) by Cyberax (subscriber, #52523) [Link]

>There is no wear leveling in hardware. There will have to be a separate controller with plenty of RAM and a fast CPU.

There IS a specialized controller. However, it doesn't need a lot of RAM and a fast CPU, because it can offload some of the tasks on the main CPU.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 17:38 UTC (Tue) by mikov (subscriber, #33179) [Link]

Which tasks? It is presented as ATA interface, so AFAIK it can't offload anything.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 0:18 UTC (Wed) by swetland (subscriber, #63414) [Link]

Well, looks like an SD/MMC device in the case of most devices we're working with, but yeah, no magical offload of the work to the host CPU. Looks like a block device. Behaves (mostly) like a block device. Probably supports TRIM. Etc.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 15:06 UTC (Tue) by petegn (guest, #847) [Link]

Remind me NOT to buy an Android device until they get shut of the EXT filing system it has every single time i use it caused me no end of problems thanks but no thanks which is a pity cus i was quite liking the Android phones .


Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 15:59 UTC (Tue) by clump (subscriber, #27801) [Link]

The ext* filesystems have worked very well for me, especially since ext3. Your experience with ext4 will likely be very different with a dedicated device. Hopefully you'd not pass over a Linux device merely because of its filesystem.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 16:55 UTC (Tue) by Los__D (subscriber, #15263) [Link]

Don't take anything Petegn writes too seriously. google "petegn site:lwn.net" for (a long list of) reasons.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 18:17 UTC (Tue) by SEJeff (subscriber, #51588) [Link]

Thanks for the heads up, +1 to my blocklist

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 23:55 UTC (Tue) by bronson (subscriber, #4806) [Link]

A lot of people lost data to ext4's unexpected behavior, then lost confidence from Ted's unfortunate initial response.

petegn may have chosen a bad way of expressing himself but, in this case, he's not alone.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 20:37 UTC (Tue) by clump (subscriber, #27801) [Link]

Well, it's not entirely clear petegn is a troll. He/she may just be a little slow.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 3:03 UTC (Wed) by bk (guest, #25617) [Link]

"petegn site:lwn.net"
No standard web pages containing all your search terms were found.

Your search - petegn site:lwn.net - did not match any documents.

Suggestions:

Make sure all words are spelled correctly.
Try different keywords.
Try more general keywords.
Try fewer keywords.

Perhaps *you* are the troll?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 4:07 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

No, your search did a "did you mean?" attempt at spelling correction, and found nothing. If you insist on Google searching for what you actually typed, you find plenty.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 18:17 UTC (Wed) by sorpigal (subscriber, #36106) [Link]

That's really irritating. I'm getting more and more annoyed with Google silently introducing bad behavior like this which, like site previews, I don't see a way to globally disable. Do I have to start prefixing every search term with +?

Google

Posted Jan 1, 2011 13:30 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

Google (and pretty much everyone) ignores + or treats it as a hint at most

This is topical with the thread above about filesystem behaviour. Once people are using something, it's too late to decide how they should use it. Google's reality is that millions of people every second are asking Google for things like "ma DONah" and expecting to find pages about the pop singer Madonna, and for "870 + 43" expecting 913 - while its probably as much as seconds at a time between requests by (let's call them) nerds for whom Fu8E7 +frobnicate site:verytechnical.example is very precisely the exact search they wanted to perform.

So Google optimise for the former, not the latter, trusting that (a) nerds will find the way to do whatever they need to do even if you don't make it obvious, so there's no need (b) the nerds don't read adverts and thus don't generate any significant revenue.

As with XKCD's imaginary secret tech support password, it would be nice if Google provided a raw search facility for people who can spell, know what they're looking for and understand what "not found" means, but it doesn't necessarily make commercial sense. If you ask users "Are you an expert?" they say "Yes" even if they know nothing. And then they complain that they don't understand what happened next. If you provide a "secret" expert mode, "helpful" journalists tell ordinary users about it, and you're back to the same situation. I bet that fully half the quoted search queries on Google are users who don't know what quotes do, but believe that it somehow makes the search give more results, or better results.

Google

Posted Jan 6, 2011 9:04 UTC (Thu) by dw (subscriber, #12017) [Link]

Double quoting terms that should not be "corrected" works with the Google, even though it still offers you a correction, the initial set of results aren't corrected.

Prefixing with plus turns off correction with the Bing, which quite politely doesn't offer some obnoxious correction even after explicitly being told not to.

As an aside, Bing's fancy augmented search results page is much less annoying than Google's, even though it still has behaviour that triggers on mouse-over.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 19:42 UTC (Tue) by mcj220 (guest, #53162) [Link]

ext4 vs yaffs seems to be a spurious debate in this context: YAFFS2 is suited to raw nand storage devices which have very different requirements to conventional hard disk-like media (such as SD cards) which require no wear levelling, ecc and external accounting for bad-blocks.

Many current generation android devices are configured with a smallish raw nand device at the root of the filesystem + a larger micro SD card (e.g. Motorola Droid) for user multimedia data etc. The trend seems to be going towards fitting phones with a large SD card-like flash chip (Google nexus S), which requires no external wear levelling and thus is suited to conventional filesystems like ext2,3,4 etc.

The quality and reliability of the firmware in these SD card-like flash chips (MoviNAND, iNAND etc) varies of course.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 5:29 UTC (Wed) by xxiao (subscriber, #9631) [Link]

Can we disable ext4 and enable YAFFS instead..how hard will that be? What makes Android so special that only one FS is chosen?

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 28, 2010 23:15 UTC (Tue) by eds (subscriber, #69511) [Link]

I don't think it's a given that the built-in FTL would screw up so badly as to cause a big increase in program/erase counts; this is a relatively expensive general-purpose computer, not a generic $5 USB stick. It's certainly worth investigating what the performance tradeoffs are with various hardware FTL implementations versus a sophisticated kernel-based approach. Storage performance seems likely to be a part of user experience, so one would hope smartphone designers are keeping an eye on the write amplification and latency when they make these decisions.

The idea of a "flash filesystem" that's built for low-performance applications is one I'm happy to see die. One of the big advantages of flash storage is the potential for parallel operations. I think moving to full-featured filesystems, potentially with tweaks to be friendly to solid state storage, will produce a much better overall result than you'd get by starting with a software FTL and then trying to grow a specialized filesystem on top.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 0:16 UTC (Wed) by swetland (subscriber, #63414) [Link]

The biggest downside to baked-into-hardware FTL (which really means code-in-ROM for an embedded ARM7 or the like nine times out of ten) is that bugs in the translation layer can be subtle, hard to find, and difficult or impossible (depending on implementation, patchability, etc) to fix.

The upside is you get something that behaves like a block device, and you can use plain 'ol filesystems on with little or no special handling.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 4:12 UTC (Wed) by foom (subscriber, #14868) [Link]

"Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync."

Um okay, but people do remove batteries from devices *all the time* and won't think about properly shutting off their phone first...

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 8:10 UTC (Wed) by cmccabe (subscriber, #60281) [Link]

Hmm. Personally, I shut down the phone by holding down the power button. And since it has a battery, it never loses power abruptly.

I don't know anyone who removes batteries from his or her phone "all the time." First of all, you only turn off the phone when it locks up, which is extremely rare-- I think it's happened like once in my year of owning a Nexus One. Second of all, you'd have to know how to get at the battery. This is already beyond the skill level of grandma-type users. Keep in mind that some phones don't even have a removable battery, like the iPhone. Thirdly, you have to be dumb enough to not know the normal way of turning off the phone. Do all three of these factors really line up that often? I'm skeptical.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 10:43 UTC (Wed) by gowen (guest, #23914) [Link]

And since it has a battery, it never loses power abruptly.
Never dropped your phone and had the back fly off and the battery fall out? I've done that at least once with phone I've owned. Now granted, I like small, sleek, slim (and fragile) phones over more robust smartphones, but I still wouldn't rule it out.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 11:45 UTC (Wed) by drag (subscriber, #31333) [Link]

> I don't know anyone who removes batteries from his or her phone "all the time."

I think once and then having the FS corrupt would be enough to piss off most people, will cause upset feelings and costs on both the customer service and customer side of things.

I couldn't imagine recovering from a non-functioning firmware would be fun for most people. Probably would require mailing the thing in or a drive to the service center. It is not like you can just plug in your Ubuntu CD and go into recovery mode or anything like that.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 18:10 UTC (Wed) by cmccabe (subscriber, #60281) [Link]

I think the failure mode that Ted T'so was talking about was data loss, not filesystem corruption. The assumption is that you will be able to replay the journal to get the filesystem back into a known state without fsck.

The only wildcard is the behavior of the flash firmware. If it lies to you and says things are committed to flash when they really aren't, really anything could happen. However, it's hard to see how any filesystem can get around this problem, if it exists. Keep in mind you can't rely on ordering either, since the flash firmware needs to reorder your requests to do wear leveling and so forth.

Ext4 filesystem hits Android, no need to fear data loss (ars technica)

Posted Dec 29, 2010 15:26 UTC (Wed) by foom (subscriber, #14868) [Link]

> I don't know anyone who removes batteries from his or her phone "all the time."

Ever wanted to swap out a sim card? Or an SD card? You often need to remove the battery to do that. Remembering to properly shut down the phone when you're doing that is certainly not likely to happen all the time.

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds