From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)
One of our primary design goals was that it should be painlessly easy to upgrade from ext3 to ext4. You might not get all of the benefits of ext4 unless you do a backup/reformat/restore of your filesystem, but you would get at least some of the benefits by simply remounting the filesystem using ext4 and enabling some of ext4's features."
Posted Mar 31, 2009 0:21 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (36 responses)
I came away from the "recent controversy" uncertain of exactly what the real world implications of data=ordered vs data=writeback actually were in the context of ext4 with the patches destined for 2.6.30. Could someone clearly state the reliability implications of those modes in that context?
Thanks.
Posted Mar 31, 2009 3:28 UTC (Tue)
by bojan (subscriber, #14302)
[Link] (32 responses)
As per mount manual page, ordered mode of ext3 does this:
"All data is forced directly out to the main file system prior to its metadata being committed to the journal."
So, in terms or reliability (i.e. situation after the crash), the file will always have data in it, because the metadata is always committed after the data. There will be no inodes without correct data blocks. With writeback mode, this ordering is not guaranteed and you may end up with a situation like this: "old data to appear in files after a crash and journal recovery." (this is also from the manual)
AFAIK, ext4 does delayed allocation by default. This means that sometimes the metadata can hit the disk before the data, leaving the file with no blocks.
One can completely disable delayed allocation on ext4 (nodelalloc option), which should then avoid the above at considerable performance penalty. This is the big hammer approach. I think Ted also talked about the possibility of another similar option (data=alloc-on-commit), but I don't know if that went ahead or not. Anyhow, it is similar in its effect to nodelalloc.
The patches are about writing blocks (data) before metadata only in certain situations. In Ted's words:
"These three patches (with git ids bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file."
Meaning, most troublesome cases of missing data are worked around, but generally speaking delayed allocation is still in action, so one may still end up with inodes that point nowhere, because they've not been committed before the crash, either implicitly by the kernel or explicitly by fsync().
Posted Mar 31, 2009 6:41 UTC (Tue)
by nix (subscriber, #2304)
[Link] (2 responses)
(And combine it with fs-cache and running everything else over NFS, and
Posted Mar 31, 2009 7:45 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
The whole discission started with software crash (nVidia drivers are
very helpful here). I fail to see how these "new century" toys can help
against this.
Posted Apr 3, 2009 13:55 UTC (Fri)
by anton (subscriber, #25547)
[Link]
Posted Mar 31, 2009 7:03 UTC (Tue)
by man_ls (guest, #15091)
[Link] (27 responses)
Posted Mar 31, 2009 7:52 UTC (Tue)
by rahulsundaram (subscriber, #21946)
[Link] (9 responses)
We have many many arrogant people in the Free software world and key parts of any Linux system depends on their code. If you can find any technical incompetence that results in issues unfixed, it might be worth considering but I don't see you pointing out any such issues.
Posted Mar 31, 2009 19:38 UTC (Tue)
by man_ls (guest, #15091)
[Link] (8 responses)
Posted Mar 31, 2009 21:02 UTC (Tue)
by rahulsundaram (subscriber, #21946)
[Link] (7 responses)
Posted Mar 31, 2009 21:40 UTC (Tue)
by man_ls (guest, #15091)
[Link] (5 responses)
Posted Mar 31, 2009 21:50 UTC (Tue)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
The very same blog post that describes the problems also mentions that fixes have already been queued. Technically, I don't know what more you could ask for. To be clear, there are other potential issues present but the ones you are talking about were fixed even before the blog post was written.
Posted Mar 31, 2009 22:47 UTC (Tue)
by man_ls (guest, #15091)
[Link] (3 responses)
There are few black and white issues, but a filesystem developer saying that corrupting user data is fine would seem to qualify. Later commiting a fix to "work around" the problem while a hundred thousand developers fix their code is hardly enough. Technically, I am not even sure a public flogging would be enough.
And now, ladies and gentlemen, with your kind permission I will just call Ts'o a nazi in a half-assed invocation of Godwin's law to jump out of this discussion and go to sleep.
Posted Mar 31, 2009 23:52 UTC (Tue)
by rahulsundaram (subscriber, #21946)
[Link] (2 responses)
Posted Apr 1, 2009 0:04 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (1 responses)
Actually, you don't even have to look at other file systems. ext3 in writeback mode is sufficient, because metadata can go to disk before data. You may end up with garbage in your files after the crash.
Posted Apr 1, 2009 6:52 UTC (Wed)
by man_ls (guest, #15091)
[Link]
Posted Apr 3, 2009 14:03 UTC (Fri)
by anton (subscriber, #25547)
[Link]
Posted Mar 31, 2009 9:38 UTC (Tue)
by regala (guest, #15745)
[Link] (4 responses)
Posted Mar 31, 2009 9:47 UTC (Tue)
by regala (guest, #15745)
[Link] (3 responses)
Posted Mar 31, 2009 18:21 UTC (Tue)
by man_ls (guest, #15091)
[Link] (2 responses)
That reminds me of the old joke. A reckless driver on the highway is listening to the radio: "Attention, attention, there is a crazy man driving against the traffic on the highway", and he says: "One? All of 'em!"
Posted Mar 31, 2009 19:32 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
I wonder if you're been using the same Internet I have, really.
Posted Mar 31, 2009 21:37 UTC (Tue)
by man_ls (guest, #15091)
[Link]
Posted Mar 31, 2009 14:02 UTC (Tue)
by clugstj (subscriber, #4020)
[Link] (8 responses)
Posted Mar 31, 2009 15:40 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (7 responses)
There is competence, and there is judgment. And the two are distinct. I think that it is his judgment on this matter that is in question. I've been waiting for Linus to speak on the matter. I would be very interested in his view of this matter. Of course, the distros have the final say as to what are the effective defaults, even down to the patches they choose to apply. And *savvy* users have the ultimate decision as to the configuration of their systems. Unsavvy users, of course, are stuck with what they get.
Posted Mar 31, 2009 19:18 UTC (Tue)
by sbergman27 (guest, #10767)
[Link]
========================
Isn't that the same fix? ext4 just defaults to the crappy "writeback"
We might as well go back to ext2 then. If your data gets written out long
Linus
=======================
Posted Mar 31, 2009 19:18 UTC (Tue)
by man_ls (guest, #15091)
[Link] (5 responses)
We might as well go back to ext2 then. If your data gets written out long
after the metadata hit the disk, you are going to hit all kinds of bad
issues if the machine ever goes down.
And expecting every app to do fsync() is also crazy talk, especially with
the major filesystems _sucking_ so bad at it (it's actually a lot more
realistic with ext2 than it is with ext3).
So look for a middle ground. Not this crazy militant "user apps must do
fsync()" crap. Because that is simply not a realistic scenario.
And ext3 with "data=writeback" does the same, no?
Both of which are - as far as I can tell - total braindamage. At least
with ext3 it's not the _default_ mode.
Posted Mar 31, 2009 19:39 UTC (Tue)
by oak (guest, #2786)
[Link]
If /dev/null writes aren't zero-copy, it's journaled too!
The window for data retrieval is (infinitely) small though.
Posted Mar 31, 2009 22:37 UTC (Tue)
by bojan (subscriber, #14302)
[Link] (3 responses)
Major filesystems being "ext3 in ordered mode only", of course. The rest could be just fine with fsync(), as we can see above from his ext2 comment. And as Ted pointed out, ext4 doesn't have a big penalty on fsync(), because it doesn't have to flush out MBs of stuff that are unrelated to this particular fsync(), every time this system call is used.
Just as Linus says that ext4 is brain damaged for doing delayed allocation by default, so can it be claimed that is ext3 brain damaged for locking up people's machines for a few seconds on a perfectly reasonable system call: fsync(). We have seen this from the FF fiasco. In fact, when Linux says that having an interactive application do fsync() is impossible, he must mean on ext3 in ordered mode, because that's what FF complaints were about. As Alan Cox and Ted pointed out, one can already do fsync() in another thread and be fully interactive.
As for configuration files of KDE (which is where the problem started), the library can trivially do backup of these files on startup and _never_ use fsync() after that. Other problems should probably be solved by a proper system call that does guarantee ordering (I think Ted provisionally called it fbarrier() or something). Then we'd have a real guarantee of the behaviour, instead of relying on whims of implementations.
Claiming the rename() always did "data before metadata" commits is ahistorical. So, the crazy talk ain't that crazy after all. We just got caught we our pants down.
Surely, Linus is "tha man" when it comes to Linux and what he says will eventually go. But, removing any criticism from what he says is just arse licking, IMNSHO.
Posted Mar 31, 2009 22:43 UTC (Tue)
by bojan (subscriber, #14302)
[Link] (1 responses)
Gee, he should have called it something else. It is impossible to get the man's name right after having "Linux" :-)
Posted Apr 12, 2009 7:59 UTC (Sun)
by Duncan (guest, #6647)
[Link]
Actually, "he" (Linus) did call it something else, "Freeix". It was
(Just google freeix linux for more. "I'm feeling lucky" does it for me.)
Duncan
Posted Apr 2, 2009 12:01 UTC (Thu)
by renox (guest, #23785)
[Link]
OR the other possibility is to use a FS which does the operations in-order which simplify a lot the application programming.
Posted Mar 31, 2009 19:26 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Mar 31, 2009 19:28 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Apr 3, 2009 6:52 UTC (Fri)
by efexis (guest, #26355)
[Link]
Posted Mar 31, 2009 15:34 UTC (Tue)
by sbergman27 (guest, #10767)
[Link]
Posted Mar 31, 2009 20:57 UTC (Tue)
by mfleetwo (guest, #57754)
[Link] (2 responses)
The ext3 FAQ says this about data=ordered:
It seems that Ted Ts'o comments in his blog say that because ext4 is performing delayed allocation data will not be allocated to blocks and written to disk before the metadata is written to the journal, thus breaking the exceptation. I would have hopped that before the metadata is committed in the journal, outstanding data for all inodes being committed in the journal are allocated and flushed to disk. With a 60 second commit by default a lot of data can be written. If very large files are being written and fragmentation is a concern then fallocate() can be used to pre-allocate all the space in a single extent as Ted points out in this article. If the user wan't dalayed allocation beyond each journal commit then that is what data=writeback is for.
Posted Mar 31, 2009 21:35 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Posted Mar 31, 2009 22:57 UTC (Tue)
by bojan (subscriber, #14302)
[Link]
> Correct. Journaled data mode has the side-effect of maintaining a strict order for data writes, both with respect to each other (ie. writes in a given order will always preserve that order after a crash), and with respect to metadata such as timestamps. That's not a data integrity issue, but it is certainly a consistency issue; Unix semantics basically don't give you any consistency guarantees whatsoever unless the application is requesting consistent checkpoints via fsync/O_SYNC etc; but journaled data mode provides extra consistency nonetheless.
I think more than one person understands the _real_ semantics here.
Posted Apr 1, 2009 5:51 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (2 responses)
It simply involves putting rename replacement undo records in the
This could be done with O_TRUNC too, but that would be much more complex,
Posted Apr 1, 2009 6:48 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 3, 2009 21:43 UTC (Fri)
by spitzak (guest, #4593)
[Link]
My opinion on this: POSIX guarantees if you write & close a file and rename it, anybody trying to open the destination name will either see the old data or the new data, not anything else (such as an empty file). POSIX says "I don't guarantee anything on a crash". But the whole point of ext4 is to "guarantee" something. I do not see any logical reason for this guarantee to be something other than what POSIX guarantees while it is running. So the current behavior of ext4 on a crash is wrong.
Posted Oct 21, 2016 13:54 UTC (Fri)
by damnyoulinux (guest, #111878)
[Link]
I inherited a system where someone had put ext4 on several hundred workstations. I believe this may have happened automatically during a distribution upgrade, however even new installs following used ext4 and the same options as the updated system. These systems are used in many ways, including a fair few database applications. I would semi regularly, perhaps once or twice a day have to attend to file corruption issues with these. It would usually require purging and reimporting datasets. These workstations would not be treated too delicately and this was outside of my control. They would for example very be turned off at the switch at the end of the day. In my case power loss support was essential.
Considering something other than simple hardware faults (cosmic rays, loose cables, etc) was delayed because the file system never being corrupted. When the number of work stations doubles, then quadrupled and so on it One widely used application in particular would often have failures and it made no sense because it was chosen specifically for using a power loss safe means of saving data. It basically relied on move being atomic. My first thought was perhaps they lied but after deducing the application wrote more data than others than had corruption and that it was fairly proportionate to write load I started to consider file system or storage media. These were not non-standard applications but common applications being used by millions of people throughout the world so you would expect them to be reasonably resilient and power safe especially if they claim to be (to be fair though, they are normally run on more stable servers). In some cases empty files would be common. The thing about the file system is that it was relying on defaults that were established as acceptably safe before with ext3 and that didn't produce such a high rate of errors, ext4 had the same settings. Some guides today still specify those settings if you're vague with your search on things like safe mount options, people assume they will still be safe. I didn't want to go down the rabbit hole of issues with storage media so focused on the file system and found out about data=ordered not being safe. On face value, everything on the system looked fine. If you search for rename and it being atomic you will find lots of re-enforcement for it. If you do C anyone familiar with the rename function will have the belief that it's supposed to be atomic. It's an operation that seems like it could easily be atomic. It's also very useful as a poor man's safe file update. When everything looked power loss safe though and this application was relying on a rename operation I started to wonder about my assertion and belief that rename is always atomic. With this I eventually found myself here.
Unfortunately even if at least for me the culprit exists I think the problem still exists on some levels. There are probably a lot of people out there who still have the ticking time bomb of bad mount options as well as many who have had to restore a backup and don't really know why. With the information out there people today may have a very hard time avoiding this mistake unless taking extensive efforts to avoid it. It's not necessarily straight forward when setting up a system the damage mount options that once were safe can do.
More of the problem also comes from trying to find the right information. It's like following a trail of breadcrumbs through a labyrinth just to get mount options that are reasonably power safe and that you can well understand. The file system is something that is sacred. It deserves a lot of attention, so such things should in an ideal world be far more forthcoming, well presented and delivered by an authoritative expert source. People expect the reliable behaviour and have come to depend on it. You would expect that it would say in big bold letters for things such as man pages enumerating the options that this issue exists. If you think how important databases and backups are, this is easily just as important. The trail I followed started with that you need "data=journal" because it disables delayed malloc. Few places explain though that you can't just change that. A google result for the option nodelalloc returns a first page seemingly entirely of comments. You need to change the options in your bootloader or tune the filesystem with journal_data as a default mount option. It's also hard to find things out such as how does data=ordered compare with nodelalloc. Can they be used together? Going deeper in the solution is now that you only need nodelalloc. So why data=journal? What about the results that say to use data=writeback? How do these things compare on performance and data integrity? You also learn that there are other journal safety options no one uses because of a bug a while back. Are they safe now? Is no one using them because they've filtered them out as an option since the original bug? Do I have to understand the filesystem fully, read the implementation, run my own benchmarks, run my own tests and so on to be able to set options that I know are safe or safe in the right way and give the best bang for my buck on performance? What's the final conclusion on this topic and the best solution for the problem?
If you don't have such a busy schedule, this kind of thing might not be as frustrating to you are it is to me. The corruption is recoverable as backups are taken appropriately. It still becomes very time consuming however to have to keep restoring them as a relatively high frequency. It betrays the Linux's track record of being a solid system for data storage applications. While it's also part of the tradition that "Linux is hard", I don't think it should be this hard for something as fundamentally crucial as your data and being able to get certain guarantees, reliable consistent information and so on.
From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)
data=ordered also has some implied data safety issues for badly written application which dont bother to call fsync() that has been the subject of recent controversy
"""
From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)
From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)
cards. 256Mb+ of battery-backed cache RAM. Barriers? Data loss on power
failure? That's *so* last century.
you get the storage reliability of RAID and read speeds almost
local-disk-equivalent. Only writes and metadata reads are down, and I
assume that in time the latter in particular will be cacheable too.)
Power failure was not even in the picture before your rant
A software crash is less severe than a power failure, because file
systems that don't use barriers properly (e.g., ext3 by default) will
see all their writes come through to the disk drive, but on a power
failure some writes may not have been carried out, whereas some
logically later writes may have been carried out. As a result, such a
file system can become inconsistent on power failure even if it does
not get inconsistent on a software crash.
Power failure was not even in the picture before your rant
Thanks for an excellent summary. Let me explain two more possible consequences:
Two more
Linux has never been about correctness (however one might define it), but about quality and performance. I wonder if Linus, the benevolent dictator, should benevolently revoke Mr Ts'o's commit rights, or something.
Two more
You are right, "commit rights" was meant in a purely rhetorical sense. Saying "Linus should not pull nor even cherry-pick from Mr Ts'o any more" just doesn't carry the same strength.
Two more
If you can find any technical incompetence that results in issues unfixed, it might be worth considering but I don't see you pointing out any such issues.
Sorry, I don't buy that. Technical competence to me is not just leaving issues unfixed; it includes the ability to see the consequences of your actions. When a guy makes a change and suggests that thousands compensate for it for no good reason that is a pretty good sign of incompetence. As sbergman27 pointed out below (and as he quoted a few jiffies before I did), Linus did choose the word "incompetent".
Two more
Just for one reason: because Mr Ts'o never admitted to being wrong. In Catholic terms, what good is reparation without repentance? Or, how can you ever learn from your mistakes if you don't admit them in the first place?
Workarounds
Workarounds
What other filesystems are you talking about? On ext2 and other filesystems without a journal, sure, users know the risks and live with them. But applications seem to work fine on most other journaling filesystems: ext3, reiserfs, hfs+, zfs, even xfs was fixed years ago. Cygwin on ntfs works fine.
Workarounds
Workarounds
Workarounds
Writeback mode? FAT?!? Please leave your (metaphorical) commit rights in the reception on your way out. Both of you.
Workarounds
Two more
Ted suggested that it was an application usage problem but
added hacks to workaround the issues anyway.
It's a question of trust. Do I trust my data to a file system whose
developer has the attitude that Ted T'so has? Not if I have an
alternative.
Two more
who's been contributing since September 1991 ?
Two more
what I'd like you to do, is to think about what you said. I don't think anyone can say Ted was ever arrogant, in these dreadful flame threads around Launchpad, Ubuntu and here on LWN. He's been quite understanding, never calling anybody anything while being insulted by herds of angry mob.
Would you please like to stop ? He's no arrogant, you are. Ever considering Linus starting to mistrust his judgement is ridiculous.
Have you ever had anyone say that your code is "badly written" because he understood a spec in a rather peculiar manner? That amply qualifies as an insult to me. Given that most people in the world understands the spec differently, it's not bad for arrogance either.
Two more
Two more
kernel development (in fact, in free software development, period). I may
sometimes disagree with what he says, but he's *always* worth listening
to, and always well reasoned.
nix, I highly value your opinion, and Mr Ts'o can be a patron saint of the arts, but he has behaved like a jerk in this issue. Just look at his own E pur si muove:
Good people behaving badly
This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.
I probably got too carried away with the discussion (and my own indignation). Probably he did not mean to insult anyone, and he did express himself with manners. But this tirade is not well reasoned; it has a lot of holes and is in general a lot of rubbish. More's the pity if he is such a worthy individual as you say.
Two more
Two more
It is his competence that matters.
"""
Two more
On Tue, 24 Mar 2009, Theodore Tso wrote:
>
> Try ext4, I think you'll like it. :-)
>
> Failing that, data=writeback for single-user machines is probably your
> best bet.
behavior, which is insane.
Sure, it makes things _much_ smoother, since now the actual data is no
longer in the critical path for any journal writes, but anybody who thinks
that's a solution is just incompetent.
after the metadata hit the disk, you are going to hit all kinds of bad
issues if the machine ever goes down.
I had the impression that Linus had already spoken against data loss, and he has indeed:
Where competence meets judgment
Sure, it makes things _much_ smoother, since now the actual data is no
longer in the critical path for any journal writes, but anybody who thinks
that's a solution is just incompetent.
Gods how I enjoyed that quote. And:
But I also think that the "we write meta-data synchronously, but then the
actual data shows up at some random later time" is just crazy talk. That's
simply insane. It _guarantees_ that there will be huge windows of times
where data simply will be lost if something bad happens.
And:
Doesn't at least ext4 default to the _insane_ model of "data is less
important than meta-data, and it doesn't get journalled"?
Linus is tha man.
Speed doesn't matter if you cannot trust it
cat >/dev/null
Where competence meets judgment
Where competence meets judgment
Where competence meets judgment
above "he should have called it something else" was simply a figure of
speech, but maybe the below will be new to the newbies at least.
Linus' colleague that put it up on the ftp-site that put it in a
directory he named "linux", and so history was made.
Judgments must take into accounts users
Which means that whatever the FS if you must use fsync to have the correct behaviour then to avoid showing freeze to the user you must go to the dreaded multi-threaded world.
Sure the FS can provide a (Linux specific) write barrier, but it's very likely that nobody will use this.
There may be a small performance cost, somehow I doubt that users will care.
Two more
coreutils now (merged with what used to be sh-utils and textutils).
Two more
loss: the remaining instances don't seem major to me (extending existing
files, for instance, is much rarer than writing out new ones).
"extending existing
files, for instance, is much rarer than writing out new ones"
Two more
My system, apache and database replay log directories would disagree on that one.
From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)
Ext4 breaking the promise of data=ordered ?
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
Ext4 breaking the promise of data=ordered ?
Ext4 breaking the promise of data=ordered ?
Reliable, fast rename replacements
fast and reliable in this case - i.e. never truncate files on crashes after
rename replacements without being forced to commit all data from the
replacement to disk before finishing the rename.
filesystem journal, and on recovery, after rolling the journal forward,
undo-ing any rename replacements for which the data of the replacement
version did not make it to disk. See discussion in comments to Ted's
recent blog entries on the subject for more information.
and contra-Linus I don't see how anyone can rationally expect not to get a
zero length file on recovery if an application explicitly specifies that is
what it wants (before proceeding further).
Reliable, fast rename replacements
Reliable, fast rename replacements
From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)