From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine) [LWN.net]

From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)

Posted Mar 31, 2009 0:21 UTC (Tue) by sbergman27 (guest, #10767) [Link] (36 responses)

"""
data=ordered also has some implied data safety issues for badly written application which dont bother to call fsync() that has been the subject of recent controversy
"""

I came away from the "recent controversy" uncertain of exactly what the real world implications of data=ordered vs data=writeback actually were in the context of ext4 with the patches destined for 2.6.30. Could someone clearly state the reliability implications of those modes in that context?

Thanks.

From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)

Posted Mar 31, 2009 3:28 UTC (Tue) by bojan (subscriber, #14302) [Link] (32 responses)

I will try, but your best bet is to ask Ted or someone that worked on ext3/ext4 directly.

As per mount manual page, ordered mode of ext3 does this:

"All data is forced directly out to the main file system prior to its metadata being committed to the journal."

So, in terms or reliability (i.e. situation after the crash), the file will always have data in it, because the metadata is always committed after the data. There will be no inodes without correct data blocks. With writeback mode, this ordering is not guaranteed and you may end up with a situation like this: "old data to appear in files after a crash and journal recovery." (this is also from the manual)

AFAIK, ext4 does delayed allocation by default. This means that sometimes the metadata can hit the disk before the data, leaving the file with no blocks.

One can completely disable delayed allocation on ext4 (nodelalloc option), which should then avoid the above at considerable performance penalty. This is the big hammer approach. I think Ted also talked about the possibility of another similar option (data=alloc-on-commit), but I don't know if that went ahead or not. Anyhow, it is similar in its effect to nodelalloc.

The patches are about writing blocks (data) before metadata only in certain situations. In Ted's words:

"These three patches (with git ids bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file."

Meaning, most troublesome cases of missing data are worked around, but generally speaking delayed allocation is still in action, so one may still end up with inodes that point nowhere, because they've not been committed before the crash, either implicitly by the kernel or explicitly by fsync().

From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)

Posted Mar 31, 2009 6:41 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

Personally, I've spent some time last week slobbering over Areca RAID
cards. 256Mb+ of battery-backed cache RAM. Barriers? Data loss on power
failure? That's *so* last century.

(And combine it with fs-cache and running everything else over NFS, and
you get the storage reliability of RAID and read speeds almost
local-disk-equivalent. Only writes and metadata reads are down, and I
assume that in time the latter in particular will be cacheable too.)

Power failure was not even in the picture before your rant

Posted Mar 31, 2009 7:45 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

The whole discission started with software crash (nVidia drivers are very helpful here). I fail to see how these "new century" toys can help against this.

Power failure was not even in the picture before your rant

Posted Apr 3, 2009 13:55 UTC (Fri) by anton (subscriber, #25547) [Link]

A software crash is less severe than a power failure, because file systems that don't use barriers properly (e.g., ext3 by default) will see all their writes come through to the disk drive, but on a power failure some writes may not have been carried out, whereas some logically later writes may have been carried out. As a result, such a file system can become inconsistent on power failure even if it does not get inconsistent on a software crash.

Two more

Posted Mar 31, 2009 7:03 UTC (Tue) by man_ls (guest, #15091) [Link] (27 responses)

Thanks for an excellent summary. Let me explain two more possible consequences:

Mr Ts'o shows considerable arrogance saying that virtually every application on the planet is "badly written" (including GNU fileutils, meaning most frequently used OS tools such as mv). He also seems unaware of what we might call "Hot topics in filesystem design", such as: "POSIX is not the bible of reliability it was never supposed to be" or "Users dislike empty files".
This dangerous combination of arrogance and ignorance is leading Mr Ts'o to quickly damage ext4 reputation and place it next to XFS in users minds, and we all know how hard it is to revert that kind of reputation. This may leave Linux users in many years to come between a rock and a hard place when it comes to filesystem performance: use the obsolete and slow ext3, or suffer the consequences of repeated slow fsync() calls in the much-needed ext4.

Linux has never been about correctness (however one might define it), but about quality and performance. I wonder if Linus, the benevolent dictator, should benevolently revoke Mr Ts'o's commit rights, or something.

Two more

Posted Mar 31, 2009 7:52 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (9 responses)

Linus doesn't give commit access to his tree to anybody. Everybody gets commit access to their own trees and nobody else can prevent that. I assume. you mean refuse to pull the maintainer's code. I don't think it is the right strategy to encourage cooperation and I don't remember that happening before.

We have many many arrogant people in the Free software world and key parts of any Linux system depends on their code. If you can find any technical incompetence that results in issues unfixed, it might be worth considering but I don't see you pointing out any such issues.

Two more

Posted Mar 31, 2009 19:38 UTC (Tue) by man_ls (guest, #15091) [Link] (8 responses)

You are right, "commit rights" was meant in a purely rhetorical sense. Saying "Linus should not pull nor even cherry-pick from Mr Ts'o any more" just doesn't carry the same strength.

If you can find any technical incompetence that results in issues unfixed, it might be worth considering but I don't see you pointing out any such issues.

Sorry, I don't buy that. Technical competence to me is not just leaving issues unfixed; it includes the ability to see the consequences of your actions. When a guy makes a change and suggests that thousands compensate for it for no good reason that is a pretty good sign of incompetence. As sbergman27 pointed out below (and as he quoted a few jiffies before I did), Linus did choose the word "incompetent".

Two more

Posted Mar 31, 2009 21:02 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (7 responses)

There are multiple issues being mixed together in these discussions, I think. If you are talking about the zero length file issues, Ted suggested that it was an application usage problem but added hacks to workaround the issues anyway. It seems a lot of people just ignored that for whatever reasons.

Workarounds

Posted Mar 31, 2009 21:40 UTC (Tue) by man_ls (guest, #15091) [Link] (5 responses)

Just for one reason: because Mr Ts'o never admitted to being wrong. In Catholic terms, what good is reparation without repentance? Or, how can you ever learn from your mistakes if you don't admit them in the first place?

Workarounds

Posted Mar 31, 2009 21:50 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (4 responses)

You seem to be making it a religious black and white issue while I think, there is weight in both sides of the debate. One important consequence is that if applications rely on Ext3 like behavior, then those applications will fail miserably when running on other filesystems that haven't adopted the same characteristics and that was among the things Ted pointed out.

The very same blog post that describes the problems also mentions that fixes have already been queued. Technically, I don't know what more you could ask for. To be clear, there are other potential issues present but the ones you are talking about were fixed even before the blog post was written.

Workarounds

Posted Mar 31, 2009 22:47 UTC (Tue) by man_ls (guest, #15091) [Link] (3 responses)

What other filesystems are you talking about? On ext2 and other filesystems without a journal, sure, users know the risks and live with them. But applications seem to work fine on most other journaling filesystems: ext3, reiserfs, hfs+, zfs, even xfs was fixed years ago. Cygwin on ntfs works fine.

There are few black and white issues, but a filesystem developer saying that corrupting user data is fine would seem to qualify. Later commiting a fix to "work around" the problem while a hundred thousand developers fix their code is hardly enough. Technically, I am not even sure a public flogging would be enough.

And now, ladies and gentlemen, with your kind permission I will just call Ts'o a nazi in a half-assed invocation of Godwin's law to jump out of this discussion and go to sleep.

Workarounds

Posted Mar 31, 2009 23:52 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (2 responses)

Maybe FAT? what about filesystems beyond those in Linux? That's the point of POSIX. I don't see anyone ever claiming that losing data is fine. It is much more nuanced debate than that and I am sure you are aware of the issues very well so I won't bother repeating it again but I still don't know why you think hundreds of applications have to be fixed when the patch has already been merged to retains the Ext3 like behavior.

Workarounds

Posted Apr 1, 2009 0:04 UTC (Wed) by bojan (subscriber, #14302) [Link] (1 responses)

> Maybe FAT?

Actually, you don't even have to look at other file systems. ext3 in writeback mode is sufficient, because metadata can go to disk before data. You may end up with garbage in your files after the crash.

Workarounds

Posted Apr 1, 2009 6:52 UTC (Wed) by man_ls (guest, #15091) [Link]

Writeback mode? FAT?!? Please leave your (metaphorical) commit rights in the reception on your way out. Both of you.

Two more

Posted Apr 3, 2009 14:03 UTC (Fri) by anton (subscriber, #25547) [Link]

Ted suggested that it was an application usage problem but added hacks to workaround the issues anyway.

It's a question of trust. Do I trust my data to a file system whose developer has the attitude that Ted T'so has? Not if I have an alternative.

Two more

Posted Mar 31, 2009 9:38 UTC (Tue) by regala (guest, #15745) [Link] (4 responses)

who's the arrogant ?
who's been contributing since September 1991 ?

Two more

Posted Mar 31, 2009 9:47 UTC (Tue) by regala (guest, #15745) [Link] (3 responses)

guess that's not well put :/
what I'd like you to do, is to think about what you said. I don't think anyone can say Ted was ever arrogant, in these dreadful flame threads around Launchpad, Ubuntu and here on LWN. He's been quite understanding, never calling anybody anything while being insulted by herds of angry mob.
Would you please like to stop ? He's no arrogant, you are. Ever considering Linus starting to mistrust his judgement is ridiculous.

Two more

Posted Mar 31, 2009 18:21 UTC (Tue) by man_ls (guest, #15091) [Link] (2 responses)

Have you ever had anyone say that your code is "badly written" because he understood a spec in a rather peculiar manner? That amply qualifies as an insult to me. Given that most people in the world understands the spec differently, it's not bad for arrogance either.

That reminds me of the old joke. A reckless driver on the highway is listening to the radio: "Attention, attention, there is a crazy man driving against the traffic on the highway", and he says: "One? All of 'em!"

Two more

Posted Mar 31, 2009 19:32 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

I'd call Ted one of the most charming and thoughtful people working in
kernel development (in fact, in free software development, period). I may
sometimes disagree with what he says, but he's *always* worth listening
to, and always well reasoned.

I wonder if you're been using the same Internet I have, really.

Good people behaving badly

Posted Mar 31, 2009 21:37 UTC (Tue) by man_ls (guest, #15091) [Link]

nix, I highly value your opinion, and Mr Ts'o can be a patron saint of the arts, but he has behaved like a jerk in this issue. Just look at his own E pur si muove:

This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.

I probably got too carried away with the discussion (and my own indignation). Probably he did not mean to insult anyone, and he did express himself with manners. But this tirade is not well reasoned; it has a lot of holes and is in general a lot of rubbish. More's the pity if he is such a worthy individual as you say.

Two more

Posted Mar 31, 2009 14:02 UTC (Tue) by clugstj (subscriber, #4020) [Link] (8 responses)

His arrogance and whether he is or is not arrogant is irrelevant. It is his competence that matters. If his code sucks, everyone is free to not use it. This is the power of freedom.

Two more

Posted Mar 31, 2009 15:40 UTC (Tue) by sbergman27 (guest, #10767) [Link] (7 responses)

"""
It is his competence that matters.
"""

There is competence, and there is judgment. And the two are distinct. I think that it is his judgment on this matter that is in question. I've been waiting for Linus to speak on the matter. I would be very interested in his view of this matter. Of course, the distros have the final say as to what are the effective defaults, even down to the patches they choose to apply. And *savvy* users have the ultimate decision as to the configuration of their systems. Unsavvy users, of course, are stuck with what they get.

Two more

Posted Mar 31, 2009 19:18 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Apparently, Linus has spoken, and I missed it. And he does choose the word "incompetent":

========================
On Tue, 24 Mar 2009, Theodore Tso wrote:
>
> Try ext4, I think you'll like it. :-)
>
> Failing that, data=writeback for single-user machines is probably your
> best bet.

Isn't that the same fix? ext4 just defaults to the crappy "writeback"
behavior, which is insane.
Sure, it makes things _much_ smoother, since now the actual data is no
longer in the critical path for any journal writes, but anybody who thinks
that's a solution is just incompetent.

We might as well go back to ext2 then. If your data gets written out long
after the metadata hit the disk, you are going to hit all kinds of bad
issues if the machine ever goes down.

Linus

=======================

Where competence meets judgment

Posted Mar 31, 2009 19:18 UTC (Tue) by man_ls (guest, #15091) [Link] (5 responses)

I had the impression that Linus had already spoken against data loss, and he has indeed:

Sure, it makes things _much_ smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that's a solution is just incompetent.
We might as well go back to ext2 then. If your data gets written out long after the metadata hit the disk, you are going to hit all kinds of bad issues if the machine ever goes down.

Gods how I enjoyed that quote. And:

But I also think that the "we write meta-data synchronously, but then the actual data shows up at some random later time" is just crazy talk. That's simply insane. It _guarantees_ that there will be huge windows of times where data simply will be lost if something bad happens.
And expecting every app to do fsync() is also crazy talk, especially with the major filesystems _sucking_ so bad at it (it's actually a lot more realistic with ext2 than it is with ext3).
So look for a middle ground. Not this crazy militant "user apps must do fsync()" crap. Because that is simply not a realistic scenario.

And:

Doesn't at least ext4 default to the _insane_ model of "data is less important than meta-data, and it doesn't get journalled"?
And ext3 with "data=writeback" does the same, no?
Both of which are - as far as I can tell - total braindamage. At least with ext3 it's not the _default_ mode.

Linus is tha man.

Speed doesn't matter if you cannot trust it

Posted Mar 31, 2009 19:39 UTC (Tue) by oak (guest, #2786) [Link]

I have an insanely fast file system with a really simple design:
cat >/dev/null

If /dev/null writes aren't zero-copy, it's journaled too!

The window for data retrieval is (infinitely) small though.

Where competence meets judgment

Posted Mar 31, 2009 22:37 UTC (Tue) by bojan (subscriber, #14302) [Link] (3 responses)

> And expecting every app to do fsync() is also crazy talk, especially with the major filesystems _sucking_ so bad at it (it's actually a lot more realistic with ext2 than it is with ext3).

Major filesystems being "ext3 in ordered mode only", of course. The rest could be just fine with fsync(), as we can see above from his ext2 comment. And as Ted pointed out, ext4 doesn't have a big penalty on fsync(), because it doesn't have to flush out MBs of stuff that are unrelated to this particular fsync(), every time this system call is used.

Just as Linus says that ext4 is brain damaged for doing delayed allocation by default, so can it be claimed that is ext3 brain damaged for locking up people's machines for a few seconds on a perfectly reasonable system call: fsync(). We have seen this from the FF fiasco. In fact, when Linux says that having an interactive application do fsync() is impossible, he must mean on ext3 in ordered mode, because that's what FF complaints were about. As Alan Cox and Ted pointed out, one can already do fsync() in another thread and be fully interactive.

As for configuration files of KDE (which is where the problem started), the library can trivially do backup of these files on startup and _never_ use fsync() after that. Other problems should probably be solved by a proper system call that does guarantee ordering (I think Ted provisionally called it fbarrier() or something). Then we'd have a real guarantee of the behaviour, instead of relying on whims of implementations.

Claiming the rename() always did "data before metadata" commits is ahistorical. So, the crazy talk ain't that crazy after all. We just got caught we our pants down.

Surely, Linus is "tha man" when it comes to Linux and what he says will eventually go. But, removing any criticism from what he says is just arse licking, IMNSHO.

Where competence meets judgment

Posted Mar 31, 2009 22:43 UTC (Tue) by bojan (subscriber, #14302) [Link] (1 responses)

> when Linux says

Gee, he should have called it something else. It is impossible to get the man's name right after having "Linux" :-)

Where competence meets judgment

Posted Apr 12, 2009 7:59 UTC (Sun) by Duncan (guest, #6647) [Link]

You likely know this as I'm sure most Linux veterans do by now, and the
above "he should have called it something else" was simply a figure of
speech, but maybe the below will be new to the newbies at least.

Actually, "he" (Linus) did call it something else, "Freeix". It was
Linus' colleague that put it up on the ftp-site that put it in a
directory he named "linux", and so history was made.

(Just google freeix linux for more. "I'm feeling lucky" does it for me.)

Duncan

Judgments must take into accounts users

Posted Apr 2, 2009 12:01 UTC (Thu) by renox (guest, #23785) [Link]

Even the best fsync on earth can take a long time if there's a lot of data to be written on the disk, so fsync is always a 'potentially time consuming' operation.
Which means that whatever the FS if you must use fsync to have the correct behaviour then to avoid showing freeze to the user you must go to the dreaded multi-threaded world.
Sure the FS can provide a (Linux specific) write barrier, but it's very likely that nobody will use this.

OR the other possibility is to use a FS which does the operations in-order which simplify a lot the application programming.
There may be a small performance cost, somehow I doubt that users will care.

Two more

Posted Mar 31, 2009 19:26 UTC (Tue) by nix (subscriber, #2304) [Link]

One minor point: GNU fileutils hasn't existed since about 2001. It's GNU
coreutils now (merged with what used to be sh-utils and textutils).

Two more

Posted Mar 31, 2009 19:28 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

What? He's written defences against the common instances of this data
loss: the remaining instances don't seem major to me (extending existing
files, for instance, is much rarer than writing out new ones).

Two more

Posted Apr 3, 2009 6:52 UTC (Fri) by efexis (guest, #26355) [Link]

"extending existing files, for instance, is much rarer than writing out new ones"

My system, apache and database replay log directories would disagree on that one.

From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)

Posted Mar 31, 2009 15:34 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Thanks for the reply. It is appreciated. And that is pretty much what I got out of the thread. The reason that I asked is that I'm unclear as to the actual difference between data=ordered and data=writeback in ext4 after the 2.6.30-destined patches. I'm also wondering about the affect of that switch (if any) upon fragmentation. He says writeback is faster, which is not surprising. And with ext3 I had a clear idea of the cost of that performance increase. But I don't have a clear idea of the cost under the ext4 destined for 2.6.30.

Ext4 breaking the promise of data=ordered ?

Posted Mar 31, 2009 20:57 UTC (Tue) by mfleetwo (guest, #57754) [Link] (2 responses)

From my reading of Ted Ts'o blog "Delayed allocation and the zero-length file problem" (http://thunk.org/tytso/blog/2009/03/12/delayed-allocation...) it seems to be that ext4 does not live up to the promise that ext3 made for data=ordered.

The ext3 FAQ says this about data=ordered:
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.

It seems that Ted Ts'o comments in his blog say that because ext4 is performing delayed allocation data will not be allocated to blocks and written to disk before the metadata is written to the journal, thus breaking the exceptation. I would have hopped that before the metadata is committed in the journal, outstanding data for all inodes being committed in the journal are allocated and flushed to disk. With a 60 second commit by default a lot of data can be written. If very large files are being written and fragmentation is a concern then fallocate() can be used to pre-allocate all the space in a single extent as Ted points out in this article. If the user wan't dalayed allocation beyond each journal commit then that is what data=writeback is for.

Ext4 breaking the promise of data=ordered ?

Posted Mar 31, 2009 21:35 UTC (Tue) by sbergman27 (guest, #10767) [Link] (1 responses)

In ext3, data=ordered was considered to be a reliability feature. Ted is clearly trying to spin it as only having been a security feature. (Which I happen to think is a load of crap.) Now that we have heard from Linus on the matter, I'm curious to hear Stephen Tweedie's views. Stephen Tweedie is the kind of dev whose filesystems you can really feel good entrusting your data to.

Ext4 breaking the promise of data=ordered ?

Posted Mar 31, 2009 22:57 UTC (Tue) by bojan (subscriber, #14302) [Link]

Since you mention Stephen, here is an interesting quote (from here: http://markmail.org/message/nkh4og2ymaavek73):

> Correct. Journaled data mode has the side-effect of maintaining a strict order for data writes, both with respect to each other (ie. writes in a given order will always preserve that order after a crash), and with respect to metadata such as timestamps. That's not a data integrity issue, but it is certainly a consistency issue; Unix semantics basically don't give you any consistency guarantees whatsoever unless the application is requesting consistent checkpoints via fsync/O_SYNC etc; but journaled data mode provides extra consistency nonetheless.

I think more than one person understands the _real_ semantics here.

Reliable, fast rename replacements

Posted Apr 1, 2009 5:51 UTC (Wed) by butlerm (subscriber, #13312) [Link] (2 responses)

There is a relatively simple solution here that would allow ext4 to be both
fast and reliable in this case - i.e. never truncate files on crashes after
rename replacements without being forced to commit all data from the
replacement to disk before finishing the rename.

It simply involves putting rename replacement undo records in the
filesystem journal, and on recovery, after rolling the journal forward,
undo-ing any rename replacements for which the data of the replacement
version did not make it to disk. See discussion in comments to Ted's
recent blog entries on the subject for more information.

This could be done with O_TRUNC too, but that would be much more complex,
and contra-Linus I don't see how anyone can rationally expect not to get a
zero length file on recovery if an application explicitly specifies that is
what it wants (before proceeding further).

Reliable, fast rename replacements

Posted Apr 1, 2009 6:48 UTC (Wed) by bojan (subscriber, #14302) [Link]

Very cool. So, with this, we'd get full delayed allocation (i.e. removal of workarounds) and any renamed files with zero blocks would have their old versions resurrected. Correct?

Reliable, fast rename replacements

Posted Apr 3, 2009 21:43 UTC (Fri) by spitzak (guest, #4593) [Link]

This sounds like the correct solution to me and perhaps it is interesting and "pure" enough that Tso would feel like doing it.

My opinion on this: POSIX guarantees if you write & close a file and rename it, anybody trying to open the destination name will either see the old data or the new data, not anything else (such as an empty file). POSIX says "I don't guarantee anything on a crash". But the whole point of ext4 is to "guarantee" something. I do not see any logical reason for this guarantee to be something other than what POSIX guarantees while it is running. So the current behavior of ext4 on a crash is wrong.

From ext3 to ext4: An Interview with Theodore Ts'o (Linux Magazine)

Posted Oct 21, 2016 13:54 UTC (Fri) by damnyoulinux (guest, #111878) [Link]

It's now 7 years later and Linus was right. I rely on things like atomic renames and so do others. The consequences of this are not insignificant. I'm at the far end of this problem where it has manifested and I've had to work my way back.

I inherited a system where someone had put ext4 on several hundred workstations. I believe this may have happened automatically during a distribution upgrade, however even new installs following used ext4 and the same options as the updated system. These systems are used in many ways, including a fair few database applications. I would semi regularly, perhaps once or twice a day have to attend to file corruption issues with these. It would usually require purging and reimporting datasets. These workstations would not be treated too delicately and this was outside of my control. They would for example very be turned off at the switch at the end of the day. In my case power loss support was essential.

Considering something other than simple hardware faults (cosmic rays, loose cables, etc) was delayed because the file system never being corrupted. When the number of work stations doubles, then quadrupled and so on it One widely used application in particular would often have failures and it made no sense because it was chosen specifically for using a power loss safe means of saving data. It basically relied on move being atomic. My first thought was perhaps they lied but after deducing the application wrote more data than others than had corruption and that it was fairly proportionate to write load I started to consider file system or storage media. These were not non-standard applications but common applications being used by millions of people throughout the world so you would expect them to be reasonably resilient and power safe especially if they claim to be (to be fair though, they are normally run on more stable servers). In some cases empty files would be common. The thing about the file system is that it was relying on defaults that were established as acceptably safe before with ext3 and that didn't produce such a high rate of errors, ext4 had the same settings. Some guides today still specify those settings if you're vague with your search on things like safe mount options, people assume they will still be safe. I didn't want to go down the rabbit hole of issues with storage media so focused on the file system and found out about data=ordered not being safe. On face value, everything on the system looked fine. If you search for rename and it being atomic you will find lots of re-enforcement for it. If you do C anyone familiar with the rename function will have the belief that it's supposed to be atomic. It's an operation that seems like it could easily be atomic. It's also very useful as a poor man's safe file update. When everything looked power loss safe though and this application was relying on a rename operation I started to wonder about my assertion and belief that rename is always atomic. With this I eventually found myself here.

Unfortunately even if at least for me the culprit exists I think the problem still exists on some levels. There are probably a lot of people out there who still have the ticking time bomb of bad mount options as well as many who have had to restore a backup and don't really know why. With the information out there people today may have a very hard time avoiding this mistake unless taking extensive efforts to avoid it. It's not necessarily straight forward when setting up a system the damage mount options that once were safe can do.

More of the problem also comes from trying to find the right information. It's like following a trail of breadcrumbs through a labyrinth just to get mount options that are reasonably power safe and that you can well understand. The file system is something that is sacred. It deserves a lot of attention, so such things should in an ideal world be far more forthcoming, well presented and delivered by an authoritative expert source. People expect the reliable behaviour and have come to depend on it. You would expect that it would say in big bold letters for things such as man pages enumerating the options that this issue exists. If you think how important databases and backups are, this is easily just as important. The trail I followed started with that you need "data=journal" because it disables delayed malloc. Few places explain though that you can't just change that. A google result for the option nodelalloc returns a first page seemingly entirely of comments. You need to change the options in your bootloader or tune the filesystem with journal_data as a default mount option. It's also hard to find things out such as how does data=ordered compare with nodelalloc. Can they be used together? Going deeper in the solution is now that you only need nodelalloc. So why data=journal? What about the results that say to use data=writeback? How do these things compare on performance and data integrity? You also learn that there are other journal safety options no one uses because of a bug a while back. Are they safe now? Is no one using them because they've filtered them out as an option since the original bug? Do I have to understand the filesystem fully, read the implementation, run my own benchmarks, run my own tests and so on to be able to set options that I know are safe or safe in the right way and give the best bang for my buck on performance? What's the final conclusion on this topic and the best solution for the problem?

If you don't have such a busy schedule, this kind of thing might not be as frustrating to you are it is to me. The corruption is recoverable as backups are taken appropriately. It still becomes very time consuming however to have to keep restoring them as a relatively high frequency. It betrays the Linux's track record of being a solid system for data storage applications. While it's also part of the tradition that "Linux is hard", I don't think it should be this hard for something as fundamentally crucial as your data and being able to get certain guarantees, reliable consistent information and so on.