LWN: Comments on "An update on the ext4 corruption issue"

Pretty sure I've found & fixed the root cause of this now.

nix — Wed, 31 Oct 2012 23:15:53 +0000

It hit nontechnical news sources? I was rather interested in how far it spread (for obvious reasons relating to negative egoboo[1]) but I never found a non-trade-press source reproducing it (counting Heise's English-language IT section as 'trade press', and also traducing LWN's reputation as Linux paper-of-record by lumping it into the same category).

[1] that feeling of 'oh god stop it tell me there is no more' that any true introvert gets when they create, accidentally or otherwise, a huge splash

Pretty sure I've found & fixed the root cause of this now.

rahvin — Wed, 31 Oct 2012 22:33:09 +0000

I'd say hurricane before splash. It hit non-technical news sources where Linux is now some bizarre system that crashes disks (and eats babies) among people who have never run or even downloaded Linux in their entire life.

In all honesty it felt like some paid "sponsors" (or shills as others call them) took this bug report and ran with it for political reasons. It was an obscure bug, yet the way it was presented it was a bug everyone had experienced with data (and babies) being eaten. The overblown way it was presented felt like some companies press office wrote stories and had the shills running around submitting it to every news source available.

An update on the ext4 corruption issue

nix — Tue, 30 Oct 2012 10:35:40 +0000

Or you could use an old Unix trick of the trade: use four or five smaller filesystems rather than one huge one and not put yourself completely in the crapper should one of them get corrupted. (I'd have been in much worse trouble if I hadn't split the frequently-written-to /var off from the rest, for instance.)

An update on the ext4 corruption issue

kleptog — Tue, 30 Oct 2012 07:46:41 +0000

XFS isn't everything. The checker for XFS requires something like 1GB of RAM for each 1TB of disk space. This doesn't sound bad until you want to install a machine with 96TB of disk and 64GB of RAM. You then find out that the ext4 user space tools can't init a disk that big and XFS can't check a disk that big. :(

Using xfs_repair -n is a workaround, but still (IMHO) a hack.

Pretty sure I've found & fixed the root cause of this now.

theophrastus — Sun, 28 Oct 2012 20:03:24 +0000

hm, just did a git pull and there was a EXTRAVERSION change...

VERSION = 3
PATCHLEVEL = 7
SUBLEVEL = 0
EXTRAVERSION = -rc3
NAME = Terrified Chipmunk

...yet nothing about ext4 in the logs [shrug].

in any-case i appreciate the apparent bug fix that's eventually on the way (patience is a virtue) thankee!

Pretty sure I've found & fixed the root cause of this now.

nix — Sun, 28 Oct 2012 19:14:00 +0000

I also learnt why I should re-examine whether my ugly hack workarounds using barely-documented mount options are still needed a few years down the line.

Still, I guess it's a good thing I reported this rather than just saying 'oh, I'll turn it off since it seems to be broken now', even if the media splash was more than slightly disconcerting.

Compound bugs

man_ls — Sun, 28 Oct 2012 18:44:01 +0000

For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense.

Very true. When a software error is caused by two unlikely conditions, the best course of action is always to fix both conditions so that neither arises again. The usual case is that an interface returns a strange value (condition A) and the consumer does not interpret it correctly (condition B); a robust solution will avoid the strange value and also make the consumer invulnerable to getting it. Note that neither condition can be considered as a bug on its own: the specification for the interface is usually not comprehensive enough to include all possible exceptions.

This is an area where a numeric approach to good engineering fails miserably: you can count the number of bugs and the number of fixes, but it is literally impossible to correlate different error causes or make the resolution fine-grained enough so that a robust solution is preferred.

Pretty sure I've found & fixed the root cause of this now.

sandeen — Sun, 28 Oct 2012 18:03:05 +0000

Oh, and "how to get better regression coverage for changes" as well. This episode will probably lead to a better framework for journal recovery testing in ext[34], which could then be leveraged by xfstests (which already does extensive recovery testing for XFS).

Pretty sure I've found & fixed the root cause of this now.

sandeen — Sun, 28 Oct 2012 17:58:47 +0000

The very first email about this problem listed the mount options in use, which included journal_async_commit and journal_checksum - two non-default and lightly tested options, which carries some risk, and to me were suspect.

I decided to test recovery with journal_checksum enabled, and every journal replay I tried failed with a bad checksum in the log. (This used to work; long ago I fixed a journal_checksum error Linus ran into, and we turned it off by default after that). I went searching for when this new regression happened, and landed on a commit present in kernel 3.4, 119c0d4460b001e44b41dcf73dc6ee794b98bd31 "ext4: fold ext4_claim_inode into ext4_new_inode." This change resulted in an un-journaled metadata update, which caused the bad journal checksum, which caused the "corruption" (really, just an unplayable / unplayed log) the reporter experienced.

Anyway, I really expect the patch I sent last night, "[PATCH] ext4: fix unjournaled inode bitmap modification" to fix it; the original reporter found that it fixed it for him.

It appears that the corruption problem everyone was worried about was confined to users who had the non-default journal_checksum option turned on, thus resulting in an unplayable log.

There's a lot to be learned from this whole episode - about how to report bugs, how to triage bugs, and how to write news articles about bugs, I think. :) Anyway, in the end, fixed I believe, and not as scary or widespread as originally feared.

-Eric

An update on the ext4 corruption issue

bluss — Sun, 28 Oct 2012 16:42:05 +0000

I don't think you read the item.

Storm formation

cesarb — Sun, 28 Oct 2012 16:17:33 +0000

I looked up the commit mentioned in that patch (119c0d4) using git tag --contains, and if that commit was the cause then the bug was introduced in the 3.4 merge window.

If that is the case, it is impressive that the bug remained hidden for so long. Either the ones saying it was very hard to hit were right, or it did not cause any real problems beyond breaking the journal checksum (which then cascaded into the full corruption).

Storm formation

nix — Sun, 28 Oct 2012 15:49:29 +0000

Well, the fix under review <http://lkml.indiana.edu/hypermail/linux/kernel/1210.3/026...> did indeed find a bug that does not depend on but is merely highlighted by journal checksumming, a place where the inode bitmap was being modified outside any transaction (and was thus not being journalled).

Of course, whether a journal abort and *more* filesystem corruption is actually the right way to indicate a bug of this nature is another matter :)

An update on the ext4 corruption issue

th0ma7 — Sun, 28 Oct 2012 13:04:00 +0000

Glad to have stick with XFS after all... ;)

Storm formation

tialaramex — Sun, 28 Oct 2012 11:41:30 +0000

In aviation (and some, but frustratingly not all other failure-is-death systems) once the majority of actual accidents were "freak" outliers the accident investigators began to throw effort at non-accidents in which something goes wrong but it does not result in a reportable accident. Good work on these incidents means that the next time three things would have gone wrong, you already prevented at least one of them, and everybody lives to write the story of another near miss.

The fixes that result from accidents (e.g. re-training pilots and traffic control to accept TCAS resolutions over instructions from a distant human where the two conflict, as a result of the Überlingen accident) get more headlines, but today the fixes made without a single person even getting bruised are probably just as important to safety.

For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense. A bug you fix today is one less bug that might contribute to the symptoms reported tomorrow. I will be surprised if this ext4 bug does not have at least two and probably more separate "fixes" that would have completely prevented any corruption, and perhaps even some of which could have been identified as correct fixes without nix even reporting the corruption in the first place.

Storm formation

man_ls — Sun, 28 Oct 2012 07:48:12 +0000

On the other hand, an asteroid impact might be a bit too much to compensate for a couple of timely backups, depending on the size of the asteroid; at least an extinction-level event would be definitely overblown. Perhaps originating a news storm around a relatively minor bug might be enough "punishment"? Actually, I think that having to run a bzr repo is adequate compensation.

Storm formation

nix — Sat, 27 Oct 2012 23:40:05 +0000

Nah, for me this was a moderate problem. I just had to run a backup before each reboot. It was very annoying, but that was all. (Now, the first restart, when I found the corruption for the first time -- that could have been worse, only by chance I'd run a backup the day before. Take that, Murphy!)

Your aviation point is well-made: you need a very long concatenation of unlikely circumstances to trip this bug. If Eric Sandeen is right, it's been around since v3.4 came out. I suspect that if v3.4 and v3.5 hadn't been so stable that I hardly ever rebooted, I'd have tripped it a few months ago, June or July -- right in the middle of a DTrace for Linux deadline crunch. But they were stable. Murphy misses again!

This is the second problem I've had in the last two months that was resolved by having up-to-date backups, after years of never needing them. (The first was a power spike that blew up my desktop box, last month). I've seen comments on lots of forums from people saying that this has caused them to run backups more often -- so this has had a major positive impact, in my eyes. Perhaps I should arrange to find a new horrible-but-in-hindsight-obscure filesystem corruption problem every year or so, just to keep people backing up properly. :)

I note that counting the desktop box failure without data loss this is the third Murphy's Law swing-and-miss in six months. I'm in for some *massive* negative karma now.

What will it be?

An asteroid impact?

Storm formation

man_ls — Sat, 27 Oct 2012 22:13:02 +0000

It is a rare opportunity to see a piece of news develop from its origin to a huge storm, and probably back to a tempest in a teapot. Once it has been established that it requires an esoteric combination of software and hardware, the "but ext4 is so unstable" camp will lose its feeble ammunition over this non-issue.

Of course that for you this is a huge problem, and it is well-deserved attention from the kernel devs. But this reminds me of aviation: once all accidents are really freak accidents (a weird combination of three or more very unusual factors), you know that commercial airlines are really safe.

It is a huge relief because there isn't anywhere else to go: btrfs is immature, ext3 would probably not be a good idea on today's disks. Perhaps XFS could be a valid option? After its period of widely reported unsafeness there have been good reports here on LWN, if I remember well.

An update on the ext4 corruption issue

nix — Sat, 27 Oct 2012 21:18:23 +0000

"Hits so rarely" is perhaps a bad way of putting it: I can reproduce it consistently now. It hits very few people, for sure.

My latest tests indicate that unless you use journal_checksum or journal_async_commit (which implies journal_checksum), neither of which are the default, you appear to be safe. Unless you use nobarrier *too* (which you really shouldn't unless you have suitable hardware, which means battery-backed disk controllers: a PSU isn't good enough), you will see a journal abort and a readonly remount of the fs at next mount. You need both journal_async_commit *and* nobarrier to get no journal abort and silent fs corruption.

There may be additional conditions I haven't yet combed out: for all I know the latter case also requires the arcmsr driver or something. It's kind of hard for me to test that on account of if I configure that driver out, my machine has no accessible disks to corrupt filesystems on. :)

An update on the ext4 corruption issue

nix — Sat, 27 Oct 2012 21:06:16 +0000

Talk about Sod's Law and bad timing on my part. At the time he posted that update, I was unable to see it because my machine was in reboot-and-fsck-over-and-over hell, running the tests he asked for. :)