An update on the ext4 corruption issue

Posted Oct 27, 2012 21:18 UTC (Sat) by nix (subscriber, #2304)
Parent article: An update on the ext4 corruption issue

"Hits so rarely" is perhaps a bad way of putting it: I can reproduce it consistently now. It hits very few people, for sure.

My latest tests indicate that unless you use journal_checksum or journal_async_commit (which implies journal_checksum), neither of which are the default, you appear to be safe. Unless you use nobarrier *too* (which you really shouldn't unless you have suitable hardware, which means battery-backed disk controllers: a PSU isn't good enough), you will see a journal abort and a readonly remount of the fs at next mount. You need both journal_async_commit *and* nobarrier to get no journal abort and silent fs corruption.

There may be additional conditions I haven't yet combed out: for all I know the latter case also requires the arcmsr driver or something. It's kind of hard for me to test that on account of if I configure that driver out, my machine has no accessible disks to corrupt filesystems on. :)

Storm formation

Posted Oct 27, 2012 22:13 UTC (Sat) by man_ls (guest, #15091) [Link] (6 responses)

It is a rare opportunity to see a piece of news develop from its origin to a huge storm, and probably back to a tempest in a teapot. Once it has been established that it requires an esoteric combination of software and hardware, the "but ext4 is so unstable" camp will lose its feeble ammunition over this non-issue.

Of course that for you this is a huge problem, and it is well-deserved attention from the kernel devs. But this reminds me of aviation: once all accidents are really freak accidents (a weird combination of three or more very unusual factors), you know that commercial airlines are really safe.

It is a huge relief because there isn't anywhere else to go: btrfs is immature, ext3 would probably not be a good idea on today's disks. Perhaps XFS could be a valid option? After its period of widely reported unsafeness there have been good reports here on LWN, if I remember well.

Storm formation

Posted Oct 27, 2012 23:40 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

Nah, for me this was a moderate problem. I just had to run a backup before each reboot. It was very annoying, but that was all. (Now, the first restart, when I found the corruption for the first time -- that could have been worse, only by chance I'd run a backup the day before. Take that, Murphy!)

Your aviation point is well-made: you need a very long concatenation of unlikely circumstances to trip this bug. If Eric Sandeen is right, it's been around since v3.4 came out. I suspect that if v3.4 and v3.5 hadn't been so stable that I hardly ever rebooted, I'd have tripped it a few months ago, June or July -- right in the middle of a DTrace for Linux deadline crunch. But they were stable. Murphy misses again!

This is the second problem I've had in the last two months that was resolved by having up-to-date backups, after years of never needing them. (The first was a power spike that blew up my desktop box, last month). I've seen comments on lots of forums from people saying that this has caused them to run backups more often -- so this has had a major positive impact, in my eyes. Perhaps I should arrange to find a new horrible-but-in-hindsight-obscure filesystem corruption problem every year or so, just to keep people backing up properly. :)

I note that counting the desktop box failure without data loss this is the third Murphy's Law swing-and-miss in six months. I'm in for some *massive* negative karma now.

What will it be?

An asteroid impact?

Storm formation

Posted Oct 28, 2012 7:48 UTC (Sun) by man_ls (guest, #15091) [Link]

On the other hand, an asteroid impact might be a bit too much to compensate for a couple of timely backups, depending on the size of the asteroid; at least an extinction-level event would be definitely overblown. Perhaps originating a news storm around a relatively minor bug might be enough "punishment"? Actually, I think that having to run a bzr repo is adequate compensation.

Storm formation

Posted Oct 28, 2012 11:41 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (3 responses)

In aviation (and some, but frustratingly not all other failure-is-death systems) once the majority of actual accidents were "freak" outliers the accident investigators began to throw effort at non-accidents in which something goes wrong but it does not result in a reportable accident. Good work on these incidents means that the next time three things would have gone wrong, you already prevented at least one of them, and everybody lives to write the story of another near miss.

The fixes that result from accidents (e.g. re-training pilots and traffic control to accept TCAS resolutions over instructions from a distant human where the two conflict, as a result of the Überlingen accident) get more headlines, but today the fixes made without a single person even getting bruised are probably just as important to safety.

For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense. A bug you fix today is one less bug that might contribute to the symptoms reported tomorrow. I will be surprised if this ext4 bug does not have at least two and probably more separate "fixes" that would have completely prevented any corruption, and perhaps even some of which could have been identified as correct fixes without nix even reporting the corruption in the first place.

Storm formation

Posted Oct 28, 2012 15:49 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

Well, the fix under review <http://lkml.indiana.edu/hypermail/linux/kernel/1210.3/026...> did indeed find a bug that does not depend on but is merely highlighted by journal checksumming, a place where the inode bitmap was being modified outside any transaction (and was thus not being journalled).

Of course, whether a journal abort and *more* filesystem corruption is actually the right way to indicate a bug of this nature is another matter :)

Storm formation

Posted Oct 28, 2012 16:17 UTC (Sun) by cesarb (subscriber, #6266) [Link]

I looked up the commit mentioned in that patch (119c0d4) using git tag --contains, and if that commit was the cause then the bug was introduced in the 3.4 merge window.

If that is the case, it is impressive that the bug remained hidden for so long. Either the ones saying it was very hard to hit were right, or it did not cause any real problems beyond breaking the journal checksum (which then cascaded into the full corruption).

Compound bugs

Posted Oct 28, 2012 18:44 UTC (Sun) by man_ls (guest, #15091) [Link]

For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense.

Very true. When a software error is caused by two unlikely conditions, the best course of action is always to fix both conditions so that neither arises again. The usual case is that an interface returns a strange value (condition A) and the consumer does not interpret it correctly (condition B); a robust solution will avoid the strange value and also make the consumer invulnerable to getting it. Note that neither condition can be considered as a bug on its own: the specification for the interface is usually not comprehensive enough to include all possible exceptions.

This is an area where a numeric approach to good engineering fails miserably: you can count the number of bugs and the number of fixes, but it is literally impossible to correlate different error causes or make the resolution fine-grained enough so that a robust solution is preferred.