Pretty sure I've found & fixed the root cause of this now.

Posted Oct 28, 2012 17:58 UTC (Sun) by sandeen (guest, #42852)
Parent article: An update on the ext4 corruption issue

The very first email about this problem listed the mount options in use, which included journal_async_commit and journal_checksum - two non-default and lightly tested options, which carries some risk, and to me were suspect.

I decided to test recovery with journal_checksum enabled, and every journal replay I tried failed with a bad checksum in the log. (This used to work; long ago I fixed a journal_checksum error Linus ran into, and we turned it off by default after that). I went searching for when this new regression happened, and landed on a commit present in kernel 3.4, 119c0d4460b001e44b41dcf73dc6ee794b98bd31 "ext4: fold ext4_claim_inode into ext4_new_inode." This change resulted in an un-journaled metadata update, which caused the bad journal checksum, which caused the "corruption" (really, just an unplayable / unplayed log) the reporter experienced.

Anyway, I really expect the patch I sent last night, "[PATCH] ext4: fix unjournaled inode bitmap modification" to fix it; the original reporter found that it fixed it for him.

It appears that the corruption problem everyone was worried about was confined to users who had the non-default journal_checksum option turned on, thus resulting in an unplayable log.

There's a lot to be learned from this whole episode - about how to report bugs, how to triage bugs, and how to write news articles about bugs, I think. :) Anyway, in the end, fixed I believe, and not as scary or widespread as originally feared.

-Eric

Pretty sure I've found & fixed the root cause of this now.

Posted Oct 28, 2012 18:03 UTC (Sun) by sandeen (guest, #42852) [Link] (3 responses)

Oh, and "how to get better regression coverage for changes" as well. This episode will probably lead to a better framework for journal recovery testing in ext[34], which could then be leveraged by xfstests (which already does extensive recovery testing for XFS).

Pretty sure I've found & fixed the root cause of this now.

Posted Oct 28, 2012 19:14 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

I also learnt why I should re-examine whether my ugly hack workarounds using barely-documented mount options are still needed a few years down the line.

Still, I guess it's a good thing I reported this rather than just saying 'oh, I'll turn it off since it seems to be broken now', even if the media splash was more than slightly disconcerting.

Pretty sure I've found & fixed the root cause of this now.

Posted Oct 31, 2012 22:33 UTC (Wed) by rahvin (guest, #16953) [Link] (1 responses)

I'd say hurricane before splash. It hit non-technical news sources where Linux is now some bizarre system that crashes disks (and eats babies) among people who have never run or even downloaded Linux in their entire life.

In all honesty it felt like some paid "sponsors" (or shills as others call them) took this bug report and ran with it for political reasons. It was an obscure bug, yet the way it was presented it was a bug everyone had experienced with data (and babies) being eaten. The overblown way it was presented felt like some companies press office wrote stories and had the shills running around submitting it to every news source available.

Pretty sure I've found & fixed the root cause of this now.

Posted Oct 31, 2012 23:15 UTC (Wed) by nix (subscriber, #2304) [Link]

It hit nontechnical news sources? I was rather interested in how far it spread (for obvious reasons relating to negative egoboo[1]) but I never found a non-trade-press source reproducing it (counting Heise's English-language IT section as 'trade press', and also traducing LWN's reputation as Linux paper-of-record by lumping it into the same category).

[1] that feeling of 'oh god stop it tell me there is no more' that any true introvert gets when they create, accidentally or otherwise, a huge splash

Pretty sure I've found & fixed the root cause of this now.

Posted Oct 28, 2012 20:03 UTC (Sun) by theophrastus (guest, #80847) [Link]

hm, just did a git pull and there was a EXTRAVERSION change...

VERSION = 3
PATCHLEVEL = 7
SUBLEVEL = 0
EXTRAVERSION = -rc3
NAME = Terrified Chipmunk

...yet nothing about ext4 in the logs [shrug].

in any-case i appreciate the apparent bug fix that's eventually on the way (patience is a virtue) thankee!