In aviation (and some, but frustratingly not all other failure-is-death systems) once the majority of actual accidents were "freak" outliers the accident investigators began to throw effort at non-accidents in which something goes wrong but it does not result in a reportable accident. Good work on these incidents means that the next time three things would have gone wrong, you already prevented at least one of them, and everybody lives to write the story of another near miss.
The fixes that result from accidents (e.g. re-training pilots and traffic control to accept TCAS resolutions over instructions from a distant human where the two conflict, as a result of the Überlingen accident) get more headlines, but today the fixes made without a single person even getting bruised are probably just as important to safety.
For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense. A bug you fix today is one less bug that might contribute to the symptoms reported tomorrow. I will be surprised if this ext4 bug does not have at least two and probably more separate "fixes" that would have completely prevented any corruption, and perhaps even some of which could have been identified as correct fixes without nix even reporting the corruption in the first place.
Posted Oct 28, 2012 15:49 UTC (Sun) by nix (subscriber, #2304)
[Link]
Well, the fix under review <http://lkml.indiana.edu/hypermail/linux/kernel/1210.3/026...> did indeed find a bug that does not depend on but is merely highlighted by journal checksumming, a place where the inode bitmap was being modified outside any transaction (and was thus not being journalled).
Of course, whether a journal abort and *more* filesystem corruption is actually the right way to indicate a bug of this nature is another matter :)
Storm formation
Posted Oct 28, 2012 16:17 UTC (Sun) by cesarb (subscriber, #6266)
[Link]
I looked up the commit mentioned in that patch (119c0d4) using git tag --contains, and if that commit was the cause then the bug was introduced in the 3.4 merge window.
If that is the case, it is impressive that the bug remained hidden for so long. Either the ones saying it was very hard to hit were right, or it did not cause any real problems beyond breaking the journal checksum (which then cascaded into the full corruption).
Compound bugs
Posted Oct 28, 2012 18:44 UTC (Sun) by man_ls (subscriber, #15091)
[Link]
For those of us lucky enough that our software can't kill anyone if it goes wrong the same approach still likely makes sense.
Very true. When a software error is caused by two unlikely conditions, the best course of action is always to fix both conditions so that neither arises again. The usual case is that an interface returns a strange value (condition A) and the consumer does not interpret it correctly (condition B); a robust solution will avoid the strange value and also make the consumer invulnerable to getting it. Note that neither condition can be considered as a bug on its own: the specification for the interface is usually not comprehensive enough to include all possible exceptions.
This is an area where a numeric approach to good engineering fails miserably: you can count the number of bugs and the number of fixes, but it is literally impossible to correlate different error causes or make the resolution fine-grained enough so that a robust solution is preferred.