This last line seems to me the key to the whole problem.
You can't (further) corrupt a file system if you don't
write to it. There's a great deal of checking that you can
only afford to do when some program isn't waiting for read()
to come back. On btrfs, if it's designed right, you should
be able to run a consistency checker in background on live
It's no fun to be informed that your file system is corrupt
and, further, that it can't be fixed, but that's much better
than *not* being informed that your file system is corrupt,
when it is. The sooner you find out, the fewer backups will
also be corrupt. A tool that can only constrain the locus of
the corruption would still be helpful; only the faulty part
needs to be reloaded from backups.
A widely used checker would result in better bug reports for the
file system proper, as corruption is found early. How many bugs
are still waiting to be found just because nothing is looking?
The way forward, then, is to release a pure checker, first,
and then begin to release repair capabilities one at a time as
they become ready. If the repair tool generated a journal of
changes without writing them to the file system proper, then
you could run a full check on the sum of fs+journal, and only
commit the changes if the result is clearly better than before.
Ideally the repair machinery would actually be the same well
tested code that, in production, integrates more usual changes
into the file system.