When I worked at JPL, on a fault tolerant supercomputer designed to fly in space, there was much work done on fault injectors. The problem was that we could zap some application and sometimes it would continue unharmed, sometimes it would produce garbage, and sometimes it would barf.
It was virtually impossible to debug. The big brain folks didn't change their code as a result of the faults. The fault injectors were useless for code development/debugging.
So I decided to write ERFI the Exact Repeatable Fault Injector. You had to instrument the code to specify the data areas you might be injecting faults into. You also specified code regions where you might be injecting a fault. You seeded the ERFI random number generators, and specified fault injection frequency. The big win was when a fault caused problems with the error correcting algorithms, you could debug and fix the code, *and then verify the fix* by injecting the exact same fault at the exact same time.
I think a modified strategy would be needed for the kernel, but it seems to me that it is important to be able to re-do the fault just as it was done before in order to verify that the fix *really* fixed the problem in the code.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds