LWN.net Logo

Injecting faults into the kernel

Some kernel developers, doubtless, feel that their systems fail too often as it is; they certainly would not go out looking for ways to make more trouble. Others, however, are most interested in how their code behaves when things go wrong. As your editor recently discovered to his chagrin, error paths tend to be debugged rather less well than the "normal" code. One can try to anticipate possible failures and try to code the right response, but it can be hard to actually test that code. So error-handling paths can be incorrect (or missing) but the code will appear to work - until something blows up.

In an attempt to help test kernel error handling, Akinobu Mita has been working for some time on a framework for injecting faults into a running kernel. By causing things to go wrong occasionally, the fault injection code should help to ensure that error situations are handled - and handled correctly. This mechanism has found its way into 2.6.19-rc5-mm2 where, hopefully, it will be employed by developers to make sure that their code is bulletproof. Hopefully.

The framework can cause memory allocation failures at two levels: in the slab allocator (where it affects kmalloc() and most other small-object allocations) and at the page allocator level (where it affects everything, eventually). There are also hooks to cause occasional disk I/O operations to fail, which should be useful for filesystem developers. In both cases, there is a flexible runtime configuration infrastructure, based on debugfs, which will let developers focus fault injections into a specific part of the kernel.

Your editor built a version of 2.6.19-rc5-mm2 with the fault injection capability turned on. For whatever reason, the configuration system insisted that the locking validator be enabled too; perhaps somebody injected a fault into the config scripts. In any case, the resulting kernel exports a directory (in debugfs) for each of the available fault injection capabilities.

So, for example, the slab allocation capability has a directory failslab. At system boot, failure injection is turned off; slab failures can be enabled by writing an integer value to the failslab/probability file. The value written there will be interpreted as the percent probability that any given allocation will fail; so writing "5" will cause a 5% failure rate. For situations where a failure rate of less than 1% (but greater than zero) is needed, there is a separate interval value which further filters the result. So a 0.1% failure rate could be had by setting interval to 1000 and probability to 100 - preferably in that order. There is also a times variable which puts an upper limit on the number of failures which will be simulated.

As it happens, randomly injecting failures into the kernel as a whole does not necessarily lead to a lot of useful information for a developer, who is probably interested in the behavior of a specific subsystem. There is only so long that one can put up with basic shell commands failing while trying to make something happen in one particular driver. So there are a number of options which can be used to focus the faults on a particular part of the kernel. These include:

  • task-filter: if this variable is set to a positive value, faults will only be injected when a specially-marked processes are running. To enable this marking, each process has a new flag (make-it-fail) in its /proc directory; setting that value to one will cause faults to be injected into that process.

  • address-start and address-stop: if these values are set, fault injection will be concentrated on the code found within the address range specified. As long as any entry within the call chain is inside that address range, the fault injection code will consider causing a failure.

  • ignore-gfp-wait: if this value is set to one, only non-waiting (GFP_ATOMIC) allocations will potentially fail. There is also a ignore-gfp-highmem option which will cause failures not to be injected into high-memory allocations.

Various other options exist; there is also a set of boot options for turning on injection which might be useful for debugging early system initialization. The documentation file has the details. Also found in the documentation directory are a couple of scripts for concentrating faults on a specific command or module.

The end result of all this is a useful tool. One need not just hope that the error recovery paths in a piece of kernel code will just work properly; it is now possible to actually run them and see what happens. This should lead to a better tested, more robust kernel in the near future, and that can only be a good thing.


(Log in to post comments)

Injecting faults into the kernel

Posted Nov 16, 2006 8:21 UTC (Thu) by simlo (subscriber, #10866) [Link]

I dislike these kind of random tests. You hit a rare error but it can be very hard to debug because the triggering is random and coupled to timing issues which can be very hard to reproduce. I also dislike this because it makes changes to the actual running kernel. It is not a black box test.

The solution I prefer is unit testing. You take your subsystem, isolate it by stubbing all calls and make a test suite to run in user space. This test program should explicitly handle all the border cases. You can use a coverage scope to see you hit a high fraction of the code with your test. And it should do it deterministic.

This gives some huge benifites to development:
You get a must faster development cycle because you don't have to recompile and run boot the kernel for each change, only your subsystem and test program. You can easily debug it in gdb.
To do this you have to have loosely coupled subsystems. Therefore, when you start forcing yourself to work this way, you automatically get a better architecture.
You feel safe about changing a system because you know a lot of the bugs you might introduce will be caught by the test suite. Thus you avoid "coding in fear" which always produce bad code.

So here it is my suggestion:
Make a test directory in the kernel source. Put all kind of unit test suites in there. All kernel patches should parse all tests. A patch to the kernel also contains changes to the tests as they are developed along with the kernel code.

I made such a "TestRTMutex" to code on the rt-mutex. It worked really well. I could do at least some SMP coding without having actual SMP hardware. Unfortunately, it isn't maintained along with the kernel code and is thus not useable now.

At work I decided to do this unit-testing on a project. My boss was a bit worried why it took so long to make the code, but when I merged it into our application there was almost no errors, because almost every line of code was tested in detail.

If unit tests were to get established within the kernel, you will see the number of "oops, that was a mistake" released get much, much smaller. You have tests for many error paths in the code. You still need an integration test of course, but there is no need for injecting faults into a running kernel. That is much better done in unit tests in user-space.

Injecting faults into the kernel

Posted Nov 16, 2006 10:51 UTC (Thu) by mokki (subscriber, #33200) [Link]

I do agree that unit tests would be the way to go in an ideal world.

But in the real world there will always be code in the kernel that is not fully covered by unit tests (and even 100% coverage does not guarantee anything).

What this fault injection provides is a way to third parties to test the whole system or partial system failures independently. I think such a feature can only be helpful and does not prevent in any way applying of any other testing methods.

Injecting faults into the kernel vs unit testing

Posted Nov 17, 2006 23:19 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I think the main value of this fault injection over unit testing is cost. It takes a significant amount of time and boredom to write scaffolding for a module of the kernel, but a complete scaffold already exists -- the rest of the real kernel. All it lacks is the controls to manipulate all the inputs and outputs to get a full test, so fault injection adds a faint whisper of those.

I too agree that unit testing (and modular programming in general) gives a better result. But I understand why people find it not worth the cost.

Injecting faults into the kernel

Posted Dec 14, 2006 16:52 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]

The really cool thing about fault injection is that it can make errors happen more quickly. In one example some years back, a race-condition bug that was taking about 24 hours to reproduce under heavy load was flushed out in under 10 minutes using fault-injection code. Think about this a bit. How long would you have to test the original system to be 99.99% confident that you had in fact fixed the bug? Now, how about the fault-injected system?

Let's just say that your users will likely be a lot happier with you if you are using fault injection.

That said, I also really like unit tests as well, kernel/rcutorture.c being a case in point.

The fact is that we need both fault injection and unit tests.

Injecting faults into the kernel

Posted Feb 5, 2007 19:47 UTC (Mon) by lopgok (guest, #43164) [Link]

When I worked at JPL, on a fault tolerant supercomputer designed to fly in space, there was much work done on fault injectors. The problem was that we could zap some application and sometimes it would continue unharmed, sometimes it would produce garbage, and sometimes it would barf.

It was virtually impossible to debug. The big brain folks didn't change their code as a result of the faults. The fault injectors were useless for code development/debugging.

So I decided to write ERFI the Exact Repeatable Fault Injector. You had to instrument the code to specify the data areas you might be injecting faults into. You also specified code regions where you might be injecting a fault. You seeded the ERFI random number generators, and specified fault injection frequency. The big win was when a fault caused problems with the error correcting algorithms, you could debug and fix the code, *and then verify the fix* by injecting the exact same fault at the exact same time.

I think a modified strategy would be needed for the kernel, but it seems to me that it is important to be able to re-do the fault just as it was done before in order to verify that the fix *really* fixed the problem in the code.

Injecting faults into the kernel

Posted Apr 10, 2007 16:45 UTC (Tue) by aab@cichlid.com (guest, #44579) [Link]

Some repeatability could be obtained by the fault injector remembering the call trace.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds