Fault injection and unexpected requirement injection

By Jonathan Corbet
December 2, 2009

Good developers carefully write their code to handle error conditions which may arise. This code frequently suffers from one problem, though: test coverage is hard. Many of the anticipated errors never come about, so the error-handling code never gets exercised. So when things go wrong for real, recovery does not work as expected. For a few years, the Linux kernel has had a fault injection framework designed to help in the debugging of some types of error-handling code. By forcing specific things (memory allocations in particular) to go wrong, the fault injection framework can help developers ensure that errors are really handled as expected.

Sripathi Kodi recently posted a patch adding certain types of futex failures to the fault injection framework. Ingo Molnar responded with a potentially surprising request:

Instead of this unacceptably ugly and special-purpose debugfs interface, please extend perf events to allow event injection. Some other places in the kernel (which deal with rare events) want/need this capability too.

This "unacceptably ugly" interface has existed as part of the fault injection framework since 2006, so it is a little surprising to hear, now, that it cannot be used. Ingo is firm about this point, though, and appears unwilling to back down.

Extending perf events for fault injection might be the right long-term solution. But this situation highlights a trap for developers which certainly acts to make participation in the development process harder. In his travels, your editor has heard complaints from developers who set out to accomplish a specific task, only to be told that they must undertake a much larger cleanup to get their code merged. The topic also came up at the 2009 kernel summit; there, the consensus seemed to be that this kind of request can quickly become unreasonable.

In this case, Sripathi has not been asked to fix the remainder of the fault injection framework code. But adding a new functionality to the perf events subsystem still likely goes rather beyond the scope of the original project. Sripathi has not responded to this request, so it's not clear whether we'll see a futex fault injection mechanism reworked to fit the new requirements, or whether this code will just fade away.

Index entries for this article
Kernel	Development model

Fault injection and unexpected requirement injection

Posted Dec 3, 2009 2:15 UTC (Thu) by jcm (subscriber, #18262) [Link] (7 responses)

This is all absolutely true. Thanks for saying it!

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 8:32 UTC (Thu) by mingo (guest, #31122) [Link] (6 responses)

Please allow me to explain my position on this in more detail.

Fault-injection certainly looked fine back in 2006 in its then-limited incarnation - i remember having reviewed it, and i liked it, and i even tested it and reported bugs in it.

These particular patches to the futex subsystem look ugly and intrusive though. Futexes are performance sensitive and complex - which both raise the bar to extra instrumentation and extra complexity.

The fact that some debugging infrastructure exists in the kernel does not give automatic acceptance of its use in any arbitrary place in the kernel - quality judgements have to be done (and are being done every day) for specific debugging patches.

For example new tracepoints are not accepted into other subsystems automatically (Andrew Morton has repeatedly rejected new mm tracepoints), without the maintainers actively accepting them - and similar boundaries exists for other existing debugging infrastructure as well.

Another example are debugobjects: each new debugobject conversion has to be separately accepted and acked by the respective maintainer. A debugging facility can be fine for some subsystems - and less suitable for others.

A third example is lockdep coverage: there were cases in the past where patches were rejected by subsystem maintainers, on the grounds of intrusiveness and other cost/benefit arguments.

These kinds of rejections are always case by case and they happened several times for debug infrastructure i authored over the past decade - and i expect them to happen in the future too.

Why should fault-injection to be treated in a preferential way to tracepoints, lockdep, debugobjects and all the other debugging facilities that exist in the upstream kernel today?

Also note that while in the discussion i suggested perf events as a cleaner solution (and i like that solution for obvious reasons and i know it would help other areas of the kernel too) - others might exist as well. One thing is sure though, the cost/benefit ratio of this particular patch-set looked too high for me.

Thanks,

Ingo

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 9:39 UTC (Thu) by johill (subscriber, #25196) [Link] (2 responses)

These particular patches to the futex subsystem look ugly and intrusive though. Futexes are performance sensitive and complex - which both raise the bar to extra instrumentation and extra complexity.

Umm?

For the benefit of other readers, the impact on the code is essentially this in a number of places:

+	if (futex_should_fail(&fail_futex_efault, 1))
+		return -EFAULT;

Beauty may be in the eye of the beholder, but even though you could argue that some of the parameters could be hidden by macros I'd never call it "butt-ugly" like you did.

Also, what should such an injection framework look like? It pretty much needs to have hooks in place to be able to fail at those specific points, and it also needs to impact the flow control in the function to be able to drop locks -- the above was a simple version from a place that can just return.

I'll have to agree with Jon and Peter, you're making an entire unreasonable requirement, and it appears to me (and probably many others) to be because you want perf to rule the world, rather than solve get the actual functionality in. If you really cared only about style issues you could, after all, ask them to create wrapper macros or something so that the code looks neater.

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 9:51 UTC (Thu) by jcm (subscriber, #18262) [Link]

Let's not turn this into an attack on perf. The real problem here is the general one of people getting turned off when they're told they have to re-architect something much larger to get their patch accepted. Obviously, there's not a lot that can be done when you truly believe that the existing code needs to be ripped out and replaced, but I think that (collectively) a better job could be done on explaining why when such responses come up.

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 10:01 UTC (Thu) by mingo (guest, #31122) [Link]

For the benefit of other readers, the impact on the code is essentially this in a number of places:

If you read the patches (i have done that) in reality those hooks are not just added 'in a number of places' (i personally wouldn't argue much about 1-2 call-sites although FYI many other maintainers will argue even about 1-2 hooks), the patches add literally a dozen such fault injection call-sites, sprinkled all around kernel/futex.c.

That is too ugly to me. There has to be a higher pay-off for us to accept that kind of intrusion: comprehensive, well-thought-out instrumentation of futex.c. Not just for the limited purpose of fault injection but also for other instrumentation.

So, as i've outlined it in my original reply as well, i'm not opposed to this kind of instrumentation in principle, it simply has to be done better.

Thanks,

Ingo

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 9:54 UTC (Thu) by jcm (subscriber, #18262) [Link] (2 responses)

btw, in agreeing with Jon, I wasn't attacking you personally Ingo. I was just happy to see someone bring up the issue of people posting patches being told they should go do some other major overhaul :)

Does a debugging subsystem have the right to spread arbitrarily?

Posted Dec 3, 2009 10:18 UTC (Thu) by mingo (guest, #31122) [Link] (1 responses)

btw, in agreeing with Jon, I wasn't attacking you personally Ingo.

Hey i didn't feel being attacked personally (or being attacked at all) :)

I was just happy to see someone bring up the issue of people posting patches being told they should go do some other major overhaul :)

Been there, done that ;-)

I don't have enough fingers (nor enough toes) to count the number of times Linus or some other maintainer rejected some patch of mine in the past, because it was not complete/clean enough yet. (Quite a few of those rejections were of the "not ever!" non-negotiable sort - which hurt far more than the kind of "not this way" rejection we are talking about here.)

Think about the positive side: a large majority of 'major overhauls' that were ever done in Linux happened due to someone saying 'no' first.

It wouldn't be so easy to add new hacks today if we weren't standing on the shoulders of giants who did all the past series of major overhauls and kept Linux so pleasantly hackable.

(In fact i bravely claim that we probably wouldn't have major overhauls at all if people weren't be saying 'no'. It's so much easier to go for the fast hack than to fix things for real.)

You are arguing for people to not insist on clean approaches and major overhauls when a new patch justifies such an overhaul - and i hope i'm not misrepresenting your point by paraphrasing it like that.

If that is your argument, what other mechanism should drive the overhauls in your opinion, if we eliminate/weaken the current scheme of using people's desire to implement new features? Or do you suggest we should let the kernel's general code quality degrade over time, in favour of getting more features done more quickly?

I for one welcome our stringent kernel overlords.

Posted Dec 3, 2009 21:41 UTC (Thu) by dmag (guest, #17775) [Link]

In Open Source, nobody (not even Linus!) can "assign" work. Instead, they can merely "signal" it via patch rejection and hints. I agree with our esteemed Editor that this case seems a bit harsh, but I also agree with Ingo that it needs to be done.

Patch refusal from an obstinate kernel developer has lead to a better overall architecture in tons of cases (i.e. TALPA, Linus rejecting a USB stack patch causing a full rewrite, PID virtualization discussion, etc.)

Even if this method "holds up" development of a feature occasionally, I think that's fine. Linux is in more danger of "flying apart a the seams" than "stalling due to lack of interest". And I'll bet a box of doughnuts that this rejection spurs someone (maybe not Sripathi) to "get it done" sooner rather than later.

Keep up the good work, Ingo!

Fault injection and unexpected requirement injection

Posted Dec 3, 2009 17:12 UTC (Thu) by dvhart (guest, #19636) [Link]

For what its worth, Sripathi and I have discussed Ingo's position and we are in no way "turned off" or even discouraged by it. As maintainer Ingo has veto power and we respect his opinion enough to accept it without much argument. I don't agree with the "butt-ugly" statement, but his other arguments are sufficient in my opinion. Depending on the scope of the new work, this may or may not be something we will be able to commit to working on. This is the risk we take for quality, and it isn't necessarily a bad one. If we were getting a lot of requests for this feature we might push back harder, but as it is, this is a useful feature for our development, and we can use it as is for now. If others really need it, they can apply this patch to their builds or help in the effort to integrate fault injection with perf.