LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 23:49 UTC (Thu) by iabervon (subscriber, #722)
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

I obviously have a somewhat unusual experience, but the only Linux kernel bug I've run into worked like this: I have an ethernet card which, in a certain configuration, sends, in addition to the interrupts that the kernel expects, interrupts on the IRQ for the hard drive controller. If the system ran for a while without any hard drive traffic, the kernel would decide that some unknown device was screwing with that interrupt, and shut it of. At this point, the hard drive stops working, because its interrupts are ignored.

There are a number of belt-and-suspenders ways that Linux could keep the system stable (if an interrupt isn't handled, but gets shut off somehow anyway, that's as good as handling it for the purposes of stuck interrupt detection; if a shared interrupt with handlers is stuck and gets disabled, call the handlers from the timer interrupt or something, which will be bad for performance but keep the system running slowly), but I don't see a way that a microkernel could have helped. The bug would always trigger for this particular system, but wouldn't have any effect on a system in which the misdirected interrupt wasn't to an IRQ with a significant device on it. When the bug showed symptoms, the problem was not that the driver having trouble was wrong at all, and restarting it would have no effect. The bug didn't affect the misbehaving hardware or the driver that put it into the non-compliant state. And the bug involved only standard access to the device's own I/O space, in a way which the PCI spec says is correct.

Probably not everybody's pet bug has this sort of characteristic, but it seems to me that, while there is a clear benefit to defensive designs in which each component doubts the correctness of the rest of the system, works around dysfunction, and reports (for debugging) things that are wrong but survivable, I'm not convinced that a microkernel design is the ideal implementation of this practice.


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds