LWN.net Logo

KHB: Recovering Device Drivers: From Sandboxing to Surviving

January 12, 2007

This article was contributed by Valerie Henson

Drivers are the dominant source of crashes and bugs in operating systems. This is especially disturbing given the proportion of operating system code that is driver code. In Linux, approximately two thirds of the source lines of code are in drivers (depending on the version). Full-time kernel developers often bemoan the quality of code in drivers; one study [PDF] found that the bug rate in drivers was actually three to seven times higher than in core kernel code. Binary drivers (hopefully being phased out) are an especially nasty source of bugs. Unfortunately, the companies and programmers writing these driver have neither the expertise nor the incentive to write beautiful, clean, well-behaved drivers.

Efforts to limit the effects of driver bugs on the core operating system have been going on for decades, with limited success. One of the motivations behind microkernels was the desire to isolate parts of the kernel so that they could not, for example, stomp on the memory of other parts of the kernel. Safe behind the message passing interface between microkernel modules, each module only had to validate the input from other modules in order to ensure that external bugs would not interfere with its proper working. In reality, completely validating messages is harder than it looks, and the performance overhead of message passing, MMU tricks, and the code to work around them turned out to be prohibitive. A variety of more limited sandboxing techniques, isolating only likely troublemakers such as device drivers, reduced operating system crashes significantly, but left the system with a non-functioning, possibly crucial device (such as the network card). While the OS was still up and running, the system reliability, viewed from an application standpoint, was not particularly improved. From the point of view of a web server, a crashed system and a system with no network access due to a safely sandboxed but crashed network driver are practically identical.

The Solution

What we really need is a lightweight, unintrusive system to not only catch device driver errors, but to recover and restart the device driver while simultaneously covering for the device while it is re-initializing. Michael Swift, Muthukaruppan Annamalai, Brian Bershad, and Henry Levy implemented such a system for Linux 2.4.18, as described in their 2004 OSDI paper, Recovering Device Drivers [PDF]. The key idea in this paper is shadow drivers, a driver that wraps around the original hardware driver and records requests sent to it, monitors the health of the driver, and restarts the driver if it crashes, replaying any missed requests collected while the driver was restarting. You can think of a shadow driver as a substitute teacher temporarily filling in while the real driver is out sick. Each class of device drivers (sound, network, disk, etc.) requires the writing of only one shadow driver.

Shadow drivers are built on the Nooks driver isolation system, outlined in a paper in the 19th SOSP, Improving the Reliability of Commodity Operating Systems [PDF]. Nooks provides most of the benefits of the microkernel architecture for a relatively low cost. The four main services are (1) memory isolation - drivers run with most of the kernel memory read-only, (2) wrappers around data transfer between the kernel and drivers, (3) tracking of kernel objects used by the driver, and (4) a recovery manager. The Nooks architecture is simplified by the (perfectly reasonable) assumption that kernel modules are not malicious, but merely buggy, and so doesn't need to take special steps to, for example, prevent a device driver from deliberately altering memory permissions.

When a shadow driver detects that a device driver has failed, it begins to actively proxy for the device driver (queuing up requests, etc.) and begins recovery of the driver. First it safely shuts down the driver, which may require some delicate work given that the driver has crashed. For example, it may need to explicitly disable interrupts on the device since a crashed driver can no longer acknowledge them. Then it reloads the driver and reconfigures it. The shadow driver will have recorded any prior configuration requests (such as "set full-duplex mode") and replays them if necessary. Then it replays any queued up requests that accumulated during the recover phase. Depending on the device and type of request, it may make more sense to drop the requests; for example, a shadow driver for sound will just drop any requests to play sound, since they are real-time and aren't useful to save up to play when the driver recovers. (With the Audigy sound card and driver evaluated in the paper, this resulted in a gap in the audio of about one-tenth of a second.)

The authors compared vanilla Linux, Nooks, and shadow drivers by adding bugs by hand to three drivers, a network driver (e1000), a sound card driver (audigy), and a disk driver (ide-disk). The bugs were based on real bugs reported on mailing lists in order to be as realistic as possible. They then tested the reliability of the system from the application point of view on each system. The results are summarized in the table below; shadow drivers were able to transparently recover from all tested driver bugs which normally crash the entire machine, without interrupting the application.

Application Behavior
Device Driver Application Activity Linux-NativeLinux-NooksLinux-SD
Sound mp3 player CRASH MALFUNCTION SUCCESS
(audigy driver) audio recorder CRASH MALFUNCTION SUCCESS
speech synthesizer CRASH SUCCESS SUCCESS
strategy game CRASH MALFUNCTION SUCCESS
Network network file transfer CRASH SUCCESS SUCCESS
(e1000 driver) remote window manager CRASH SUCCESS SUCCESS
network analyzer CRASH MALFUNCTION SUCCESS
IDE compiler CRASH CRASH SUCCESS
(ide-disk driver) encoder CRASH CRASH SUCCESS
database CRASH CRASH SUCCESS

What does this mean for Linux?

Linux developers have a number of ways to reduce the impact of buggy drivers. Given that the limiting factor is usually human eyeball time, we should choose methods that rely on automation as much as possible. Some of these methods are automatic bug checking, compiler-level checks, and code-level asserts. Adding an automatic driver sandbox and recovery system would be an excellent investment in kernel developer time in return for overall system stability, particularly for distribution vendors. Even implementing a subset of the features in the shadow driver system would be helpful. More than likely, Linux 2.6 has frameworks which would easily lend themselves to implementing some of these features.


(Log in to post comments)

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 18, 2007 10:45 UTC (Thu) by nix (subscriber, #2304) [Link]

The sandboxing idea would work for a lot of classes of failure, notably those where the driver goes quietly moribund without breaking anything else on the way down. But it's stuck if the driver messes up core kernel data structures before it goes down, and it's *really* stuck if it leaves the hardware's state unclean. (This is probably especially important for graphics cards, but any card with a complex stateful protocol is likely to have this problem.)

This is fixable, but you'd need shadow initialization/shutdown code for *every* driver that needs special shadowing support. (Even there, some cards have complex enough internal states that it can be very hard to deduce how to reset the card if you can't trust the driver's internal state. Again graphics cards are the big villains here, but some SCSI cards I've used have been notable for bizarre state machines which are at some points hard to reset.)

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 18, 2007 14:21 UTC (Thu) by liljencrantz (guest, #28458) [Link]

According to the article, the sandboxing implementation used actually makes most kernel memory read only. The way to deal with stateful hardware is to reset the hardware to a known state. In order to then restore the pre-crash state state usually requires cooperation with userspace, but it should be possible.

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 18, 2007 12:08 UTC (Thu) by tsr2 (subscriber, #4293) [Link]

From the point of view of a web server, a crashed system and a system with no network access due to a safely sandboxed but crashed network driver are practically identical.

I do not agree that they are practically identical. A crashed system will usually reboot and return to operation in a short period of time. A system that stays up, but is unable to communicate with the outside world will require external intervention. Also, if it's in an inconvenient location, you can't log in remotely and reboot, so all in all a crash is probably preferable in this scenario.

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 18, 2007 18:05 UTC (Thu) by bronson (subscriber, #4806) [Link]

...which was Val's point. If you want to split hairs, I think she meant, "From the point of view of a web server, a *hung* system and a system with no network access due to a safely sandboxed but crashed network driver are practically identical."

If a crashed system magically reboots, consider yourself very lucky. The majority of crashers that I've seen result in zombies computers unable to, say, read from its disk array. Until the watchdog steps in, of course, but a watchdogs work just fine for sandboxed systems too. Automatically-rebooting crashes are so 1990s.

I don't think the two situations are quite as different as you imply.

Ummm... KHB?

Posted Jan 23, 2007 15:03 UTC (Tue) by Max.Hyre (subscriber, #1054) [Link]

Boy do I feel ignorant. What does `KHB' stand for? ``Kernel Hackers' Bxxx'' seems likely, or maybe it's a conference, à la LCA, but neither a search of LWN, Google, nor Wikipedia supplies a satisfactory answer. (I suspect it's not the Karnataka Housing Board, which is Google's first offering. :-)

Ummm... KHB?

Posted Jan 23, 2007 15:16 UTC (Tue) by corbet (editor, #1) [Link]

Kernel Hacker's Bookshelf - described in earlier articles in the series, but not reinforced in recent times.

Ummm... KHB?

Posted Jan 23, 2007 16:37 UTC (Tue) by Max.Hyre (subscriber, #1054) [Link]

Thanks. Whatever it's called, it's a fascinating series, and I really enjoy it.

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 26, 2007 13:38 UTC (Fri) by alext (guest, #7589) [Link]

"What we really don'tneed is a lightweight, unintrusive system to not only catch device driver ..." (my change in bold) but it will do when we can't have what we really need which is drivers that work properly.

KHB: Recovering Device Drivers: From Sandboxing to Surviving

Posted Jan 29, 2007 23:09 UTC (Mon) by slamb (guest, #1070) [Link]

You're saying we wouldn't need fault isolation/tolerance/recovery if we didn't have faults. That's true, but unless someone comes up with a way to prevent all faults, it's not a useful statement. Barring that, schemes like this are at least interesting, and I'm not sure why parts of it aren't in place. In particular, if someone has a way to make most of the core kernel's memory read-only to drivers at low cost, I'm all for it.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds