January 12, 2007
This article was contributed by Valerie Henson
Drivers are the dominant source of crashes and bugs in operating
systems. This is especially disturbing given the proportion of
operating system code that is driver code. In Linux, approximately
two thirds of the source lines of code are in drivers (depending on
the version). Full-time kernel developers often bemoan the quality of
code in drivers; one
study [PDF]
found that the bug rate in drivers was actually three to seven times
higher than in core kernel code. Binary drivers (hopefully being
phased out) are an especially nasty source of bugs. Unfortunately,
the companies and programmers writing these driver have neither the
expertise nor the incentive to write beautiful, clean, well-behaved
drivers.
Efforts to limit the effects of driver bugs on the core operating
system have been going on for decades, with limited success. One of
the motivations behind microkernels was the desire to isolate parts of
the kernel so that they could not, for example, stomp on the memory of
other parts of the kernel. Safe behind the message passing interface
between microkernel modules, each module only had to validate the
input from other modules in order to ensure that external bugs would
not interfere with its proper working. In reality, completely
validating messages is harder than it looks, and the performance
overhead of message passing, MMU tricks, and the code to work around
them turned out to be prohibitive. A variety of more limited
sandboxing techniques, isolating only likely troublemakers such as
device drivers, reduced operating system crashes significantly, but
left the system with a non-functioning, possibly crucial device (such
as the network card). While the OS was still up and running, the
system reliability, viewed from an application standpoint, was not
particularly improved. From the point of view of a web server, a
crashed system and a system with no network access due to a safely
sandboxed but crashed network driver are practically identical.
The Solution
What we really need is a lightweight, unintrusive system to not only
catch device driver errors, but to recover and restart the device
driver while simultaneously covering for the device while it is
re-initializing. Michael Swift, Muthukaruppan Annamalai, Brian
Bershad, and Henry Levy implemented such a system for Linux 2.4.18, as
described in their 2004 OSDI paper,
Recovering
Device Drivers [PDF]. The key idea in this paper is
shadow
drivers, a driver that wraps around the original hardware driver
and records requests sent to it, monitors the health of the driver,
and restarts the driver if it crashes, replaying any missed requests
collected while the driver was restarting. You can think of a shadow
driver as a substitute teacher temporarily filling in while the real
driver is out sick. Each class of device drivers (sound, network,
disk, etc.) requires the writing of only one shadow driver.
Shadow drivers are built on the Nooks driver isolation system,
outlined in a paper in the 19th SOSP, Improving the
Reliability of Commodity Operating Systems [PDF]. Nooks provides most
of the benefits of the microkernel architecture for a relatively low
cost. The four main services are (1) memory isolation - drivers run
with most of the kernel memory read-only, (2) wrappers around data
transfer between the kernel and drivers, (3) tracking of kernel
objects used by the driver, and (4) a recovery manager. The Nooks
architecture is simplified by the (perfectly reasonable) assumption
that kernel modules are not malicious, but merely buggy, and so
doesn't need to take special steps to, for example, prevent a device
driver from deliberately altering memory permissions.
When a shadow driver detects that a device driver has failed, it
begins to actively proxy for the device driver (queuing up requests,
etc.) and begins recovery of the driver. First it safely shuts down
the driver, which may require some delicate work given that the driver
has crashed. For example, it may need to explicitly disable
interrupts on the device since a crashed driver can no longer
acknowledge them. Then it reloads the driver and reconfigures it.
The shadow driver will have recorded any prior configuration requests (such as
"set full-duplex mode") and replays them if necessary. Then it
replays any queued up requests that accumulated during the recover
phase. Depending on the device and type of request, it may make more
sense to drop the requests; for example, a shadow driver for sound
will just drop any requests to play sound, since they are real-time
and aren't useful to save up to play when the driver recovers. (With
the Audigy sound card and driver evaluated in the paper, this resulted
in a gap in the audio of about one-tenth of a second.)
The authors compared vanilla Linux, Nooks, and shadow drivers by
adding bugs by hand to three drivers, a network driver (e1000), a
sound card driver (audigy), and a disk driver (ide-disk). The bugs
were based on real bugs reported on mailing lists in order to be as
realistic as possible. They then tested the reliability of the system
from the application point of view on each system. The results are
summarized in the table below; shadow drivers were able to
transparently recover from all tested driver bugs which normally crash
the entire machine, without interrupting the application.
| Application Behavior |
| Device Driver |
Application Activity |
Linux-Native | Linux-Nooks | Linux-SD |
| Sound | mp3 player | CRASH | MALFUNCTION | SUCCESS |
| (audigy driver) | audio recorder | CRASH | MALFUNCTION | SUCCESS |
| | speech synthesizer | CRASH | SUCCESS | SUCCESS |
| | strategy game | CRASH | MALFUNCTION | SUCCESS |
| Network | network file transfer | CRASH | SUCCESS | SUCCESS |
| (e1000 driver) | remote window manager | CRASH | SUCCESS | SUCCESS |
| | network analyzer | CRASH | MALFUNCTION | SUCCESS |
| IDE | compiler | CRASH | CRASH | SUCCESS |
| (ide-disk driver) | encoder | CRASH | CRASH | SUCCESS |
| | database | CRASH | CRASH | SUCCESS |
What does this mean for Linux?
Linux developers have a number of ways to reduce the impact of buggy
drivers. Given that the limiting factor is usually human eyeball
time, we should choose methods that rely on automation as much as
possible. Some of these methods are automatic bug checking,
compiler-level checks, and code-level asserts. Adding an automatic
driver sandbox and recovery system would be an excellent investment in
kernel developer time in return for overall system stability,
particularly for distribution vendors. Even implementing a subset of
the features in the shadow driver system would be helpful. More than
likely, Linux 2.6 has frameworks which would easily lend themselves to
implementing some of these features.
(
Log in to post comments)