Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.20-rc5, released by Linus on January 12. It contains a number of fixes, and might be the last -rc release before 2.6.20. No patches have hit the mainline git repository since -rc5; this situation will probably not change until after Linus returns from linux.conf.au.The current -mm tree is 2.6.20-rc4-mm1. Recent changes to -mm include the e1000 development tree, the HID development tree, unionfs, and the asynchronous filesystem I/O patches.
Kernel development news
LCA: The state of the Nouveau project
In any conference, there comes a time when one has to wonder what the people who do the talk scheduling were thinking. For lca2007, that moment came when your editor realized that the talks on OLPC (Jim Gettys), real
time (Ted Ts'o), and Nouveau were all scheduled together. Nouveau won out,
but it was not an easy decision.
The Nouveau project is an effort to develop a set of free 3D drivers for NVidia chipsets. NVidia has long annoyed the free software community with its refusal to release free drivers or programming information for its video chipsets. The Nouveau folks have had enough of that, and they are doing something about it. Dave Airlie used his slot at linux.conf.au to talk about the project and its current status.
Nouveau got its start in February 2005, though serious work did not begin until June of that year. The project was announced at FOSDEM 2006, at which point others started to help. There are currently about six developers doing serious work on Nouveau.
The project is relying on reverse engineering for the information needed to write free drivers. To that end, the developers have put together a set of tools. At the top of the list is renouveau, which is designed to reveal the commands sent to the card in response to specific operations. Using the existing binary drivers, renouveau sets up a context, then scans the process's mappings until it finds the command FIFO. It then requests an operation and sees how the FIFO changes. With enough operations, a pretty good idea of how the adapter is programmed to specific ends can be had. This was not a trivial tool, and the better part of a year was put into its development.
Renouveau is useful for examining the FIFO, but it doesn't help with reads and writes to I/O registers. For that, there's another set of tools, starting with valgrind-mmt - a version of valgrind designed to trap I/O memory operations. Libsegfault is a modified version of mmap() which doesn't actually do the mappings as the caller would like; it traps the subsequent segmentation faults and dumps out the operations. There is another tool, called kmmio, which performs a similar task for register operations done in kernel space. Finally, the project uses a BIOS tracer which runs BIOS code in x86emu and traps I/O register accesses.
All of the information obtained from these tools is supplemented with hints from the old, free nv driver. There is also, says Dave, information "which shouldn't be there" to be found on some Russian web sites.
Where has all of this information led the project? Basic tasks, like the allocation of instance RAM and FIFO initialization, are working. Hardware context switching works - on little-endian machines. There is 2D support derived from the nv driver; it offers basic EXA and RandR 1.2 support. On the 3D front, the Mesa TCL (transform, clipping, and lighting) driver mostly works. Textures and objects do not, however. It is possible to run glxgears on nv4x chips. It has taken some time to get to this point, but Dave thinks that things will start to move a lot faster from here.
The next milestone would be to run Quake 3. That is, says Dave, an obligatory step on the roadmap. Getting there will involve texture support, a better memory manager, and better locking in the kernel DRM code. The developers (Dave in particular) are aiming for RandR 1.2 multi-head support. Once all of this is in place, the nouveau driver will have reached a reasonably capable state.
There are a lot of people asking when this will be; Dave says that the project's IRC channel is often overwhelmed by spectators looking for news. There is no wish to push the code out ahead of its time; among other things, that would nail down the API between the kernel and the X server, making things harder to change. The current hope is to have some sort alpha release toward the end of 2007.
For people wanting to help, Dave had a simple message: they need developers. There's not much for people who can't work on driver code to do at this point. Graphics drivers, he says, are not as hard as people think. Finally, he addressed the issue of the $10K pledge for the project. It rather took the developers by surprise; they had not endorsed this drive, and had held some doubts as to whether it would be successful. How the pledge money will be handled is still being worked out; it looks like it will mostly be used for hardware purchases.
Lack of support for 3D video adapters has stalled the community for years; there has been a long wait in the hope that the vendors would come to their senses. That wait is just about over. The Nouveau project (along with various others) shows that we have the resources to figure out how our hardware works, even in the face of complex devices and uncooperative vendors. It would be better if we did not have to take things into our own hands this way, but it is nice to see how well we can do it when the need arises.
KHB: Recovering Device Drivers: From Sandboxing to Surviving
Drivers are the dominant source of crashes and bugs in operating systems. This is especially disturbing given the proportion of operating system code that is driver code. In Linux, approximately two thirds of the source lines of code are in drivers (depending on the version). Full-time kernel developers often bemoan the quality of code in drivers; one study [PDF] found that the bug rate in drivers was actually three to seven times higher than in core kernel code. Binary drivers (hopefully being phased out) are an especially nasty source of bugs. Unfortunately, the companies and programmers writing these driver have neither the expertise nor the incentive to write beautiful, clean, well-behaved drivers.Efforts to limit the effects of driver bugs on the core operating system have been going on for decades, with limited success. One of the motivations behind microkernels was the desire to isolate parts of the kernel so that they could not, for example, stomp on the memory of other parts of the kernel. Safe behind the message passing interface between microkernel modules, each module only had to validate the input from other modules in order to ensure that external bugs would not interfere with its proper working. In reality, completely validating messages is harder than it looks, and the performance overhead of message passing, MMU tricks, and the code to work around them turned out to be prohibitive. A variety of more limited sandboxing techniques, isolating only likely troublemakers such as device drivers, reduced operating system crashes significantly, but left the system with a non-functioning, possibly crucial device (such as the network card). While the OS was still up and running, the system reliability, viewed from an application standpoint, was not particularly improved. From the point of view of a web server, a crashed system and a system with no network access due to a safely sandboxed but crashed network driver are practically identical.
The Solution
What we really need is a lightweight, unintrusive system to not only catch device driver errors, but to recover and restart the device driver while simultaneously covering for the device while it is re-initializing. Michael Swift, Muthukaruppan Annamalai, Brian Bershad, and Henry Levy implemented such a system for Linux 2.4.18, as described in their 2004 OSDI paper, Recovering Device Drivers [PDF]. The key idea in this paper is shadow drivers, a driver that wraps around the original hardware driver and records requests sent to it, monitors the health of the driver, and restarts the driver if it crashes, replaying any missed requests collected while the driver was restarting. You can think of a shadow driver as a substitute teacher temporarily filling in while the real driver is out sick. Each class of device drivers (sound, network, disk, etc.) requires the writing of only one shadow driver.Shadow drivers are built on the Nooks driver isolation system, outlined in a paper in the 19th SOSP, Improving the Reliability of Commodity Operating Systems [PDF]. Nooks provides most of the benefits of the microkernel architecture for a relatively low cost. The four main services are (1) memory isolation - drivers run with most of the kernel memory read-only, (2) wrappers around data transfer between the kernel and drivers, (3) tracking of kernel objects used by the driver, and (4) a recovery manager. The Nooks architecture is simplified by the (perfectly reasonable) assumption that kernel modules are not malicious, but merely buggy, and so doesn't need to take special steps to, for example, prevent a device driver from deliberately altering memory permissions.
When a shadow driver detects that a device driver has failed, it begins to actively proxy for the device driver (queuing up requests, etc.) and begins recovery of the driver. First it safely shuts down the driver, which may require some delicate work given that the driver has crashed. For example, it may need to explicitly disable interrupts on the device since a crashed driver can no longer acknowledge them. Then it reloads the driver and reconfigures it. The shadow driver will have recorded any prior configuration requests (such as "set full-duplex mode") and replays them if necessary. Then it replays any queued up requests that accumulated during the recover phase. Depending on the device and type of request, it may make more sense to drop the requests; for example, a shadow driver for sound will just drop any requests to play sound, since they are real-time and aren't useful to save up to play when the driver recovers. (With the Audigy sound card and driver evaluated in the paper, this resulted in a gap in the audio of about one-tenth of a second.)
The authors compared vanilla Linux, Nooks, and shadow drivers by adding bugs by hand to three drivers, a network driver (e1000), a sound card driver (audigy), and a disk driver (ide-disk). The bugs were based on real bugs reported on mailing lists in order to be as realistic as possible. They then tested the reliability of the system from the application point of view on each system. The results are summarized in the table below; shadow drivers were able to transparently recover from all tested driver bugs which normally crash the entire machine, without interrupting the application.
Application Behavior Device Driver Application Activity Linux-Native Linux-Nooks Linux-SD Sound mp3 player CRASH MALFUNCTION SUCCESS (audigy driver) audio recorder CRASH MALFUNCTION SUCCESS speech synthesizer CRASH SUCCESS SUCCESS strategy game CRASH MALFUNCTION SUCCESS Network network file transfer CRASH SUCCESS SUCCESS (e1000 driver) remote window manager CRASH SUCCESS SUCCESS network analyzer CRASH MALFUNCTION SUCCESS IDE compiler CRASH CRASH SUCCESS (ide-disk driver) encoder CRASH CRASH SUCCESS database CRASH CRASH SUCCESS
What does this mean for Linux?
Linux developers have a number of ways to reduce the impact of buggy drivers. Given that the limiting factor is usually human eyeball time, we should choose methods that rely on automation as much as possible. Some of these methods are automatic bug checking, compiler-level checks, and code-level asserts. Adding an automatic driver sandbox and recovery system would be an excellent investment in kernel developer time in return for overall system stability, particularly for distribution vendors. Even implementing a subset of the features in the shadow driver system would be helpful. More than likely, Linux 2.6 has frameworks which would easily lend themselves to implementing some of these features.
RCU and Unloadable Modules
RCU (read-copy update) is a synchronization mechanism that can be thought of as a replacement for read-writer locking (among other things), but with very low-overhead readers that are immune to deadlock, priority inversion, and unbounded latency. RCU read-side critical sections are delimited by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT kernels, generate no code whatsoever.This means that RCU writers are unaware of the presence of concurrent readers, so that RCU updates to shared data must be undertaken quite carefully, leaving an old version of the data structure in place until all pre-existing readers have finished. These old versions are needed because such readers might hold a reference to them. RCU updates can therefore be rather expensive, and RCU is thus best suited for read-mostly situations.
How can an RCU writer possibly determine when all readers are finished, given that readers might well leave absolutely no trace of their presence? There is a synchronize_rcu() primitive that blocks until all pre-existing readers have completed. An updater wishing to delete an element p from a linked list might do the following, while holding an appropriate lock, of course:
Unloading Modules That Use call_rcu()
But what if p_callback is defined in an unloadable module?
If we unload the module while some RCU callbacks are pending, the CPUs
executing these
callbacks are going to be severely disappointed when they are later invoked,
as fancifully depicted on the right.
We could try placing a synchronize_rcu() in the module-exit code path, but this is not sufficient. Although synchronize_rcu() does wait for a grace period to elapse, it does not wait for the callbacks to complete.
One might be tempted to try several back-to-back synchronize_rcu() calls, but this is still not guaranteed to work. If there is a very heavy RCU-callback load, then some of the callbacks might be deferred in order to allow other processing to proceed. Such deferral is required in realtime kernels in order to avoid excessive scheduling latencies.
rcu_barrier()
We instead need the rcu_barrier() primitive. This primitive is similar to synchronize_rcu(), but instead of waiting solely for a grace period to elapse, it also waits for all outstanding RCU callbacks to complete. Pseudo-code using rcu_barrier() is as follows:- Prevent any new RCU callbacks from being posted.
- Execute rcu_barrier().
- Allow the module to be unloaded.
Quick Quiz #2: Why is there no rcu_barrier_bh()?
The rcutorture module makes use of rcu_barrier in its exit function as follows:
Lines 7-50 stop all the kernel tasks associated with the rcutorture module. Therefore, once execution reaches line 53, no more rcutorture RCU callbacks will be posted. The rcu_barrier() call on line 53 waits for any pre-existing callbacks to complete.
Then lines 55-62 print status and do operation-specific cleanup, and then return, permitting the module-unload operation to be completed.
Quick Quiz #3: Is there any other situation where rcu_barrier() might be required?
Your module might have additional complications. For example, if your module invokes call_rcu() from timers, you will need to first cancel all the timers, and only then invoke rcu_barrier() to wait for any remaining RCU callbacks to complete.
Implementing rcu_barrier()
Dipankar Sarma's implementation of rcu_barrier() makes use of the fact that RCU callbacks are never reordered once queued on one of the per-CPU queues. His implementation queues an RCU callback on each of the per-CPU callback queues, and then waits until they have all started executing, at which point, all earlier RCU callbacks are guaranteed to have completed.The code for rcu_barrier() is as follows:
The rcu_barrier_func() runs on each CPU, where it invokes call_rcu() to post an RCU callback, as follows:
The rcu_barrier_callback() function simply atomically decrements the rcu_barrier_cpu_count variable and finalizes the completion when it reaches zero, as follows:
Quick Quiz #4: What happens if CPU 0's rcu_barrier_func() executes immediately (thus incrementing rcu_barrier_cpu_count to the value one), but the other CPU's rcu_barrier_func() invocations are delayed for a full grace period? Couldn't this result in rcu_barrier() returning prematurely?
rcu_barrier() Summary
The rcu_barrier() primitive has seen relatively little use, since most code using RCU is in the core kernel rather than in modules. However, if you are using RCU from an unloadable module, you need to use rcu_barrier() so that your module may be safely unloaded.
Answers to Quick Quizzes
Quick Quiz #1: Why is there no srcu_barrier()?Since there is no call_srcu(), there can be no outstanding SRCU callbacks. Therefore, there is no need to wait for them.
Quick Quiz #2: Why is there no rcu_barrier_bh()?
Because no one has needed it yet. As soon as someone needs to use call_rcu_bh() from within an unloadable module, they will need an rcu_barrier_bh().
Quick Quiz #3: Is there any other situation where rcu_barrier() might be required?
Interestingly enough, rcu_barrier() was not originally implemented for module unloading. Nikita Danilov was using RCU in a filesystem, which resulted in a similar situation at filesystem-unmount time. Dipankar Sarma coded up rcu_barrier() in response, so that Nikita could invoke it during the filesystem-unmount process.
Much later, yours truly hit the RCU module-unload problem when implementing rcutorture, and found that rcu_barrier() solves this problem as well.
Quick Quiz #4: What happens if CPU 0's rcu_barrier_func() executes immediately (thus incrementing rcu_barrier_cpu_count to the value one), but the other CPU's rcu_barrier_func() invocations are delayed for a full grace period? Couldn't this result in rcu_barrier() returning prematurely?
This cannot happen. The reason is that on_each_cpu() has its last argument, the wait flag, set to "1". This flag is passed through to smp_call_function() and further to smp_call_function_on_cpu(), causing this latter to spin until the cross-CPU invocation of rcu_barrier_func() has completed. This by itself would prevent a grace period from completing on non-CONFIG_PREEMPT kernels, since each CPU must undergo a context switch (or other quiescent state) before the grace period can complete. However, this is of no use in CONFIG_PREEMPT kernels.
Therefore, on_each_cpu() disables preemption across its call to smp_call_function() and also across the local call to rcu_barrier_func(). This prevents the local CPU from context switching, again preventing grace periods from completing. This means that all CPUs have executed rcu_barrier_func() before the first rcu_barrier_callback() can possibly execute, in turn preventing rcu_barrier_cpu_count from prematurely reaching zero.
Currently, -rt implementations of RCU keep but a single global queue for RCU callbacks, and thus do not suffer from this problem. However, when the -rt RCU eventually does have per-CPU callback queues, things will have to change. One simple change is to add an rcu_read_lock() before line 8 of rcu_barrier() and an rcu_read_unlock() after line 8 of this same function. If you can think of a better change, please let me know!
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Forrest Cook
Next page:
Distributions>>
