Brief items
The current 2.6 prepatch is 2.6.20-rc5,
released by Linus on
January 12. It contains a number of fixes, and might be the last -rc
release before 2.6.20. No patches have hit the mainline git repository
since -rc5; this situation will probably not change until after Linus
returns from linux.conf.au.
The current -mm tree is 2.6.20-rc4-mm1. Recent changes
to -mm include the e1000 development tree,
the HID development tree, unionfs, and the asynchronous filesystem I/O
patches.
Comments (none posted)
Kernel development news
In any conference, there comes a time when one has to wonder what the
people who do the talk scheduling were thinking. For lca2007, that moment
came when your editor realized that the talks on OLPC (Jim Gettys), real
![[Dave Airlie]](/images/conf/lca2007/Airlie-sm.jpg)
time (Ted Ts'o), and Nouveau were all scheduled together. Nouveau won out,
but it was not an easy decision.
The Nouveau project is an
effort to develop a set of free 3D drivers for NVidia chipsets. NVidia has
long annoyed the free software community with its refusal to release free
drivers or programming information for its video chipsets. The Nouveau
folks have had enough of that, and they are doing something about it.
Dave Airlie used his slot at linux.conf.au to talk about the project and
its current status.
Nouveau got its start in February 2005, though serious work did not begin
until June of that year. The project was announced at FOSDEM 2006, at
which point others started to help. There are currently about six
developers doing serious work on Nouveau.
The project is relying on reverse engineering for the information needed to
write free drivers. To that end, the developers have put together a set of
tools. At the top of the list is renouveau, which is designed to
reveal the commands sent to the card in response to specific operations.
Using the existing binary drivers, renouveau sets up a context, then scans
the process's mappings until it finds the command FIFO. It then requests
an operation and sees how the FIFO changes. With enough operations, a
pretty good idea of how the adapter is programmed to specific ends can be
had. This was not a trivial tool, and the better part of a year was put
into its development.
Renouveau is useful for examining the FIFO, but it doesn't help with reads
and writes to I/O registers. For that, there's another set of tools,
starting with valgrind-mmt - a version of valgrind designed to trap I/O
memory operations. Libsegfault is a modified version of mmap()
which doesn't actually do the mappings as the caller would like; it traps
the subsequent segmentation faults and dumps out the operations. There is
another tool, called kmmio, which performs a similar task for register
operations done in kernel space. Finally, the project uses a BIOS tracer
which runs BIOS
code in x86emu and traps I/O register accesses.
All of the information obtained from these tools is supplemented with hints
from the old, free nv driver. There is also, says Dave, information
"which shouldn't be there" to be found on some Russian web sites.
Where has all of this information led the project? Basic tasks, like the
allocation of instance RAM and FIFO initialization, are working. Hardware
context switching works - on little-endian machines. There is 2D support
derived from the nv driver; it offers basic EXA and RandR 1.2 support. On
the 3D front, the Mesa TCL (transform, clipping, and lighting) driver
mostly works. Textures and objects do not, however. It is possible to run
glxgears on nv4x chips. It has taken some time to get to this point, but
Dave thinks that things will start to move a lot faster from here.
The next milestone would be to run Quake 3. That is, says Dave, an
obligatory step on the roadmap. Getting there will involve texture
support, a better memory manager, and better locking in the kernel DRM
code. The developers (Dave in particular) are aiming for RandR 1.2
multi-head support. Once all of this is in place, the nouveau driver will
have reached a reasonably capable state.
There are a lot of people asking when this will be; Dave says that the
project's IRC channel is often overwhelmed by spectators looking for news.
There is no wish to push the code out ahead of its time; among other
things, that would nail down the API between the kernel and the X server, making
things harder to change. The current hope is to have some sort alpha
release toward the end of 2007.
For people wanting to help, Dave had a simple message: they need
developers. There's not much for people who can't work on driver code to
do at this point. Graphics drivers, he says, are not as hard as people
think. Finally, he addressed the issue of the $10K pledge for the
project. It rather took the developers by surprise; they had not endorsed
this drive, and had held some doubts as to whether it would be successful.
How the pledge money will be handled is still being worked out; it looks
like it will mostly be used for hardware purchases.
Lack of support for 3D video adapters has stalled the community for years;
there has been a long wait in the hope that the vendors would come to their
senses. That wait is just about over. The Nouveau project (along with
various others) shows that we have the resources to figure out how our
hardware works, even in the face of complex devices and uncooperative
vendors. It would be better if we did not have to take things into our own
hands this way, but it is nice to see how well we can do it when the need
arises.
Comments (42 posted)
January 12, 2007
This article was contributed by Valerie Henson
Drivers are the dominant source of crashes and bugs in operating
systems. This is especially disturbing given the proportion of
operating system code that is driver code. In Linux, approximately
two thirds of the source lines of code are in drivers (depending on
the version). Full-time kernel developers often bemoan the quality of
code in drivers; one
study [PDF]
found that the bug rate in drivers was actually three to seven times
higher than in core kernel code. Binary drivers (hopefully being
phased out) are an especially nasty source of bugs. Unfortunately,
the companies and programmers writing these driver have neither the
expertise nor the incentive to write beautiful, clean, well-behaved
drivers.
Efforts to limit the effects of driver bugs on the core operating
system have been going on for decades, with limited success. One of
the motivations behind microkernels was the desire to isolate parts of
the kernel so that they could not, for example, stomp on the memory of
other parts of the kernel. Safe behind the message passing interface
between microkernel modules, each module only had to validate the
input from other modules in order to ensure that external bugs would
not interfere with its proper working. In reality, completely
validating messages is harder than it looks, and the performance
overhead of message passing, MMU tricks, and the code to work around
them turned out to be prohibitive. A variety of more limited
sandboxing techniques, isolating only likely troublemakers such as
device drivers, reduced operating system crashes significantly, but
left the system with a non-functioning, possibly crucial device (such
as the network card). While the OS was still up and running, the
system reliability, viewed from an application standpoint, was not
particularly improved. From the point of view of a web server, a
crashed system and a system with no network access due to a safely
sandboxed but crashed network driver are practically identical.
The Solution
What we really need is a lightweight, unintrusive system to not only
catch device driver errors, but to recover and restart the device
driver while simultaneously covering for the device while it is
re-initializing. Michael Swift, Muthukaruppan Annamalai, Brian
Bershad, and Henry Levy implemented such a system for Linux 2.4.18, as
described in their 2004 OSDI paper,
Recovering
Device Drivers [PDF]. The key idea in this paper is
shadow
drivers, a driver that wraps around the original hardware driver
and records requests sent to it, monitors the health of the driver,
and restarts the driver if it crashes, replaying any missed requests
collected while the driver was restarting. You can think of a shadow
driver as a substitute teacher temporarily filling in while the real
driver is out sick. Each class of device drivers (sound, network,
disk, etc.) requires the writing of only one shadow driver.
Shadow drivers are built on the Nooks driver isolation system,
outlined in a paper in the 19th SOSP, Improving the
Reliability of Commodity Operating Systems [PDF]. Nooks provides most
of the benefits of the microkernel architecture for a relatively low
cost. The four main services are (1) memory isolation - drivers run
with most of the kernel memory read-only, (2) wrappers around data
transfer between the kernel and drivers, (3) tracking of kernel
objects used by the driver, and (4) a recovery manager. The Nooks
architecture is simplified by the (perfectly reasonable) assumption
that kernel modules are not malicious, but merely buggy, and so
doesn't need to take special steps to, for example, prevent a device
driver from deliberately altering memory permissions.
When a shadow driver detects that a device driver has failed, it
begins to actively proxy for the device driver (queuing up requests,
etc.) and begins recovery of the driver. First it safely shuts down
the driver, which may require some delicate work given that the driver
has crashed. For example, it may need to explicitly disable
interrupts on the device since a crashed driver can no longer
acknowledge them. Then it reloads the driver and reconfigures it.
The shadow driver will have recorded any prior configuration requests (such as
"set full-duplex mode") and replays them if necessary. Then it
replays any queued up requests that accumulated during the recover
phase. Depending on the device and type of request, it may make more
sense to drop the requests; for example, a shadow driver for sound
will just drop any requests to play sound, since they are real-time
and aren't useful to save up to play when the driver recovers. (With
the Audigy sound card and driver evaluated in the paper, this resulted
in a gap in the audio of about one-tenth of a second.)
The authors compared vanilla Linux, Nooks, and shadow drivers by
adding bugs by hand to three drivers, a network driver (e1000), a
sound card driver (audigy), and a disk driver (ide-disk). The bugs
were based on real bugs reported on mailing lists in order to be as
realistic as possible. They then tested the reliability of the system
from the application point of view on each system. The results are
summarized in the table below; shadow drivers were able to
transparently recover from all tested driver bugs which normally crash
the entire machine, without interrupting the application.
| Application Behavior |
| Device Driver |
Application Activity |
Linux-Native | Linux-Nooks | Linux-SD |
| Sound | mp3 player | CRASH | MALFUNCTION | SUCCESS |
| (audigy driver) | audio recorder | CRASH | MALFUNCTION | SUCCESS |
| | speech synthesizer | CRASH | SUCCESS | SUCCESS |
| | strategy game | CRASH | MALFUNCTION | SUCCESS |
| Network | network file transfer | CRASH | SUCCESS | SUCCESS |
| (e1000 driver) | remote window manager | CRASH | SUCCESS | SUCCESS |
| | network analyzer | CRASH | MALFUNCTION | SUCCESS |
| IDE | compiler | CRASH | CRASH | SUCCESS |
| (ide-disk driver) | encoder | CRASH | CRASH | SUCCESS |
| | database | CRASH | CRASH | SUCCESS |
What does this mean for Linux?
Linux developers have a number of ways to reduce the impact of buggy
drivers. Given that the limiting factor is usually human eyeball
time, we should choose methods that rely on automation as much as
possible. Some of these methods are automatic bug checking,
compiler-level checks, and code-level asserts. Adding an automatic
driver sandbox and recovery system would be an excellent investment in
kernel developer time in return for overall system stability,
particularly for distribution vendors. Even implementing a subset of
the features in the shadow driver system would be helpful. More than
likely, Linux 2.6 has frameworks which would easily lend themselves to
implementing some of these features.
Comments (8 posted)
January 14, 2007
This article was contributed by Paul McKenney
RCU (read-copy update) is a synchronization mechanism that can be thought
of as a replacement for read-writer locking (among other things),
but with very low-overhead readers that are immune to deadlock,
priority inversion, and unbounded latency.
RCU read-side critical sections are delimited by
rcu_read_lock()
and
rcu_read_unlock(), which, in non-
CONFIG_PREEMPT
kernels, generate no code whatsoever.
This means that RCU writers are unaware of the presence of concurrent
readers, so that RCU updates to shared data must be undertaken quite
carefully, leaving an old version of the data structure in place
until all pre-existing readers have finished.
These old versions are needed because
such readers might hold a reference to them.
RCU updates can therefore be rather expensive, and RCU is thus best
suited for read-mostly situations.
How can an RCU writer possibly determine when all readers are finished, given
that readers might well leave absolutely no trace of their presence?
There is a synchronize_rcu() primitive that blocks until
all pre-existing readers have completed.
An updater wishing to delete an element p from a linked
list might do the following, while holding an appropriate lock,
of course:
list_del_rcu(p);
synchronize_rcu();
kfree(p);
But the above code cannot be used in IRQ context -- the
call_rcu() primitive must be used instead.
This primitive takes a pointer to an
rcu_head struct placed
within the RCU-protected data structure and another pointer
to a function that may be invoked later to free that structure.
Code to delete an element
p from the linked list from IRQ
context might then be as follows:
list_del_rcu(p);
call_rcu(&p->rcu, p_callback);
Since
call_rcu() never blocks, this code can safely be used
from within IRQ context.
The function
p_callback() might be defined as follows:
static void p_callback(struct rcu_head *rp)
{
struct pstruct *p = container_of(rp, struct pstruct, rcu);
kfree(p);
}
Unloading Modules That Use call_rcu()
But what if
p_callback is defined in an unloadable module?
If we unload the module while some RCU callbacks are pending, the CPUs
executing these
callbacks are going to be severely disappointed when they are later invoked,
as fancifully depicted on the right.
We could try placing a synchronize_rcu() in the module-exit
code path, but this is not sufficient.
Although synchronize_rcu() does wait for a grace period
to elapse, it does not wait for the callbacks to complete.
One might be tempted to try several back-to-back
synchronize_rcu() calls, but this is still not guaranteed to work.
If there is a very heavy RCU-callback load, then some of the callbacks might
be deferred in order to allow other processing to proceed.
Such deferral is required in realtime kernels in order to avoid excessive
scheduling latencies.
rcu_barrier()
We instead need the
rcu_barrier() primitive.
This primitive is similar to
synchronize_rcu(), but instead of
waiting solely for a grace period to elapse, it also waits for
all outstanding RCU callbacks to complete.
Pseudo-code using
rcu_barrier() is as follows:
- Prevent any new RCU callbacks from being posted.
- Execute rcu_barrier().
- Allow the module to be unloaded.
Quick Quiz #1: Why is there no
srcu_barrier()?
Quick Quiz #2: Why is there no rcu_barrier_bh()?
The rcutorture module makes use of rcu_barrier in its exit function
as follows:
1 static void
2 rcu_torture_cleanup(void)
3 {
4 int i;
5
6 fullstop = 1;
7 if (shuffler_task != NULL) {
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
9 kthread_stop(shuffler_task);
10 }
11 shuffler_task = NULL;
12
13 if (writer_task != NULL) {
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
15 kthread_stop(writer_task);
16 }
17 writer_task = NULL;
18
19 if (reader_tasks != NULL) {
20 for (i = 0; i < nrealreaders; i++) {
21 if (reader_tasks[i] != NULL) {
22 VERBOSE_PRINTK_STRING(
23 "Stopping rcu_torture_reader task");
24 kthread_stop(reader_tasks[i]);
25 }
26 reader_tasks[i] = NULL;
27 }
28 kfree(reader_tasks);
29 reader_tasks = NULL;
30 }
31 rcu_torture_current = NULL;
32
33 if (fakewriter_tasks != NULL) {
34 for (i = 0; i < nfakewriters; i++) {
35 if (fakewriter_tasks[i] != NULL) {
36 VERBOSE_PRINTK_STRING(
37 "Stopping rcu_torture_fakewriter task");
38 kthread_stop(fakewriter_tasks[i]);
39 }
40 fakewriter_tasks[i] = NULL;
41 }
42 kfree(fakewriter_tasks);
43 fakewriter_tasks = NULL;
44 }
45
46 if (stats_task != NULL) {
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
48 kthread_stop(stats_task);
49 }
50 stats_task = NULL;
51
52 /* Wait for all RCU callbacks to fire. */
53 rcu_barrier();
54
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
56
57 if (cur_ops->cleanup != NULL)
58 cur_ops->cleanup();
59 if (atomic_read(&n_rcu_torture_error))
60 rcu_torture_print_module_parms("End of test: FAILURE");
61 else
62 rcu_torture_print_module_parms("End of test: SUCCESS");
63 }
Line 6 sets a global variable that prevents any RCU callbacks from
re-posting themselves.
This will not be necessary in most cases, since RCU callbacks rarely
include calls to
call_rcu().
However, the rcutorture module is an exception to this rule, and
therefore needs to set this global variable.
Lines 7-50 stop all the kernel tasks associated with the
rcutorture module.
Therefore, once execution reaches line 53, no more rcutorture
RCU callbacks will be posted.
The rcu_barrier() call on line 53 waits for any pre-existing
callbacks to complete.
Then lines 55-62 print status and do operation-specific cleanup,
and then return, permitting the module-unload operation to be completed.
Quick Quiz #3: Is there any other situation where
rcu_barrier() might be required?
Your module might have additional complications.
For example, if your module invokes call_rcu() from
timers, you will need to first cancel all the timers, and only
then invoke rcu_barrier() to wait for any remaining
RCU callbacks to complete.
Implementing rcu_barrier()
Dipankar Sarma's implementation of
rcu_barrier()
makes use of the fact that RCU callbacks are never reordered once
queued on one of the per-CPU queues.
His implementation queues an RCU callback on each of
the per-CPU callback queues, and then waits until they have all
started executing, at which point, all earlier RCU callbacks are
guaranteed to have completed.
The code for rcu_barrier() is as follows:
1 void rcu_barrier(void)
2 {
3 BUG_ON(in_interrupt());
4 /* Take cpucontrol mutex to protect against CPU hotplug */
5 mutex_lock(&rcu_barrier_mutex);
6 init_completion(&rcu_barrier_completion);
7 atomic_set(&rcu_barrier_cpu_count, 0);
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
9 wait_for_completion(&rcu_barrier_completion);
10 mutex_unlock(&rcu_barrier_mutex);
11 }
Line 3 verifies that the caller is in process context, and
lines 5 and 10 use
rcu_barrier_mutex to ensure that
only one
rcu_barrier() is using the global completion
and counters at a time, which are initialized on lines 6 and 7.
Line 8 causes each CPU to invoke
rcu_barrier_func(),
which is shown below.
Note that the final "1" in
on_each_cpu()'s argument list
ensures that all the calls to
rcu_barrier_func() will
have completed before
on_each_cpu() returns.
Line 9 then waits for the completion.
The rcu_barrier_func() runs on each CPU, where it invokes
call_rcu() to post an RCU callback, as follows:
1 static void rcu_barrier_func(void *notused)
2 {
3 int cpu = smp_processor_id();
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
5 struct rcu_head *head;
6
7 head = &rdp->barrier;
8 atomic_inc(&rcu_barrier_cpu_count);
9 call_rcu(head, rcu_barrier_callback);
10 }
Lines 3 and 4 locate RCU's internal per-CPU
rcu_data structure,
which contains the
struct rcu_head that needed for the later
call to
call_rcu().
Line 7 picks up a pointer to this
struct rcu_head, and line 8
increments a global counter.
This counter will later be decremented by the callback.
Line 9 then registers the
rcu_barrier_callback()
on the current CPU's queue.
The rcu_barrier_callback() function simply atomically
decrements the rcu_barrier_cpu_count variable and finalizes
the completion when it reaches zero, as follows:
1 static void rcu_barrier_callback(struct rcu_head *notused)
2 {
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
4 complete(&rcu_barrier_completion);
5 }
Quick Quiz #4: What happens if CPU 0's rcu_barrier_func()
executes immediately (thus incrementing rcu_barrier_cpu_count
to the value one), but the other CPU's rcu_barrier_func()
invocations are delayed for a full grace period?
Couldn't this result in rcu_barrier() returning prematurely?
rcu_barrier() Summary
The rcu_barrier() primitive has seen relatively little use,
since most code using RCU is in the core kernel rather than in modules.
However, if you are using RCU from an unloadable module, you need to use
rcu_barrier() so that your module may be safely unloaded.
Answers to Quick Quizzes
Quick Quiz #1: Why is there no
srcu_barrier()?
Since there is no call_srcu(), there can be no outstanding
SRCU callbacks.
Therefore, there is no need to wait for them.
Quick Quiz #2: Why is there no rcu_barrier_bh()?
Because no one has needed it yet.
As soon as someone needs to use call_rcu_bh() from within an
unloadable module, they will need an rcu_barrier_bh().
Quick Quiz #3: Is there any other situation where
rcu_barrier() might be required?
Interestingly enough, rcu_barrier() was not originally
implemented for module unloading.
Nikita Danilov was using RCU in a filesystem, which resulted in a
similar situation at filesystem-unmount time.
Dipankar Sarma coded up rcu_barrier() in response, so that
Nikita could invoke it during the filesystem-unmount process.
Much later, yours truly hit the RCU module-unload problem
when implementing rcutorture, and found that rcu_barrier()
solves this problem as well.
Quick Quiz #4: What happens if CPU 0's rcu_barrier_func()
executes immediately (thus incrementing rcu_barrier_cpu_count
to the value one), but the other CPU's rcu_barrier_func()
invocations are delayed for a full grace period?
Couldn't this result in rcu_barrier() returning prematurely?
This cannot happen.
The reason is that on_each_cpu() has its last argument,
the wait flag, set to "1".
This flag is passed through to smp_call_function() and
further to smp_call_function_on_cpu(), causing this latter
to spin until the cross-CPU invocation of rcu_barrier_func() has completed.
This by itself would prevent a grace period from completing on
non-CONFIG_PREEMPT kernels, since each CPU must undergo a context switch
(or other quiescent state) before the grace period can complete.
However, this is of no use in CONFIG_PREEMPT kernels.
Therefore, on_each_cpu() disables preemption across its
call to smp_call_function() and also across the local
call to rcu_barrier_func().
This prevents the local CPU from context switching, again preventing
grace periods from completing.
This means that all CPUs have executed rcu_barrier_func()
before the first rcu_barrier_callback() can possibly execute,
in turn preventing rcu_barrier_cpu_count from prematurely
reaching zero.
Currently, -rt implementations of RCU keep but a single global queue
for RCU callbacks, and thus do not suffer from this problem.
However, when the -rt RCU eventually does have per-CPU callback
queues, things will have to change.
One simple change is to add an rcu_read_lock() before line 8
of rcu_barrier() and an rcu_read_unlock() after
line 8 of this same function.
If you can think of a better change, please let me know!
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Forrest Cook
Next page: Distributions>>