Brief items
The current 2.6 prepatch is 2.6.24-rc7,
released by Linus on
January 6. It contains a fair number of fixes and an implementation
of
/proc/slabinfo for the SLUB allocator (which was
discussed in last week's Kernel
Page). About the long release cycle, he says "
I'll be charitable and
claim it's because it's all stabilizing, and not because we've all been in
a drunken stupor over the holidays." The short-form changelog can be
found in the release announcement; see
the
long-format changelog for all the details.
The mainline git repository contains, as of this writing, a few dozen
post-rc7 patches.
The current stable 2.6 kernel is 2.6.23.13, released on January 9. This update is only of interest to people using the w83627ehf
hardware monitoring driver, but they should be very interested: "I
have had a private report that this bug might have caused permanent
hardware damage. There is no definitive proof at this point, but
unfortunately due to the lack of documentation I really can't rule it
out."
For older kernels: 2.6.16.58-rc1 was released on
January 6 with about a dozen fixes, a few of which are
security-related.
Comments (none posted)
Kernel development news
What guarantees that it doesn't happen before we get to callback? AFAICS,
nothing whatsoever...
And if it does happen, we'll get rdev happily freed (by rdev_free(), as
->release() of &rdev->kobj) by the time we get to delayed_delete(). Which
explains what's going on just fine.
--
Al Viro shows how to debug kernel
problems
I consider the fact that I
can spend full-time working on Linux to be a blessing. But if you
don't feel that way, my condolences, and please do what you need to do
so you can stay in your happy place.
--
Ted Ts'o shows how to respond with class
to trolls
Comments (4 posted)
By Jonathan Corbet
January 9, 2008
As of this writing, the 2.6.24 kernel is getting close to a release -
though there is likely to be one more -rc version to look at first. The
rate of change has slowed significantly, though, and the final regressions
are being chased down. So it seems like a suitable time to look at the
patches which went into this kernel and where they came from.
This is, in many ways, a record-breaking development cycle. Over 10,000
individual changesets have been merged this time around, with a net growth
of almost 300,000 lines of code. 950 developers contributed this code; of
those, 358 contributed just one patch. By comparison, the previous cycle
(2.6.23) merged some 6200 patches from about 860 developers. Given that,
it's not surprising that the 2.6.24 cycle has been a little longer than
some of its predecessors.
Without further ado, here is the list of top contributors to this kernel:
| Most active 2.6.24 developers |
| By changesets |
| Thomas Gleixner | 362 | 3.6% |
| Bartlomiej Zolnierkiewicz | 205 | 2.0% |
| Adrian Bunk | 190 | 1.9% |
| Ralf Baechle | 176 | 1.8% |
| Pavel Emelyanov | 146 | 1.5% |
| Ingo Molnar | 141 | 1.4% |
| Tejun Heo | 138 | 1.4% |
| Paul Mundt | 131 | 1.3% |
| Johannes Berg | 119 | 1.2% |
| Al Viro | 116 | 1.2% |
| Takashi Iwai | 115 | 1.1% |
| Jeff Garzik | 107 | 1.1% |
| David S. Miller | 102 | 1.0% |
| Matthew Wilcox | 97 | 1.0% |
| Jens Axboe | 89 | 0.9% |
| Krzysztof Helt | 89 | 0.9% |
| Stephen Hemminger | 86 | 0.9% |
| Rusty Russell | 86 | 0.9% |
| Alan Cox | 85 | 0.8% |
| Herbert Xu | 84 | 0.8% |
|
| By changed lines |
| Thomas Gleixner | 46358 | 5.9% |
| Zhu Yi | 35133 | 4.5% |
| Auke Kok | 25861 | 3.3% |
| Michael Buesch | 24480 | 3.1% |
| Ivo van Doorn | 22178 | 2.8% |
| Matthew Wilcox | 20416 | 2.6% |
| Adrian Bunk | 19050 | 2.4% |
| Larry Finger | 15003 | 1.9% |
| David S. Miller | 14315 | 1.8% |
| Andy Gospodarek | 13814 | 1.8% |
| Nathanael Nerode | 12821 | 1.6% |
| Jeff Dike | 11103 | 1.4% |
| Johannes Berg | 10118 | 1.3% |
| Ralf Baechle | 9555 | 1.2% |
| Scott Wood | 9328 | 1.2% |
| Krzysztof Helt | 8162 | 1.0% |
| Kumar Gala | 8002 | 1.0% |
| Jeff Garzik | 7689 | 1.0% |
| David Gibson | 7284 | 0.9% |
| Michael Hennerich | 7181 | 0.9% |
|
By either method of counting, Thomas Gleixner comes out at the top of the
list by virtue of his work on the i386/x86_64 architecture merger.
Bringing those architectures together and making the result work well was a
huge job; this effort will continue into future development cycles. (For
the curious, simply renamed files were not counted as "changed lines" in
the generation of these numbers). Note that many of these patches also
carry a signoff by Ingo Molnar, but git only stores the name of a single
"author" for a changeset.
Other contributors of large numbers of changesets in 2.6.24 include
Bartlomiej Zolnierkiewicz (lots of IDE driver patches), Adrian Bunk
(cleanups all over the kernel tree), Ralf Baechle (MIPS architecture work),
Pavel Emelyanov (mostly network and PID namespaces), Tejun Heo (serial ATA
and a number of sysfs cleanups), Johannes Berg (wireless networking), and
Al Viro (mostly annotation patches and related fixes). If one looks at the
number of changed lines, the list of developers changes almost completely:
Zhu Yi (iwlwifi driver), Auke Kok (e1000 driver), Michael Buesch (wireless
networking and the b43 driver), Ivo van Doorn (rt2x00 wireless driver),
Matthew Wilcox (SCSI, especially advansys and sym53c8xx drivers), Adrian
Bunk (cleanups and code deletions), Larry Finger (mainly addition of the
b43 legacy driver), and David Miller (networking and SPARC64).
If one assigns developers' contributions to employers and totals the
results, the following numbers emerge (note that these tables have been
updated since initial publication to fix an error):
| Most active 2.6.24 employers |
| By changesets |
| (None) | 1417 | 14.1% |
| (Unknown) | 1108 | 11.1% |
| Red Hat | 1045 | 10.4% |
| IBM | 819 | 8.2% |
| Novell | 680 | 6.8% |
| Intel | 446 | 4.5% |
| linutronix | 369 | 3.7% |
| Oracle | 240 | 2.4% |
| SWsoft | 212 | 2.1% |
| CERN | 205 | 2.0% |
| Movial | 190 | 1.9% |
| Linux Foundation | 190 | 1.9% |
| MIPS Technologies | 176 | 1.8% |
| Renesas Technology | 140 | 1.4% |
| (Academia) | 132 | 1.3% |
| Freescale | 126 | 1.3% |
| MontaVista | 122 | 1.2% |
| Analog Devices | 115 | 1.1% |
| (Consultant) | 112 | 1.1% |
| NetApp | 101 | 1.0% |
|
| By lines changed |
| (None) | 140730 | 18.0% |
| (Unknown) | 121511 | 15.5% |
| Intel | 114990 | 14.7% |
| Red Hat | 58858 | 7.5% |
| IBM | 51777 | 6.6% |
| linutronix | 47968 | 6.1% |
| Novell | 29856 | 3.8% |
| Movial | 19093 | 2.4% |
| Freescale | 15262 | 1.9% |
| Analog Devices | 14971 | 1.9% |
| MIPS Technologies | 11726 | 1.5% |
| SWsoft | 8331 | 1.1% |
| Linux Foundation | 7917 | 1.0% |
| Oracle | 7777 | 1.0% |
| Atmel | 7125 | 0.9% |
| CERN | 6618 | 0.8% |
| Renesas Technology | 6414 | 0.8% |
| Google | 6373 | 0.8% |
| MontaVista | 6026 | 0.8% |
| NetApp | 5620 | 0.7% |
|
In many ways, these lists look similar to those posted for past kernels.
But there are a few things which jump out this time around:
- Intel has made it to the top of the "by lines changed" list - and
not just by a little bit. This happened by virtue of the work done by
four of the top-20 developers, but also by dozens of others who
contributed to the 2.6.24 kernel. Intel has a lot of people
working on the kernel, many of whom spend little time in the
limelight.
- Movial found its way onto the list
for the first time as a result of having hired a very active
developer.
- The amount of work done by people known to be hacking on their own
time has grown a bit. This change is mostly a result of more complete
information on our side - many developers have moved out of the
"unknown" category. Quite a bit of the no-employer work this time
around was done on the wireless networking tree; since much of the
interesting work in this area currently involves reverse engineering,
perhaps it is not surprising that relatively few companies are willing
to sponsor it.
All told, some 130 distinct employers were identified for the contributors
to 2.6.24. That is a lot of companies to be working on one body of code.
Looking at the Signed-off-by headers of patches is always interesting; if
one removes the signoffs added by the authors themselves, what is left is a
list of the gatekeepers - those who channel the code into the mainline.
The people who signed off on the most patches which they did not write are:
| Sign-offs in the 2.6.24 kernel |
| By developer |
| Andrew Morton | 1679 | 17.6% |
| David S. Miller | 894 | 9.4% |
| Jeff Garzik | 631 | 6.6% |
| Ingo Molnar | 626 | 6.6% |
| John W. Linville | 413 | 4.3% |
| Mauro Carvalho Chehab | 367 | 3.9% |
| Greg Kroah-Hartman | 337 | 3.5% |
| Paul Mackerras | 305 | 3.2% |
| Jaroslav Kysela | 284 | 3.0% |
| James Bottomley | 260 | 2.7% |
| Linus Torvalds | 250 | 2.6% |
| Thomas Gleixner | 216 | 2.3% |
| Bryan Wu | 166 | 1.7% |
| Takashi Iwai | 115 | 1.2% |
| Jens Axboe | 113 | 1.2% |
| Len Brown | 113 | 1.2% |
| Avi Kivity | 107 | 1.1% |
| Roland Dreier | 107 | 1.1% |
| Ralf Baechle | 96 | 1.0% |
| Adrian Bunk | 88 | 0.9% |
|
| By employer |
| Red Hat | 2935 | 30.2% |
| Linux Foundation | 1929 | 19.9% |
| (None) | 823 | 8.5% |
| (Unknown) | 736 | 7.6% |
| Novell | 636 | 6.6% |
| IBM | 584 | 6.0% |
| Intel | 318 | 3.3% |
| linutronix | 216 | 2.2% |
| Analog Devices | 175 | 1.8% |
| SGI | 141 | 1.5% |
| Oracle | 133 | 1.4% |
| Cisco | 107 | 1.1% |
| Qumranet | 107 | 1.1% |
| NetApp | 106 | 1.1% |
| MIPS Technologies | 96 | 1.0% |
| Movial | 88 | 0.9% |
| (Consultant) | 85 | 0.9% |
| Renesas Technology | 84 | 0.9% |
| Cendio | 43 | 0.4% |
| CERN | 40 | 0.4% |
|
There are not a lot of changes here from previous development cycles.
While quite a few developers add signoffs to code and pass it on, they work
for a relatively small number of companies - 7 employers account for
70% of the non-author signoffs.
Finally, given that we are starting a new year, it is worth taking a quick
look back at the entirety of 2007. In 2007, Linus merged just over 30,000
changesets (more than 80 per day, every day) from 1900 developers working
for (at least) 200 companies. All
told, they changed over 2 million lines of code, growing the kernel by
more than 750,000 lines. The kernel developers are, in other words,
touching over 5,000 lines of code every day - that is a high rate of
change.
The top contributors over the course of the year
(by changesets) were:
| Top contributors in 2007 |
| By developer |
| Ralf Baechle | 507 | 1.7% |
| Thomas Gleixner | 485 | 1.6% |
| David S. Miller | 468 | 1.6% |
| Adrian Bunk | 439 | 1.5% |
| Tejun Heo | 394 | 1.3% |
| Ingo Molnar | 351 | 1.2% |
| Paul Mundt | 351 | 1.2% |
| Al Viro | 337 | 1.1% |
| Bartlomiej Zolnierkiewicz | 330 | 1.1% |
| Andrew Morton | 319 | 1.1% |
| Stephen Hemminger | 302 | 1.0% |
| Patrick McHardy | 277 | 0.9% |
| Alan Cox | 270 | 0.9% |
| Takashi Iwai | 269 | 0.9% |
| Trond Myklebust | 256 | 0.9% |
| David Brownell | 254 | 0.8% |
| Avi Kivity | 229 | 0.8% |
| Jeff Dike | 227 | 0.8% |
| Jeff Garzik | 216 | 0.7% |
| Jean Delvare | 215 | 0.7% |
|
| By employer |
| (None) | 4881 | 16.2% |
| Red Hat | 3441 | 11.4% |
| (Unknown) | 2933 | 9.7% |
| IBM | 2379 | 7.9% |
| Novell | 2054 | 6.8% |
| Intel | 1060 | 3.5% |
| Linux Foundation | 784 | 2.6% |
| Oracle | 677 | 2.2% |
| (Consultant) | 631 | 2.1% |
| MIPS Technologies | 507 | 1.7% |
| linutronix | 507 | 1.7% |
| Renesas Technology | 394 | 1.3% |
| (Academia) | 392 | 1.3% |
| SWsoft | 384 | 1.3% |
| SGI | 368 | 1.2% |
| MontaVista | 342 | 1.1% |
| CERN | 330 | 1.1% |
| Freescale | 291 | 1.0% |
| NetApp | 279 | 0.9% |
| Astaro | 277 | 0.9% |
|
It should be noted that the employer numbers are more approximate than
usual. Some developers changed employers in 2007, but LWN, as a matter of
policy, does not maintain a database of developers and their employers over
time. Still, the picture is relatively constant - the same companies
continue to contribute approximately the same percentage of the patches
going into the kernel over relatively long periods of time.
Overall, the picture that results from all these numbers is one of a
widespread and healthy development community. There appears to be no
shortage of jobs for kernel developers, but also room for those who work
outside of the office. The kernel truly is a common resource, with
literally thousands of people working to improve it. And it shows no signs
of slowing down anytime soon.
Your editor would like to profusely thank Greg Kroah-Hartman for his help
in improving these statistics.
Comments (6 posted)
By Jake Edge
January 9, 2008
Instrumenting a running kernel for debugging or profiling is on the
wish list of many administrators and developers. Advocates of OpenSolaris
like to point to DTrace as a
feature that Linux lacks, though SystemTap has started to close
that gap. The Linux Trace Toolkit next
generation (LTTng) takes a different approach and was recently
submitted for inclusion in the kernel (in two patches: arch independent and arch dependent).
LTTng relies upon kernel
markers to provide static probe points for its kernel tracing
activities. It also provides the ability to trace userspace programs and
combine that data with kernel tracing data to give a detailed view of
the internals of the system. Unlike other tools, LTTng takes a
post-processing approach, storing the data away as efficiently as possible
for later analysis. This is in contrast to SystemTap and DTrace which have their own
mini-languages that specify what to do as each trace point is reached.
One of the major design goals of LTTng is to have as little impact on the
system as possible, not only when it is actually tracing events, but also
when it is disabled. Kernel hackers are quite resistant to debugging
solutions that add any significant performance penalty when not in use. In addition, any
significant delays while enabled may change the system timing such that the bug or
condition being studied does not occur. For this reason, LTTng does not
take the path that various dynamic tracing solutions have used and avoids
the expense of a breakpoint interrupt by using the static markers.
Another major design goal is to provide monotonically increasing timestamp
values for events. The original LTT uses timestamps derived from the
kernel Network Time Protocol (NTP) time, which can fluctuate somewhat as
adjustments are made – sometimes going backward. LTTng uses a
timestamp derived from the hardware clocks that will work on various
processor architectures and clock speeds. In addition, the timestamps can
be correlated between different processors in a multi-processor system.
As LTTng gathers its data, it uses relayfs to get the data to a
userspace daemon (lttd) that writes the data to disk. The daemon
is started from the lttctl command-line tool, which controls the
tracing settings in the kernel via a netlink socket. A user wishing to
investigate tracing could use lttctl to start and stop a trace;
once the trace is complete, the data could be viewed and analyzed.
The LTT viewer (LTTV) is the program that is used to analyze the data
gathered. It provides both GUI and text-based viewers to interpret the
binary data generated by LTTng and present it to the user. Multi-gigabyte
files of tracing data are not uncommon when using LTTng, so a tool like
LTTV is indispensable for visualization and filtering to allow the user to
focus on the events of interest. LTTV has a plugin mechanism that allows
users to develop their own display and analysis tools, while using the LTTV
framework and filtering capabilities.
An advantage of using static probe points – though some may see it as
a disadvantage – is that they can be maintained with the kernel code
they are targeting. If the kernel markers patch is merged, subsystems can
add probe points at places they find interesting or useful and those
markers will be carried along in the kernel source; updated as the
kernel changes. Other solutions rely on matching an external list of
probes with the version of the running kernel, which can result in
mismatches and incorrect traces. Also, SystemTap will be able to use any
markers that get added to the kernel as is, so users who want the abilities
that it provides will also benefit.
LTTng is being developed at the École Polytechnique de
Montréal with support from quite a few Linux companies. It
has the looks of a very well thought out framework that builds upon the
tracing work that has been done before. It certainly won't make it into
2.6.24, but it would seem to have a good chance of making it into a future
mainline kernel.
Comments (2 posted)
January 7, 2008
This article was contributed by Paul McKenney
[
Editor's note: this is the third and final installment in Paul
McKenney's "What is RCU?" series. The first and second parts remain available
for those who might have missed them. Many thanks to Paul for letting LWN
run these articles.]
Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU is most frequently described as a replacement for reader-writer locking,
but has also been used in a number of other ways.
RCU is notable in that RCU readers do not directly synchronize with
RCU updaters,
which makes RCU read paths extremely fast, and also
permits RCU readers to accomplish useful work even
when running concurrently with RCU updaters.
This leads to the question "what exactly is RCU?", a question that this
document addresses from the viewpoint of the Linux kernel's RCU API.
-
RCU has a Family of Wait-to-Finish APIs
-
RCU has Publish-Subscribe and Version-Maintenance APIs
-
So, What is RCU Really?
These sections are followed by a
references section and the
answers to the Quick Quizzes.
The most straightforward answer to "what is RCU" is that RCU is
an API used in the Linux kernel, as summarized by the pair of tables
in this section
(the first table shows the wait-for-RCU-readers portions of the API,
while the second table shows the publish/subscribe portions of the API).
Or, more precisely, RCU is a family of APIs as shown in the first table,
with each column corresponding to a member of the RCU API family.
If you are new to RCU, you might consider focusing on just one
of the columns in the following table.
For example, if you are primarily interested in understanding how RCU
is used in the Linux kernel, "RCU Classic" would be the place to start,
as it is used most frequently.
On the other hand, if you want to understand RCU for its own sake,
"SRCU" has the simplest API.
You can always come back for the other columns later.
If you are already familiar with RCU, the following pair of tables can
serve as a useful reference.
| Attribute |
RCU Classic |
RCU BH |
RCU Sched |
Realtime RCU |
SRCU |
QRCU |
| Purpose |
Original |
Prevent DDoS attacks |
Wait for hardirqs and NMIs |
Realtime response |
Sleeping readers |
Sleeping readers and fast grace periods |
| Availability |
2.5.43 |
2.6.9 |
2.6.12 |
Aug 2005 -rt |
2.6.19 |
|
| Read-side primitives |
rcu_read_lock() rcu_read_unlock() |
rcu_read_lock_bh()
rcu_read_unlock_bh() |
preempt_disable()
preempt_enable()
(and friends) |
rcu_read_lock()
rcu_read_unlock() |
srcu_read_lock()
srcu_read_unlock() |
qrcu_read_lock()
qrcu_read_unlock() |
Update-side primitives (synchronous) |
synchronize_rcu()
synchronize_net() |
|
synchronize_sched() |
synchronize_rcu()
synchronize_net() |
synchronize_srcu() |
synchronize_qrcu() |
Update-side primitives (asynchronous/callback) |
call_rcu() |
call_rcu_bh() |
|
call_rcu() |
N/A |
N/A |
Update-side primitives (wait for callbacks) |
rcu_barrier() |
|
|
rcu_barrier() |
N/A |
N/A |
| Read side constraints |
No blocking |
No irq enabling |
No blocking |
No blocking except preemption and lock acquisition |
No synchronize_srcu() |
No synchronize_qrcu() |
| Read side overhead |
Preempt disable/enable (free on non-PREEMPT) |
BH disable/enable |
Preempt disable/enable (free on non-PREEMPT) |
Simple instructions, irq disable/enable |
Simple instructions, preempt disable/enable |
Atomic increment and
decrement of shared variable |
Asynchronous update-side overhead
(for example, call_rcu()) |
sub-microsecond |
sub-microsecond |
|
sub-microsecond |
N/A |
N/A |
| Grace-period latency |
10s of milliseconds |
10s of milliseconds |
10s of milliseconds |
10s of milliseconds |
10s of milliseconds |
10s of nanoseconds in absence of readers |
| Non-PREEMPT_RT implementation |
RCU Classic |
RCU BH |
RCU Classic |
N/A |
SRCU |
N/A |
| PREEMPT_RT implementation |
N/A |
Realtime RCU |
Forced Schedule on all CPUs |
Realtime RCU |
SRCU |
N/A |
Quick Quiz 1:
Why are some of the cells in the above table colored green?
The "RCU Classic" column corresponds to the original RCU implementation,
in which RCU read-side critical sections are delimited by
rcu_read_lock() and rcu_read_unlock(), which
may be nested.
The corresponding synchronous update-side primitives,
synchronize_rcu(), along with its synonym
synchronize_net(), wait for any currently executing
RCU read-side critical sections to complete.
The length of this wait is known as a "grace period".
The asynchronous update-side primitive, call_rcu(),
invokes a specified function with a specified argument after a
subsequent grace period.
For example, call_rcu(p,f); will result in
the "RCU callback" f(p)
being invoked after a subsequent grace period.
There are situations,
such as when unloading a module that uses call_rcu(),
when it is necessary to wait for all
outstanding RCU callbacks to complete.
The rcu_barrier() primitive does this job.
In the "RCU BH" column, rcu_read_lock_bh() and
rcu_read_unlock_bh() delimit RCU read-side critical
sections, and call_rcu_bh() invokes the specified
function and argument after a subsequent grace period.
Note that RCU BH does not have a synchronous synchronize_rcu_bh()
interface,
though one could easily be added if required.
Quick Quiz 2:
What happens if you mix and match?
For example, suppose you use rcu_read_lock() and
rcu_read_unlock() to delimit RCU read-side critical
sections, but then use call_rcu_bh() to post an
RCU callback?
In the "RCU Sched" column, anything that disables preemption
acts as an RCU read-side critical section, and synchronize_sched()
waits for the corresponding RCU grace period.
This RCU API family was added in the 2.6.12 kernel, which split the
old synchronize_kernel() API into the current
synchronize_rcu() (for RCU Classic) and
synchronize_sched() (for RCU Sched).
Note that RCU Sched does not have an asynchronous
call_rcu_sched() interface,
though one could be added if required.
Quick Quiz 3:
What happens if you mix and match RCU Classic and RCU Sched?
The "Realtime RCU" column has the same API as does
RCU Classic, the only difference being that RCU read-side critical
sections may be preempted and may block while acquiring spinlocks.
The design of Realtime RCU is described in the LWN article
The design of preemptible read-copy-update.
Quick Quiz 4:
What happens if you mix and match Realtime RCU and RCU Classic?
The "SRCU" column displays a specialized RCU API that permits
general sleeping in RCU read-side critical sections, as was
described in the LWN article
Sleepable RCU.
Of course,
use of synchronize_srcu() in an SRCU read-side
critical section can result in
self-deadlock, so should be avoided.
SRCU differs from earlier RCU implementations in that the caller
allocates an srcu_struct for each distinct SRCU
usage.
This approach prevents SRCU read-side critical sections from blocking
unrelated synchronize_srcu() invocations.
In addition, in this variant of RCU, srcu_read_lock()
returns a value that must be passed into the corresponding
srcu_read_unlock().
The "QRCU" column presents an RCU implementation with the same
API structure as SRCU, but optimized for extremely low-latency
grace periods in absence of readers, as described in the LWN article
Using Promela and Spin to verify parallel algorithms.
As with SRCU, use of synchronize_qrcu() can result in
self-deadlock, so should be avoided.
Although QRCU has not yet been accepted into the Linux kernel, it
is worth mentioning given that it is the only RCU implementation
that can boast deep sub-microsecond grace-period latencies.
Quick Quiz 5:
Why do both SRCU and QRCU lack asynchronous call_srcu()
or call_qrcu() interfaces?
Quick Quiz 6:
Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?
The Linux kernel currently has a surprising number of RCU APIs and
implementations.
There is some hope of reducing this number, evidenced by the fact
that a given build of the Linux kernel currently has at most
three implementations behind four APIs (given that RCU Classic
and Realtime RCU share the same API).
However, careful inspection and analysis will be required, just as
would be required for one of the many locking APIs.
Fortunately, the RCU publish-subscribe and version-maintenance
primitives shown in the following
table apply to all of the variants of RCU discussed above.
This commonality can in some cases allow more code to be shared,
which certainly reduces the API proliferation that would otherwise
occur.
| Category |
Primitives |
Availability |
Overhead |
| List traversal |
list_for_each_entry_rcu() |
2.5.59 |
Simple instructions (memory barrier on Alpha) |
| List update |
list_add_rcu() |
2.5.44 |
Memory barrier |
list_add_tail_rcu() |
2.5.44 |
Memory barrier |
list_del_rcu() |
2.5.44 |
Simple instructions |
list_replace_rcu() |
2.6.9 |
Memory barrier |
list_splice_init_rcu() |
2.6.21 |
Grace-period latency |
| Hlist traversal |
hlist_for_each_entry_rcu() |
2.6.8 |
Simple instructions (memory barrier on Alpha) |
| Hlist update |
hlist_add_after_rcu() |
2.6.14 |
Memory barrier |
hlist_add_before_rcu() |
2.6.14 |
Memory barrier |
hlist_add_head_rcu() |
2.5.64 |
Memory barrier |
hlist_del_rcu() |
2.5.64 |
Simple instructions |
hlist_replace_rcu() |
2.6.15 |
Memory barrier |
| Pointer traversal |
rcu_dereference() |
2.6.9 |
Simple instructions (memory barrier on Alpha) |
| Pointer update |
rcu_assign_pointer() |
2.6.10 |
Memory barrier |
The first pair of categories operate on Linux
struct list_head lists, which are circular, doubly-linked
lists.
The list_for_each_entry_rcu() primitive traverses an
RCU-protected list in a type-safe manner, while also enforcing
memory ordering for situations where a new list element is inserted
into the list concurrently with traversal.
On non-Alpha platforms, this primitive incurs little or no performance
penalty compared to list_for_each_entry().
The list_add_rcu(), list_add_tail_rcu(),
and list_replace_rcu() primitives are analogous to
their non-RCU counterparts, but incur the overhead of an additional
memory barrier on weakly-ordered machines.
The list_del_rcu() primitive is also analogous to its
non-RCU counterpart, but oddly enough is very slightly faster due to the
fact that it poisons only the prev pointer rather than
both the prev and next pointers as
list_del() must do.
Finally, the list_splice_init_rcu() primitive is similar
to its non-RCU counterpart, but incurs a full grace-period latency.
The purpose of this grace period is to allow RCU readers to finish
their traversal of the source list before completely disconnecting
it from the list header -- failure to do this could prevent such
readers from ever terminating their traversal.
Quick Quiz 7:
Why doesn't list_del_rcu() poison both the next
and prev pointers?
The second pair of categories operate on Linux's
struct hlist_head, which is a linear linked list.
One advantage of struct hlist_head over
struct list_head is that the former requires only
a single-pointer list header, which can save significant memory in
large hash tables.
The struct hlist_head primitives in the table
relate to their non-RCU counterparts in much the same way as do the
struct list_head primitives.
The final pair of categories operate directly on pointers, and
are useful for creating RCU-protected non-list data structures,
such as RCU-protected arrays and trees.
The rcu_assign_pointer() primitive ensures that any
prior initialization remains ordered before the assignment to the
pointer on weakly ordered machines.
Similarly, the rcu_dereference() primitive ensures that subsequent
code dereferencing the pointer will see the effects of initialization code
prior to the corresponding rcu_assign_pointer() on
Alpha CPUs.
On non-Alpha CPUs, rcu_dereference() documents which pointer
dereferences are protected by RCU.
Quick Quiz 8:
Normally, any pointer subject to rcu_dereference() should
always be updated using rcu_assign_pointer().
What is an exception to this rule?
Quick Quiz 9:
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
At its core, RCU is nothing more nor less than an API that supports
publication and subscription for insertions, waiting for all RCU readers
to complete, and maintenance of multiple versions.
That said, it is possible to build higher-level constructs
on top of RCU, including the reader-writer-locking, reference-counting,
and existence-guarantee constructs listed in the companion article.
Furthermore, I have no doubt that the Linux community will continue to
find interesting new uses for RCU,
just as they do for any of a number of synchronization
primitives throughout the kernel.
Finally, a complete view of RCU would also include
all of the things you can do with these APIs.
Acknowledgements
We are all indebted to Andy Whitcroft, Jon Walpole, and Gautham Shenoy,
whose review of an early draft of this document greatly improved it.
I owe thanks to the members of the Relativistic Programming project
and to members of PNW TEC for many valuable discussions.
I am grateful to Dan Frye for his support of this effort.
This work represents the view of the author and does not necessarily
represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or
service marks of others.
This section gives a short annotated bibliography describing using RCU,
Linux-kernel RCU implementations, background, and historical perspectives.
For more information, see
Paul E. McKenney's RCU Page.
Using RCU
-
Overview of Linux-Kernel Reference Counting (McKenney,
January 2007) [PDF].
Overview of Linux-kernel reference counting (including RCU)
prepared for the
Concurrency Working Group of the C/C++ standards committee.
-
RCU and Unloadable Modules (McKenney, January 2007).
Describes how to unload modules that use
call_rcu(),
so as to avoid RCU callbacks trying to use the module after it
has been unloaded.
-
Recent Developments in SELinux Kernel Performance.
James Morris describes a performance problem in the SELinux
Access Vector Cache (AVC), and its resolution via RCU in
a patch by Kaigai Kohei.
-
Using Read-Copy-Update Techniques for System V IPC in the
Linux 2.5 Kernel (Arcangeli et al., June 2003) [PDF].
Describes how RCU is used in the Linux kernel's System V IPC
implementation.
Linux-Kernel RCU Implementations
-
The design of preemptible read-copy-update (McKenney, October 2007).
Describes a high-performance RCU implementation for realtime use.
- Sleepable RCU (McKenney,
October 2006).
Description of SRCU.
-
Using Promela and Spin to verify parallel algorithms (McKenney,
August 2007).
Description of the QRCU patch.
-
RCU dissertation (McKenney, July 2004) [PDF].
- Section 2.2.20 (pages 62-64) gives a history of RCU-like
mechanisms, a very brief summary of which can be found
below.
- Chapter 4 (pages 71-98) and Appendix C (pages 326-345) review
a number of different types of RCU implementations, summarizing
a number of earlier papers.
- Chapter 5 (pages 137-178) gives an overview of a number of
"design patterns" guiding use of RCU.
- Chapter 6 (pages 179-234) describes some early uses of RCU.
-
Using RCU in the Linux 2.5 Kernel (October 2003).
Brief summary of why RCU can be helpful, along with
an analogy between RCU and reader-writer locking.
- Anyone who is laboring under the misapprehension that
the Linux community would never have
independently invented RCU should read this
netdev posting and
this one as well.
Both postings pre-date the earliest known introduction of RCU to the
Linux community.
Background
-
Real-Time Linux Wiki.
Provides much valuable information on the -rt patchset for both
kernel and application developers.
-
Home of the -rt kernel patchsets.
-
Memory Ordering in Modern Microprocessors (McKenney, August 2005) [PDF].
Gives an overview of how Linux's memory-ordering primitives work
on a number of computer architectures.
Historical Perspectives on RCU and Related Mechanisms
-
Tornado: Maximizing Locality and Concurrency in a
Shared Memory Multiprocessor Operating System
(Gamsa et al., February 1999) [PDF].
Independent invention of a mechanism very similar to RCU.
Tornado is a research operating system developed at the
University of Toronto.
This operating system uses its analog to RCU pervasively.
Some of the University of Toronto students brought this operating
system with them to IBM Research, where it was developed as part of the
K42 project.
-
Read-Copy Update: Using Execution History to Solve Concurrency
Problems (McKenney and Slingwine, October 1998) [PDF].
First non-patent publication of DYNIX/ptx's RCU implementation.
-
Passive Serialization in a Multitasking Environment
(Hennessey et al., February 1989).
This patent describes an RCU-like mechanism that was apparently
used in IBM's VM/XA mainframe hypervisor.
This is the earliest known production use of an RCU-like mechanism.
-
Concurrent Manipulation of Binary Search Trees (Kung and Lehman,
September 1980).
The earliest known publication of an RCU-like mechanism,
using a garbage collector to implicitly compute grace periods.
Quick Quiz 1:
Why are some of the cells in the above table colored green?
Answer: The green API members (rcu_read_lock(),
rcu_read_unlock(), and call_rcu()) were the
only members of the Linux RCU API that Paul E. McKenney was aware of back
in the mid-90s.
During this timeframe, he was under the mistaken impression that
he knew all that there is to know about RCU.
Back to Quick Quiz 1.
Quick Quiz 2:
What happens if you mix and match?
For example, suppose you use rcu_read_lock() and
rcu_read_unlock() to delimit RCU read-side critical
sections, but then use call_rcu_bh() to post an
RCU callback?
Answer: If there happened to be no RCU read-side critical
sections delimited by rcu_read_lock_bh() and
rcu_read_unlock_bh() at the time call_rcu_bh()
was invoked, RCU would be within its rights to invoke the callback
immediately, possibly freeing a data structure still being used by
the RCU read-side critical section!
This is not merely a theoretical possibility: a long-running RCU
read-side critical section delimited by rcu_read_lock()
and rcu_read_unlock() is vulnerable to this failure mode.
This vulnerability disappears in -rt kernels, where
RCU Classic and RCU BH both map onto a common implementation.
Back to Quick Quiz 2.
Quick Quiz 3:
What happens if you mix and match RCU Classic and RCU Sched?
Answer: In a non-PREEMPT or a PREEMPT kernel, mixing these
two works "by accident" because in those kernel builds, RCU Classic and RCU
Sched map to the same implementation.
However, this mixture is fatal in PREEMPT_RT builds using the -rt
patchset, due to the fact that Realtime RCU's read-side critical
sections can be preempted, which would permit
synchronize_sched() to return before the
RCU read-side critical section reached its rcu_read_unlock()
call.
This could in turn result in a data structure being freed before the
read-side critical section was finished with it,
which could in turn greatly increase the actuarial risk experienced
by your kernel.
In fact, the split between RCU Classic and RCU Sched was inspired
by the need for preemptible RCU read-side critical sections.
Back to Quick Quiz 3.
Quick Quiz 4:
What happens if you mix and match Realtime RCU and RCU Classic?
Answer: That would be up to you, because you would have
to code up changes to the kernel to make such mixing possible.
Currently, any kernel running with RCU Classic cannot access
Realtime RCU and vice versa.
Back to Quick Quiz 4.
Quick Quiz 5:
Why do both SRCU and QRCU lack asynchronous call_srcu()
or call_qrcu() interfaces?
Answer: Given an asynchronous interface, a single task
could register an arbitrarily large number of SRCU or QRCU callbacks,
thereby consuming an arbitrarily large quantity of memory.
In contrast, given the current synchronous
synchronize_srcu() and synchronize_qrcu()
interfaces, a given task must finish waiting for a given grace period
before it can start waiting for the next one.
Back to Quick Quiz 5.
Quick Quiz 6:
Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?
Answer: In principle, you can use
synchronize_srcu() with a given srcu_struct
within an SRCU read-side critical section that uses some other
srcu_struct.
In practice, however, doing this is almost certainly a bad idea.
In particular, the following could still result in deadlock:
idx = srcu_read_lock(&ssa);
synchronize_srcu(&ssb);
srcu_read_unlock(&ssa, idx);
/* . . . */
idx = srcu_read_lock(&ssb);
synchronize_srcu(&ssa);
srcu_read_unlock(&ssb, idx);
Back to Quick Quiz 6.
Quick Quiz 7:
Why doesn't list_del_rcu() poison both the next
and prev pointers?
Answer: Poisoning the next pointer would interfere
with concurrent RCU readers, who must use this pointer.
However, RCU readers are forbidden from using the prev
pointer, so it may safely be poisoned.
Back to Quick Quiz 7.
Quick Quiz 8:
Normally, any pointer subject to rcu_dereference() must
always be updated using rcu_assign_pointer().
What is an exception to this rule?
Answer: One such exception is when a multi-element linked
data structure is initialized as a unit while inaccessible to other
CPUs, and then a single rcu_assign_pointer() is used
to plant a global pointer to this data structure.
The initialization-time pointer assignments need not use
rcu_assign_pointer(), though any such assignments that
happen after the structure is globally visible must use
rcu_assign_pointer().
However, unless this initialization code is on an impressively hot
code-path, it is probably wise to use rcu_assign_pointer()
anyway, even though it is in theory unnecessary.
It is all too easy for a "minor" change to invalidate your cherished
assumptions about the initialization happening privately.
Back to Quick Quiz 8.
Quick Quiz 9:
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
Answer: It can sometimes be difficult for automated
code checkers such as "sparse" (or indeed for human beings) to
work out which type of RCU read-side critical section a given
RCU traversal primitive corresponds to.
For example, consider the following:
rcu_read_lock();
preempt_disable();
p = rcu_dereference(global_pointer);
/* . . . */
preempt_enable();
rcu_read_unlock();
Is the rcu_dereference() primitive in an RCU Classic
or an RCU Sched critical section?
What would you have to do to figure this out?
Back to Quick Quiz 9.
Comments (4 posted)
Patches and updates
Kernel trees
Build system
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>