Brief items
The current development kernel is 2.6.33-rc6,
released on January 29.
"
Give it a go. Hopefully we've fixed a number of regressions, we're
getting to that stage of the release cycle where things mostly should 'just
work' and people who still see regressions should start making loud
noises." Full details can be found in
the
full changelog.
Stable updates: 2.6.32.7 and 2.6.27.45 were released on January 28. The 2.6.32.7
update is rather large, consisting of 98 patches, which Greg Kroah-Hartman
explains as follows: "This release is brought to you by the very
appreciated efforts of the
Debian, Gentoo, and Novell kernel teams, who spent a lot of time to
flush out patches that were in their trees to me for inclusion. Special
thanks goes to Ben Hutchings for doing a lot of this work." A
footnote in the 2.6.32.7 review
announcement makes it clear that Kroah-Hartman was the Gentoo
and Novell kernel team member responsible.
Ancient kernels: 2.4.37.8 was released on
January 31; it contains an e1000 security fix and a few other
updates. 2.4.37.9 followed
the next day with a fix for the e1000 fix.
Comments (none posted)
It's really very simple: overcommit off you must have enough RAM
and swap to hold all allocations requested. Overcommit on - you
don't need this but if you do use more than is available on the
system something has to go.
It's kind of like banking overcommit off is proper banking, overcommit
on is modern western banking.
--
Alan Cox
Consider the fact that i get 1000 times more bugreports aided by
strace, which has 1000 times more overhead than even the slowest of
uprobes approaches.
This simple fact tell us that while performance matters, it is of
little use if good utility and a clean design is not there. (in
fact sane and clean design will almost automatically result in good
performance too down the line, but i digress.) Faster crap is still
crap.
--
Ingo Molnar
Forks aren't always great, but I honestly don't think of forks as
being a bad thing and I've tried to instill in Google the same
ethic.
In fact, I'd say that the various forks of Linux, and how the Linux
maintainers have roped back in some forks (and let others go on
their merry way) is what made the Linux kernel great and not just a
BSD rehash.
--
Chris DiBona
Comments (8 posted)
Kernel development news
By Jonathan Corbet
February 3, 2010
Readahead is the process of speculatively reading file data into the page
cache in the hope that it will be useful to an application in the near
future. When readahead works well, it can significantly improve the
performance of I/O bound applications by avoiding the need for those
applications to wait for data and by increasing I/O transfer size. On the
other hand, readahead risks making performance worse as well: if it guesses
wrong, scarce memory and I/O bandwidth will be wasted on data which will
never be used. So, as is the case with memory management in general,
readahead algorithms are both performance-critical and heavily based on
heuristics.
As is also generally the case with such code, few people dare to wander
into the readahead logic; it tends to be subtle and quick to anger. One of
those who dare is Wu Fengguang, who has worked on readahead a few times
over the years. His latest contribution is this set of patches which tries
to improve readahead performance in the general case while also making it
more responsive to low-memory situations.
The headline feature of this patch set is an increase in the maximum
readahead size from 128KB to 512KB. Given the size of today's files and
storage devices, 512KB may well seem a bit small. But there are costs to
readahead, including the amount of memory required to store the data and
the amount of I/O bandwidth required to read it. If a larger readahead
buffer causes other useful data to be paged out, it could cause a net loss
in system performance even if all of the readahead data proves to be
useful. Larger readahead
operations will occupy the storage device for longer, causing I/O latencies
to increase. And one should remember that there can be a readahead buffer
associated with every open file descriptor - of which there can be
thousands - in the system. Even a small increase in the amount of
readahead can have a large impact on the behavior of the system.
The 512K number was reached by way of an extensive series of benchmark runs
using both rotating and solid-state storage devices. With rotating disks,
bumping the maximum readahead size to 512KB nearly tripled I/O
throughput with a modest increase in I/O latency; any further increases,
while increasing throughput again, caused latency increases that were
deemed to be unacceptable. On solid-state devices the throughput increase
was less (on a percentage basis) but still significant.
These numbers hold for a device with reasonable performance, though. A
typical USB thumb drive, not being a device with reasonable performance,
can run into real trouble with an increased readahead size. To address
this problem, the patch set puts a cap on the readahead window size for
small devices. For a 2MB device (assuming such a thing can be found),
readahead is limited to 4KB; for a 2GB drive, the limit is 128KB. Only at
32GB does the full 512KB readahead window take effect.
This heuristic is not perfect. Jens Axboe protested that some solid-state devices are
relatively small in capacity, but they can be quite fast. Such devices may
not perform as well as they could with a larger readahead size.
Another part of this patch set is the "context readahead" code which tries
to prevent the system from performing more readahead than its memory can
handle. For a typical file stream with no memory contention, the contents
of the page cache can be visualized (within your editor's poor drawing
skills) like this:
Here, we are looking at a representation of a stream of pages containing
the file's data; the green pages are those which are in the page cache at
the moment. Several recently-consumed pages behind the offset have not yet
been evicted, and the full readahead window is waiting for the application
to get around to consuming it.
If memory is tight, though, we could find a situation more like this:
Because the system is scrambling for memory, it has been much more
aggressive about evicting this file's pages from the page cache. There is
much less history there, but, more importantly, a number of pages which
were brought in via readahead have been pushed back out before the
application was able to actually make use of them. This sort of thrashing
behavior is harmful to system performance; the readahead occupied memory
when it was needed elsewhere, and that data will have to be read a second time in the
near future. Clearly, when this sort of behavior is seen, the system
should be doing less readahead.
Thrashing behavior is easily detected; if pages which have already been
read in via readahead are missing when the application tries to actually
read them, things are going amiss. When that happens, the code will get an
estimate of the amount of memory it can safely use by counting the number
of history pages (those which have already been consumed by the
application) which remain in the page cache. If some history remains, the
number of history pages is taken as a guess for what the size of the
readahead window should be.
If, instead, there's no history at all, the readahead size is halved. In
this case, the readahead code will also carefully shift any readahead pages
which are still in memory to the head of the LRU list, making it less
likely that they will be evicted immediately prior to their use. The file
descriptor will be marked as "thrashed," causing the kernel to continue to
use the history size as a guide for the readahead window size in the
future. That, in turn, will cause the window to expand and contract as
memory conditions warrant.
Readahead changes can be hard to get into the mainline. The heuristics can
be tricky, and, as Linus has noted, it can
be easy to optimize the system for a subset of workloads:
The problem is, it's often easier to test/debug the "good" cases,
ie the cases where we _want_ read-ahead to trigger. So that
probably means that we have a tendency to read-ahead too
aggressively, because those cases are the ones where people can
most easily look at it and say "yeah, this improves throughput of a
'dd bs=8192'".
The stated goal of this patch set is to make readahead more aggressive by
increasing the maximum size of the readahead window. But, in truth, much
of the work goes in the other direction, constraining the readahead
mechanism in situations where too much readahead can do harm. Whether
these new heuristics reliably improve performance will not be known until a
considerable amount of benchmarking has been done.
Comments (10 posted)
By Jonathan Corbet
February 2, 2010
As of this writing, there have not yet been any distributor updates for the
vulnerability which will become known as CVE-2010-0307. This particular
bug does not (as far as your editor knows) allow a complete takeover of a
system, but it can be used for
denial-of-service attacks, or in a situation where an attacker with
unprivileged local access wishes to force a reboot. It is also an
illustration of the hazards which come with old and tricky code.
Mathias Krause reported the problem at the
end of January. It seems that, on an x86_64 system, a kernel panic can be
forced by trying (and failing) to exec() a 64-bit program while
running in 32-bit mode, then triggering a core dump. There does not seem
to be a way to exploit this bug to run arbitrary code - but those who would
take over systems have shown enough creativity in situations like this that
one can never be sure. Even without that, though, the ability to take any
64-bit x86 system down is not a good thing. Current kernels are affected,
as are older ones; your editor is not aware of anybody having taken the
time to determine when the problem first appeared, but Mathias has shown that
2.6.26 kernels contained the bug.
The execve() system call is the means by which a process stops
running one program and starts running a new one. It must clean up most
(but not all) of the state associated with the old program, resetting
things for the new one. In this process, there is a "point of no return":
the place where the system call is committed to making the change and can
no longer back out. Before this point, any sort of failure should lead to
an error return from the system call (which otherwise is not expected to
return at all); afterward, the only recourse is to kill the process
outright.
Sometime after the point of no return, execve() must adjust the
"personality" of the process to match the new executable image. For
example, a 64-bit process switching to a 32-bit image must go into the
32-bit personality. In the past, personalities have also been used to
emulate other operating environments - running SYSV binaries, for example.
The personality changes a number of aspects of the environment the program
runs in, though, as we'll see, fewer than it once did.
In the past, personality changes have included filesystem namespace
changes. That was necessary because the process of starting the new
executable could require looking up other images, such as an "interpreter"
image to run the new program. The lookup clearly had to happen prior to
the point of no return; if the lookup fails then the system call should
fail. So some aspects of the new image's environment had to be present
while the process was still running in the context of the old image.
The solution, at the time, was to put some brutal hacks into the low-level
SET_PERSONALITY() macro. This macro's job is to switch the
process to a new personality, but, post-hack, it no longer did that.
Instead, it would make the namespace changes, but leave most of the
environment unchanged, setting the special TIF_ABI_PENDING task
flag to remind the kernel that, at a later point, it needed to complete the
personality change. Over time, the namespace changes were removed from the
kernel, but this two-step personality switch mechanism remained.
This hackery allowed SET_PERSONALITY() to be called before the
point of no return without breaking the process of tearing down the old
image. What was missing, though, was any mechanism for fully restoring the
old personality should things change after the SET_PERSONALITY()
call. In effect, that call became the real point of no return,
since the kernel had no way of going back to how things were before.
There aren't too many ways that execve() could fail in the window
between the SET_PERSONALITY() call and the official point of no
return. But one is all it takes, and one easily accessible failure mode is
an inability to find the "interpreter" for the new image. The interpreter
need not be an executable; it's really the execution environment as a
whole. As it happens, there's no means by which a 32-bit process can run a
64-bit image; trying to do so leads to a failure in just the wrong part of
the execve() call. Control will return to the calling program,
but with a partially-corrupted personality setup.
As it happens, the most common response to an execve() failure is
to inform the user and exit; the calling program wasn't expecting to be
running any more, so it will normally just bail out. So the schizophrenic
personality it's running under will likely never be noticed. But if the
calling program instead takes a signal which forces a core dump, the
confused personality information will lead to an equally confused kernel and a
panic.
In summary, what we have here is a combination of tricky code, made worse
by inter-architecture compatibility concerns, implementing behavior which
is no longer needed - and doing it wrong. For added fun, it's worth noting
that this problem was reported in December,
but it fell through the cracks and remained unfixed.
The initial solution proposed by Linus was
to simply remove the early SET_PERSONALITY() call. After a bit of
discussion, though, Linus and H. Peter Anvin concluded that it was better
to fix the code for real. The result was a pair of patches, the
first of which splits flush_old_exec() (which contained the point
of no return deeply within) into two functions meant to run before and
after that point. This patch also gets rid of the early
SET_PERSONALITY() call. The
second patch then eliminates the TIF_ABI_PENDING hack, simply
doing the full personality change at the point of no return.
These changes were merged just prior to the release of 2.6.33-rc6. This is
a fairly significant pair of patches to put into the core kernel at this
late stage in the 2.6.33 development cycle. And, indeed, they have caused
some problems, especially with non-x86 architectures. Distributors looking
to backport this fix into older kernels may well find themselves looking
for a way to simplify it. But security fixes are important, and fixes
which get rid of cobweb-encrusted code which could be hiding other problems
are even better. The remaining problems should be cleaned up in short
order, and the 2.6.33 kernel will be better for it.
Comments (10 posted)
February 1, 2010
This article was contributed by Paul McKenney
Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU improves scalability
by allowing readers to execute concurrently with writers.
In contrast, conventional locking primitives require that readers
wait for ongoing writers and vice versa.
RCU ensures read coherence by
maintaining multiple versions of data structures and ensuring that they are not
freed until all pre-existing read-side critical sections complete.
RCU relies on efficient and scalable mechanisms for publishing
and reading new versions of an object, and also for deferring the collection
of old versions.
These mechanisms distribute the work among read and
update paths in such a way as to make read paths extremely fast. In some
cases (non-preemptable kernels), RCU's read-side primitives have zero
overhead.
RCU updates can be expensive, so RCU is in general best-suited to
read-mostly data structures.
RCU readers execute in RCU read-side critical sections
that begin with rcu_read_lock() and end with
rcu_read_unlock().
The Linux kernel has
multiple flavors of RCU,
and each flavor uses its own flavor of rcu_read_lock() and
rcu_read_unlock().
Anything outside of an RCU read-side critical section is a
quiescent state, and a grace period is any time
period in which every CPU (or task, for real-time RCU implementations)
passes through at least one quiescent state.
Taken together, these rules guarantee that any RCU read-side critical section
that is executing at the beginning of a given grace period must
complete before that grace period can be permitted to end.
This guarantee is surprisingly useful, allowing RCU to act as a
high-performance scalable replacement for reader-writer locking,
among other things.
But this guarantee is sufficient only for systems
with sequentially consistent memory ordering, which are quite rare.
Even strongly ordered architectures such as x86 or s390
will allow later reads to execute ahead of prior writes, and compilers
can reorder code quite freely.
Therefore, RCU needs an additional
publish-subscribe
guarantee, which is provided by rcu_assign_pointer()
and rcu_dereference().
Uses of rcu_assign_pointer() are typically protected
by the update-side lock, and uses of rcu_dereference()
must typically be within an RCU-read-side critical section.
Unfortunately for this simple rule on use of
rcu_dereference(), there is quite a bit of code that
is used by both RCU readers and updaters.
A more accurate rule is that rcu_dereference() must
either be:
- within an RCU read-side critical section,
- protected by the update-side lock, or
- inaccessible to RCU readers.
The remainder of this article is as follows:
-
Why Bother With lockdep-Enabling RCU?
-
RCU API for lockdep.
-
RCU lockdep Usage Examples.
-
RCU lockdep Implementation.
-
RCU API for lockdep: Quick Reference.
These sections are followed by
Conclusions and Future Directions
and
Answers to Quick Quizzes.
Compliance with the usage rule for rcu_dereference()
is verified by manual code inspection.
And this manual code inspection worked great back in 2.6.10,
when there were at grand total of 38 occurrences of
rcu_dereference().
However, given that there are now more than 350 occurrences of
rcu_dereference() in 2.6.32, it appears the day
of sole reliance on manual code inspection is long over.
Additional evidence on this point
was provided by Thomas Gleixner when he trained his eagle eye on a
few rcu_dereference() instances in mainline.
It is clearly time to bring lockdep-style checking to
rcu_dereference().
Unfortunately, because rcu_dereference_check() can be
used in such a wide variety of environments, simple addition of lockdep
checking to the current API fails, producing reams of false positives
while ignoring potentially dangerous bugs.
Quick Quiz 1:
How can you be so sure that there is no clever lockdep-check
strategy given the current API? Answer
Some major goals of any API change is to minimize impact on existing
code, patches in flight, and ongoing debugging efforts.
Because the most common use of rcu_dereference()
is for accesses that are strictly within a vanilla RCU read-side
critical section, rcu_dereference() should check
only for being in a vanilla RCU read-side critical section.
This minimizes impact on existing code, including patches in flight.
This means that other rcu_dereference() API members
must be created.
However, these other API members cannot be defined in terms
of rcu_dereference() because these other members
must be usable outside of vanilla RCU read-side critical sections.
Therefore, a raw interface named rcu_dereference_raw()
inherits the implementation that used to belong to
rcu_dereference().
In other words, if you “know what you are doing”, just use
rcu_dereference_raw() and lockdep will never complain about
them.
(But you just might hear a few questions from me!)
The underlying API for the other forms of rcu_dereference()
is rcu_dereference_check(), which takes two arguments.
The first argument is an RCU-protected pointer, the same as that
of rcu_dereference() and the new
rcu_dereference_raw().
The second argument is a boolean expression that evaluates to zero if there is
a problem, in which case, if RCU lockdep is enabled, you will get
a WARN_ON_ONCE() on your console log.
The other dereferencing APIs are rcu_dereference(),
rcu_dereference_sched(), rcu_dereference_bh(),
and srcu_dereference(), each of which checks to make sure that
it is being used in the corresponding flavor of RCU read-side critical
section, giving your console log a WARN_ON_ONCE() otherwise
(again, assuming that RCU lockdep is enabled).
All of these take a single RCU-protected pointer as an argument,
except for srcu_dereference(), which also takes a pointer to
a struct srcu_struct.
This additional argument permits srcu_dereference() to
distinguish among multiple SRCU domains.
These four dereferencing APIs use corresponding APIs that check
for being in the corresponding flavor of RCU read-side critical
section:
rcu_read_lock_held(),
rcu_read_lock_bh_held(),
rcu_read_lock_sched_held(), and
srcu_read_lock_held().
Of these, only srcu_read_lock_held() takes an argument,
namely a struct srcu_struct, again permitting distinguishing
among multiple SRCU domains.
The prototypical use of these new APIs is as follows:
1 rcu_read_lock();
2 p = rcu_dereference(gp->data);
3 do_something_with(p);
4 rcu_read_unlock();
The alert reader may have noticed that this is no different from
the old usage of these APIs.
This situation is strictly intentional.
Similar code may be written for other flavors of RCU, for example:
1 srcu_read_lock();
2 p = srcu_dereference(gp->data, sp);
3 do_something_with(p);
4 srcu_read_unlock();
These examples work well when used inside RCU read-side critical
sections, but fail completely for code that is invoked both by
readers and updaters.
Although we could insert artificial RCU read-side critical sections
in updaters, these can cause much confusion.
Instead, we use rcu_dereference_check(), for example,
in the files_fdtable() macro:
1 #define files_fdtable(files) \
2 (rcu_dereference_check((files)->fdt, \
3 rcu_read_lock_held() || \
4 lockdep_is_held(&(files)->file_lock) || \
5 atomic_read(&files->count) == 1))
This statement fetches the RCU-protected pointer
(files)->fdt, but requires that
files_fdtable() be invoked
within an RCU read-side critical section,
with lockdep_is_held(&(files)->file_lock) held, or
with the &files->count reference counter zeroed
(in other words, if inaccessible to RCU readers).
Quick Quiz 2:
Suppose that an access to an RCU-protected pointer gp
must be either inside an RCU-bh read-side critical section, an
SRCU read-side critical section for SRCU domain sp, or
with mylock held.
How do you code this? Answer
The basic change underlying the RCU lockdep implementation is
a set of per-RCU-flavor lockdep maps (in the case of SRCU, per-SRCU-domains
lockdep maps ->depmap in each struct srcu_struct):
1 extern struct lockdep_map rcu_lock_map;
2 # define rcu_read_acquire() \
3 lock_acquire(&rcu_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
4 # define rcu_read_release() lock_release(&rcu_lock_map, 1, _THIS_IP_)
5
6 extern struct lockdep_map rcu_bh_lock_map;
7 # define rcu_read_acquire_bh() \
8 lock_acquire(&rcu_bh_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
9 # define rcu_read_release_bh() lock_release(&rcu_bh_lock_map, 1, _THIS_IP_)
10
11 extern struct lockdep_map rcu_sched_lock_map;
12 # define rcu_read_acquire_sched() \
13 lock_acquire(&rcu_sched_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
14 # define rcu_read_release_sched() \
15 lock_release(&rcu_sched_lock_map, 1, _THIS_IP_)
16
17 # define srcu_read_acquire(sp) \
18 lock_acquire(&(sp)->dep_map, 0, 0, 2, 1, NULL, _THIS_IP_)
19 # define srcu_read_release(sp) \
20 lock_release(&(sp)->dep_map, 1, _THIS_IP_)
These are used to implement
rcu_read_lock_held(),
rcu_read_lock_bh_held(),
rcu_read_lock_sched_held(),
and
srcu_read_lock_held():
1 static inline int rcu_read_lock_held(void)
2 {
3 if (debug_locks)
4 return lock_is_held(&rcu_lock_map);
5 return 1;
6 }
7
8 static inline int rcu_read_lock_bh_held(void)
9 {
10 if (debug_locks)
11 return lock_is_held(&rcu_bh_lock_map);
12 return 1;
13 }
14
15 static inline int rcu_read_lock_sched_held(void)
16 {
17 int lockdep_opinion = 0;
18
19 if (debug_locks)
20 lockdep_opinion = lock_is_held(&rcu_sched_lock_map);
21 return lockdep_opinion || preempt_count() != 0;
22 }
23
24 static inline int srcu_read_lock_held(struct srcu_struct *sp)
25 {
26 if (debug_locks)
27 return lock_is_held(&sp->dep_map);
28 return 1;
29 }
In each case, if lockdep is enabled, we consult the corresponding
lockdep_map, otherwise, we (conservatively) guess that
we are in the appropriate RCU read-side critical section.
This permits
WARN_ON_ONCE(!rcu_read_lock_held())
to be used freely.
Quick Quiz 3:
How do these work if lockdep is not configured at all? Answer
The non-checking variant of rcu_dereference() is
rcu_dereference_raw(), which is defined as follows:
1 #define rcu_dereference_raw(p) ({ \
2 typeof(p) _________p1 = ACCESS_ONCE(p); \
3 smp_read_barrier_depends(); \
4 (_________p1); \
5 })
Then
rcu_dereference_check() is implemented in terms
of
rcu_dereference_raw() as follows:
1 #define rcu_dereference_check(p, c) \
2 ({ \
3 if (debug_locks) \
4 WARN_ON_ONCE(!(c)); \
5 rcu_dereference_raw(p); \
6 })
However, if lockdep is not configured, the following alternative
implementation is used:
1 #define rcu_dereference_check(p, c) rcu_dereference_raw(p)
Quick Quiz 4:
Why not include a ((void)(c)) to the non-lockdep version
of rcu_dereference_check() in order to detect compiler
errors in the “c” argument? Answer
The remainder of the primitives are defined as follows:
1 #define rcu_dereference(p) \
2 rcu_dereference_check(p, rcu_read_lock_held())
3
4 #define rcu_dereference_bh(p) \
5 rcu_dereference_check(p, rcu_read_lock_bh_held())
6
7 #define rcu_dereference_sched(p) \
8 rcu_dereference_check(p, rcu_read_lock_sched_held())
9
10 #define srcu_dereference(p, sp) \
11 rcu_dereference_check(p, srcu_read_lock_held(sp))
Quick Quiz 5:
What are the non-lockdep definitions of these primitives? Answer
| Name |
CONFIG_PROVE_RCU |
!CONFIG_PROVE_RCU |
| rcu_dereference(p) |
returns p, warns if not in RCU read-side critical section |
returns p, never warns |
| rcu_dereference_bh(p) |
returns p, warns if not in RCU-bh read-side critical section |
returns p, never warns |
| rcu_dereference_sched(p) |
returns p, warns if not in RCU-sched read-side critical section |
returns p, never warns |
| srcu_dereference(p, sp) |
returns p, warns if not in SRCU read-side critical section for sp |
returns p, never warns |
| rcu_dereference_check(p, c) |
returns p, warns if !c |
returns p, never warns |
| rcu_dereference_raw(p) |
returns p, never warns |
returns p, never warns |
| |
| rcu_read_lock_held() |
non-zero if in RCU read-side critical section |
always non-zero |
| rcu_read_lock_bh_held() |
non-zero if in RCU-bh read-side critical section |
always non-zero |
| rcu_read_lock_sched_held() |
non-zero if in RCU-sched read-side critical section |
always non-zero |
| srcu_read_lock_held(sp) |
non-zero if in SRCU read-side critical section for sp |
always non-zero |
These are early days for the lockdep-enabled RCU primitives.
They have been applied to some of the networking, VFS, scheduler,
radix tree, and IDR code.
Thus far, things are going well, but here are some possible future
directions:
- The RCU list macros, radix tree, and IDR implementations
currently use
rcu_dereference_raw().
At some point, it may be necessary to produce checked
variants.
Given that this will require yet more APIs, need must
be demonstrated before the API explosion is undertaken.
list_for_each_rcu(), list_for_each_rcu_bh(),
list_for_each_rcu_sched(),
list_for_each_srcu(),
list_for_each_rcu_check(), and
list_for_each_rcu_raw(), anyone?
- Thus far, it has been easy to generate
rcu_dereference_check()'s boolean expressions.
Nevertheless, I am a bit nervous about code that is called
both in RCU read-side critical sections and by initialization
code.
In some cases, it might be difficult to detect the initialization
case, but this will be dealt with as they come up.
- The
rcu_assign_pointer() primitive remains unchecked.
It is used primarily under locks, which are quite a bit more
familiar, and for which there is already lockdep available.
Regardless of how the future unfolds, lockdep-enabled RCU should
be very helpful in detecting RCU-usage bugs.
Acknowledgments
I am grateful to Peter Zijlstra and Thomas Gleixner for sharing their
experiences applying lockdep checking to rcu_dereference().
I owe thanks to Eric Dumazet for helping me work out how to handle some
difficult rcu_dereference() instances in the networking code,
to Ingo Molnar for much encouragement and advice,
and to Kathy Bennett for her support of this effort.
This work represents the view of the authors and does not necessarily
represent the view of IBM.
Quick Quiz 1:
How can you be so sure that there is no clever lockdep-check
strategy given the current API?
Answer:
Because if there was a clever lockdep-check strategy given the current
RCU API, Peter Zijlstra would have implemented it!
If you know of one, please don't keep it a secret — but please
do yourself the favor of reading the rest of this article before
deciding whether or not you do have a solution.
Back to Quick Quiz 1.
Quick Quiz 2:
Suppose that an access to an RCU-protected pointer gp
must be either inside an RCU-bh read-side critical section, an
SRCU read-side critical section for SRCU domain sp, or
with mylock held.
How do you code this?
Answer:
One approach is as follows:
1 rcu_dereference_check(gp,
2 rcu_read_lock_bh_held() ||
3 srcu_read_lock_held(sp) ||
4 lockdep_is_held(&mylock));
Back to Quick Quiz 2.
Quick Quiz 3:
How do these work if lockdep is not configured at all?
Answer:
As follows:
1 static inline int rcu_read_lock_held(void)
2 {
3 return 1;
4 }
5
6 static inline int rcu_read_lock_bh_held(void)
7 {
8 return 1;
9 }
10
11 static inline int rcu_read_lock_sched_held(void)
12 {
13 return preempt_count() != 0;
14 }
15
16 static inline int srcu_read_lock_held(struct srcu_struct *sp)
17 {
18 return 1;
19 }
Back to Quick Quiz 3.
Quick Quiz 4:
Why not include a ((void)(c)) to the non-lockdep version
of rcu_dereference_check() in order to detect compiler
errors in the “c” argument?
Answer:
Because lockdep_is_held() is defined only in lockdep
builds of the kernel.
Therefore, ((void)(c)) would give you lots of false
alarms.
So, just make sure that you do at least one build-and-test cycle
with lockdep defined.
Back to Quick Quiz 4.
Quick Quiz 5:
What are the non-lockdep definitions of these primitives?
Answer:
They are exactly the same as the lockdep definitions!
The implementations of rcu_dereference_check()
remove the need for duplicate definitions for
rcu_dereference(), rcu_dereference_bh(),
rcu_dereference_sched(), and srcu_dereference().
Back to Quick Quiz 5.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
- Bartlomiej Zolnierkiewicz: ide2libata .
(January 30, 2010)
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>