LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc6, released on January 29. "Give it a go. Hopefully we've fixed a number of regressions, we're getting to that stage of the release cycle where things mostly should 'just work' and people who still see regressions should start making loud noises." Full details can be found in the full changelog.

Stable updates: 2.6.32.7 and 2.6.27.45 were released on January 28. The 2.6.32.7 update is rather large, consisting of 98 patches, which Greg Kroah-Hartman explains as follows: "This release is brought to you by the very appreciated efforts of the Debian, Gentoo, and Novell kernel teams, who spent a lot of time to flush out patches that were in their trees to me for inclusion. Special thanks goes to Ben Hutchings for doing a lot of this work." A footnote in the 2.6.32.7 review announcement makes it clear that Kroah-Hartman was the Gentoo and Novell kernel team member responsible.

Ancient kernels: 2.4.37.8 was released on January 31; it contains an e1000 security fix and a few other updates. 2.4.37.9 followed the next day with a fix for the e1000 fix.

Comments (none posted)

Quotes of the week

It's really very simple: overcommit off you must have enough RAM and swap to hold all allocations requested. Overcommit on - you don't need this but if you do use more than is available on the system something has to go.

It's kind of like banking overcommit off is proper banking, overcommit on is modern western banking.

-- Alan Cox

Consider the fact that i get 1000 times more bugreports aided by strace, which has 1000 times more overhead than even the slowest of uprobes approaches.

This simple fact tell us that while performance matters, it is of little use if good utility and a clean design is not there. (in fact sane and clean design will almost automatically result in good performance too down the line, but i digress.) Faster crap is still crap.

-- Ingo Molnar

Forks aren't always great, but I honestly don't think of forks as being a bad thing and I've tried to instill in Google the same ethic.

In fact, I'd say that the various forks of Linux, and how the Linux maintainers have roped back in some forks (and let others go on their merry way) is what made the Linux kernel great and not just a BSD rehash.

-- Chris DiBona

Comments (8 posted)

Kernel development news

Improving readahead

By Jonathan Corbet
February 3, 2010
Readahead is the process of speculatively reading file data into the page cache in the hope that it will be useful to an application in the near future. When readahead works well, it can significantly improve the performance of I/O bound applications by avoiding the need for those applications to wait for data and by increasing I/O transfer size. On the other hand, readahead risks making performance worse as well: if it guesses wrong, scarce memory and I/O bandwidth will be wasted on data which will never be used. So, as is the case with memory management in general, readahead algorithms are both performance-critical and heavily based on heuristics.

As is also generally the case with such code, few people dare to wander into the readahead logic; it tends to be subtle and quick to anger. One of those who dare is Wu Fengguang, who has worked on readahead a few times over the years. His latest contribution is this set of patches which tries to improve readahead performance in the general case while also making it more responsive to low-memory situations.

The headline feature of this patch set is an increase in the maximum readahead size from 128KB to 512KB. Given the size of today's files and storage devices, 512KB may well seem a bit small. But there are costs to readahead, including the amount of memory required to store the data and the amount of I/O bandwidth required to read it. If a larger readahead buffer causes other useful data to be paged out, it could cause a net loss in system performance even if all of the readahead data proves to be useful. Larger readahead operations will occupy the storage device for longer, causing I/O latencies to increase. And one should remember that there can be a readahead buffer associated with every open file descriptor - of which there can be thousands - in the system. Even a small increase in the amount of readahead can have a large impact on the behavior of the system.

The 512K number was reached by way of an extensive series of benchmark runs using both rotating and solid-state storage devices. With rotating disks, bumping the maximum readahead size to 512KB nearly tripled I/O throughput with a modest increase in I/O latency; any further increases, while increasing throughput again, caused latency increases that were deemed to be unacceptable. On solid-state devices the throughput increase was less (on a percentage basis) but still significant.

These numbers hold for a device with reasonable performance, though. A typical USB thumb drive, not being a device with reasonable performance, can run into real trouble with an increased readahead size. To address this problem, the patch set puts a cap on the readahead window size for small devices. For a 2MB device (assuming such a thing can be found), readahead is limited to 4KB; for a 2GB drive, the limit is 128KB. Only at 32GB does the full 512KB readahead window take effect.

This heuristic is not perfect. Jens Axboe protested that some solid-state devices are relatively small in capacity, but they can be quite fast. Such devices may not perform as well as they could with a larger readahead size.

Another part of this patch set is the "context readahead" code which tries to prevent the system from performing more readahead than its memory can handle. For a typical file stream with no memory contention, the contents of the page cache can be visualized (within your editor's poor drawing skills) like this:

[Readahead diagram]

Here, we are looking at a representation of a stream of pages containing the file's data; the green pages are those which are in the page cache at the moment. Several recently-consumed pages behind the offset have not yet been evicted, and the full readahead window is waiting for the application to get around to consuming it.

If memory is tight, though, we could find a situation more like this:

[Readahead diagram]

Because the system is scrambling for memory, it has been much more aggressive about evicting this file's pages from the page cache. There is much less history there, but, more importantly, a number of pages which were brought in via readahead have been pushed back out before the application was able to actually make use of them. This sort of thrashing behavior is harmful to system performance; the readahead occupied memory when it was needed elsewhere, and that data will have to be read a second time in the near future. Clearly, when this sort of behavior is seen, the system should be doing less readahead.

Thrashing behavior is easily detected; if pages which have already been read in via readahead are missing when the application tries to actually read them, things are going amiss. When that happens, the code will get an estimate of the amount of memory it can safely use by counting the number of history pages (those which have already been consumed by the application) which remain in the page cache. If some history remains, the number of history pages is taken as a guess for what the size of the readahead window should be.

If, instead, there's no history at all, the readahead size is halved. In this case, the readahead code will also carefully shift any readahead pages which are still in memory to the head of the LRU list, making it less likely that they will be evicted immediately prior to their use. The file descriptor will be marked as "thrashed," causing the kernel to continue to use the history size as a guide for the readahead window size in the future. That, in turn, will cause the window to expand and contract as memory conditions warrant.

Readahead changes can be hard to get into the mainline. The heuristics can be tricky, and, as Linus has noted, it can be easy to optimize the system for a subset of workloads:

The problem is, it's often easier to test/debug the "good" cases, ie the cases where we _want_ read-ahead to trigger. So that probably means that we have a tendency to read-ahead too aggressively, because those cases are the ones where people can most easily look at it and say "yeah, this improves throughput of a 'dd bs=8192'".

The stated goal of this patch set is to make readahead more aggressive by increasing the maximum size of the readahead window. But, in truth, much of the work goes in the other direction, constraining the readahead mechanism in situations where too much readahead can do harm. Whether these new heuristics reliably improve performance will not be known until a considerable amount of benchmarking has been done.

Comments (10 posted)

The x86_64 DOS hole

By Jonathan Corbet
February 2, 2010
As of this writing, there have not yet been any distributor updates for the vulnerability which will become known as CVE-2010-0307. This particular bug does not (as far as your editor knows) allow a complete takeover of a system, but it can be used for denial-of-service attacks, or in a situation where an attacker with unprivileged local access wishes to force a reboot. It is also an illustration of the hazards which come with old and tricky code.

Mathias Krause reported the problem at the end of January. It seems that, on an x86_64 system, a kernel panic can be forced by trying (and failing) to exec() a 64-bit program while running in 32-bit mode, then triggering a core dump. There does not seem to be a way to exploit this bug to run arbitrary code - but those who would take over systems have shown enough creativity in situations like this that one can never be sure. Even without that, though, the ability to take any 64-bit x86 system down is not a good thing. Current kernels are affected, as are older ones; your editor is not aware of anybody having taken the time to determine when the problem first appeared, but Mathias has shown that 2.6.26 kernels contained the bug.

The execve() system call is the means by which a process stops running one program and starts running a new one. It must clean up most (but not all) of the state associated with the old program, resetting things for the new one. In this process, there is a "point of no return": the place where the system call is committed to making the change and can no longer back out. Before this point, any sort of failure should lead to an error return from the system call (which otherwise is not expected to return at all); afterward, the only recourse is to kill the process outright.

Sometime after the point of no return, execve() must adjust the "personality" of the process to match the new executable image. For example, a 64-bit process switching to a 32-bit image must go into the 32-bit personality. In the past, personalities have also been used to emulate other operating environments - running SYSV binaries, for example. The personality changes a number of aspects of the environment the program runs in, though, as we'll see, fewer than it once did.

In the past, personality changes have included filesystem namespace changes. That was necessary because the process of starting the new executable could require looking up other images, such as an "interpreter" image to run the new program. The lookup clearly had to happen prior to the point of no return; if the lookup fails then the system call should fail. So some aspects of the new image's environment had to be present while the process was still running in the context of the old image.

The solution, at the time, was to put some brutal hacks into the low-level SET_PERSONALITY() macro. This macro's job is to switch the process to a new personality, but, post-hack, it no longer did that. Instead, it would make the namespace changes, but leave most of the environment unchanged, setting the special TIF_ABI_PENDING task flag to remind the kernel that, at a later point, it needed to complete the personality change. Over time, the namespace changes were removed from the kernel, but this two-step personality switch mechanism remained.

This hackery allowed SET_PERSONALITY() to be called before the point of no return without breaking the process of tearing down the old image. What was missing, though, was any mechanism for fully restoring the old personality should things change after the SET_PERSONALITY() call. In effect, that call became the real point of no return, since the kernel had no way of going back to how things were before.

There aren't too many ways that execve() could fail in the window between the SET_PERSONALITY() call and the official point of no return. But one is all it takes, and one easily accessible failure mode is an inability to find the "interpreter" for the new image. The interpreter need not be an executable; it's really the execution environment as a whole. As it happens, there's no means by which a 32-bit process can run a 64-bit image; trying to do so leads to a failure in just the wrong part of the execve() call. Control will return to the calling program, but with a partially-corrupted personality setup.

As it happens, the most common response to an execve() failure is to inform the user and exit; the calling program wasn't expecting to be running any more, so it will normally just bail out. So the schizophrenic personality it's running under will likely never be noticed. But if the calling program instead takes a signal which forces a core dump, the confused personality information will lead to an equally confused kernel and a panic.

In summary, what we have here is a combination of tricky code, made worse by inter-architecture compatibility concerns, implementing behavior which is no longer needed - and doing it wrong. For added fun, it's worth noting that this problem was reported in December, but it fell through the cracks and remained unfixed.

The initial solution proposed by Linus was to simply remove the early SET_PERSONALITY() call. After a bit of discussion, though, Linus and H. Peter Anvin concluded that it was better to fix the code for real. The result was a pair of patches, the first of which splits flush_old_exec() (which contained the point of no return deeply within) into two functions meant to run before and after that point. This patch also gets rid of the early SET_PERSONALITY() call. The second patch then eliminates the TIF_ABI_PENDING hack, simply doing the full personality change at the point of no return.

These changes were merged just prior to the release of 2.6.33-rc6. This is a fairly significant pair of patches to put into the core kernel at this late stage in the 2.6.33 development cycle. And, indeed, they have caused some problems, especially with non-x86 architectures. Distributors looking to backport this fix into older kernels may well find themselves looking for a way to simplify it. But security fixes are important, and fixes which get rid of cobweb-encrusted code which could be hiding other problems are even better. The remaining problems should be cleaned up in short order, and the 2.6.33 kernel will be better for it.

Comments (10 posted)

Lockdep-RCU

February 1, 2010

This article was contributed by Paul McKenney

Introduction

Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU improves scalability by allowing readers to execute concurrently with writers. In contrast, conventional locking primitives require that readers wait for ongoing writers and vice versa. RCU ensures read coherence by maintaining multiple versions of data structures and ensuring that they are not freed until all pre-existing read-side critical sections complete. RCU relies on efficient and scalable mechanisms for publishing and reading new versions of an object, and also for deferring the collection of old versions. These mechanisms distribute the work among read and update paths in such a way as to make read paths extremely fast. In some cases (non-preemptable kernels), RCU's read-side primitives have zero overhead. RCU updates can be expensive, so RCU is in general best-suited to read-mostly data structures.

RCU readers execute in RCU read-side critical sections that begin with rcu_read_lock() and end with rcu_read_unlock(). The Linux kernel has multiple flavors of RCU, and each flavor uses its own flavor of rcu_read_lock() and rcu_read_unlock(). Anything outside of an RCU read-side critical section is a quiescent state, and a grace period is any time period in which every CPU (or task, for real-time RCU implementations) passes through at least one quiescent state. Taken together, these rules guarantee that any RCU read-side critical section that is executing at the beginning of a given grace period must complete before that grace period can be permitted to end.

This guarantee is surprisingly useful, allowing RCU to act as a high-performance scalable replacement for reader-writer locking, among other things. But this guarantee is sufficient only for systems with sequentially consistent memory ordering, which are quite rare. Even strongly ordered architectures such as x86 or s390 will allow later reads to execute ahead of prior writes, and compilers can reorder code quite freely. Therefore, RCU needs an additional publish-subscribe guarantee, which is provided by rcu_assign_pointer() and rcu_dereference(). Uses of rcu_assign_pointer() are typically protected by the update-side lock, and uses of rcu_dereference() must typically be within an RCU-read-side critical section.

Unfortunately for this simple rule on use of rcu_dereference(), there is quite a bit of code that is used by both RCU readers and updaters. A more accurate rule is that rcu_dereference() must either be:

  1. within an RCU read-side critical section,
  2. protected by the update-side lock, or
  3. inaccessible to RCU readers.

The remainder of this article is as follows:

  1. Why Bother With lockdep-Enabling RCU?
  2. RCU API for lockdep.
  3. RCU lockdep Usage Examples.
  4. RCU lockdep Implementation.
  5. RCU API for lockdep: Quick Reference.
These sections are followed by Conclusions and Future Directions and Answers to Quick Quizzes.

Why Bother With lockdep-Enabling RCU?

Compliance with the usage rule for rcu_dereference() is verified by manual code inspection. And this manual code inspection worked great back in 2.6.10, when there were at grand total of 38 occurrences of rcu_dereference(). However, given that there are now more than 350 occurrences of rcu_dereference() in 2.6.32, it appears the day of sole reliance on manual code inspection is long over. Additional evidence on this point was provided by Thomas Gleixner when he trained his eagle eye on a few rcu_dereference() instances in mainline.

It is clearly time to bring lockdep-style checking to rcu_dereference(). Unfortunately, because rcu_dereference_check() can be used in such a wide variety of environments, simple addition of lockdep checking to the current API fails, producing reams of false positives while ignoring potentially dangerous bugs.

Quick Quiz 1: How can you be so sure that there is no clever lockdep-check strategy given the current API? Answer

RCU API for lockdep

Some major goals of any API change is to minimize impact on existing code, patches in flight, and ongoing debugging efforts.

Because the most common use of rcu_dereference() is for accesses that are strictly within a vanilla RCU read-side critical section, rcu_dereference() should check only for being in a vanilla RCU read-side critical section. This minimizes impact on existing code, including patches in flight. This means that other rcu_dereference() API members must be created.

However, these other API members cannot be defined in terms of rcu_dereference() because these other members must be usable outside of vanilla RCU read-side critical sections. Therefore, a raw interface named rcu_dereference_raw() inherits the implementation that used to belong to rcu_dereference(). In other words, if you “know what you are doing”, just use rcu_dereference_raw() and lockdep will never complain about them. (But you just might hear a few questions from me!)

The underlying API for the other forms of rcu_dereference() is rcu_dereference_check(), which takes two arguments. The first argument is an RCU-protected pointer, the same as that of rcu_dereference() and the new rcu_dereference_raw(). The second argument is a boolean expression that evaluates to zero if there is a problem, in which case, if RCU lockdep is enabled, you will get a WARN_ON_ONCE() on your console log.

The other dereferencing APIs are rcu_dereference(), rcu_dereference_sched(), rcu_dereference_bh(), and srcu_dereference(), each of which checks to make sure that it is being used in the corresponding flavor of RCU read-side critical section, giving your console log a WARN_ON_ONCE() otherwise (again, assuming that RCU lockdep is enabled). All of these take a single RCU-protected pointer as an argument, except for srcu_dereference(), which also takes a pointer to a struct srcu_struct. This additional argument permits srcu_dereference() to distinguish among multiple SRCU domains.

These four dereferencing APIs use corresponding APIs that check for being in the corresponding flavor of RCU read-side critical section: rcu_read_lock_held(), rcu_read_lock_bh_held(), rcu_read_lock_sched_held(), and srcu_read_lock_held(). Of these, only srcu_read_lock_held() takes an argument, namely a struct srcu_struct, again permitting distinguishing among multiple SRCU domains.

RCU lockdep Usage Examples

The prototypical use of these new APIs is as follows:

  1 rcu_read_lock();
  2 p = rcu_dereference(gp->data);
  3 do_something_with(p);
  4 rcu_read_unlock();

The alert reader may have noticed that this is no different from the old usage of these APIs. This situation is strictly intentional.

Similar code may be written for other flavors of RCU, for example:

  1 srcu_read_lock();
  2 p = srcu_dereference(gp->data, sp);
  3 do_something_with(p);
  4 srcu_read_unlock();

These examples work well when used inside RCU read-side critical sections, but fail completely for code that is invoked both by readers and updaters. Although we could insert artificial RCU read-side critical sections in updaters, these can cause much confusion. Instead, we use rcu_dereference_check(), for example, in the files_fdtable() macro:

  1 #define files_fdtable(files) \
  2   (rcu_dereference_check((files)->fdt, \
  3                          rcu_read_lock_held() || \
  4                          lockdep_is_held(&(files)->file_lock) || \
  5                          atomic_read(&files->count) == 1))

This statement fetches the RCU-protected pointer (files)->fdt, but requires that files_fdtable() be invoked within an RCU read-side critical section, with lockdep_is_held(&(files)->file_lock) held, or with the &files->count reference counter zeroed (in other words, if inaccessible to RCU readers).

Quick Quiz 2: Suppose that an access to an RCU-protected pointer gp must be either inside an RCU-bh read-side critical section, an SRCU read-side critical section for SRCU domain sp, or with mylock held. How do you code this? Answer

RCU lockdep Implementation

The basic change underlying the RCU lockdep implementation is a set of per-RCU-flavor lockdep maps (in the case of SRCU, per-SRCU-domains lockdep maps ->depmap in each struct srcu_struct):

  1 extern struct lockdep_map rcu_lock_map;
  2 # define rcu_read_acquire() \
  3     lock_acquire(&rcu_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
  4 # define rcu_read_release()  lock_release(&rcu_lock_map, 1, _THIS_IP_)
  5 
  6 extern struct lockdep_map rcu_bh_lock_map;
  7 # define rcu_read_acquire_bh() \
  8     lock_acquire(&rcu_bh_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
  9 # define rcu_read_release_bh()  lock_release(&rcu_bh_lock_map, 1, _THIS_IP_)
 10 
 11 extern struct lockdep_map rcu_sched_lock_map;
 12 # define rcu_read_acquire_sched() \
 13     lock_acquire(&rcu_sched_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_)
 14 # define rcu_read_release_sched() \
 15     lock_release(&rcu_sched_lock_map, 1, _THIS_IP_)
 16 
 17 # define srcu_read_acquire(sp) \
 18     lock_acquire(&(sp)->dep_map, 0, 0, 2, 1, NULL, _THIS_IP_)
 19 # define srcu_read_release(sp) \
 20     lock_release(&(sp)->dep_map, 1, _THIS_IP_)
These are used to implement rcu_read_lock_held(), rcu_read_lock_bh_held(), rcu_read_lock_sched_held(), and srcu_read_lock_held():
  1 static inline int rcu_read_lock_held(void)
  2 {
  3   if (debug_locks)
  4     return lock_is_held(&rcu_lock_map);
  5   return 1;
  6 }
  7 
  8 static inline int rcu_read_lock_bh_held(void)
  9 {
 10   if (debug_locks)
 11     return lock_is_held(&rcu_bh_lock_map);
 12   return 1;
 13 }
 14 
 15 static inline int rcu_read_lock_sched_held(void)
 16 {
 17   int lockdep_opinion = 0;
 18 
 19   if (debug_locks)
 20     lockdep_opinion = lock_is_held(&rcu_sched_lock_map);
 21   return lockdep_opinion || preempt_count() != 0;
 22 }
 23 
 24 static inline int srcu_read_lock_held(struct srcu_struct *sp)
 25 {
 26   if (debug_locks)
 27     return lock_is_held(&sp->dep_map);
 28   return 1;
 29 }
In each case, if lockdep is enabled, we consult the corresponding lockdep_map, otherwise, we (conservatively) guess that we are in the appropriate RCU read-side critical section. This permits WARN_ON_ONCE(!rcu_read_lock_held()) to be used freely.

Quick Quiz 3: How do these work if lockdep is not configured at all? Answer

The non-checking variant of rcu_dereference() is rcu_dereference_raw(), which is defined as follows:

  1 #define rcu_dereference_raw(p)  ({ \
  2         typeof(p) _________p1 = ACCESS_ONCE(p); \
  3         smp_read_barrier_depends(); \
  4         (_________p1); \
  5         })
Then rcu_dereference_check() is implemented in terms of rcu_dereference_raw() as follows:
  1 #define rcu_dereference_check(p, c) \
  2   ({ \
  3     if (debug_locks) \
  4       WARN_ON_ONCE(!(c)); \
  5     rcu_dereference_raw(p); \
  6   })
However, if lockdep is not configured, the following alternative implementation is used:
  1 #define rcu_dereference_check(p, c)     rcu_dereference_raw(p)

Quick Quiz 4: Why not include a ((void)(c)) to the non-lockdep version of rcu_dereference_check() in order to detect compiler errors in the “c” argument? Answer

The remainder of the primitives are defined as follows:

  1 #define rcu_dereference(p) \
  2   rcu_dereference_check(p, rcu_read_lock_held())
  3 
  4 #define rcu_dereference_bh(p) \
  5     rcu_dereference_check(p, rcu_read_lock_bh_held())
  6 
  7 #define rcu_dereference_sched(p) \
  8     rcu_dereference_check(p, rcu_read_lock_sched_held())
  9 
 10 #define srcu_dereference(p, sp) \
 11     rcu_dereference_check(p, srcu_read_lock_held(sp))

Quick Quiz 5: What are the non-lockdep definitions of these primitives? Answer

RCU API for lockdep: Quick Reference

Name CONFIG_PROVE_RCU !CONFIG_PROVE_RCU
rcu_dereference(p) returns p, warns if not in RCU read-side critical section returns p, never warns
rcu_dereference_bh(p) returns p, warns if not in RCU-bh read-side critical section returns p, never warns
rcu_dereference_sched(p) returns p, warns if not in RCU-sched read-side critical section returns p, never warns
srcu_dereference(p, sp) returns p, warns if not in SRCU read-side critical section for sp returns p, never warns
rcu_dereference_check(p, c) returns p, warns if !c returns p, never warns
rcu_dereference_raw(p) returns p, never warns returns p, never warns
 
rcu_read_lock_held() non-zero if in RCU read-side critical section always non-zero
rcu_read_lock_bh_held() non-zero if in RCU-bh read-side critical section always non-zero
rcu_read_lock_sched_held() non-zero if in RCU-sched read-side critical section always non-zero
srcu_read_lock_held(sp) non-zero if in SRCU read-side critical section for sp always non-zero

Conclusions and Future Directions

These are early days for the lockdep-enabled RCU primitives. They have been applied to some of the networking, VFS, scheduler, radix tree, and IDR code. Thus far, things are going well, but here are some possible future directions:
  1. The RCU list macros, radix tree, and IDR implementations currently use rcu_dereference_raw(). At some point, it may be necessary to produce checked variants. Given that this will require yet more APIs, need must be demonstrated before the API explosion is undertaken. list_for_each_rcu(), list_for_each_rcu_bh(), list_for_each_rcu_sched(), list_for_each_srcu(), list_for_each_rcu_check(), and list_for_each_rcu_raw(), anyone?

  2. Thus far, it has been easy to generate rcu_dereference_check()'s boolean expressions. Nevertheless, I am a bit nervous about code that is called both in RCU read-side critical sections and by initialization code. In some cases, it might be difficult to detect the initialization case, but this will be dealt with as they come up.

  3. The rcu_assign_pointer() primitive remains unchecked. It is used primarily under locks, which are quite a bit more familiar, and for which there is already lockdep available.

Regardless of how the future unfolds, lockdep-enabled RCU should be very helpful in detecting RCU-usage bugs.

Acknowledgments

I am grateful to Peter Zijlstra and Thomas Gleixner for sharing their experiences applying lockdep checking to rcu_dereference(). I owe thanks to Eric Dumazet for helping me work out how to handle some difficult rcu_dereference() instances in the networking code, to Ingo Molnar for much encouragement and advice, and to Kathy Bennett for her support of this effort.

This work represents the view of the authors and does not necessarily represent the view of IBM.

Answers to Quick Quizzes

Quick Quiz 1: How can you be so sure that there is no clever lockdep-check strategy given the current API?

Answer: Because if there was a clever lockdep-check strategy given the current RCU API, Peter Zijlstra would have implemented it! If you know of one, please don't keep it a secret — but please do yourself the favor of reading the rest of this article before deciding whether or not you do have a solution.

Back to Quick Quiz 1.

Quick Quiz 2: Suppose that an access to an RCU-protected pointer gp must be either inside an RCU-bh read-side critical section, an SRCU read-side critical section for SRCU domain sp, or with mylock held. How do you code this?

Answer: One approach is as follows:

  1   rcu_dereference_check(gp,
  2                         rcu_read_lock_bh_held() ||
  3                         srcu_read_lock_held(sp) ||
  4                         lockdep_is_held(&mylock));

Back to Quick Quiz 2.

Quick Quiz 3: How do these work if lockdep is not configured at all?

Answer: As follows:

  1 static inline int rcu_read_lock_held(void)
  2 {
  3   return 1;
  4 }
  5 
  6 static inline int rcu_read_lock_bh_held(void)
  7 {
  8   return 1;
  9 }
 10 
 11 static inline int rcu_read_lock_sched_held(void)
 12 {
 13   return preempt_count() != 0;
 14 }
 15 
 16 static inline int srcu_read_lock_held(struct srcu_struct *sp)
 17 {
 18   return 1;
 19 }

Back to Quick Quiz 3.

Quick Quiz 4: Why not include a ((void)(c)) to the non-lockdep version of rcu_dereference_check() in order to detect compiler errors in the “c” argument?

Answer: Because lockdep_is_held() is defined only in lockdep builds of the kernel. Therefore, ((void)(c)) would give you lots of false alarms. So, just make sure that you do at least one build-and-test cycle with lockdep defined.

Back to Quick Quiz 4.

Quick Quiz 5: What are the non-lockdep definitions of these primitives?

Answer: They are exactly the same as the lockdep definitions! The implementations of rcu_dereference_check() remove the need for duplicate definitions for rcu_dereference(), rcu_dereference_bh(), rcu_dereference_sched(), and srcu_dereference().

Back to Quick Quiz 5.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

  • Bartlomiej Zolnierkiewicz: ide2libata . (January 30, 2010)

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds