Weekly Edition Return to the Kernel page |
What is RCU, Fundamentally?
[Editor's note: this is the first in a three-part series on how the
read-copy-update mechanism works. Many thanks to Paul McKenney and
Jonathan Walpole for allowing us to publish these articles. The remaining
two sections will appear in future weeks.]
Part 1 of 3 of What is RCU, Really? Paul E. McKenney, IBM Linux Technology Center
IntroductionRead-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU achieves scalability improvements by allowing reads to occur concurrently with updates. In contrast with conventional locking primitives that ensure mutual exclusion among concurrent threads regardless of whether they be readers or updaters, or with reader-writer locks that allow concurrent reads but not in the presence of updates, RCU supports concurrency between a single updater and multiple readers. RCU ensures that reads are coherent by maintaining multiple versions of objects and ensuring that they are not freed up until all pre-existing read-side critical sections complete. RCU defines and uses efficient and scalable mechanisms for publishing and reading new versions of an object, and also for deferring the collection of old versions. These mechanisms distribute the work among read and update paths in such a way as to make read paths extremely fast. In some cases (non-preemptable kernels), RCU's read-side primitives have zero overhead. Quick Quiz 1: But doesn't seqlock also permit readers and updaters to get work done concurrently? This leads to the question "what exactly is RCU?", and perhaps also to the question "how can RCU possibly work?" (or, not infrequently, the assertion that RCU cannot possibly work). This document addresses these questions from a fundamental viewpoint; later installments look at them from usage and from API viewpoints. This last installment also includes a list of references. RCU is made up of three fundamental mechanisms, the first being used for insertion, the second being used for deletion, and the third being used to allow readers to tolerate concurrent insertions and deletions. These mechanisms are described in the following sections, which focus on applying RCU to linked lists:
These sections are followed by concluding remarks and the answers to the Quick Quizzes. Publish-Subscribe MechanismOne key attribute of RCU is the ability to safely scan data, even
though that data is being modified concurrently.
To provide this ability for concurrent insertion,
RCU uses what can be thought of as a publish-subscribe mechanism.
For example, consider an initially
1 struct foo {
2 int a;
3 int b;
4 int c;
5 };
6 struct foo *gp = NULL;
7
8 /* . . . */
9
10 p = kmalloc(sizeof(*p), GFP_KERNEL);
11 p->a = 1;
12 p->b = 2;
13 p->c = 3;
14 gp = p;
Unfortunately, there is nothing forcing the compiler and CPU to execute
the last four assignment statements in order.
If the assignment to 1 p->a = 1; 2 p->b = 2; 3 p->c = 3; 4 rcu_assign_pointer(gp, p); The However, it is not sufficient to only enforce ordering at the updater, as the reader must enforce proper ordering as well. Consider for example the following code fragment:
1 p = gp;
2 if (p != NULL) {
3 do_something_with(p->a, p->b, p->c);
4 }
Although this code fragment might well seem immune to misordering,
unfortunately, the
DEC
Alpha CPU [PDF]
and value-speculation compiler optimizations can, believe it or not,
cause the values of Clearly, we need to prevent this sort of skullduggery on the
part of both the compiler and the CPU.
The
1 rcu_read_lock();
2 p = rcu_dereference(gp);
3 if (p != NULL) {
4 do_something_with(p->a, p->b, p->c);
5 }
6 rcu_read_unlock();
The Although
Adapting the pointer-publish example for the linked list gives the following:
1 struct foo {
2 struct list_head list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 list_add_rcu(&p->list, &head);
Line 15 must be protected by some synchronization mechanism (most
commonly some sort of lock) to prevent multiple Subscribing to an RCU-protected list is straightforward:
1 rcu_read_lock();
2 list_for_each_entry_rcu(p, head, list) {
3 do_something_with(p->a, p->b, p->c);
4 }
5 rcu_read_unlock();
The Quick Quiz 2:
What prevents the Linux's other doubly linked list, the hlist, is a linear list, which means that it needs only one pointer for the header rather than the two required for the circular list. Thus, use of hlist can halve the memory consumption for the hash-bucket arrays of large hash tables.
Publishing a new element to an RCU-protected hlist is quite similar to doing so for the circular list:
1 struct foo {
2 struct hlist_node *list;
3 int a;
4 int b;
5 int c;
6 };
7 HLIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 hlist_add_head_rcu(&p->list, &head);
As before, line 15 must be protected by some sort of synchronization mechanism, for example, a lock. Subscribing to an RCU-protected hlist is also similar to the circular list:
1 rcu_read_lock();
2 hlist_for_each_entry_rcu(p, q, head, list) {
3 do_something_with(p->a, p->b, p->c);
4 }
5 rcu_read_unlock();
Quick Quiz 3:
Why do we need to pass two pointers into
The set of RCU publish and subscribe primitives are shown in the following table, along with additional primitives to "unpublish", or retract:
Note that the These questions are addressed in the following section. Wait For Pre-Existing RCU Readers to CompleteIn its most basic form, RCU is a way of waiting for things to finish. Of course, there are a great many other ways of waiting for things to finish, including reference counts, reader-writer locks, events, and so on. The great advantage of RCU is that it can wait for each of (say) 20,000 different things without having to explicitly track each and every one of them, and without having to worry about the performance degradation, scalability limitations, complex deadlock scenarios, and memory-leak hazards that are inherent in schemes using explicit tracking. In RCU's case, the things waited on are called
"RCU read-side critical sections".
An RCU read-side critical section starts with an
RCU accomplishes this feat by indirectly determining when these other things have finished, as has been described elsewhere for RCU Classic and realtime RCU. In particular, as shown in the following figure, RCU is a way of waiting for pre-existing RCU read-side critical sections to completely finish, including memory operations executed by those critical sections.
However, note that RCU read-side critical sections that begin after the beginning of a given grace period can and will extend beyond the end of that grace period. The following pseudocode shows the basic form of algorithms that use RCU to wait for readers:
The following code fragment, adapted from those in the
previous section,
demonstrates this process, with field
1 struct foo {
2 struct list_head list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = search(head, key);
12 if (p == NULL) {
13 /* Take appropriate action, unlock, and return. */
14 }
15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
17 q->b = 2;
18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
20 synchronize_rcu();
21 kfree(p);
Lines 19, 20, and 21 implement the three steps called out above. Lines 16-19 gives RCU ("read-copy update") its name: while permitting concurrent reads, line 16 copies and lines 17-19 do an update. The There is a trick, and the trick is that RCU Classic read-side critical
sections delimited by Thus, RCU Classic's 1 for_each_online_cpu(cpu) 2 run_on(cpu); Here, Of course, the actual implementation in the Linux kernel is much more complex, as it is required to handle interrupts, NMIs, CPU hotplug, and other hazards of production-capable kernels, but while also maintaining good performance and scalability. Realtime implementations of RCU must additionally help provide good realtime response, which rules out implementations (like the simple two-liner above) that rely on disabling preemption. Although it is good to know that there is a simple conceptual
implementation of Maintain Multiple Versions of Recently Updated ObjectsThis section demonstrates how RCU maintains multiple versions of lists to accommodate synchronization-free readers. Two examples are presented showing how a an element that might be referenced by a given reader must remain intact while that reader remains in its RCU read-side critical section. The first example demonstrates deletion of a list element, and the second example demonstrates replacement of an element. Example 1: Maintaining Multiple Versions During DeletionTo start the "deletion" example, we will modify lines 11-21 in the example in the previous section as follows:
1 p = search(head, key);
2 if (p != NULL) {
3 list_del_rcu(&p->list);
4 synchronize_rcu();
5 kfree(p);
6 }
The initial state of the list, including the pointer
The triples in each element represent the values of fields After the
Please note that readers are not permitted to maintain references to
element
At this point, the
At this point, we have completed the deletion of
element Example 2: Maintaining Multiple Versions During ReplacementTo start the replacement example, here are the last few lines of the example in the previous section: 1 q = kmalloc(sizeof(*p), GFP_KERNEL); 2 *q = *p; 3 q->b = 2; 4 q->c = 3; 5 list_replace_rcu(&p->list, &q->list); 6 synchronize_rcu(); 7 kfree(p); The initial state of the list, including the pointer
As before,
the triples in each element represent the values of fields Line 1
Line 2 copies the old element to the new one:
Line 3 updates
Line 4 updates
Now, line 5 does the replacement, so that the new element is
finally visible to readers.
At this point, as shown below, we have two versions of the list.
Pre-existing readers might see the
After the
After the
Despite the fact that RCU was named after the replacement case, the vast majority of RCU usage within the Linux kernel relies on the simple deletion case shown in the previous section. DiscussionThese examples assumed that a mutex was held across the entire update operation, which would mean that there could be at most two versions of the list active at a given time. Quick Quiz 4: How would you modify the deletion example to permit more than two versions of the list to be active? Quick Quiz 5: How many RCU versions of a given list can be active at any given time? This sequence of events shows how RCU updates use multiple versions to safely carry out changes in presence of concurrent readers. Of course, some algorithms cannot gracefully handle multiple versions. There are techniques [PDF] for adapting such algorithms to RCU, but these are beyond the scope of this article. ConclusionThis article has described the three fundamental components of RCU-based algorithms:
Quick Quiz 6:
How can RCU updaters possibly delay RCU readers, given that the
These three RCU components allow data to be updated in face of concurrent readers, and can be combined in different ways to implement a surprising variety of different types of RCU-based algorithms, some of which will be the topic of the next installment in this "What is RCU, Really?" series. AcknowledgementsWe are all indebted to Andy Whitcroft, Gautham Shenoy, and Mike Fulton, whose review of an early draft of this document greatly improved it. We owe thanks to the members of the Relativistic Programming project and to members of PNW TEC for many valuable discussions. We are grateful to Dan Frye for his support of this effort. Finally, this material is based upon work supported by the National Science Foundation under Grant No. CNS-0719851. This work represents the view of the authors and does not necessarily represent the view of IBM or of Portland State University. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. Answers to Quick QuizzesQuick Quiz 1: But doesn't seqlock also permit readers and updaters to get work done concurrently? Answer:
Yes and no.
Although seqlock readers can run concurrently with
seqlock writers, whenever this happens, the In contrast, RCU readers can perform useful work even in presence of concurrent RCU updaters. Quick Quiz 2:
What prevents the Answer: On all systems running Linux, loads from and stores
to pointers are atomic, that is, if a store to a pointer occurs at
the same time as a load from that same pointer, the load will return
either the initial value or the value stored, never some bitwise mashup
of the two.
In addition, the Quick Quiz 3:
Why do we need to pass two pointers into
Answer: Because in an hlist it is necessary to check for
NULL rather than for encountering the head.
(Try coding up a single-pointer Quick Quiz 4: How would you modify the deletion example to permit more than two versions of the list to be active? Answer: One way of accomplishing this is as follows:
spin_lock(&mylock);
p = search(head, key);
if (p == NULL)
spin_unlock(&mylock);
else {
list_del_rcu(&p->list);
spin_unlock(&mylock);
synchronize_rcu();
kfree(p);
}
Note that this means that multiple concurrent deletions might be
waiting in Quick Quiz 5: How many RCU versions of a given list can be active at any given time? Answer: That depends on the synchronization design. If a semaphore protecting the update is held across the grace period, then there can be at most two versions, the old and the new. However, if only the search, the update, and the
Quick Quiz 6:
How can RCU updaters possibly delay RCU readers, given that the
Answer: The modifications undertaken by a given RCU updater will cause the corresponding CPU to invalidate cache lines containing the data, forcing the CPUs running concurrent RCU readers to incur expensive cache misses. (Can you design an algorithm that changes a data structure without inflicting expensive cache misses on concurrent readers? On subsequent readers?) (Log in to post comments)
example code nitpicks Posted Dec 20, 2007 21:15 UTC (Thu) by ntl (subscriber, #40518) [Link] Shouldn't the list_head and hlist_node structs be embedded in struct foo, instead of pointers? That is,
struct foo {
- struct list_head *list;
+ struct list_head list;
...
}
Also, no need to cast the return value of kmalloc :)
example code nitpicks Posted Dec 21, 2007 1:03 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link] Good catch in both cases!
example code nitpicks Posted Dec 21, 2007 17:08 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link] And many thanks to Jon Corbet for applying the fixes for these problems!
Problem with ex 2 ? Posted Dec 23, 2007 23:23 UTC (Sun) by xav (subscriber, #18536) [Link] It looks to me that example 2 is wrong: access to p is unlocked, so it can changed under its feet if preempted: *q = *p can access freed memory.
Problem with ex 2 ? Posted Dec 27, 2007 18:57 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link] Indeed! As with the preceding example, there must be some sort of mutual exclusion in play, and as in the preceding example, this mutual exclusion is not shown explicitly. One approach, as you say, would be to add locking, perhaps similar to that described in the answer to Quick Quiz 4 (but for the deletion example). Another approach would be to hold a mutex (in the old style, "semaphore") across the entire code segment. Yet a third approach would be to permit only a single designated task to do updates (in which case the code would remain as is). There are other approaches as well. In any case, I am glad to see that people are sensitized to the need for mutual exclusion in parallel code!
What is RCU, Fundamentally? Posted Dec 26, 2007 12:54 UTC (Wed) by union (subscriber, #36393) [Link] Since nobody else said it. Thanks for a great article. I liked the articles about memory and this looks like another great miniseries. While I spend most of my days programming python and Java, I believe that understanding such lower level concepts makes me better programmer. I hope that there will be more such background or fundamentals articles in the future. Luka
What is RCU, Fundamentally? Posted Dec 27, 2007 19:04 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link] Glad you liked it!!! ;-) Garbage-collected languages such as Java implement RCU's "wait for pre-existing RCU readers to complete" implicitly via the garbage collector. However, in Java, you must make careful use of atomic variables in order to correctly implement the publish-subscribe mechanism. Last I knew, Java would emit memory barriers for the subscribe operation, even when unnecessary, but perhaps that has changed by now. However, in many cases, the memory-barrier overhead might well be acceptable (e.g., when avoiding contention). So you might well be able to use RCU techniques in Java! (For whatever it is worth, Kung and Lehman described the gist of using garbage collectors in this fashion in their paper entitled "Concurrent Maintenance of Binary Search Trees" in "ACM Transactions on Database Systems" back in September 1980.)
What is RCU, Fundamentally? Posted Dec 27, 2007 23:47 UTC (Thu) by scottt (subscriber, #5028) [Link] A link to the paper.With people doing all these advanced concurrent algorithms in the eighties I wonder why we're only just now getting a formal memory model for the C/C++ programming languages.
What is RCU, Fundamentally? Posted Dec 28, 2007 15:22 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link] Thank you for providing the link -- much better than my old hard copy! It is very good indeed to see some of these older papers becoming available on the web. There was indeed some memory-consistency-model research done some decades ago. Although I cannot claim comprehensive knowledge of memory-model research, as near as I can tell, almost all of this older research was swept aside by the notion of sequential consistency in 1979. The lion's share of later research assumed sequential consistency, which rendered this research less than helpful on weakly-ordered machines, where "weakly ordered" includes x86. Algorithms that fail to work on x86 cannot be said to have much practical value, in my opinion. There were nevertheless some fundamental papers published in the 90s (e.g., Sarita Adve and Kourosh Gharachorloo, among others), but many of the researchers focused on "how to emulate sequential consistency on weak-memory machines" as opposed to "how best to express memory-ordering constraints while allowing efficient code to be generated for systems with a wide variety of memory consistency models". It is this latter statement that the C/C++ standards committee must address.
Backlinks Posted Dec 28, 2007 12:37 UTC (Fri) by mmutz (subscriber, #5642) [Link] I'm wondering whether the omission of the backlinks in the examples is a good thing. The omission makes the technique trivial, since publishing only involves one replacing one pointer. What about the second, back, one? Without support for atomic two-pointer updates, how can both the p->prev->next = q and p->next->prev = q updates be performed without risking clients to see an inconsistent view of the doubly linked list? Or is that not a problem in practice? Thanks for the article, though. Looking forward to the next installment!
Backlinks Posted Dec 28, 2007 15:35 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link] Glad you liked the article, and thank you for the excellent question! I could give any number of answers, including:
Given the example above, we could support a level of consistency equivalent to the double-pointer atomic update simply by assigning the pointers in sequence -- just remove the prev-pointer poisoning from list_del_rcu(), for example. But doing this would sacrifice the ability to catch those bugs that pointer-poisoning currently catches. So, there might well come a time when the Linux kernel permits RCU-protected traversal of linked lists in both directions, but we need to see a compelling need for this before implementing it.
Forcing the compiler and CPU to execute assignment statements in order ? Posted Dec 30, 2007 16:51 UTC (Sun) by Brenner (subscriber, #28232) [Link] Thanks for the article; In the 'Publish-Subscribe Mechanism', the article states : "Unfortunately, there is nothing forcing the compiler and CPU to execute the last four assignment statements in order." This breaks my mind. I understand concurrency problems with multiple tasks, but here only one task is considered (or did I misunderstand something ?). Why would the CPU not execute the last four assignment statements in the order given by the compiler, and why would the compiler not keep the order given by the programmer ??? Even with multiple tasks, I thought that the order was guaranteed (I agree that many things can happen between two code lines of one given task if other tasks are running though). Can anyone enlighten me ?
Forcing the compiler and CPU to execute assignment statements in order ? Posted Jan 1, 2008 23:10 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link] You are welcome! Your question about ordering is a good one, and code reordering by both CPU and compiler can be grossly counter-intuitive at times. So an explanation is indeed in order. Let's start with the compiler. The code was as follows: 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 gp = p; It is possible that the compiler holds the value of So special non-standard compiler directives ( On to the CPU. Suppose that the compiler generates the code in order, and that the CPU executes it in order. What could possibly go wrong? The store buffer. The CPU will likely commit the new values of In addition, superscalar CPUs routinely execute code out of order. Such beasts might become less common as power-consumption concerns rise to the fore, but there are still quite a few such CPUs out there, dating from the mid-1990s for commodity microprocessors and to the mid-1960s for supercomputers. For example, Tomasulo's Algorithm, dating from the mid-1960s, is specifically designed to allow CPUs to execute instructions out of order. And there were super-scalar supercomputers pre-dating Tomasulo's Algorithm. In short, if ordering is important to you, use primitives to enforce ordering. Such primitives include locking primitives (e.g., Hey, you asked!!! And Happy New Year!!!
Forcing the compiler and CPU to execute assignment statements in order ? Posted Jan 2, 2008 0:09 UTC (Wed) by Brenner (subscriber, #28232) [Link] Thanks a lot for your very informative answer. And Happy New Year!!!
Userland Posted Jan 4, 2008 4:06 UTC (Fri) by eduardo.juan (guest, #47737) [Link] Thanks for this excellent article! I have a fundamentally question! :D Is any way to use RCU on userland programs? I mean, any lib or implementation of this constraints in userland? Thanks and regards!
Using RCU in userland Posted Jan 4, 2008 18:13 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link] RCU has been used experimentally in user-mode code, and one benchmarking framework may be found at U. of Toronto.In addition, I vaguely recall at least one user-level RCU implementation being posted on LKML as a programming/debugging aid. There are some others that I am unfortunately unable to release at this time. As noted in an earlier comment, garbage-collected languages automatically provide much of the wait-for-reader functionality for free -- however, it is still necessary to correctly handle the publish-subscribe operation correctly. One way to do this in Java is to use the recently added "volatile" field specifier (no relation to "volatile" in C/C++) for the pointer that is being published. In other words, instead of using the Linux-kernel rcu_assign_pointer() and rcu_dereference() to publish and subscribe, you instead mark the pointer itself using Java's "volatile" keyword.
Some usages of RCU in the kernel code Posted Jan 6, 2008 10:42 UTC (Sun) by fbh (guest, #49754) [Link] Hi, I'm probably missing something but I looked at users of RCU in the kernel and found that kernel/kprobes.c doesn't call rcu_read_lock() function before entering in a critical section. Does it have any good reasons to do so ? Another point in include/linux/list.h: sometimes it seems that RCU API is not used for no good reasons. For example __list_add_rcu() does not use rcu_dereference(). However by using it, the code could have clearly showed which pointer is RCU protected... The Alpha memory ordering and value-speculation compiler optimizations done in this architecture sounds incredibely weird to me. Any avalaible pointers on this subject ? Anyways, thanks for this article.
Some usages of RCU in the kernel code Posted Jan 6, 2008 19:04 UTC (Sun) by PaulMcKenney (subscriber, #9624) [Link] The reason that kernel/kprobes.c does not use The reason that The discussion of Figure 2 in this article and its bibliography is a good place to start on Alpha's memory ordering. I am not all that familiar with value-speculation compiler-optimization research, but one place to start might be here.
Some usages of RCU in the kernel code Posted Jan 7, 2008 10:14 UTC (Mon) by fbh (guest, #49754) [Link]
Regarding the second point, I'm sorry, my fingers typed rcu_dereference() whereas my brain was
thinking to rcu_assign_pointer(). rcu_dereference() could be used in update-side code, as you
pointed out, but obviously not in __list_add_rcu().
Therefore __list_add_rcu() could be:
new->next = next;
new->prev = prev;
rcu_assign_pointer(prev->next, new);
next->prev = new;
The main advantage of this is readability IMHO.
You said that __list_add_rcu() must be protected by appropriate locking, and a memory barrier
is implied by lock acquisition which should be enough. But __list_add_rcu() has a memory
barrier in its implementation.
Thanks.
Some usages of RCU in the kernel code Posted Jan 7, 2008 15:28 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link] Good point! Would you like to submit a patch?
Some usages of RCU in the kernel code Posted Jan 7, 2008 16:03 UTC (Mon) by fbh (guest, #49754) [Link] Well, I actually raised 2 points and I'm not sure which one is the good one... Could you please clarify ? To answer your question, I don't have any problems for submitting a patch. --- PS: This interface is pretty hard for discussing on an article. It's like bottom-posting except we loose all its advantages... Just my 2 cents.
Some usages of RCU in the kernel code Posted Jan 7, 2008 16:52 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link] I was inviting a patch for incorporating the smp_wmb() into the rcu_assign_pointer. The lock acquisition makes rcu_dereference() unnecessary, but cannot enforce the ordering provided by the smp_wmb()/rcu_assign_pointer().
Some usages of RCU in the kernel code Posted Jan 8, 2008 11:02 UTC (Tue) by jarkao2 (guest, #41960) [Link] Thanks for this (next) great article too! BTW, probably 'a' tiny fix needed: "Maintain Multiple Versions of Recently Updated Objects [...] Two examples are presented showing how a an element [...]"
Some usages of RCU in the kernel code Posted Jan 8, 2008 14:13 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link] Good eyes!!!
|
Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.