The RCU API, 2014 Edition

September 4, 2014

This article was contributed by Paul McKenney

Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently described as a replacement for reader-writer locking, but has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits readers to accomplish useful work even when running concurrently with updaters.

Although the basic idea behind RCU has not changed in the decades following its introduction into DYNIX/ptx in 1993, the RCU API has evolved significantly even over the nearly four years since a 2010 article covering the RCU API, most recently due to software-engineering concerns. This evolution is documented by the following sections.

Summary of RCU API additions
RCU has a family of wait-to-finish and data-access APIs
RCU has list-based publish-subscribe and version-maintenance APIs
Kernel configuration parameters
How did those 2010 predictions turn out?
What next for the RCU API?

These sections are followed by answers to the Quick Quizzes.

Summary of RCU API additions

Quick Quiz 1: Why is extreme caution required for call_srcu() and srcu_barrier()?
Answer

The largest change to the RCU API since 2010 has been to SRCU (Sleepable RCU), which was reimplemented from scratch by Peter Zijlstra, myself, and finally by Lai Jiangshan. The new implementation offers much lower-latency grace periods (which was important for KVM), and, unlike other RCU implementations, allows readers in the idle loop and even in offline CPUs. In addition, this new SRCU implementation provides the full RCU API, including the call_srcu() and srcu_barrier() functions that were omitted from the previous version. That said, these new APIs should be used with extreme caution.

Another important addition is kfree_rcu(), which allows “fire and forget” freeing of RCU-protected data. Given a structure p with an rcu_head field imaginatively named rh, you can now free a structure pointed to by p as follows:

kfree_rcu(p, rh);

Before kfree_rcu() was available, something like this was instead required:

static void free_by_callback(struct rcu_head *rhp)
{
	struct foo *p = container_of(rhp, struct foo, rh);

	kfree(p);
}

...

	call_rcu(&p->rh, free_by_callback);

Quick Quiz 2: If kfree_rcu() is so popular, why isn't there a kfree_rcu_bh(), kfree_rcu_sched(), or kfree_srcu()? For that matter, why not a kmem_cache_free_rcu()?
Answer

The use of kfree_rcu() therefore saves a bit of code, so that this API now has almost 200 uses in the Linux kernel. Equally important, if your loadable module uses call_rcu(), you will need to use rcu_barrier() at module-unload time, as described here. If you can convert all of your module's call_rcu() invocations to kfree_rcu(), then the rcu_barrier() is not needed.

A new bitlocked linked list (hlist_bl_head and hlist_bl_node) was added. The bitlocked linked list is useful when you need a lock associated with each linked list, but memory pressures prohibit associating a spinlock_t with each list. As the name suggests, a bitlocked linked list reduces the size of the lock down to a single bit that is placed in the low-order bit of a pre-existing pointer. Bitlocked linked lists required the RCU-safe accessors hlist_bl_for_each_entry_rcu(), hlist_bl_first_rcu(), hlist_bl_add_head_rcu(), hlist_bl_del_rcu(), hlist_bl_del_init_rcu(), and hlist_bl_set_first_rcu().

Performance issues in networking led to the addition of RCU_INIT_POINTER(), which can be used in place of rcu_assign_pointer() in a few special cases, and that omits rcu_assign_pointer()'s barrier and volatile cast. Ugly-code issues led to the addition of RCU_POINTER_INITIALIZER(), which may be used to initialize RCU-protected pointers in structures at compile time.

The rcu_access_index() primitive was requested as a RCU-protected-index counterpart to rcu_access_pointer() for integers used as array indexes. The rcu_access_index() and rcu_access_pointer() functions may be used in cases where the index or pointer will not be dereferenced; for example, when it is just being compared. Therefore, rcu_access_index() and rcu_access_pointer() need not use smp_read_barrier_depends(), however, given that smp_read_barrier_depends() is free on most platforms, this is not much of a motivation. The motivation instead is that these primitives can be safely used outside of RCU read-side critical sections, thus avoiding the need to worry about lockdep-RCU complaints.

The rcu_dereference_raw_notrace(), rcu_is_watching(), and hlist_for_each_entry_rcu_notrace() primitives were needed for special cases in the tracing code. The motivation here is that the tracing code uses RCU, but also needs to be able to trace RCU's primitives. Providing these _notrace variants allows the tracing implementation to more easily avoid self-recursion.

Lower-level list APIs for stepwise traversal of RCU-protected lists were added: list_first_or_null_rcu(), list_next_rcu(), hlist_first_rcu(), hlist_next_rcu(), hlist_pprev_rcu(), hlist_nulls_first_rcu(), and hlist_nulls_next_rcu(). In addition, the hlist-nulls primitive hlist_nulls_del_init_rcu() was added as as counterpart to the hlist hlist_del_init_rcu() primitive.

The rcu_lockdep_assert() primitive allows functions to insist that they be invoked within the specified RCU and locking contexts: Experience indicates that RCU-lockdep splats get the prompt attention required to ensure that such functions are called in the required environment.

Finally, RCU_NONIDLE() may be used to protect RCU read-side critical sections in idle loops, which would otherwise be illegal. RCU_NONIDLE() is not used much because almost all the idle-loop uses of RCU are due to tracing, which supplies a trace functions with an _rcuidle suffix for idle-loop use. However, non-tracing uses of RCU within the idle loop should use RCU_NONIDLE(). There is some discussion of restricting the region of idle-loop code that RCU considers to be idle, and if this region becomes small enough, it might be possible to dispense with both RCU_NONIDLE() and the _rcuidle suffix.

The next sections discuss aspects of the RCU API, highlighting recent changes.

RCU has a family of wait-to-finish and data-access APIs

The most straightforward answer to “what is RCU?” is that RCU is an API used in the Linux kernel, as summarized by the big API table and the following discussion. Or, more precisely, RCU is a four-member family of APIs as shown in the table, with four columns that correspond to each of the family members and the last column containing generic APIs that apply across families.

If you are new to RCU, you might consider focusing on just one of the columns in the big RCU API table. For example, if you are primarily interested in understanding how RCU is most frequently used in the Linux kernel, “RCU” would be the place to start. On the other hand, if you want to understand RCU for its own sake, “SRCU” has the simplest API. In both cases, you will need to refer to the final “Generic” column. You can always come back to the other columns later. If you are already familiar with RCU, this table can serve as a useful reference.

The green-colored RCU API members are those that existed back in the 1990s, a time when I was under the delusion that I knew all that there is to know about RCU. The blue-colored cells correspond to the RCU API members that are new since the 2010 RCU API documentation came out.

The “RCU” column corresponds to the original RCU implementation, in which RCU read-side critical sections are delimited by rcu_read_lock() and rcu_read_unlock(), which may be nested. RCU-protected data is accessed using rcu_dereference() and rcu_dereference_check(), with the former used within RCU read-side critical sections and the latter used by code shared between readers and updaters. In both cases, the pointers must be C-language lvalues. These read-side APIs are lightweight, although the two data-access APIs must execute a memory barrier on DEC Alpha. RCU's performance and scalability advantages stem from the lightweight nature of these read-side APIs.

The corresponding synchronous update-side primitives, synchronize_rcu(), along with its synonym, synchronize_net(), wait for any currently executing RCU read-side critical sections to complete. The length of this wait is known as a “grace period”. If grace periods are too long for you, synchronize_rcu_expedited() speeds things up by about an order of magnitude, but at the expense of significant CPU overhead and of latency spikes on all CPUs, even the CPUs that are currently idle.

The asynchronous update-side primitive, call_rcu(), invokes a specified function with a specified argument after a subsequent grace period. For example, call_rcu(p,f); will result in the “RCU callback” f(p) being invoked after a subsequent grace period. There are situations, such as when unloading a module that uses call_rcu(), when it is necessary to wait for all outstanding RCU callbacks to complete. The rcu_barrier() primitive does this job. The kfree_rcu() primitive serves as a shortcut for an RCU callback that does nothing but free the structure passed to it. Use of kfree_rcu() can both simplify code and reduce the need for rcu_barrier(). Finally, the rcu_read_lock_held() may be used in assertions and lockdep expressions to verify that RCU read-side protection is in fact being provided. This primitive is conservative, and thus can produce false negatives, particularly in kernels built with CONFIG_PROVE_RCU=n.

The “RCU BH” column contains the RCU-bh primitives. RCU-bh differs from RCU in that RCU-bh read-side critical sections (rcu_read_lock_bh() and rcu_read_unlock_bh()) disable bottom-half (i.e. softirq) processing. RCU-bh also features somewhat lower-latency grace periods in CONFIG_PREEMPT=n kernels due to the fact that any point in the code where bottom halves are enabled is an RCU-bh quiescent state. The overall effect is that RCU-bh trades off slightly increased read-side overhead to gain shorter and more predictable grace periods. These shorter grace periods allow the system to avoid out-of-memory (OOM) conditions in the face of network-based denial-of-service attacks.

Quick Quiz 3: What happens if you mix and match RCU and RCU-Sched?
Answer

In the “RCU Sched” column, anything that disables preemption acts as an RCU read-side critical section, however, rcu_read_lock_sched() and rcu_read_unlock_sched() are the official read-side primitives. Other than that, the RCU-sched primitives are analogous to their RCU counterparts, though, again, RCU-sched lacks a counterpart to synchronize_net() and kfree_rcu(). This RCU API family was added in the 2.6.12 kernel, which split the old synchronize_kernel() API into the current synchronize_rcu() (for RCU) and synchronize_sched() (for RCU Sched). There are also call_rcu_sched(), synchronize_sched_expedited(), and rcu_barrier_sched() primitives, which are analogous to their “RCU” counterparts.

Quick Quiz 4: Can synchronize_srcu() be safely used within an SRCU read-side critical section? If so, why? If not, why not?
Answer

The "SRCU" column displays a specialized RCU API that permits general sleeping in RCU read-side critical sections, as was described in the LWN article “Sleepable RCU”. SRCU is also the only RCU flavor whose read-side primitives may be freely invoked from the idle loop and from offline CPUs. Of course, use of synchronize_srcu() in an SRCU read-side critical section can result in self-deadlock, so it should be avoided. SRCU differs from earlier RCU implementations in that the caller allocates an srcu_struct for each distinct SRCU usage, which must either be statically allocated using DEFINE_SRCU() or be initialized after dynamic allocation using init_srcu_struct().

Quick Quiz 5: Why isn't there an smp_mb__after_rcu_read_unlock(), smp_mb__after_rcu_bh_read_unlock(), or smp_mb__after_rcu_sched_read_unlock()?
Answer

This approach prevents SRCU read-side critical sections from blocking unrelated synchronize_srcu() invocations. In addition, in this variant of RCU, srcu_read_lock() returns a value that must be passed into the corresponding srcu_read_unlock(). There is also an smp_mb__after_srcu_read_unlock() that, when combined with an immediately prior srcu_read_unlock(), provides a full memory barrier. Finally, as with RCU-bh and RCU-sched, there is no counterpart to RCU's synchronize_net() and kfree_rcu() primitives.

The final column contains a few additional RCU APIs that apply equally to all four flavors.

The following primitives do initialization:

RCU_INIT_POINTER() may be used instead of rcu_assign_pointer() to assign a value to an RCU-protected pointer in a few special cases where reordering from both the compiler and the CPU can be tolerated. These special cases are as follows:
- You are assigning NULL to the pointer, or
- You have prevented RCU readers from accessing the pointer, for example, during initialization when RCU readers do not yet have a path to the pointer in question, or
- The pointed-to data structure whose pointer is being assigned has already been exposed to readers, and
  - You have not made any reader-visible changes to the pointed-to data structure since then, or
  - It is OK for readers to see the old state of the structure.
  An example of this third case is when removing an element from an RCU-protected linked list, in which case that element's successor has already been exposed to readers.
RCU_POINTER_INITIALIZER() is used for compile-time initialization of RCU-protected pointers within a structure.

The following primitives access RCU-protected data:

rcu_access_index() fetches an RCU-protected index in cases where ordering is not required, for example, when the only use of the value fetched is for a comparison.
rcu_access_pointer() fetches an RCU-protected pointer in cases where ordering is not required. This primitive may be used instead of one of the rcu_dereference() group of primitives when only the value of the RCU-protected pointer is used without being dereferenced; for example, the RCU-protected pointer might simply be compared against NULL. There is, therefore, no need to protect against concurrent updates, and there is also no need to be under the protection of rcu_read_lock() and friends.
rcu_dereference_index_check() fetches an RCU-protected index, and takes a lockdep expression identifying the locks and types of RCU that protect the access. Unfortunately, this primitive also disables sparse-based checking, so it is possible that this primitive will be deprecated in the future; any remaining uses should shift from using indexes to the equivalent pointers. (So if you know of an RCU-protected index that cannot be easily converted to an RCU-protected pointer, this would be a really good time to speak up.)
rcu_dereference_protected() is used to access RCU-protected pointers from update-side code. Because the update-side code is using some other synchronization mechanism (locks, atomic operations, single updater thread, etc.), it does not need to put RCU read-side protections in place. This primitive also takes a lockdep expression that can be used to assert that the right locks are held and that any other necessary conditions hold.
rcu_dereference_raw() disables lockdep checking, which allows it to be used in cases where the lockdep correctness condition cannot be expressed in a reasonably simple way. For example, the RCU list macros might be protected by any combination of RCU flavors and locks, so they use rcu_dereference_raw(). That said, some _bh list-macro variants have appeared, so it is possible that lockdep-enabled variants of these macros will appear in the future. However, where you use rcu_dereference_raw(), please include a comment saying why its use is safe and why other forms of rcu_dereference() cannot be used.
rcu_dereference_raw_notrace() is similar to rcu_dereference_raw(), but additionally disables function tracing.

The following primitive updates RCU-protected data:

rcu_assign_pointer() acts like an assignment statement, but does debug checking and enforces ordering on both compiler and CPU as needed.

Finally, the following primitives do validation:

__rcu is used to tag RCU-protected pointers, allowing sparse to check for misuse of such pointers.
init_rcu_head_on_stack() initializes an on-stack rcu_struct structure for debug-objects use. The debug-objects subsystem checks for memory-allocation usage bugs, for example, double-kfree(). If the kernel is built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, this checking is extended to double call_rcu(). Although debug-objects automatically sets up its state for global variables and heap memory, explicit setup is required for on-stack variables, hence the init_rcu_head_on_stack().
destroy_rcu_head_on_stack() must be used on any on-stack variable passed to init_rcu_head_on_stack() before returning from the function containing that on-stack variable.
init_rcu_head() and destroy_rcu_head() also initialize objects for debug-objects use. These are normally not needed because the first call to call_rcu() will implicitly set up debug-objects state for non-stack memory. However, if that call_rcu() occurs in the memory allocator or in some other function used by debug-objects, this implicit call_rcu()-time invocation can result in deadlock. Functions called by debug-objects that also use call_rcu() should therefore manually invoke debug_init_rcu_head() during initialization in order to break such deadlocks.
rcu_is_watching() checks to see if the current code may legally contain RCU read-side critical sections. Examples of places where RCU read-side critical sections are not legal include the idle loop (but see RCU_NONIDLE() below) and offline CPUs. Note that SRCU read-side critical sections are legal anywhere, including in the idle loop and from offline CPUs.
rcu_lockdep_assert() is used to verify that the code has the needed protection. For example:
```
    rcu_lockdep_assert(rcu_read_lock_held(), "Need rcu_read_lock()!");
```
is a way to enforce the toothless comments stating that the current function must be invoked within an RCU read-side critical section. But please note that the kernel must be built with CONFIG_PROVE_RCU=y for this enforcement to take effect.
RCU_NONIDLE() takes a C statement as its argument. It informs RCU that this CPU is momentarily non-idle, executes the statement, then informs RCU that this CPU is once again idle. Note that event tracing uses RCU, which means that if you are doing event tracing from the idle loop, you must use the _rcuidle form of the tracing functions, for example: trace_rcu_dyntick_rcuidle().

The Linux kernel currently has a surprising number of RCU APIs and implementations. There is some hope of reducing this number, but careful inspection and analysis will be required before removing either an implementation or any API members, just as would be required before removing one of the many locking APIs in the kernel. Besides which, recent trends have been very much in the opposite direction. The next section describes RCU's list-based APIs, which have seen some growth since 2010.

RCU has list-based publish-subscribe and version-maintenance APIs

Although in theory rcu_dereference() and rcu_assign_pointer() are sufficient to implement pretty much any data structure, in practice this approach would be time-consuming and error-prone. RCU therefore provides specialized list-based publish-subscribe and version-maintenance APIs. Fortunately, most of RCU's list-based publish-subscribe and version-maintenance primitives shown in this table apply to all of the variants of RCU discussed above. This commonality can, in some cases, allow more code to be shared, which certainly reduces the API proliferation that would otherwise occur. However, it is quite likely that software-engineering considerations will eventually result in variants of these list-handling primitives that are specialized for each given flavor of RCU, as has, in fact, happened with hlist_for_each_entry_rcu_bh() and hlist_for_each_entry_continue_rcu_bh().

The APIs in the first column of the table operate on the Linux kernel's struct list_head lists, which are circular, doubly-linked lists. These primitives permit lists to be modified in the face of concurrent traversals by readers. The list-traversal primitives are implemented with simple instructions, so are extremely lightweight, although they also execute a memory barrier on DEC Alpha. The list-update primitives that add elements to a list incur memory-barrier overhead, while those that only remove elements from a list are implemented using simple instructions. The list_splice_init_rcu() primitive incurs not only memory-barrier overhead, but also grace-period latency, and is therefore the only blocking primitive shown in the table.

The APIs in the second column of the table operate on the Linux kernel's struct hlist_head, which is a linear doubly linked list. One advantage of struct hlist_head over struct list_head is that the former requires only a single-pointer list header, which can save significant memory in large hash tables. The struct hlist_head primitives in the table relate to their non-RCU counterparts in much the same way as do the struct list_head primitives. Their overheads are similar to that of their list counterparts in the first two categories in the table.

Quick Quiz 6: Why would anyone need to distinguish lists based on their NULL pointers? Why not just remember which list you started searching?
Answer

The APIs in the third column of the table operate on Linux-kernel hlist-nulls lists, which are made up of hlist_nulls_head and hlist_nulls_node structures. These lists have special multi-valued NULL pointers, which have the low-order bit set to 1 with the upper bits available to the programmer to distinguish different lists. There are hlist-nulls interfaces for non-RCU-protected lists as well.

A major advantage of hlist-nulls lists is that updaters can free elements to SLAB_DESTROY_BY_RCU slab caches without waiting for an RCU grace period to elapse. However, readers must be extremely careful when traversing such lists: Not only must they conduct their searches within a single RCU read-side critical section, but, because any element might be freed and then reallocated at any time, readers must also validate each element that they encounter during their traversal.

Quick Quiz 7: Why is there no hlist_nulls_add_tail_rcu()?
Answer

The APIs in the fourth and final column of the table operate on Linux-kernel hlist-bitlocked lists, which are made up of hlist_bl_head and hlist_bl_node structures. These lists use the low-order bit of the pointer to the first element as a lock, which allows per-bucket locks on large hash tables while still maintaining a reasonable memory footprint.

List Initialization

The new INIT_LIST_HEAD_RCU() API member allows a normal list to be initialized even when there are concurrent readers. This is useful for constructing list-splicing functions.

Full Traversal

The macros for full list traversal must be used within an RCU read-side critical section. These macros map to a C-language for loop, just as their non-RCU counterparts do.