The RCU API, 2024 edition

September 13, 2024

This article was contributed by Paul McKenney

Read-copy-update (RCU) is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently used as a replacement for reader-writer locking, but is also used in a number of other ways. This mechanism is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. Those wishing an in-depth introduction to RCU are invited to consult the LWN series here, here, and here.

This article covers recent changes to the RCU API; it was contributed by Paul McKenney, Boqun Feng, Frederic Weisbecker, Joel Fernandes, Neeraj Upadhyay, and Uladzislau Rezki.

Although the basic idea behind RCU has not changed during the three decades following its introduction into DYNIX/ptx, the RCU API has evolved significantly since the 2010, 2014, and 2019 editions of the Linux-kernel RCU API. The most recent five years of this evolution is documented by the following sections.

Summary of RCU API changes
How did those 2019 predictions turn out?
What next for the RCU API?

If that is not enough RCU for you, there are a lot more details to be found in the associated background-material article.

But first, a few announcements:

People familiar with the 2019 edition of the RCU API will note several additional names on the byline. These people have taken up the challenge of learning the various Linux-kernel RCU implementations, each investing significant time over a period of years. Not that Paul intends to go anywhere anytime soon, his father having taken early retirement only recently, but Mother Nature might force him into her retirement program at any time. Paul therefore asks you to please take them seriously, because if something happens to him, they are your Linux-kernel RCU maintainers.
It is wise to CC rcu@vger.kernel.org on any RCU-related email that might otherwise have been sent privately to Paul, who is likely to become less aggressive about checking email during off-hours.
In a not-unrelated change, the source of truth for RCU commits has moved from Paul's venerable -rcu tree to a shared RCU tree. His -rcu tree will continue to exist, but will be one of several feeding into the shared RCU tree and elsewhere.

Summary of RCU API changes

These sections summarize some of the most visible changes to RCU:

Lazy RCU grace asynchronous periods
Reworking of kfree_rcu()
Polled RCU grace periods
Tasks Rude RCU and Tasks Trace RCU
New SRCU read-side critical sections
Read-side guards
RCU callback dynamic (de-)offloading
Miscellaneous

Lazy RCU grace asynchronous periods

The addition of lazy RCU grace periods was prompted by energy-efficiency concerns on battery-powered platforms, most notably Android and ChromeOS. This laziness affects callbacks queued via call_rcu(), and is not to be confused with the lazy processing of RCU callbacks from kfree_rcu(), which is covered in the next section.

New-age lazy call_rcu() grace periods are controlled at build time by a new CONFIG_RCU_LAZY kernel configuration option and, in kernels built with CONFIG_RCU_LAZY=y, at run time by a new rcutree.enable_rcu_lazy kernel boot parameter. There is an additional CONFIG_RCU_LAZY_DEFAULT_OFF configuration option that changes the default setting of this new kernel boot parameter to be disabled. In other words, when the kernel boot parameter is not specified, kernels built with CONFIG_RCU_LAZY=y but without specifying CONFIG_RCU_LAZY_DEFAULT_OFF (or, equivalently, built with CONFIG_RCU_LAZY_DEFAULT_OFF=n) will have lazy call_rcu() callbacks, while kernels built with CONFIG_RCU_LAZY=y and CONFIG_RCU_LAZY_DEFAULT_OFF=y will have non-lazy (hurried) call_rcu() callbacks.

The kernel boot parameter overrides the configuration options, so that (for example) booting a CONFIG_RCU_LAZY=y kernel with rcutree.enable_rcu_lazy=1 will result in lazy call_rcu() callbacks. Lazy callbacks will wait up to ten seconds before starting a grace period, thus greatly reducing the number of grace periods (and their associated wakeups of idle CPUs) on an almost-idle device. However, call_rcu() callbacks will only ever be lazy on rcu_nocbs CPUs. The reason for this is that systems with non-rcu_nocbs CPUs consume more energy than their rcu_nocbs counterparts, so those concerned with battery lifetime would be well-advised to first enable rcu_nocbs. This can be done by building the kernel with CONFIG_RCU_NOCB_CPU=y and booting it with rcu_nocbs=all.

Testing of lazy call_rcu() grace periods showed well in excess of 10% improvements in energy efficiency.

Some uses of call_rcu() cannot tolerate laziness, for example, when the callback function does a wakeup. These uses should instead invoke the new call_rcu_hurry() function. Note that invoking call_rcu_hurry() on a given CPU will hurry along any earlier lazy callbacks that were previously queued using call_rcu().

When suspending or hibernating, it is also important for all callbacks to hurry. (But don't take our word for it, ask a user about adding a ten-second delay to suspend!) RCU therefore has an rcu_pm_notify() function that hurries callbacks at the start of a suspend or hibernation operation. This same function re-enables laziness when this operation completes.

Reworking of `kfree_rcu()`

From the viewpoint of pre-existing, working code, kfree_rcu() works just like it always did. However, under the covers, performance has increased significantly through use of pages of pointers to track the memory and through use of kfree_bulk(). Both of these changes greatly improve cache locality, which provides up to a 12% increase in performance. Performance was increased still further for battery-powered devices via carefully developed heuristics that govern how long a partially populated page of pointers waits before being submitted to RCU.

In addition, kfree_rcu() can now handle memory from kmem_cache_alloc(), not due to any changes to kfree_rcu(), but rather due to changes to kfree(). However, please note that rcu_barrier() does not wait for memory being freed via kfree_rcu(), so that there currently is no way to safely invoke kmem_cache_destroy() on module exit if that module ever used kfree_rcu() on memory from the corresponding kmem_cache structure. A fix is in the works. In the meantime, such modules should continue using call_rcu() for this use case.

One disadvantage of kfree_rcu() is that an rcu_head structure must be embedded within the structure to be freed, costing 16 additional bytes on 64-bit systems. For example, a structure referenced by p where the rcu_head structure is in a field named rh can be RCU-deferred-freed using:

    kfree_rcu(&p->rh, rh)

This issue is addressed by a new kfree_rcu_mightsleep() function, which takes a single pointer to the beginning of the object to be freed, as in kfree_rcu_mightsleep(p), thus avoiding the need for that rcu_head structure and its 16 bytes of memory.

Of course, there is always a catch, and in this case the catch is that, as the name suggests, kfree_rcu_mightsleep() might sleep. In fact, if it is unable to allocate the memory needed to track the object being deferred-freed, it will simply invoke synchronize_rcu(), which blocks for some tens of milliseconds, and then directly invokes kfree(). In contrast, kfree_rcu() reacts to low memory by invoking call_rcu(), thus giving up cache locality but not caller-visible latency. As always, choose wisely!

There are versions of the kernel that have a single-argument variant of kfree_rcu() instead of kfree_rcu_mightsleep(). This single-argument approach proved to be a serious mistake that led to subtle bugs in which people forgot to specify the second argument, which fails (but only sometimes) from atomic contexts such as interrupt handlers. Therefore, recent kernels provide only the double-argument variant of kfree_rcu(). Where the old single-argument version would have been used, use kfree_rcu_mightsleep() instead.

There are also kvfree_rcu() and kvfree_rcu_mightsleep() functions that operate on vmalloc() memory.

In short, if you have RCU callbacks that do nothing but kfree(), vfree(), or kmem_cache_free() to immortal kmem_cache structures, these functions might save you both CPU time and a few lines of code. Many of these have already been converted, courtesy of a Coccinelle script and patches from Julia Lawall, but new code is written all the time.

Polled RCU grace periods

The historic RCU API does quite a bit of work for the user, who can simply invoke synchronize_rcu() and, upon its return, know that all pre-existing readers are done. Or, if the synchronize_rcu() function's blocking is problematic, pass a pointer and a callback function to call_rcu() and upon invocation of that callback function, know that all pre-existing readers are done. Better yet, if that callback function is doing nothing but invoking kfree() or kmem_cache_free(), just pass a pointer and rcu_head offset to kfree_rcu(), and rely on RCU to do everything, including the freeing. Or even better still, if within a latency-tolerant, non-atomic context, dispense with the rcu_head and pass just the pointer itself to kfree_rcu_mightsleep().

However, as reported here, there are situations in which these conveniences become counterproductive. For example, in caching situations, it might be that an object that has been queued for deferred free is once again needed. In such cases, it might be helpful to cancel a synchronize_rcu(), prevent a call_rcu() callback from being invoked, or to prevent an object passed to kfree_rcu() or kfree_rcu_mightsleep() from being freed. For another example, in situations that invoke many instances of call_rcu() in short time periods, RCU's choice of software-interrupt context for callback invocation might be suboptimal. Unfortunately, providing for these situations would further complicate the RCU API and implementation, and would also result in performance degradation.

Quick Quiz 1: Why would call_rcu() be helpful in scheduling polling for a particular grace period?

(Click for answer)

Because the callback passed to call_rcu() tends to be invoked shortly after a grace period has completed, which is an excellent time to do polling for the end of a grace period. Alternatively, if the RCU-callback softirq context is inconvenient, you can instead use queue_rcu_work() to schedule a workqueue handler to execute shortly after a grace period completes.

In recent Linux kernels, RCU instead provides a complete polling API for managing RCU grace periods. This permits RCU users to take full control over all aspects of waiting for grace periods and responding to the end of a particular grace period. For example, the get_state_synchronize_rcu() function returns a cookie that can be passed to poll_state_synchronize_rcu(). This latter function will return true if a grace period has elapsed in the meantime. The user may choose any convenient method to schedule the polling for the end of the grace period, from mod_timer() up to and including use of call_rcu() itself. Once the grace period in question has elapsed, the user may choose any convenient context from which to free memory, or to undertake whatever other processing is required.

For a bit more background on RCU's polled grace-period API, please see Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers or slides 34-on in the Reclamation Interactions with RCU LSFMM+BPF 2024 presentation.

Tasks Rude RCU and Tasks Trace RCU

The 2019 edition of the RCU API described the addition of Tasks RCU for use by ftrace and kprobes. These facilities use trampolines containing tracepoint code, and Tasks RCU is used to synchronize removal of a trampoline with tasks that might still be executing within it. Because ftrace and kprobes trampolines never do context switches, nor do they invoke functions that do context switches, a voluntary context switch suffices as a Tasks RCU quiescent state. This edition describes Tasks Rude RCU, which was consolidated from an open-coded implementation provided by Steve Rostedt, and also Tasks Trace RCU, which was added for use by sleepable BPF programs that might block.

Tasks Rude RCU augments Tasks RCU by handling the idle tasks that Tasks RCU ignores, thus permitting trampolines to be installed in the idle loop. One could instead prohibit tracepoints and kprobes in the idle loop, but the increasing quantity of power-management code living there makes such a prohibition unpalatable.

Therefore, true to its name, Tasks Rude RCU uses the schedule_on_each_cpu() function to force a context switch on each CPU, and thus force each CPU out of the idle loop. Those using battery-powered systems might well consider the resulting wakeups of deep-idle-state CPUs to be quite rude, hence the name.

Like Tasks RCU, Tasks Rude RCU has no read-side markers. It has synchronize_rcu_tasks_rude() and call_rcu_tasks_rude() functions to wait for a grace period, synchronously and asynchronously, respectively. The rcu_barrier_tasks_rude() function waits for the invocation of all callbacks queued by previous invocations of call_rcu_tasks_rude(), which is needed when unloading modules such as rcutorture that invoke call_rcu_tasks_rude(). However, neither call_rcu_tasks_rude() nor rcu_barrier_tasks_rude() is used in current mainline, which is likely to lead to their removal sooner rather than later.

Peter Zijlstra and Thomas Gleixner reworked the x86 entry/exit code, using noinstr and inlining, so that any function that RCU is not watching cannot be traced. On any architecture where this is completed, where the CPU-hotplug architecture-specific offline code path has been addressed, and where the maintainers feel confident that it will stay completed, synchronize_rcu_tasks_rude() can become a no-op and call_rcu_tasks_rude() can invoke its callback immediately from a clean context.

BPF programs use a combination of RCU, Tasks RCU, and Tasks Rude RCU for various purposes, including synchronizing uses of and removals of trampolines. RCU is used to protect entire BPF programs, which works well, but which prohibits BPF programs from blocking. This prohibition in turn prevents BPF programs from unconditionally loading the contents of user-space memory because of the fact that user-space accesses might result in page faults. That, in turn, might result in blocking, waiting for that user-space data to be paged back in. Therefore, BPF programs have been given conditional access to user-space memory, which either completes the access or indicates failure.

These failure indications can be quite inconvenient, so a new special-purpose Tasks Trace RCU flavor has been created that allows limited blocking within its read-side critical sections. As such, Tasks Trace RCU can be thought of as variant of sleepable RCU (SRCU) with low-overhead read-side markers. Although, like SRCU, the implementation can tolerate arbitrary blocking, by convention Tasks Trace RCU readers are only permitted to block for long enough to handle a major page fault.

The Tasks Trace RCU read-side markers are rcu_read_lock_trace() and rcu_read_unlock_trace(), and there is a lockdep-enabled rcu_read_lock_trace_held() function that indicates whether or not it is within an Tasks Trace RCU read-side critical section. (When lockdep is disabled, this function always returns the value one.) Synchronous grace-period waits are provided by synchronize_rcu_tasks_trace() and asynchronous grace-period waits by call_rcu_tasks_trace(). The rcu_barrier_tasks_trace() function waits for the invocation of all callbacks queued by previous invocations of call_rcu_tasks_trace(). It is not unusual for BPF code to need to wait for both an RCU and an Tasks Trace RCU grace period, and a current accident of implementation means that any Tasks Trace RCU grace period is also a plain RCU grace period. This could of course change at any time, so there is a rcu_trace_implies_rcu_gp() function (which currently unconditionally returns true) that specifies whether or not this happy accident is still in effect.

Again, Tasks Trace RCU is quite specialized, so those wishing to use it should consult not only with its maintainers, but also with its current users.

New SRCU read-side critical sections

SRCU read-side critical sections use this_cpu_inc(), which excludes interrupt and software-interrupt handlers, but is not guaranteed to exclude non-maskable interrupt (NMI) handlers. Therefore, SRCU read-side critical sections may not be used in NMI handlers, at least not in portable code. This restriction became problematic for printk(), which is frequently called upon to do stack backtraces from NMI handlers. This situation motivated adding srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe(). These new API members instead use atomic_long_inc(), which can be more expensive than this_cpu_inc(), but which does exclude NMI handlers.

However, SRCU will complain if you use both the traditional and the NMI-safe API members on the same srcu_struct structure. In theory, it is possible to mix and match but, in practice, the rules for safely doing so are not consistent with good software-engineering practice. So if you need any of a given srcu_struct structure's read-side critical sections to appear in an NMI handler, use srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() to mark all of that srcu_struct structure's read-side critical sections. When lockdep is enabled, the kernel will complain bitterly if you attempt to mix and match NMI-safe and non-NMI-safe SRCU readers on the same srcu_struct structure.

An SRCU read-side critical section must be wholly contained within a given task. Discussions led to the belief that this restriction was too severe, resulting in the new srcu_down_read() and srcu_up_read() API members, by analogy to down_read() and up_read(). However, these APIs have not yet seen any use. If they continue to be unused, they will be removed.

In addition, the list_for_each_entry_srcu() and hlist_for_each_entry_srcu() iterators were added, and are actually in use.

Finally, the cleanup_srcu_struct_quiesced() was removed because the deadlock issue that led to its creation was resolved by adding WQ_MEM_RECLAIM workqueues. Therefore, any code that would previously have used cleanup_srcu_struct_quiesced() can now use cleanup_srcu_struct() instead.

Read-side guards

Zijlstra introduced read-side guards for RCU and SRCU and Johannes Berg made the RCU read-side guards safe for the sparse static-analysis tool. These guards use the __cleanup__ attribute to cause a read-side critical section to be exited as soon as the scope ends. This enables the RAII (resource allocation is initialization) pattern for RCU and SRCU, for example. From fs/libfs.c:

 1 static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
 2 {
 3   struct dentry *dentry;
 4
 5   guard(rcu)();
 6   dentry = READ_ONCE(stashed);
 7   if (!dentry)
 8     return NULL;
 9   if (!lockref_get_not_dead(&dentry->d_lockref))
10     return NULL;
11   return dentry;
12 }

Quick Quiz 2: What is with the extra pair of parentheses on line 5?

Click for answer

For lock-based guards, these would specify which lock to acquire. But RCU is global in nature, so does not need anything between the second pair of parentheses.

Those willing to look more deeply under the covers will see that the (rcu) is the argument to the guard() macro and the () is an argument to the constructor function that enters the RCU read-side critical section.

Line 5 creates an RCU read-side critical section that extends to the end of the enclosing scope, that is, to the end of the function body.

An SRCU read-side guard must specify which srcu_struct to use, for example, as follows:

 1 static void gpiochip_setup_devs(void)
 2 {
 3   struct gpio_device *gdev;
 4   int ret;
 5
 6   guard(srcu)(&gpio_devices_srcu);
 7
 8   list_for_each_entry_srcu(gdev, &gpio_devices, list,
 9          srcu_read_lock_held(&gpio_devices_srcu)) {
10     ret = gpiochip_setup_dev(gdev);
11     if (ret)
12       dev_err(&gdev->dev,
13         "Failed to initialize gpio device (%d)\n", ret);
14   }
15 }

Here, line 6 enters an SRCU read-side critical section using the srcu_struct structure named gpio_devices_srcu, and this critical section extends to the end of the enclosing scope.

But sometimes it is necessary to exit the critical section prior to the end of the enclosing scope. For this purpose, scoped_guard() creates an SRCU read-side critical section that covers only the following statement, which will often be a compound statement, for example, as follows:

 1 struct gpio_desc *gpio_to_desc(unsigned gpio)
 2 {
 3   struct gpio_device *gdev;
 4
 5   scoped_guard(srcu, &gpio_devices_srcu) {
 6     list_for_each_entry_srcu(gdev, &gpio_devices, list,
 7         srcu_read_lock_held(&gpio_devices_srcu)) {
 8       if (gdev->base <= gpio &&
 9           gdev->base + gdev->ngpio > gpio)
10         return &gdev->descs[gpio - gdev->base];
11     }
12   }
13
14   if (!gpio_is_valid(gpio))
15     pr_warn("invalid GPIO %d\n", gpio);
16
17   return NULL;
18 }

Quick Quiz 3: So why aren't there read-side guards for the three Tasks RCU variants?

Click for answer

For Tasks RCU and Tasks Rude RCU, the readers are implicit, which means that there is no useful way to create a guard. For Tasks Trace RCU, read-side guards are easy to implement, and will be implemented should a use case arise.

Here, line 5 enters an SRCU read-side critical section that extends from line 6 through line 12.

In all cases, use of read-side guards avoids bugs in which code enters a critical section but then fails to exit it.

RCU callback dynamic (de-)offloading

RCU callbacks may be offloaded, which means that, instead of being invoked in software-interrupt context (usually on the CPU that queued them), they are instead processed (shepherded through the end of the grace period) in the context of rcuog kthreads, then invoked in the context of per-CPU rcuoc kthreads. Callback offloading can provide substantial improvements for both HPC and real-time workloads by removing the "noise" of callback invocation from the CPUs doing time-critical work, or, failing that, by letting the scheduler decide when and where the RCU callbacks should be invoked. These kthreads may be controlled by the system administrator using the usual set of scheduling facilities, ranging from taskset to control groups.

Quick Quiz 4: Why not always offload RCU callback invocation?

Click for answer

Because offloaded callbacks require that call_rcu() explicitly synchronize with whatever CPU the corresponding rcuog kthread might be running on. This synchronization is not free, and poses the usual choice between high throughput (callbacks not offloaded) and real-time response (callbacks offloaded). And because RCU has no way of knowing which approach is best for a given workload, it must therefore defer to the better judgment of the system administrator.

By default, CPUs are never offloaded. Setting the RCU_NOCB_CPU_DEFAULT_ALL kernel configuration option causes all CPUs to be offloaded. Either build-time option may be overridden by the nohz_full or the rcu_nocbs kernel boot parameter, or both, if you feel that your future self will need the additional confusion.

As the number of CPUs on the typical computer system has increased, so has the number of workloads running on such a system, and in turn, the need to dynamically adjust those workloads. There is thus now an in-kernel facility to offload and de-offload RCU callbacks on specific CPUs. However, this facility has not yet been made available to user space due to persistent issues, including race conditions, deadlocks, and other hangs.

The current direction is to allow run-time offloading and de-offloading, but only for offline CPUs. This is expected to significantly simplify the code while supporting all known use cases. This facility will likely be made available to user space with the keenly anticipated run-time adjustment of the nohz_full kernel boot parameter.

Miscellaneous

The CONFIG_RCU_FAST_NO_HZ kernel configuration option was intended to improve energy efficiency, but a survey in 2021 showed that the only users of this option also offloaded RCU callbacks. In that case, all CONFIG_RCU_FAST_NO_HZ does is to provide a slight slowdown for transitions to and from idle. This configuration option was therefore removed.

The data-access APIs added rcu_dereference_raw_check(), rcu_replace_pointer(), and unrcu_pointer(), but also removed rcu_dereference_raw_notrace() and rcu_swap_protected().

The updater-validation APIs removed RCU_NONIDLE() because Zijlstra's and Gleixner's idle-loop rework removed the need for it.

The RCU list APIs added list_tail_rcu(), hlists_swap_heads_rcu(), hlist_nulls_add_tail_rcu(), and hlist_nulls_add_fake(), but removed hlist_bl_del_init_rcu().

How did those 2019 predictions turn out?

The 2019 article included a set of predictions about the future of the RCU API. Five years later, a look at how they turned out seems warranted.

A kmem_struct counterpart to kfree_rcu() will likely be required.
This had been predicted in 2014 as well but, yet again, this did not happen. But something even better happened, namely that kfree() now handles memory obtained from kmem_struct_alloc(). This means that there is no longer a need for something like a kmem_struct_free_rcu() because kfree_rcu() now just handles this case.
Almost. But that is the subject of a new prediction.
Quick Quiz 32: Inlining rcu_read_lock() sounds quite valuable. So why aren't Jiangshan's patches upstream?

Click for answer
They do remove function-call and task-structure-access overhead from rcu_read_lock(), but at the cost of additional code on the context-switch fast path. This might well be a good tradeoff, but actual performance results are required. Please feel free to give this patch series a spin on your favorite workload! (And of course to post the resulting performance results.)

Inlining of TREE_PREEMPT_RCU's rcu_read_lock() primitive.
And yet again, this did not happen.
But not for lack of effort on the part of Lai Jiangshan, who provided not one but two patch series along these lines (2024, 2019).
Additional forward-progress work, both in rcutorture and in RCU proper.
There was significant work in this area, along with upgrades to the callback-flooding testing in rcutorture to help ensure that the improvements stay improved.
Better handling of vCPU preemption within RCU readers.
There have been some interesting experiments in this area, but no commits have yet made it to mainline.
Adding rcutorture to kselftests, that is, adding a Makefile to tools/testing/selftests/rcutorture that carries out a quick rcutorture-based smoke test of RCU.
This has been done. Even better, there are now more people who make frequent use of rcutorture.
Disentangling rcu_barrier() from CPU hotplug operations, which could permit this function to be invoked from CPU-hotplug notifiers.
This has been completed.

Quick Quiz 33: what happened to quick quizzes 5-31?

Click for answer

They can be found in the background-material supplemental article, for everybody who wants more RCU.

Three and two halves out of six, which is not as good as one might hope, but better than they usually are.

It is also illuminating to list the unexpected changes. Some of these are hinted at above, but bear repeating:

There is now a Tasks Rude RCU and Tasks Trace RCU.
There are a number of energy-efficiency improvements, including lazy RCU callbacks and lazy kfree_rcu() processing.
There is now a (more) complete set of polled RCU grace-period APIs.
SRCU read-side critical sections may now be in NMI handlers (using the new srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() functions) and may also span tasks (using the new srcu_down_read() and srcu_up_read() functions).
RAII guards are now available for RCU and SRCU, courtesy of Zijlstra and Berg.
There is now a kernel configuration option to reduce synchronize_rcu() latency during heavy RCU-callback loads.
Expedited RCU uses kthread workers instead of workqueues, which enables priority boosting to also boost expedited grace-period processing.
The expedited RCU CPU stall-warning timeout can now be set in milliseconds, and some users set it to 20 milliseconds. And so it is that Linux's stall warning timeout finally beats the 1990s DYNIX/ptx timeout of 1.5 seconds. For expedited grace periods, anyway.
It is now OK to disable interrupts across rcu_read_unlock() even if the corresponding RCU read-side critical section might have been preempted. This is a welcome side-effect of Jiangshan's first attempt to inline rcu_read_lock() and rcu_read_unlock().
RCU code now takes greater advantage of the Kernel Concurrency Sanitizer (KCSAN).
CPUs may have their callbacks offloaded and de-offloaded at runtime, though this capability has not yet been made available to user space.
Debug-objects testing for double call_rcu() bugs can now print out more information on the memory passed in.
The rcutorture tests of RCU priority boosting are now much more stringent and resistant to false positives.
There is now a kvm-remote.sh rcutorture test facility that spreads rcutorture tests over many remote systems.
There is now a torture.sh test facility that does an overnight test of various torture tests.
There is now a trivial textbook RCU implementation in rcutorture, just to keep the slides and textbooks honest.
The lockdep facility now checks for SRCU-based deadlocks.
The Linux-kernel memory model's (LKMM's) model of SRCU is now much more realistic.
A great many much-appreciated features and fixes from more than 100 Linux-kernel developers.

What next for the RCU API?

As always, the most honest answer is that I do not know. That said, here are a few things that might happen:

The slab allocator will defer acting on kmem_struct_destroy() until after memory sent to kfree_rcu() has been freed. This would permit a module to use kfree_rcu() on memory obtained from kmem_struct allocators that this module passes to kmem_struct_destroy() at module-unload time.
TREE_PREEMPT_RCU's rcu_read_lock() primitive will be inlined. After all, why not triple down?
Hazard pointers will be added to the Linux kernel's deferred-free toolbox.
RCU callbacks will benefit from concurrent expedited grace periods.
Although RCU is now capable of changing the callback-offloaded status of a given CPU at runtime, this has not been made available to user space. It seems likely that user space will gain this capability sooner rather than later, albeit with some restrictions.
Further upgrades to RCU's energy-efficiency, latency, and simplicity.
Someone will notice that rcu_barrier() is no longer guaranteed to wait for the grace periods corresponding to prior calls to synchronize_rcu(). (Late-breaking news: Vlastimil Babka has already noticed.)

But now as always, new use cases and workloads will place unanticipated demands on RCU.

Acknowledgments

We are all indebted to a huge number of people who have used, abused, poked at, and otherwise helped to improve the RCU API. Paul is grateful to Dan Kelley for his support of this effort.

This work represents the view of the authors and does not necessarily represent the view of the authors' respective employers.

Index entries for this article
Kernel	Read-copy-update
GuestArticles	E. McKenney, Paul

Hazard pointers

Posted Sep 13, 2024 19:59 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (7 responses)

What is the intended use case for hazard pointers? Do you envision them used by "normal" code just like RCU, or only under special circumstances?

Hazard pointers

Posted Sep 14, 2024 13:28 UTC (Sat) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

The thing that made it clear that hazard pointers are needed was the recent reworking of reference-count acquisition to better handle high contention for a case in which a reference needed to be acquired within an RCU read-side critical section.

Of course, we have other ways of dealing with this, for example per-CPU refcounts. Which work very well and thus see a lot of use. But in this case, their per-object memory overhead was too much for traffic to bear. A similar effect may often be achieved using SRCU, but only when it is OK for long readers to block updates. All in all, we have taken RCU quite a bit farther than I would have guessed possible 25 years ago, but we are starting to find areas where more is needed.

Hazard pointers is a good match for this situation. A reference may be acquired without contention, and no per-object storage is required.

To be sure, hazard pointers is limited in its own way: (1) Heavy barriers are needed for acquiring a reference, or, alternatively, IPIs or RCU readers. (2) Unlike RCU, hazard pointers can say "no" to a reference acquisition. But that was already the case in the aforementioned situation. (3) Like RCU, hazard pointers involves deferred reclamation, but is often more able to do emergency reclamations. (4) Hazard pointers is new code and will require some work to stabilize, just like RCU did, and sometimes still does.

So, I expect hazard pointers to initially be used only under special circumstances. Longer term, who knows?

Hazard pointers

Posted Oct 1, 2024 19:48 UTC (Tue) by jseigh (guest, #173778) [Link] (5 responses)

The implementation of an asymmetric memory barrier for hazard pointers, as well as other things, is going to have a vcpu preemption problem also.

Asymmetric memory barriers, that are used to eliminate the need for the expensive memory barrier in a hazard pointer read, look for events on the other processors which would constitute a memory barrier on the other processors. Examples would be context switches, cpu in a wait state, and ipi interrupt handling. The former 2 being used as quiescent states by RCU (if I'm guessing correctly), hence the use of RCU in an asymmetric memory barrier implementation. A vcpu wait state would also constitute a memory barrier, so if they were observable, they to could be used to avoid vcpu preemption issues in a asymmetric memory barrier implementation.

vcpu wait state detection should also be used for IPI implementations since the IPI's are also affected by vcpu preemption.

Hazard pointers continued and vcpu preemption

Posted Oct 2, 2024 14:27 UTC (Wed) by jseigh (guest, #173778) [Link]

A few additional comments.

The hazard pointer implementation in the kernel. Is there an API for that? Also if it's new are they looking for rust or c?

For the vm ballooning problem, some vm's are using pause loop detection for spin locks (although they apparently are not that good at figuring out the right target vcpu to switch to. If it was a global spin lock eventually most of the CPUs would trigger it so the vm would find the target by process of elimination. For RCU if a processor looked unresponsive you could do a spin loop with pauses on the vcpu's quiescent state until it changed or just long enough to trigger the PLE event. You might have to do this for all vcpus via IPI to get the vm to find the right target vcpu.

Hazard pointers

Posted Oct 2, 2024 19:43 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (3 responses)

Here is the most recent patch series: https://lore.kernel.org/all/20241002010205.1341915-1-math...

There are earlier ones from Boqun Feng and Neeraj Upadhyay.

Mathieu is also working on a userspace hazard-pointer implementation, which I would expect to use sys_membarrier().

If I understand correctly, for hazard pointers and vCPU preemption, this is a performance problem rather than a correctness problem. If a vCPU is preempted, this will delay the RCU grace period, the IPI response, or the sys_membarrier() response, as the case may be. There have been a number of prototyped solutions to vCPU preemption, for example, based on priority boosting. So, should vCPU preemption become a problem in practice, there are a number of starting points for a solution.

But what is your preferred starting point?

Hazard pointers

Posted Oct 2, 2024 22:39 UTC (Wed) by jseigh (guest, #173778) [Link] (2 responses)

Re hazard pointer API. I was mainly curious since the c++26 hazard pointer API seems to be more complex than I would expect. If I did anything it would be in user space although I have a hazard pointer based proxy collector implementation that suites most of my needs. I don't use deferred reclamation with data structures requiring high rates of modifications, e.g. a linked list lock-free queue because
1) deferred reclamation would kill throughput
2) I already have a lock-free queue which is ABA proof, i.e. doesn't require deferred reclamation to work correctly.

Re vcpu preemption. I don't really have any preferred starting point. Just curious about that also. I did propose a new machine instruction for restartable sequences about 20 years ago. If we had something like that any deferred reclamation schemes based on that, it would have automatically worked in vm's much the same way LL/SC works in vm's without any special handling.

Hazard pointers

Posted Oct 3, 2024 12:01 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

The C++ Hazard Pointers API was informed by some years of production experience with the implementation in the Folly library. As you might guess, Maged Michael was the force behind this implementation, the experience with it, and of course with the adjustments undertaken in response to that experience. On the other hand, and to your point, it is not unusual for such experience to result in increased complexity. Linux-kernel RCU is most certainly another example of this tendency, after all. ;-)

That said, I completely agree that if you can achieve your goals simply, then embrace simplicity. In particular, if a simple lock-free queue does the job for you, then by all means use that queue!

At one point, I had hoped that the various hardware transactional memory mechanisms would be useful for many things, including vCPU preemption. And perhaps someday they will. But at present, they seem to have fallen victim to the same "complexity surprise" called out above.

Hazard pointers

Posted Oct 17, 2024 12:32 UTC (Thu) by jseigh (guest, #173778) [Link]

I gave up on the idea of a hazard pointer implementation as soon as I realized it would involve writing a tracing garbage collector of sorts as part of that.

I did remove the hazard pointer logic from smrproxy. Absent the load of the address of the reader lock object, a lock action is 2 machine instructions, a load register followed by a store register. No loop so it is formally wait-free. Unlock is same as before, a store of 0 to memory. I'm starting on porting it to c++ which also involves figuring out what a c++ deferred reclamation API looks like.

The RCU API, 2024 edition

Summary of RCU API changes

Lazy RCU grace asynchronous periods

Reworking of kfree_rcu()

Polled RCU grace periods

Tasks Rude RCU and Tasks Trace RCU

New SRCU read-side critical sections

Read-side guards

RCU callback dynamic (de-)offloading

Miscellaneous

How did those 2019 predictions turn out?

What next for the RCU API?

Acknowledgments

Hazard pointers

Hazard pointers

Hazard pointers

Hazard pointers continued and vcpu preemption

Hazard pointers

Hazard pointers

Hazard pointers

Hazard pointers

Reworking of `kfree_rcu()`