Kernel configuration parameters for RCU

[Posted January 21, 2019 by jake]

This sidebar is part of Paul McKenney's 2019 update to the RCU API.

Kernel configuration parameters

RCU's Kconfig options and kernel boot parameters can be considered to be part of the RCU API, most especially from the viewpoint of someone building a kernel intended for a specialized device or workload. This section summarizes the RCU-related Kconfig options and the more commonly used kernel boot parameters, but please note that many of the Kconfig options require that the CONFIG_RCU_EXPERT Kconfig option be set.

The first set of Kconfig parameters controls the underlying behavior of the RCU implementation itself, and is defined in kernel/rcu/Kconfig.

CONFIG_PREEMPT=n and CONFIG_SMP=y implies CONFIG_TREE_RCU, thus selecting the non-preemptible tree-based RCU implementation that is appropriate for server-class SMP builds. It can accommodate a very large number of CPUs, but scales down sufficiently well for all but the most memory-constrained systems. CONFIG_TREE_RCU provides the following boot parameters:
- rcutree.blimit= sets the maximum number of RCU callbacks to process in one batch, which defaults to ten callbacks. This limit does not apply to offloaded CPUs.
- rcutree.qhimark= sets the threshold of queued RCU callbacks beyond which rcutree.blimit= will be ignored. This defaults to 10,000 callbacks.
- rcutree.qlowmark= sets the threshold of queued RCU callbacks below which rcutree.blimit= will once again have effect. This defaults to 100 callbacks.
- rcutree.jiffies_till_first_fqs= sets the number of jiffies to wait between grace-period initialization and the first force-quiescent-state scan that checks (among other things) for idle CPUs. The default depends on the value of HZ and the number of CPUs on the system. By default, all systems wait for at least one jiffy, with one additional jiffy for HZ greater than 250, an additional jiffy for HZ greater than 500, and one additional jiffy for each 256 CPUs on the system. This value may be manually set to zero, which can be useful for specialty systems that tend to have idle CPUs, that need fast grace periods, and that don't mind burning a little extra CPU during grace-period initialization.
- rcutree.jiffies_till_next_fqs= sets the number of jiffies to wait between successive force-quiescent-state scans. The default is the same as for rcutree.jiffies_till_first_fqs=.
- rcutree.jiffies_till_sched_qs= sets the number of jiffies that a grace period will wait before soliciting help from rcu_note_context_switch(), and cond_resched().
- rcutree.rcu_kick_kthreads causes the grace-period kthread to get an extra wake-up if it sleeps more than three times longer than specified.
- rcupdate.rcu_expedited= causes normal grace-period primitives to act like their expedited counterparts. For example, invoking synchronize_rcu() will act as if synchronize_rcu_expedited() had been invoked.
- rcupdate.rcu_normal= causes expedited grace-period primitives to act like their normal counterparts. This kernel boot parameter overrides rcupdate.rcu_expedited= except during very early boot.
- rcupdate.rcu_normal_after_boot= causes expedited grace-period primitives to act like their normal counterparts once init has spawned. Real-time systems desiring fast boot but wishing to avoid run-time IPIs from expedited grace periods would therefore set both rcupdate.rcu_expedited= and rcupdate.rcu_normal_after_boot=.
CONFIG_PREEMPT=y implies CONFIG_PREEMPT_RCU, thus selecting the preemptible tree-based RCU implementation that is appropriate for real-time and low-latency SMP builds. It can also accommodate a very large number of CPUs, and scales down sufficiently well for all but the most memory-constrained systems. The boot parameters for CONFIG_TREE_RCU also apply to CONFIG_TREE_PREEMPT_RCU.
CONFIG_PREEMPT=n and CONFIG_SMP=n implies CONFIG_TINY_RCU, selecting the non-preemptible uniprocessor (UP) RCU implementation that is appropriate for non-real-time UP builds. It has the smallest memory footprint of any of the current in-kernel RCU implementations.

The second set of Kconfig parameters controls RCU's energy-efficiency features. These are also defined in init/Kconfig.

CONFIG_RCU_FAST_NO_HZ=y improves RCU's energy efficiency by reducing the number of times that RCU wakes up idle CPUs. The downside of this approach is that it increases RCU grace-period latency somewhat.
- rcutree.rcu_idle_gp_delay= specifies the number of jiffies an idle CPU with callbacks should remain idle before rechecking RCU state. The default is four jiffies.
- rcutree.rcu_idle_lazy_gp_delay= specifies the number of jiffies an idle CPU with callbacks, where all callbacks are lazy, should remain idle before rechecking RCU state. (A "lazy" callback is one that RCU knows will do nothing other than free memory.) The default is six seconds, or 6*HZ jiffies.
CONFIG_RCU_NOCB_CPU=y fortuitously improves RCU's energy efficiency [PDF] by eliminating wakeups due to RCU callback processing. However, it was intended for real-time use, so is covered in the next section.
srcutree.exp_holdoff= controls the auto-expediting of the first SRCU grace period starting after an extended idle period, and defaults to 25 microseconds. If it has been longer since the end of the last SRCU grace period (for the same srcu_struct structure), the new SRCU grace period will be expedited. Please note that the units of this kernel boot parameter are nanoseconds.

Please note that these features do not pertain to CONFIG_TINY_RCU, whose job description emphasizes small memory footprint over energy efficiency.

The third set of Kconfig parameters controls RCU's real-time features, which are also defined in init/Kconfig.

CONFIG_RCU_NOCB_CPU=y allows callback processing to be offloaded from selected CPUs, with the "NOCB" standing for "no callbacks". The CPUs to offload can be specified at boot time, as can a couple of other things:
- rcu_nocbs= may be used to specify offloaded CPUs at boot time. For example, rcu_nocbs=1-3,7 would cause CPUs 1, 2, 3, and 7 to have their callbacks offloaded to rcuo kthreads. The set of offloaded CPUs cannot be changed at runtime. However, experience thus far indicates that when at least one CPU needs to be offloaded, it is just fine to offload all of them. As a result, there has not yet been a strong need for runtime changes to the set of offloaded CPUs. Of course, if you to have a workload that really needs the set of offloaded CPUs to be changed at runtime, please let me know.
- rcu_nocb_poll also offloads the need to do wakeup operations from the offloaded CPUs. On the other hand, this means that all of the rcuo kthreads must poll, which probably is not what you want on a battery-powered system.
- rcutree.rcu_nocb_leader_stride= sets the number of NOCB kthread groups, which defaults to the square root of the number of CPUs. Larger numbers reduce the wakeup overhead on the per-CPU grace-period kthreads, but increase that same overhead on each group's leader.
CONFIG_NO_HZ_FULL causes RCU to treat user-space execution as an extended quiescent state similar to RCU's handling of dyntick-idle CPUs.
CONFIG_RCU_BOOST=y enables RCU priority boosting. This could be considered a debugging option, but it is one that pertains primarily to real-time kernels, so is included in the real-time section. This Kconfig parameter causes blocked RCU readers to be priority-boosted in order to avoid indefinite prolongment of the current RCU grace period. The following Kconfig and boot parameters control the boosting process:
- CONFIG_RCU_BOOST_DELAY specifies how long RCU will allow a grace period to be delayed before starting RCU priority boosting. The default is 300 milliseconds, which seems to work quite well in practice.
- rcutree.kthread_prio= specifies the real-time priority to boost to, and defaults to priority one, the least-important real-time priority level. You should set this priority level to be greater than the highest-priority real-time CPU-bound thread. The default priority is appropriate for the common case where there are no CPU-bound threads running at real-time priority.

Please note that these Kconfig options do not pertain to CONFIG_TINY_RCU, which again is focused on small memory footprint, even at the expense of real-time response.

The fourth set of Kconfig parameters may also be specified to tune the data-structure layout of CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU:

CONFIG_RCU_FANOUT controls the fanout of non-leaf nodes of the tree. Lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default values have always sufficed with the exception of RCU stress testing.
CONFIG_RCU_FANOUT_LEAF controls the fanout of leaf nodes of the tree. As for CONFIG_RCU_FANOUT, lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default values are sufficient in most cases, but very large systems (thousands of CPUs) will want to set this to the largest possible value, namely 64. Such systems will also need to boot with skew_tick=1 to avoid massive lock contention on the leaf rcu_node ‑>lock fields. This fanout can also be set at boot time:
- rcutree.rcu_fanout_leaf= sets the number of CPUs to assign to each leaf-level rcu_node structure. This defaults to 16 CPUs. Very large systems (many hundreds of CPUs) can benefit from setting this to its maximum (64 on 64-bit systems), but such systems should also set skew_tick=1.
- rcutree.rcu_fanout_exact= disables autobalancing of the rcu_node combining tree. To the best of my knowledge, the autobalancing has always worked well.

The fifth set of kernel configuration and boot parameters controls the RCU-tasks flavor:

CONFIG_PREEMPT=y enables this flavor of RCU. In CONFIG_PREEMPT=n kernels, call_rcu_tasks() maps to call_rcu() and synchronize_rcu_tasks() maps to synchronize_rcu().
rcupdate.rcu_task_stall_timeout= kernel boot parameter gives a stall-warning timeout that defaults to ten minutes.

The sixth set of kernel configuration parameters controls debugging options:

CONFIG_RCU_TRACE enables RCU event tracing.
CONFIG_SPARSE_RCU_POINTER no longer exists. Instead, sparse unconditionally checks for proper use of RCU-protected pointers. Please note that this is a build-time check: Use "make C=1" to cause sparse to check source files that would have been rebuilt by "make", and use "make C=2" to cause sparse to unconditionally check source files.
CONFIG_DEBUG_OBJECTS_RCU_HEAD enables debug-objects checking of multiple invocations of call_rcu() (and friends) on the same structure.
CONFIG_PROVE_RCU enables lockdep-RCU checking. Note that only a single lockdep-RCU splat will be emitted per boot.
CONFIG_RCU_TORTURE_TEST enables RCU torture testing, also known as rcutorture. This is a tri-state parameter, permitting rcutorture.c to be compiled into the kernel, built as a module, or omitted entirely. When rcutorture.c is built into the kernel (CONFIG_RCU_TORTURE_TEST=y), then CONFIG_RCU_TORTURE_TEST_RUNNABLE starts RCU torture testing during boot. Please don't try this on production systems.
CONFIG_RCU_PERF_TEST enables RCU performance testing, which operates in a manner similar to rcutorture.
CONFIG_RCU_CPU_STALL_TIMEOUT specifies the maximum grace-period duration that RCU will tolerate without complaint. Excessively long grace periods are usually caused by CPUs or tasks failing to find their way out of an RCU read-side critical section in a timely manner. CPU stalls can be caused by a number of bugs, as described in Documentation/RCU/stallwarn.txt. This Kconfig variable defaults to 21 seconds. Note that if a grace period persists for more than half of this RCU CPU stall warning timeout, holdout CPUs will start receiving festive interprocessor interrupts in an attempt to get them to report quiescent states.
- rcupdate.rcu_cpu_stall_suppress= suppresses RCU CPU stall warning messages.
- rcupdate.rcu_cpu_stall_timeout= overrides the build-time CONFIG_RCU_CPU_STALL_TIMEOUT setting.
CONFIG_RCU_EQS_DEBUG causes RCU to check idle-state entry and exit. It is useful to enable this when adding new interrupt paths to your architecture, including when adding new architectures. This Kconfig option can help you find things like calls to rcu_irq_enter() that lack a matching rcu_irq_exit().

If you are working with code that uses RCU, please do us all a favor and test that code with CONFIG_PROVE_RCU and CONFIG_DEBUG_OBJECTS_RCU_HEAD enabled. If you are modifying the RCU implementation itself, you will need to run rcutorture, with multiple runs covering the relevant kernel configuration parameters. A one-hour rcutorture run on an 8-CPU machine qualifies as light rcutorture testing. The automated scripts invoked by tools/testing/selftests/rcutorture/bin/kvm.sh can be quite helpful.

Yes, running extra tests can be a hassle, but I am here to tell you that extra testing is much easier than trying to track down bugs in your RCU code.