Kernel configuration parameters for RCU
This sidebar is part of Paul McKenney's 2019 update to the RCU API.
Kernel configuration parameters
RCU's Kconfig options and kernel boot parameters can be considered
to be part of the RCU API, most especially from the viewpoint of someone
building a kernel intended for a specialized device or workload.
This section summarizes the RCU-related Kconfig options and the more
commonly used kernel boot parameters, but
please note that many of the Kconfig options require that the
CONFIG_RCU_EXPERT
Kconfig option be set.
The first set of Kconfig parameters controls the underlying behavior of
the RCU implementation itself, and is defined in
kernel/rcu/Kconfig
.
-
CONFIG_PREEMPT=n
andCONFIG_SMP=y
impliesCONFIG_TREE_RCU
, thus selecting the non-preemptible tree-based RCU implementation that is appropriate for server-class SMP builds. It can accommodate a very large number of CPUs, but scales down sufficiently well for all but the most memory-constrained systems.CONFIG_TREE_RCU
provides the following boot parameters:-
rcutree.blimit=
sets the maximum number of RCU callbacks to process in one batch, which defaults to ten callbacks. This limit does not apply to offloaded CPUs. -
rcutree.qhimark=
sets the threshold of queued RCU callbacks beyond whichrcutree.blimit=
will be ignored. This defaults to 10,000 callbacks. -
rcutree.qlowmark=
sets the threshold of queued RCU callbacks below whichrcutree.blimit=
will once again have effect. This defaults to 100 callbacks. -
rcutree.jiffies_till_first_fqs=
sets the number of jiffies to wait between grace-period initialization and the first force-quiescent-state scan that checks (among other things) for idle CPUs. The default depends on the value ofHZ
and the number of CPUs on the system. By default, all systems wait for at least one jiffy, with one additional jiffy forHZ
greater than 250, an additional jiffy forHZ
greater than 500, and one additional jiffy for each 256 CPUs on the system. This value may be manually set to zero, which can be useful for specialty systems that tend to have idle CPUs, that need fast grace periods, and that don't mind burning a little extra CPU during grace-period initialization. -
rcutree.jiffies_till_next_fqs=
sets the number of jiffies to wait between successive force-quiescent-state scans. The default is the same as forrcutree.jiffies_till_first_fqs=
. -
rcutree.jiffies_till_sched_qs=
sets the number of jiffies that a grace period will wait before soliciting help fromrcu_note_context_switch()
, andcond_resched()
. -
rcutree.rcu_kick_kthreads
causes the grace-period kthread to get an extra wake-up if it sleeps more than three times longer than specified. -
rcupdate.rcu_expedited=
causes normal grace-period primitives to act like their expedited counterparts. For example, invokingsynchronize_rcu()
will act as ifsynchronize_rcu_expedited()
had been invoked. -
rcupdate.rcu_normal=
causes expedited grace-period primitives to act like their normal counterparts. This kernel boot parameter overridesrcupdate.rcu_expedited=
except during very early boot. -
rcupdate.rcu_normal_after_boot=
causes expedited grace-period primitives to act like their normal counterparts onceinit
has spawned. Real-time systems desiring fast boot but wishing to avoid run-time IPIs from expedited grace periods would therefore set bothrcupdate.rcu_expedited=
andrcupdate.rcu_normal_after_boot=
.
-
-
CONFIG_PREEMPT=y
impliesCONFIG_PREEMPT_RCU
, thus selecting the preemptible tree-based RCU implementation that is appropriate for real-time and low-latency SMP builds. It can also accommodate a very large number of CPUs, and scales down sufficiently well for all but the most memory-constrained systems. The boot parameters forCONFIG_TREE_RCU
also apply toCONFIG_TREE_PREEMPT_RCU
. -
CONFIG_PREEMPT=n
andCONFIG_SMP=n
impliesCONFIG_TINY_RCU
, selecting the non-preemptible uniprocessor (UP) RCU implementation that is appropriate for non-real-time UP builds. It has the smallest memory footprint of any of the current in-kernel RCU implementations.
The second set of Kconfig parameters controls RCU's energy-efficiency
features.
These are also defined in init/Kconfig
.
-
CONFIG_RCU_FAST_NO_HZ=y
improves RCU's energy efficiency by reducing the number of times that RCU wakes up idle CPUs. The downside of this approach is that it increases RCU grace-period latency somewhat.-
rcutree.rcu_idle_gp_delay=
specifies the number of jiffies an idle CPU with callbacks should remain idle before rechecking RCU state. The default is four jiffies. -
rcutree.rcu_idle_lazy_gp_delay=
specifies the number of jiffies an idle CPU with callbacks, where all callbacks are lazy, should remain idle before rechecking RCU state. (A "lazy" callback is one that RCU knows will do nothing other than free memory.) The default is six seconds, or6*HZ
jiffies.
-
-
CONFIG_RCU_NOCB_CPU=y
fortuitously improves RCU's energy efficiency [PDF] by eliminating wakeups due to RCU callback processing. However, it was intended for real-time use, so is covered in the next section. -
srcutree.exp_holdoff=
controls the auto-expediting of the first SRCU grace period starting after an extended idle period, and defaults to 25 microseconds. If it has been longer since the end of the last SRCU grace period (for the samesrcu_struct
structure), the new SRCU grace period will be expedited. Please note that the units of this kernel boot parameter are nanoseconds.
Please note that these features do not pertain to
CONFIG_TINY_RCU
, whose job description emphasizes
small memory footprint over energy efficiency.
The third set of Kconfig parameters controls RCU's real-time features,
which are also defined in init/Kconfig
.
-
CONFIG_RCU_NOCB_CPU=y
allows callback processing to be offloaded from selected CPUs, with the "NOCB" standing for "no callbacks". The CPUs to offload can be specified at boot time, as can a couple of other things:-
rcu_nocbs=
may be used to specify offloaded CPUs at boot time. For example,rcu_nocbs=1-3,7
would cause CPUs 1, 2, 3, and 7 to have their callbacks offloaded torcuo
kthreads. The set of offloaded CPUs cannot be changed at runtime. However, experience thus far indicates that when at least one CPU needs to be offloaded, it is just fine to offload all of them. As a result, there has not yet been a strong need for runtime changes to the set of offloaded CPUs. Of course, if you to have a workload that really needs the set of offloaded CPUs to be changed at runtime, please let me know. -
rcu_nocb_poll
also offloads the need to do wakeup operations from the offloaded CPUs. On the other hand, this means that all of thercuo
kthreads must poll, which probably is not what you want on a battery-powered system. -
rcutree.rcu_nocb_leader_stride=
sets the number of NOCB kthread groups, which defaults to the square root of the number of CPUs. Larger numbers reduce the wakeup overhead on the per-CPU grace-period kthreads, but increase that same overhead on each group's leader.
-
-
CONFIG_NO_HZ_FULL
causes RCU to treat user-space execution as an extended quiescent state similar to RCU's handling of dyntick-idle CPUs. -
CONFIG_RCU_BOOST=y
enables RCU priority boosting. This could be considered a debugging option, but it is one that pertains primarily to real-time kernels, so is included in the real-time section. This Kconfig parameter causes blocked RCU readers to be priority-boosted in order to avoid indefinite prolongment of the current RCU grace period. The following Kconfig and boot parameters control the boosting process:-
CONFIG_RCU_BOOST_DELAY
specifies how long RCU will allow a grace period to be delayed before starting RCU priority boosting. The default is 300 milliseconds, which seems to work quite well in practice. -
rcutree.kthread_prio=
specifies the real-time priority to boost to, and defaults to priority one, the least-important real-time priority level. You should set this priority level to be greater than the highest-priority real-time CPU-bound thread. The default priority is appropriate for the common case where there are no CPU-bound threads running at real-time priority.
-
Please note that these Kconfig options do not pertain to
CONFIG_TINY_RCU
, which again is focused on small
memory footprint, even at the expense of real-time response.
The fourth set of Kconfig parameters may also be specified to tune the
data-structure layout of CONFIG_TREE_RCU
and
CONFIG_TREE_PREEMPT_RCU
:
-
CONFIG_RCU_FANOUT
controls the fanout of non-leaf nodes of the tree. Lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default values have always sufficed with the exception of RCU stress testing. -
CONFIG_RCU_FANOUT_LEAF
controls the fanout of leaf nodes of the tree. As forCONFIG_RCU_FANOUT
, lower fanout values reduce lock contention, but also consume more memory and increase the overhead of grace-period computations. The default values are sufficient in most cases, but very large systems (thousands of CPUs) will want to set this to the largest possible value, namely 64. Such systems will also need to boot with skew_tick=1 to avoid massive lock contention on the leafrcu_node
‑>lock
fields. This fanout can also be set at boot time:-
rcutree.rcu_fanout_leaf=
sets the number of CPUs to assign to each leaf-levelrcu_node
structure. This defaults to 16 CPUs. Very large systems (many hundreds of CPUs) can benefit from setting this to its maximum (64 on 64-bit systems), but such systems should also setskew_tick=1
. -
rcutree.rcu_fanout_exact=
disables autobalancing of thercu_node
combining tree. To the best of my knowledge, the autobalancing has always worked well.
-
The fifth set of kernel configuration and boot parameters controls the RCU-tasks flavor:
-
CONFIG_PREEMPT=y
enables this flavor of RCU. InCONFIG_PREEMPT=n
kernels,call_rcu_tasks()
maps tocall_rcu()
andsynchronize_rcu_tasks()
maps tosynchronize_rcu()
. -
rcupdate.rcu_task_stall_timeout=
kernel boot parameter gives a stall-warning timeout that defaults to ten minutes.
The sixth set of kernel configuration parameters controls debugging options:
-
CONFIG_RCU_TRACE
enables RCU event tracing. -
CONFIG_SPARSE_RCU_POINTER
no longer exists. Instead, sparse unconditionally checks for proper use of RCU-protected pointers. Please note that this is a build-time check: Use "make C=1" to cause sparse to check source files that would have been rebuilt by "make", and use "make C=2" to cause sparse to unconditionally check source files. -
CONFIG_DEBUG_OBJECTS_RCU_HEAD
enables debug-objects checking of multiple invocations ofcall_rcu()
(and friends) on the same structure. -
CONFIG_PROVE_RCU
enables lockdep-RCU checking. Note that only a single lockdep-RCU splat will be emitted per boot. -
CONFIG_RCU_TORTURE_TEST
enables RCU torture testing, also known as rcutorture. This is a tri-state parameter, permittingrcutorture.c
to be compiled into the kernel, built as a module, or omitted entirely. Whenrcutorture.c
is built into the kernel (CONFIG_RCU_TORTURE_TEST=y
), thenCONFIG_RCU_TORTURE_TEST_RUNNABLE
starts RCU torture testing during boot. Please don't try this on production systems. -
CONFIG_RCU_PERF_TEST
enables RCU performance testing, which operates in a manner similar to rcutorture. -
CONFIG_RCU_CPU_STALL_TIMEOUT
specifies the maximum grace-period duration that RCU will tolerate without complaint. Excessively long grace periods are usually caused by CPUs or tasks failing to find their way out of an RCU read-side critical section in a timely manner. CPU stalls can be caused by a number of bugs, as described inDocumentation/RCU/stallwarn.txt
. This Kconfig variable defaults to 21 seconds. Note that if a grace period persists for more than half of this RCU CPU stall warning timeout, holdout CPUs will start receiving festive interprocessor interrupts in an attempt to get them to report quiescent states.-
rcupdate.rcu_cpu_stall_suppress=
suppresses RCU CPU stall warning messages. -
rcupdate.rcu_cpu_stall_timeout=
overrides the build-timeCONFIG_RCU_CPU_STALL_TIMEOUT
setting.
-
-
CONFIG_RCU_EQS_DEBUG
causes RCU to check idle-state entry and exit. It is useful to enable this when adding new interrupt paths to your architecture, including when adding new architectures. This Kconfig option can help you find things like calls torcu_irq_enter()
that lack a matchingrcu_irq_exit()
.
If you are working with code that uses RCU, please do us all
a favor and test that code with CONFIG_PROVE_RCU
and CONFIG_DEBUG_OBJECTS_RCU_HEAD
enabled.
If you are modifying the RCU implementation itself, you will need to
run rcutorture, with multiple runs covering the relevant kernel
configuration parameters.
A one-hour rcutorture run on an 8-CPU machine qualifies as light
rcutorture testing.
The automated scripts invoked by
tools/testing/selftests/rcutorture/bin/kvm.sh
can
be quite helpful.
Yes, running extra tests can be a hassle, but I am here to tell you that extra testing is much easier than trying to track down bugs in your RCU code.