Rationalizing CPU hotplugging
Hotplugging is a familiar concept when one thinks about keyboards, printers, or storage devices, but it is a bit less so for CPUs: USB-attached add-on processors are still relatively rare in the market. Even so, the kernel has had support for CPU hotplug for some time; the original version of Documentation/cpu-hotplug.txt was added in 2006 for the 2.6.16 kernel. That document mentioned a couple of use cases for this feature: high-end NUMA hardware that truly has runtime-pluggable processors, and the ability to disable a faulty CPU in a high-reliability system. Other uses have since come along, including system suspend operations (where all CPUs but one are "unplugged" prior to suspending the system) and virtualization, where virtual CPUs can be given to (or taken from) guests at will.
So CPU hotplug is a useful feature, but the current implementation in the
kernel is not well loved; in a recent patch
set intended to improve the situation, Thomas Gleixner remarked that
"the current CPU hotplug implementation has become an increasing
nightmare full of races and undocumented behaviour.
" CPU hotplug
shows a lot of the signs of a feature that has evolved significantly over
time without high-level oversight; among other things, the sequence of
steps followed for an unplug
operation is not the reverse of the steps to plug in a new CPU. But much
of the trouble associated with CPU hotplug is blamed on its extensive use
of notifiers.
The kernel's notifier mechanism is a way for kernel code to request a callback when an event of interest happens. They are, in a sense, general-purpose hooks that anybody in the kernel can use — and, it seems, just about anybody does. There have been a lot of complaints about notifiers, as is typified by this comment from Linus in response to Thomas's patch set:
Notifiers also make the code hard to understand because there is no easy way to know what will happen when a notifier chain (which is a run-time construct) is invoked: there could be an arbitrary set of notifiers in the chain, in any order. The ordering requirements of specific notifiers can add some fun challenges of their own.
The process of unplugging a CPU requires a surprisingly long list of actions. The scheduler must be informed so it can migrate processes off the affected CPU and shut down the relevant run queue. Per-CPU kernel threads need to be told to exit or "park" themselves. CPU frequency governors need to be told to stop worrying about that processor. Almost anything with per-CPU variables will need to make arrangements for one CPU to go away. Timers running on the outgoing CPU need to be relocated. The read-copy-update subsystem must be told to stop tracking the CPU and to ensure that any RCU callbacks for that CPU get taken care of. Every architecture has its own low-level details to take care of. The perf events subsystem has an impressive set of requirements of its own. And so on; this list is nowhere near comprehensive.
All of these actions are currently accomplished by way of a set of notifier callbacks which, with luck, get called in the right order. Meanwhile, plugging in a new CPU requires an analogous set of operations, but those are handled in an asymmetric manner with a different set of callbacks. The end result is that the mechanism is fragile and that few people have any real understanding of all the steps needed to plug or unplug a CPU.
Thomas's objective is not to rewrite all those notifier functions or fundamentally change what is done to implement a CPU hotplug operation — at least, not yet. Instead, he is focused on imposing some order on the whole process so that it can be understood by looking at the code. To that end, he has replaced the current set of notifier chains with a linear sequence of states to be worked through when bringing up or shutting down a CPU. There is a single array of cpuhp_step structures, one per state:
struct cpuhp_step { int (*startup)(unsigned int cpu); int (*teardown)(unsigned int cpu); };
The startup() function will be called when passing through the state as a new CPU is brought online, while teardown() is called when things are moving in the other direction. Many states only have one function or the other in the current implementation; the eventual goal is to make the process more symmetrical. In the initial patch set, the set of states is:
State startup teardown CPUHP_CREATE_THREADS ✔ CPUHP_PERF_X86_UNCORE_PREP ✔ ✔ CPUHP_PERF_X86_PREPARE ✔ ✔ CPUHP_PERF_BFIN ✔ CPUHP_PERF_POWER ✔ CPUHP_PERF_SUPERH ✔ CPUHP_PERF_PREPARE ✔ ✔ CPUHP_SCHED_MIGRATE_PREP ✔ ✔ CPUHP_WORKQUEUE_PREP ✔ CPUHP_RCUTREE_PREPARE ✔ ✔ CPUHP_HRTIMERS_PREPARE ✔ ✔ CPUHP_TIMERS_PREPARE ✔ ✔ CPUHP_PROFILE_PREPARE ✔ ✔ CPUHP_X2APIC_PREPARE ✔ ✔ CPUHP_SMPCFD_PREPARE ✔ ✔ CPUHP_SMPCFD_PREPARE ✔ CPUHP_SLAB_PREPARE ✔ ✔ CPUHP_NOTIFY_PREPARE ✔ CPUHP_NOTIFY_DEAD ✔ CPUHP_CPUFREQ_DEAD ✔ CPUHP_SCHED_DEAD ✔ CPUHP_CLOCKEVENTS_DEAD ✔ CPUHP_BRINGUP_CPU ✔ CPUHP_AP_OFFLINE Application processor states CPUHP_AP_SCHED_STARTING ✔ CPUHP_AP_PERF_X86_UNCORE_STARTING ✔ CPUHP_AP_PERF_X86_AMD_IBS_STARTING ✔ ✔ CPUHP_AP_PERF_X86_STARTING ✔ ✔ CPUHP_AP_PERF_ARM_STARTING ✔ CPUHP_AP_ARM_VFP_STARTING ✔ ✔ CPUHP_AP_ARM64_TIMER_STARTING ✔ ✔ CPUHP_AP_KVM_STARTING ✔ ✔ CPUHP_AP_X86_TBOOT_DYING ✔ CPUHP_AP_S390_VTIME_DYING ✔ CPUHP_AP_CLOCKEVENTS_DYING ✔ CPUHP_AP_RCUTREE_DYING ✔ CPUHP_AP_SCHED_NOHZ_DYING ✔ CPUHP_AP_SCHED_MIGRATE_DYING ✔ CPUHP_AP_MAZ End marker for AP states CPUHP_TEARDOWN_CPU ✔ CPUHP_PERCPU_THREADS ✔ ✔ CPUHP_SCHED_ONLINE ✔ ✔ CPUHP_PERF_ONLINE ✔ ✔ CPUHP_SCHED_MIGRATE_ONLINE ✔ CPUHP_WORKQUEUE_ONLINE ✔ ✔ CPUHP_CPUFREQ_ONLINE ✔ ✔ CPUHP_RCUTREE_ONLINE ✔ ✔ CPUHP_NOTIFY_ONLINE ✔ CPUHP_PROFILE_ONLINE ✔ CPUHP_SLAB_ONLINE ✔ ✔ CPUHP_NOTIFY_DOWN_PREPARE ✔ CPUHP_PERF_X86_UNCORE_ONLINE ✔ ✔ CPUHP_PERF_X86_ONLINE ✔ CPUHP_PERF_S390_ONLINE ✔ ✔
Looking at that list, one begins to see why the current CPU hotplug mechanism is hard to understand. Things are messy enough that Thomas is not really trying to change anything fundamental in how CPU hotplug works; most of the existing notifier callbacks are still there, they are just invoked in a different way. The purpose of the exercise, Thomas said, was:
Once some high-level order has been brought to the CPU hotplug mechanism, one can think about trying to clean things up. The eventual goal is to have a much smaller set of externally visible states; for drivers and filesystems, there will only be "prepare" and "enable" states available, with no ordering between subsystems. Also, notably, drivers and filesystems will not be allowed to cause a hotplug operation (in either direction) to fail. When the process is complete, the hotplug subsystem should be much more predictable, with a lot more of the details hidden from the rest of the kernel.
That is all work for a future series, though; the first step is to get the
infrastructure set up. Chances are that will require at least one more
iteration of Thomas's "Episode 1" patch set, meaning that it is
unlikely to be 3.9 material. Starting around 3.10, though, we may well see
significant changes to how CPU hotplugging is handled; the result should be
more comprehensible and reliable code.
Index entries for this article | |
---|---|
Kernel | Hotplug |
Kernel | Notifiers |