|
|
Subscribe / Log in / New account

Rationalizing CPU hotplugging

By Jonathan Corbet
February 12, 2013
One of the leading sources of code churn in the 3.8 development cycle was the removal of the __devinit family of macros. These macros marked code and data that were only needed during device initialization and which, thus, could be disposed of once initialization was complete. These macros are being removed for a simple reason: hardware has become so dynamic that initialization is never complete; something new can always show up, and there is no longer any point in building a kernel that cannot cope with transient devices. Even in this world, though, CPUs are generally seen as being static. But CPUs, too, can come and go, and that is motivating changes in how the kernel manages them.

Hotplugging is a familiar concept when one thinks about keyboards, printers, or storage devices, but it is a bit less so for CPUs: USB-attached add-on processors are still relatively rare in the market. Even so, the kernel has had support for CPU hotplug for some time; the original version of Documentation/cpu-hotplug.txt was added in 2006 for the 2.6.16 kernel. That document mentioned a couple of use cases for this feature: high-end NUMA hardware that truly has runtime-pluggable processors, and the ability to disable a faulty CPU in a high-reliability system. Other uses have since come along, including system suspend operations (where all CPUs but one are "unplugged" prior to suspending the system) and virtualization, where virtual CPUs can be given to (or taken from) guests at will.

So CPU hotplug is a useful feature, but the current implementation in the kernel is not well loved; in a recent patch set intended to improve the situation, Thomas Gleixner remarked that "the current CPU hotplug implementation has become an increasing nightmare full of races and undocumented behaviour." CPU hotplug shows a lot of the signs of a feature that has evolved significantly over time without high-level oversight; among other things, the sequence of steps followed for an unplug operation is not the reverse of the steps to plug in a new CPU. But much of the trouble associated with CPU hotplug is blamed on its extensive use of notifiers.

The kernel's notifier mechanism is a way for kernel code to request a callback when an event of interest happens. They are, in a sense, general-purpose hooks that anybody in the kernel can use — and, it seems, just about anybody does. There have been a lot of complaints about notifiers, as is typified by this comment from Linus in response to Thomas's patch set:

Notifiers are a disgrace, and almost all of them are a major design mistake. They all have locking problems, [they] introduce internal arbitrary API's that are hard to fix later (because you have random people who decided to hook into them, which is the whole *point* of those notifier chains).

Notifiers also make the code hard to understand because there is no easy way to know what will happen when a notifier chain (which is a run-time construct) is invoked: there could be an arbitrary set of notifiers in the chain, in any order. The ordering requirements of specific notifiers can add some fun challenges of their own.

The process of unplugging a CPU requires a surprisingly long list of actions. The scheduler must be informed so it can migrate processes off the affected CPU and shut down the relevant run queue. Per-CPU kernel threads need to be told to exit or "park" themselves. CPU frequency governors need to be told to stop worrying about that processor. Almost anything with per-CPU variables will need to make arrangements for one CPU to go away. Timers running on the outgoing CPU need to be relocated. The read-copy-update subsystem must be told to stop tracking the CPU and to ensure that any RCU callbacks for that CPU get taken care of. Every architecture has its own low-level details to take care of. The perf events subsystem has an impressive set of requirements of its own. And so on; this list is nowhere near comprehensive.

All of these actions are currently accomplished by way of a set of notifier callbacks which, with luck, get called in the right order. Meanwhile, plugging in a new CPU requires an analogous set of operations, but those are handled in an asymmetric manner with a different set of callbacks. The end result is that the mechanism is fragile and that few people have any real understanding of all the steps needed to plug or unplug a CPU.

Thomas's objective is not to rewrite all those notifier functions or fundamentally change what is done to implement a CPU hotplug operation — at least, not yet. Instead, he is focused on imposing some order on the whole process so that it can be understood by looking at the code. To that end, he has replaced the current set of notifier chains with a linear sequence of states to be worked through when bringing up or shutting down a CPU. There is a single array of cpuhp_step structures, one per state:

    struct cpuhp_step {
	int (*startup)(unsigned int cpu);
	int (*teardown)(unsigned int cpu);
    };

The startup() function will be called when passing through the state as a new CPU is brought online, while teardown() is called when things are moving in the other direction. Many states only have one function or the other in the current implementation; the eventual goal is to make the process more symmetrical. In the initial patch set, the set of states is:

Statestartupteardown
CPUHP_CREATE_THREADS
CPUHP_PERF_X86_UNCORE_PREP
CPUHP_PERF_X86_PREPARE
CPUHP_PERF_BFIN
CPUHP_PERF_POWER
CPUHP_PERF_SUPERH
CPUHP_PERF_PREPARE
CPUHP_SCHED_MIGRATE_PREP
CPUHP_WORKQUEUE_PREP
CPUHP_RCUTREE_PREPARE
CPUHP_HRTIMERS_PREPARE
CPUHP_TIMERS_PREPARE
CPUHP_PROFILE_PREPARE
CPUHP_X2APIC_PREPARE
CPUHP_SMPCFD_PREPARE
CPUHP_SMPCFD_PREPARE
CPUHP_SLAB_PREPARE
CPUHP_NOTIFY_PREPARE
CPUHP_NOTIFY_DEAD
CPUHP_CPUFREQ_DEAD
CPUHP_SCHED_DEAD
CPUHP_CLOCKEVENTS_DEAD
CPUHP_BRINGUP_CPU
CPUHP_AP_OFFLINE Application processor states
CPUHP_AP_SCHED_STARTING
CPUHP_AP_PERF_X86_UNCORE_STARTING
CPUHP_AP_PERF_X86_AMD_IBS_STARTING
CPUHP_AP_PERF_X86_STARTING
CPUHP_AP_PERF_ARM_STARTING
CPUHP_AP_ARM_VFP_STARTING
CPUHP_AP_ARM64_TIMER_STARTING
CPUHP_AP_KVM_STARTING
CPUHP_AP_X86_TBOOT_DYING
CPUHP_AP_S390_VTIME_DYING
CPUHP_AP_CLOCKEVENTS_DYING
CPUHP_AP_RCUTREE_DYING
CPUHP_AP_SCHED_NOHZ_DYING
CPUHP_AP_SCHED_MIGRATE_DYING
CPUHP_AP_MAZ End marker for AP states
CPUHP_TEARDOWN_CPU
CPUHP_PERCPU_THREADS
CPUHP_SCHED_ONLINE
CPUHP_PERF_ONLINE
CPUHP_SCHED_MIGRATE_ONLINE
CPUHP_WORKQUEUE_ONLINE
CPUHP_CPUFREQ_ONLINE
CPUHP_RCUTREE_ONLINE
CPUHP_NOTIFY_ONLINE
CPUHP_PROFILE_ONLINE
CPUHP_SLAB_ONLINE
CPUHP_NOTIFY_DOWN_PREPARE
CPUHP_PERF_X86_UNCORE_ONLINE
CPUHP_PERF_X86_ONLINE
CPUHP_PERF_S390_ONLINE

Looking at that list, one begins to see why the current CPU hotplug mechanism is hard to understand. Things are messy enough that Thomas is not really trying to change anything fundamental in how CPU hotplug works; most of the existing notifier callbacks are still there, they are just invoked in a different way. The purpose of the exercise, Thomas said, was:

It's about making the ordering constraints clear. It's about documenting the existing horror in a way, that one can understand the hotplug process w/o hallucinogenic drugs.

Once some high-level order has been brought to the CPU hotplug mechanism, one can think about trying to clean things up. The eventual goal is to have a much smaller set of externally visible states; for drivers and filesystems, there will only be "prepare" and "enable" states available, with no ordering between subsystems. Also, notably, drivers and filesystems will not be allowed to cause a hotplug operation (in either direction) to fail. When the process is complete, the hotplug subsystem should be much more predictable, with a lot more of the details hidden from the rest of the kernel.

That is all work for a future series, though; the first step is to get the infrastructure set up. Chances are that will require at least one more iteration of Thomas's "Episode 1" patch set, meaning that it is unlikely to be 3.9 material. Starting around 3.10, though, we may well see significant changes to how CPU hotplugging is handled; the result should be more comprehensible and reliable code.

Index entries for this article
KernelHotplug
KernelNotifiers


to post comments


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds