When real validation begins

By Jonathan Corbet
January 21, 2015

LCA 2015

No computer-oriented conference is complete without a good war-story presentation or two. Paul McKenney's LCA 2015 talk on the implications of enabling full dynamic tick support for all users fit the bill nicely. The result was an overview of what can happen when your code is unexpectedly enabled on vast numbers of machines — and some lessons for how to avoid disasters in the future.

Some history

Paul started by noting that, in the 1990s, there was little concern about CPU energy efficiency. In fact, in those days, an idle CPU tended to consume more power than one that was doing useful work. That's because an idle processor would sit in a tight loop waiting for something to do; there were no cache misses, so the CPU ran without a break. Delivering regular clock interrupts to an idle processor thus increased its energy efficiency; it was still doing nothing useful, but it would burn less power in the process.

Things have changed since then, he continued. CPUs are designed to be powered off when they have nothing to do, so clock interrupts to an idle CPU are bad news. But until the early 2000s, that's exactly what was happening on Linux systems. One of the changes merged for the 2.6.21 release in 2007 was partial dynamic tick support, which removed those idle clock interrupts.

That was a good step forward, but was not a full solution; delivering regular scheduling interrupts to a busy CPU can also be a problem. Realtime developers don't like clock ticks because they can introduce unwanted latency. High-performance computing users also complain; they are trying to get the most out of their CPUs, and any work that is not directed toward their problems is just overhead. Beyond that, high-performance computing workloads often communicate results between processors; delaying work on one processor can cause others to wait, creating cascading delays. The scheduling interrupt is often necessary, but, in a high-performance environment, there will only be one process running on a given CPU and no other work to do, so those interrupts can only slow things down.

Full dynamic tick support was first prototyped by Josh Triplett in 2009; it resulted in a 3% performance gain for CPU-intensive workloads. For people determined to get maximal performance from their systems, 3% is a big deal. But this patch, which was mostly a proof of concept, had some problems. Without the scheduling interrupt, a single task could monopolize the CPU and starve others. There was no process accounting, and read-copy-update (RCU) grace periods could go on forever, with the result that the system could run out of memory. So Frederic Weisbecker's decision to work on a production-ready version of the patch was welcome.

That code was merged for the 3.10 kernel. It works well, in that there will be no scheduler interrupt while only one task is running on the CPU. There is a residual once-per-second interrupt that, Paul said, serves as a sort of security blanket to make sure nothing slips through the cracks. It can be disabled, but that is not recommended at this time.

Paul did some of the work to ensure that RCU worked properly in a full dynamic tick environment. He had thought of full dynamic tick as a specialty feature that would only be enabled by users building their own kernels. Still, he was pleasantly surprised to hear that the feature was enabled for all users in the RHEL7 kernel. But, he said ruefully, you would think he would know better after his many years of experience in this industry. Turning on the feature in a major distribution means that everybody is using it. That, he said, is when the real validation begins — validation by users running workloads that he had not thought to test his patches against.

The fun begins

He soon got an email from Rik van Riel asking why the rcu_sched process was taking 40% of the CPU. This was happening on a workload that had lots of context switches — a completely different environment than the one the dynamic tick feature was designed for. Paul's first thought was that grace periods were maybe completing too quickly in the presence of a lot of context switches, increasing the amount of grace-period processing that needed to be done. He tried slowing grace-period completion down artificially, but that did not help the problem. Thus, he said, he was forced to actually analyze what was going on.

The real problem had to do with the RCU callback offloading mechanism, which moves RCU cleanup work off the CPUs that are being used in the dynamic-tick mode. This cleanup work is done in dedicated kernel threads that can be run on whichever CPU makes the most sense. It's a useful feature for high-performance workloads, but it isn't all that useful for everybody else; indeed, it appeared to be causing problems for other workloads. To address that problem, Paul put in a patch to only enable callback offloading if the nohz_full boot parameter (requesting full dynamic tick behavior) is set.

According to Paul, industry experience shows that one out of six fixes introduces a new bug of its own. This was, he said, one of those fixes. It turns out that RCU is used earlier in the boot process than he had thought, and the switch to the offloaded mode would cause early callbacks from the offloaded CPUs to be lost. The result would certainly be leaked memory, but it can also result in a full system hang if processes are waiting for a specific callback to complete. So another fix went in to make the decision on which CPUs to offload earlier.

So "now that the bleeding was stopped," he said, it was time to fix the real bug. After all, 40% CPU usage on an 80-CPU system is a bit excessive, and the problem would get worse as the number of CPUs increases. By the time the CPU count got up to 4000 or so, the system simply would not be able to keep up with the load. Since he already gets complaints about RCU performance on 4096-CPU machines, this was a real problem in need of a solution.

It turned out that a big part of the overhead was the simple process of waking up all of the offload threads at the beginning and end of grace periods. So he decided to hide the problem a bit; rather than wake all threads from the central RCU scheduling thread, he organized them into a tree and made a subset of threads responsible for waking the rest. The idea was to spread the load around the system a bit, but it also happened to reduce the total number of wakeups since it turned out to only be necessary to wake the first level of threads at the beginning of the grace period.

One of six fixes may introduce a new bug, but in this case, Paul admitted, it was two out of six. Some callbacks that were posted early in the life of the system were not being executed, leading to occasional system hangs. Yet another fix ensured that they got run, and everything was well again.

At least, all was well until somebody looked at their system and wondered why there were hundreds of callback-offload threads on a machine with a handful of CPUs. It turns out that some systems have firmware that lies about the number of installed CPUs, and "RCU was stupid enough to believe it." Changing the callback-offload code to defer starting the offload threads until the relevant CPU actually comes online dealt with that one.

At this point, the callback-offload code passed all of its tests. At least, it passed them all if loadable kernel modules were not enabled — the situation on Paul's machines. The problem was that a module could post callbacks that would still be outstanding when the module was removed. That would lead to the kernel jumping into code that was no longer present — an "embarrassing failure" that can lead to calls (of the telephone variety) back to the relevant kernel developers instead. The solution was to wait for all existing callbacks to be invoked before completing the removal of the module; that wait is done by posting a special callback on each CPU in the system and waiting for them all to report completion.

As mentioned above, the code was fixed to only run callback threads for online CPUs. That last fix would put callbacks on all CPUs, including those that are currently offline. Since an offline CPU has no offload thread, those callbacks will wait forever. So yet another fix ensured that never-online CPUs would not get callbacks posted.

Lessons learned

At this point, as far as anybody knows, things have stabilized and no remaining bugs lurk to attack innocent users. There are, Paul said, a number of lessons that one can learn from his experience. The first of these is to limit the scope of all changes to avoid putting innocent bystanders at risk. Turning on full dynamic tick behavior for all users went against that lesson with unfortunate consequences. We should also recognize that the Linux kernel serves a wide variety of workloads; it will never be possible to test them all.

Fixes can — and will — generate more bugs. Fixes for minor bugs require more caution before they are applied; since they address a problem seen by only a small subset of users, they have a high probability of creating unforeseen problems for the larger majority. And, Paul said, it is not enough to simply check one's assumptions; one may have built "towers of logic" upon those assumptions and formed habits of thought that are hard to break out of. In this case, the assumption that all users of the dynamic tick code would be building their own kernels led to some unfortunate consequences. And finally, he said, people probably trust him too much.

[Your editor would like to thank linux.conf.au for funding his travel to the event.]

Index entries for this article
Kernel	Read-copy-update
Conference	linux.conf.au/2015

When real validation begins

Posted Jan 22, 2015 3:19 UTC (Thu) by JdGordy (subscriber, #70103) [Link] (18 responses)

"At this point, as far as anybody knows, things have stabilized and no remaining bugs lurk to attack innocent users. There are, Paul said, a number of lessons that one can learn from his experience. The first of these is to limit the scope of all changes to avoid putting innocent bystanders at risk. Turning on full dynamic tick behavior for all users went against that lesson with unfortunate consequences. We should also recognize that the Linux kernel serves a wide variety of workloads; it will never be possible to test them all."

This seems contradictory. either you assume your code is too buggy for anyone to use except yourself or you assume its safe for everyone. In the first case noone will ever use it so what's the point? and in the second case you fix bugs which need fixing and you wouldn't have found without that extra help testing.

That is one way to look at it, but...

Posted Jan 22, 2015 10:03 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (17 responses)

There is a rather wide continuum of reasonable assumptions lying between your "assume your code is too buggy for anyone" and your "assume it is safe for everyone". One such reasonable assumption is that your code is safe enough for people who really badly need it, but not safe enough for people who do not need it quite so badly. This assumption might lead you to enable your code for people who really badly need it and disable it for the rest. Then, over time, as the people who really badly need the code find bugs in it and these bugs are fixed, it might (or might not) make sense to enable your code for additional classes of users, for example, those who have only moderate need of it.

It all depends on what fraction of the users need the new code, how much risk it poses to users not needing the new code, how aggressive your user base is, and how much effort has been put into validating the new code. In this case, a rather small fraction of the users needed the new code, there was moderate risk to other users, many of the other users were anything but aggressive, and my validation had (intentionally) not covered these other users' workloads. Not always an easy decision!

That is one way to look at it, but...

Posted Jan 22, 2015 14:22 UTC (Thu) by error27 (subscriber, #8346) [Link] (16 responses)

We really don't have a process to gradual roll outs. I think people barely test anything that's not in a distro kernel.

That is one way to look at it, but...

Posted Jan 22, 2015 15:05 UTC (Thu) by Limdi (guest, #100500) [Link] (8 responses)

How would a gradual rollout work?

# apt-kernel variant /1/
apt-kernel install <labels> <version>

vendor=debian.org
variant=stable,testing,pristine-upstream,barebone
name=linux

apt-kernel install variant=stable 3.19.1
apt-kernel install variant=testing 3.20-RC1
apt-kernel install vendor=debian.org,variant=pristine-upstream 3.20-RC2
apt-kernel install vendor=debian.org,variant=lightpatched 3.20-RC2
apt-kernel install vendor=debian.org,variant=barebone 3.20-RC2

# apt-kernel variant /2/
apt-kernel install <labels> <kernel-name>

vendor = debian.org(default) -> tells where to get it from
rc(release-candidate) = [1-10]
version = 3.19.1|latest
author = linus|linuxteam
variant=stable(default),testing,unstable,experimental

apt-kernel install vendor=debian.org,author=linuxteam,version=latest linux
apt-kernel install vendor=lkml.org,author=linus,version=latest linux

# apt-kernel-modules install <labels> <which-module>
One could expect the current kernel to be already installed.

version => for which kernel to install
rc => release-candidate
vendor=debian.org(default)

apt-kernel-modules install version=3.20,rc=4 overlayfs

## Install variant based on the existence of a feature.
In case the module with this feature rises from experimental to unstable/testing, these versions could be used.

apt-kernel-modules install features=multiple-ro overlayfs

Then one could choose any way dpkg-configure or via some config:
default-variant=experimental,unstable,testing,stable

And update the installed kernels/kernel modules maybe via
apt-kernel update
apt-kernel upgrade *

Just an idea written down. Something like that would maybe make it easier to use experimental kernels and kernel modules.

What do you think of it?

That is one way to look at it, but...

Posted Jan 22, 2015 19:20 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

I have to defer to others who understand packaging and installation much better than do I.

That is one way to look at it, but...

Posted Jan 23, 2015 8:03 UTC (Fri) by error27 (subscriber, #8346) [Link] (6 responses)

Kernel modules are already modular and you don't have to turn them on if you don't use them. The problem with the other options is that you have to recompile everything for that. So if you have A and B you have to compile four kernels. "disabled", "A", "B" and "A + B". If you have 3 boolean options you need 8 kernels, etc, it's 2^(number of options).

Your idea is basically to make it easier to get exactly the kernel they want. I think at that point people need to compile their own kernels. The problem is that "make menuconfig/xconfig" is total garbage so configuring your kernel is crazy difficult.

I can never find anything. Back in the day, my problem was that I couldn't figure out how to enable broadcom wireless drivers. I could see five drivers and I wasn't positive which one supported my drivers. And there were some overlap because we had the reverse engineered drivers and the ones that broadcom wrote later. In the end, the driver I wanted was "invisible" because I didn't have the BCMA bus enabled. In those days the BCMA wasn't shown under wireless it was in a completely different menu.

Last week I wanted to enable lustre to compile test a file. I couldn't find it because it was invisible. It depends on BROKEN. I still can't find how to enable BROKEN. (It might be deliberate so people are force to edit Kconfig files to enable broken stuff).

There is a search feature in menuconfig which tells you the location, but there isn't a "search in page" and quite a few of these lists of drivers are three pages long. If they were in alphabetical order that might help.

Very few people manually configure kernels. Kernel developers like me just generate their configs by using custom scripts. Menuconfig doesn't really have a maintainer but it's such an important tool.

Automation!!! It is the only way...

Posted Jan 23, 2015 9:57 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Indeed, all your points are valid reasons why most people don't bother building their own kernels.

And that is why I made the RCU callback offloading code automatically determine at boot time whether NO_HZ_FULL needs it or not. With that in place, most NO_HZ_FULL users simply don't need to worry about RCU callback offloading. It enables itself when they need it and only on the CPUs that they need it on, and stays out of the way otherwise.

Or at least that is the theory. We will soon see how it plays out in practice.

But regardless of whether or not I have additional bugs in this code (and Murphy of course says that I do), where reasonable we do need to try to automate configuration. And much else besides. ;-)

That is one way to look at it, but...

Posted Jan 23, 2015 17:37 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (4 responses)

There is a search feature in menuconfig which tells you the location, but there isn't a "search in page" and quite a few of these lists of drivers are three pages long. If they were in alphabetical order that might help.

nconfig is the new menuconfig; it has both global search for symbols and incremental search for labels within the current menu.

First I had heard of "make nconfig"

Posted Jan 23, 2015 21:59 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Very nice, thank you!

That is one way to look at it, but...

Posted Jan 29, 2015 14:44 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

Doesn't seem terribly functional to me. F1 to F4 (allegedly Help, SymInfo, Help 2 and ShowAll) serve to go back or quit: F9 (allegedly Quit) does nothing.

I bet it's my terminal (old Konsole, TERM=xterm-color) or something. I guess I'll have to do some debugging...

That is one way to look at it, but...

Posted Jan 29, 2015 15:50 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

There is certainly room for improvement, as always. For example, search works nicely, but it would be even better to be able to directly change the values of the things found by the search, as can be done with xconfig. Still, nconfig is much faster than xconfig, which is welcome.

F9 works for me. I must confess that I didn't try most of the others.

That is one way to look at it, but...

Posted Jan 29, 2015 22:15 UTC (Thu) by vbabka (subscriber, #91706) [Link]

With make menuconfig, search result have prefixes like (1), (2) etc. Press the corresponding number key and it jumps directly to the option.

That is one way to look at it, but...

Posted Jan 22, 2015 19:19 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

To your point, I don't know of a general approach to gradually roll out new functionality. There have been several attempts in the past (such as CONFIG_EXPERIMENTAL), but these were often worked around. However, that doesn't mean that we cannot come up with specific approaches as needed for specific situations.

For example, in the case covered by this LWN article, the (eventual!) gradual rollout approach was to disable the relevant portions of the new functionality unless the user explicitly passed in a particular boot-time kernel parameter. Over time, we might be less restrictive about exposing new functionality, perhaps enabling it for additional use cases as they arise.

That is one way to look at it, but...

Posted Jan 22, 2015 22:44 UTC (Thu) by error27 (subscriber, #8346) [Link] (5 responses)

We could introduce a new config option to turn them on automatically.

config MY_OPTION
bool
depends on EXPERIMENTAL || VERSION > "3.21"

Right now the Kconfig parser only understands = and != so we'd have to update it to understand '>'. I'm brainstorming here so there are no bad ideas by the way, in case you were wondering. ;)

That is one way to look at it, but...

Posted Jan 23, 2015 0:31 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

That might work in some cases, but in the case in this LWN article, the determining factor wasn't the kernel version, but rather the type of workload. People with certain types of HPC or realtime workloads will specify the nohz_full= kernel boot parameter, so I can key off of that to enable RCU callback offloading.

There was a long email thread some years back on how to keep normal users from using new experimental functionality. Dave Jones suggested making the experimental feature splat on boot as the most reliable way to keep most distros from turning it on by default. ;-)

That is one way to look at it, but...

Posted Jan 29, 2015 14:56 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

No no. Make it detect when it's being run by a distro QA team, and *crash* on boot. Anything else will get overlooked. :)

That is one way to look at it, but...

Posted Jan 29, 2015 15:52 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

You do seem to have fully internalized the old saying "Murphy was an optimist"! ;-)

That is one way to look at it, but...

Posted Feb 4, 2015 15:06 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I was being snarky. Excessively so: the QA people I work with (for Oracle Linux) are better than any other QA people I have ever worked with in any previous job. I can rely on them to find bugs! I can rely on them to proactively think of evil ways to break stuff to find more bugs! This is, to me, amazing: coworkers better at breaking my own stuff than I am, and I don't need to argue for weeks to convince them to do it, either.

(I'm sure all Linux vendors have similarly good QA teams -- I'm just only familiar with the one, and perhaps have had my expectations unduly lowered by awful QA teams in other jobs. I'm sure in e.g. aerospace the QA is even more effective.)

That is one way to look at it, but...

Posted Feb 4, 2015 15:43 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

"If it is not broken, fix your tests!!!" ;-)

When real validation begins

Posted Jan 23, 2015 18:02 UTC (Fri) by josh (subscriber, #17465) [Link] (3 responses)

> Full dynamic tick support was first prototyped by Josh Triplett in 2009; it resulted in a 3% performance gain for CPU-intensive workloads.

The major goal of that prototype wasn't the 3% performance gain; it was the consistency and real-time response. Without processes being interrupted, they consistently executed the same amount of work every time. And avoiding interruptions puts a better bound on latency.

Right you are!

Posted Jan 24, 2015 21:30 UTC (Sat) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Indeed, I was definitely thinking in terms of real-time use cases early on, and others suggested HPC.

However, as far as I know, we never did get a good measurement of the real-time benefits, and many of the most enthusiastic early adopters that I know of are more in the HPC arena than in the real-time arena. Though there are some that are to some extent in both arenas, and there might well be people using NO_HZ_FULL to implement the CPU-bound polling-loop style of real-time applications.

Or have you seen some measures of the real-time benefits or come across people using NO_HZ_FULL for real-time workloads?

If not, this would certainly not be the first project that I have been involved with whose major value turned to be its unintended consequences. :-)

Right you are!

Posted Jan 25, 2015 6:56 UTC (Sun) by josh (subscriber, #17465) [Link] (1 responses)

The original graph I presented at that Plumbers BOF showed a consistent number of operations done per unit time. That doesn't just mean consistency for HPC; it also means the task didn't get interrupted, at all. Similarly, with the appropriate kernel options turned on, a single userspace process running with no contention on its runqueue ought to get 100% of the CPU with no interruptions.

Right you are!

Posted Jan 27, 2015 16:36 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

If I understand you correctly, I agree: Although we don't have a direct measurement of improved real-time response, we do have a lot of indirect measurements that should lead one to believe that real-time response would be improved. On the other hand, Carsten Emde's measurements of an early version of Frederic's patch showed no improvement, possibly because the 1HZ residual tick was in effect. So I take a conservative approach and focus on HPC workloads and on polling-loop-style real-time workloads, for the moment, anyway.