When real validation begins
Some history
Paul started by noting that, in the 1990s, there was little concern about CPU energy efficiency. In fact, in those days, an idle CPU tended to consume more power than one that was doing useful work. That's because an idle processor would sit in a tight loop waiting for something to do; there were no cache misses, so the CPU ran without a break. Delivering regular clock interrupts to an idle processor thus increased its energy efficiency; it was still doing nothing useful, but it would burn less power in the process.
Things have changed since then, he continued. CPUs are designed to be powered off when they have nothing to do, so clock interrupts to an idle CPU are bad news. But until the early 2000s, that's exactly what was happening on Linux systems. One of the changes merged for the 2.6.21 release in 2007 was partial dynamic tick support, which removed those idle clock interrupts.
That was a good step forward, but was not a full solution; delivering
regular scheduling interrupts to a busy CPU can also be a problem.
Realtime developers don't like clock ticks because they can introduce
unwanted latency. High-performance computing users also complain; they are
trying to get the most out of their CPUs, and any work that is not directed
toward their problems is just overhead. Beyond that, high-performance
computing workloads often communicate results between processors; delaying
work on one processor can cause others to wait, creating cascading delays.
The scheduling interrupt is often necessary, but, in a high-performance
environment, there will only be one process running on a given CPU and no
other work to do, so those interrupts can only slow things down.
Full dynamic tick support was first prototyped by Josh Triplett in 2009; it resulted in a 3% performance gain for CPU-intensive workloads. For people determined to get maximal performance from their systems, 3% is a big deal. But this patch, which was mostly a proof of concept, had some problems. Without the scheduling interrupt, a single task could monopolize the CPU and starve others. There was no process accounting, and read-copy-update (RCU) grace periods could go on forever, with the result that the system could run out of memory. So Frederic Weisbecker's decision to work on a production-ready version of the patch was welcome.
That code was merged for the 3.10 kernel. It works well, in that there will be no scheduler interrupt while only one task is running on the CPU. There is a residual once-per-second interrupt that, Paul said, serves as a sort of security blanket to make sure nothing slips through the cracks. It can be disabled, but that is not recommended at this time.
Paul did some of the work to ensure that RCU worked properly in a full dynamic tick environment. He had thought of full dynamic tick as a specialty feature that would only be enabled by users building their own kernels. Still, he was pleasantly surprised to hear that the feature was enabled for all users in the RHEL7 kernel. But, he said ruefully, you would think he would know better after his many years of experience in this industry. Turning on the feature in a major distribution means that everybody is using it. That, he said, is when the real validation begins — validation by users running workloads that he had not thought to test his patches against.
The fun begins
He soon got an email from Rik van Riel asking why the rcu_sched process was taking 40% of the CPU. This was happening on a workload that had lots of context switches — a completely different environment than the one the dynamic tick feature was designed for. Paul's first thought was that grace periods were maybe completing too quickly in the presence of a lot of context switches, increasing the amount of grace-period processing that needed to be done. He tried slowing grace-period completion down artificially, but that did not help the problem. Thus, he said, he was forced to actually analyze what was going on.
The real problem had to do with the RCU callback offloading mechanism, which moves RCU cleanup work off the CPUs that are being used in the dynamic-tick mode. This cleanup work is done in dedicated kernel threads that can be run on whichever CPU makes the most sense. It's a useful feature for high-performance workloads, but it isn't all that useful for everybody else; indeed, it appeared to be causing problems for other workloads. To address that problem, Paul put in a patch to only enable callback offloading if the nohz_full boot parameter (requesting full dynamic tick behavior) is set.
According to Paul, industry experience shows that one out of six fixes introduces a new bug of its own. This was, he said, one of those fixes. It turns out that RCU is used earlier in the boot process than he had thought, and the switch to the offloaded mode would cause early callbacks from the offloaded CPUs to be lost. The result would certainly be leaked memory, but it can also result in a full system hang if processes are waiting for a specific callback to complete. So another fix went in to make the decision on which CPUs to offload earlier.
So "now that the bleeding was stopped," he said, it was time to fix the real bug. After all, 40% CPU usage on an 80-CPU system is a bit excessive, and the problem would get worse as the number of CPUs increases. By the time the CPU count got up to 4000 or so, the system simply would not be able to keep up with the load. Since he already gets complaints about RCU performance on 4096-CPU machines, this was a real problem in need of a solution.
It turned out that a big part of the overhead was the simple process of waking up all of the offload threads at the beginning and end of grace periods. So he decided to hide the problem a bit; rather than wake all threads from the central RCU scheduling thread, he organized them into a tree and made a subset of threads responsible for waking the rest. The idea was to spread the load around the system a bit, but it also happened to reduce the total number of wakeups since it turned out to only be necessary to wake the first level of threads at the beginning of the grace period.
One of six fixes may introduce a new bug, but in this case, Paul admitted, it was two out of six. Some callbacks that were posted early in the life of the system were not being executed, leading to occasional system hangs. Yet another fix ensured that they got run, and everything was well again.
At least, all was well until somebody looked at their system and wondered why there were hundreds of callback-offload threads on a machine with a handful of CPUs. It turns out that some systems have firmware that lies about the number of installed CPUs, and "RCU was stupid enough to believe it." Changing the callback-offload code to defer starting the offload threads until the relevant CPU actually comes online dealt with that one.
At this point, the callback-offload code passed all of its tests. At least, it passed them all if loadable kernel modules were not enabled — the situation on Paul's machines. The problem was that a module could post callbacks that would still be outstanding when the module was removed. That would lead to the kernel jumping into code that was no longer present — an "embarrassing failure" that can lead to calls (of the telephone variety) back to the relevant kernel developers instead. The solution was to wait for all existing callbacks to be invoked before completing the removal of the module; that wait is done by posting a special callback on each CPU in the system and waiting for them all to report completion.
As mentioned above, the code was fixed to only run callback threads for online CPUs. That last fix would put callbacks on all CPUs, including those that are currently offline. Since an offline CPU has no offload thread, those callbacks will wait forever. So yet another fix ensured that never-online CPUs would not get callbacks posted.
Lessons learned
At this point, as far as anybody knows, things have stabilized and no remaining bugs lurk to attack innocent users. There are, Paul said, a number of lessons that one can learn from his experience. The first of these is to limit the scope of all changes to avoid putting innocent bystanders at risk. Turning on full dynamic tick behavior for all users went against that lesson with unfortunate consequences. We should also recognize that the Linux kernel serves a wide variety of workloads; it will never be possible to test them all.
Fixes can — and will — generate more bugs. Fixes for minor bugs require more caution before they are applied; since they address a problem seen by only a small subset of users, they have a high probability of creating unforeseen problems for the larger majority. And, Paul said, it is not enough to simply check one's assumptions; one may have built "towers of logic" upon those assumptions and formed habits of thought that are hard to break out of. In this case, the assumption that all users of the dynamic tick code would be building their own kernels led to some unfortunate consequences. And finally, he said, people probably trust him too much.
[Your editor would like to thank linux.conf.au for funding his travel to the event.]
Index entries for this article | |
---|---|
Kernel | Read-copy-update |
Conference | linux.conf.au/2015 |
Posted Jan 22, 2015 3:19 UTC (Thu)
by JdGordy (subscriber, #70103)
[Link] (18 responses)
This seems contradictory. either you assume your code is too buggy for anyone to use except yourself or you assume its safe for everyone. In the first case noone will ever use it so what's the point? and in the second case you fix bugs which need fixing and you wouldn't have found without that extra help testing.
Posted Jan 22, 2015 10:03 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (17 responses)
It all depends on what fraction of the users need the new code, how much risk it poses to users not needing the new code, how aggressive your user base is, and how much effort has been put into validating the new code. In this case, a rather small fraction of the users needed the new code, there was moderate risk to other users, many of the other users were anything but aggressive, and my validation had (intentionally) not covered these other users' workloads. Not always an easy decision!
Posted Jan 22, 2015 14:22 UTC (Thu)
by error27 (subscriber, #8346)
[Link] (16 responses)
Posted Jan 22, 2015 15:05 UTC (Thu)
by Limdi (guest, #100500)
[Link] (8 responses)
# apt-kernel variant /1/
vendor=debian.org
apt-kernel install variant=stable 3.19.1
# apt-kernel variant /2/
vendor = debian.org(default) -> tells where to get it from
apt-kernel install vendor=debian.org,author=linuxteam,version=latest linux
# apt-kernel-modules install <labels> <which-module>
version => for which kernel to install
apt-kernel-modules install version=3.20,rc=4 overlayfs
## Install variant based on the existence of a feature.
apt-kernel-modules install features=multiple-ro overlayfs
Then one could choose any way dpkg-configure or via some config:
And update the installed kernels/kernel modules maybe via
Just an idea written down. Something like that would maybe make it easier to use experimental kernels and kernel modules.
What do you think of it?
Posted Jan 22, 2015 19:20 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Jan 23, 2015 8:03 UTC (Fri)
by error27 (subscriber, #8346)
[Link] (6 responses)
Your idea is basically to make it easier to get exactly the kernel they want. I think at that point people need to compile their own kernels. The problem is that "make menuconfig/xconfig" is total garbage so configuring your kernel is crazy difficult.
I can never find anything. Back in the day, my problem was that I couldn't figure out how to enable broadcom wireless drivers. I could see five drivers and I wasn't positive which one supported my drivers. And there were some overlap because we had the reverse engineered drivers and the ones that broadcom wrote later. In the end, the driver I wanted was "invisible" because I didn't have the BCMA bus enabled. In those days the BCMA wasn't shown under wireless it was in a completely different menu.
Last week I wanted to enable lustre to compile test a file. I couldn't find it because it was invisible. It depends on BROKEN. I still can't find how to enable BROKEN. (It might be deliberate so people are force to edit Kconfig files to enable broken stuff).
There is a search feature in menuconfig which tells you the location, but there isn't a "search in page" and quite a few of these lists of drivers are three pages long. If they were in alphabetical order that might help.
Very few people manually configure kernels. Kernel developers like me just generate their configs by using custom scripts. Menuconfig doesn't really have a maintainer but it's such an important tool.
Posted Jan 23, 2015 9:57 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
And that is why I made the RCU callback offloading code automatically determine at boot time whether NO_HZ_FULL needs it or not. With that in place, most NO_HZ_FULL users simply don't need to worry about RCU callback offloading. It enables itself when they need it and only on the CPUs that they need it on, and stays out of the way otherwise.
Or at least that is the theory. We will soon see how it plays out in practice.
But regardless of whether or not I have additional bugs in this code (and Murphy of course says that I do), where reasonable we do need to try to automate configuration. And much else besides. ;-)
Posted Jan 23, 2015 17:37 UTC (Fri)
by BenHutchings (subscriber, #37955)
[Link] (4 responses)
nconfig is the new menuconfig; it has both global search for symbols and incremental search for labels within the current menu.
Posted Jan 23, 2015 21:59 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Jan 29, 2015 14:44 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
I bet it's my terminal (old Konsole, TERM=xterm-color) or something. I guess I'll have to do some debugging...
Posted Jan 29, 2015 15:50 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
F9 works for me. I must confess that I didn't try most of the others.
Posted Jan 29, 2015 22:15 UTC (Thu)
by vbabka (subscriber, #91706)
[Link]
Posted Jan 22, 2015 19:19 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (6 responses)
For example, in the case covered by this LWN article, the (eventual!) gradual rollout approach was to disable the relevant portions of the new functionality unless the user explicitly passed in a particular boot-time kernel parameter. Over time, we might be less restrictive about exposing new functionality, perhaps enabling it for additional use cases as they arise.
Posted Jan 22, 2015 22:44 UTC (Thu)
by error27 (subscriber, #8346)
[Link] (5 responses)
config MY_OPTION
Right now the Kconfig parser only understands = and != so we'd have to update it to understand '>'. I'm brainstorming here so there are no bad ideas by the way, in case you were wondering. ;)
Posted Jan 23, 2015 0:31 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (4 responses)
There was a long email thread some years back on how to keep normal users from using new experimental functionality. Dave Jones suggested making the experimental feature splat on boot as the most reliable way to keep most distros from turning it on by default. ;-)
Posted Jan 29, 2015 14:56 UTC (Thu)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Jan 29, 2015 15:52 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
Posted Feb 4, 2015 15:06 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
(I'm sure all Linux vendors have similarly good QA teams -- I'm just only familiar with the one, and perhaps have had my expectations unduly lowered by awful QA teams in other jobs. I'm sure in e.g. aerospace the QA is even more effective.)
Posted Feb 4, 2015 15:43 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Jan 23, 2015 18:02 UTC (Fri)
by josh (subscriber, #17465)
[Link] (3 responses)
The major goal of that prototype wasn't the 3% performance gain; it was the consistency and real-time response. Without processes being interrupted, they consistently executed the same amount of work every time. And avoiding interruptions puts a better bound on latency.
Posted Jan 24, 2015 21:30 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
However, as far as I know, we never did get a good measurement of the real-time benefits, and many of the most enthusiastic early adopters that I know of are more in the HPC arena than in the real-time arena. Though there are some that are to some extent in both arenas, and there might well be people using NO_HZ_FULL to implement the CPU-bound polling-loop style of real-time applications.
Or have you seen some measures of the real-time benefits or come across people using NO_HZ_FULL for real-time workloads?
If not, this would certainly not be the first project that I have been involved with whose major value turned to be its unintended consequences. :-)
Posted Jan 25, 2015 6:56 UTC (Sun)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Jan 27, 2015 16:36 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
When real validation begins
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
apt-kernel install <labels> <version>
variant=stable,testing,pristine-upstream,barebone
name=linux
apt-kernel install variant=testing 3.20-RC1
apt-kernel install vendor=debian.org,variant=pristine-upstream 3.20-RC2
apt-kernel install vendor=debian.org,variant=lightpatched 3.20-RC2
apt-kernel install vendor=debian.org,variant=barebone 3.20-RC2
apt-kernel install <labels> <kernel-name>
rc(release-candidate) = [1-10]
version = 3.19.1|latest
author = linus|linuxteam
variant=stable(default),testing,unstable,experimental
apt-kernel install vendor=lkml.org,author=linus,version=latest linux
One could expect the current kernel to be already installed.
rc => release-candidate
vendor=debian.org(default)
In case the module with this feature rises from experimental to unstable/testing, these versions could be used.
default-variant=experimental,unstable,testing,stable
apt-kernel update
apt-kernel upgrade *
That is one way to look at it, but...
That is one way to look at it, but...
Automation!!! It is the only way...
That is one way to look at it, but...
There is a search feature in menuconfig which tells you the location, but there isn't a "search in page" and quite a few of these lists of drivers are three pages long. If they were in alphabetical order that might help.
First I had heard of "make nconfig"
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
bool
depends on EXPERIMENTAL || VERSION > "3.21"
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
That is one way to look at it, but...
When real validation begins
Right you are!
Right you are!
Right you are!