Notes from the LPC scheduler microconference

By Jonathan Corbet
September 18, 2017

The scheduler workloads microconference at the 2017 Linux Plumbers Conference covered several aspects of the kernel's CPU scheduler. While workloads were on the agenda, so were a rework of the realtime scheduler's push/pull mechanism, a distinctly different approach to multi-core scheduling, and the use of tracing for workload simulation and analysis. As the following summary shows, CPU scheduling has not yet reached a point where all of the important questions have been answered.

Workload simulation

First up was Juri Lelli, who talked about the rt-app tool that tests the scheduler by simulating specific types of workloads. Work on rt-app has been progressing; recent additions include the ability to model mutexes, condition variables, memory-mapped I/O, periodic workloads, and more. There is a new JSON grammar that can be used to specify each task's behavior. Rt-app now generates a log file for each task, making it easy to extract statistics on what happened during a test run; for example, it can be used to watch how the CPU-frequency governor reacts when a "small" task suddenly starts requiring a lot more CPU time. Developers have been using it to see how scheduler patches change the behavior of the system; it is a sort of unit test for the scheduler.

In summary, he said, rt-app has become a flexible tool for the simulation of many types of workloads. The actual workloads are generated from traces taken when running the use cases of interest. It is possible (and done), he said, to compare the simulated runs with the original traces. The rt-app repository itself has a set of relatively simple workloads, modeling browser and audio playback workloads, for example. New workloads are usually created when somebody finds something that doesn't work well and wants a way to reproduce the problem.

Rafael Wysocki said that he often has trouble with kernel changes that affect CPU-frequency behavior indirectly. He would like to have a way to evaluate patches for this kind of unwanted side effect, preferably while patches are sitting in linux-next at the latest and are relatively easy to fix. Might it be possible to use rt-app for this kind of testing? Josef Bacik added that his group (at Facebook) is constantly fixing issues that come up with each new kernel; it gets tiresome, so he, too, would like to find and fix these problems earlier.

He went on to state that everybody seems to agree that rt-app is the tool for this job. There would be value in having all developers pool their workloads into a comprehensive battery of tests, but where is the right place to put them? The rt-app project itself isn't necessarily the right place for all these workloads, so it would seem that a different repository is indicated. The LISA project was suggested as one possible home. Steve Rostedt, however, suggested that these workloads could just go into the kernel tree directly. Lelli asked whether having rt-app as an external dependency would be acceptable; Rostedt said that it would.

Wysocki wondered about how the community could ensure that these workloads get run on a wide variety of hardware; it's not possible to emulate everything. Bacik replied that it's not possible to test everything either, and that attempts to do so would be a case of the perfect being the enemy of the good. If each interested developer runs the tests on their own system, the overall coverage will be good enough. It's an approach that works out well for the xfstests filesystem-testing suite, he said.

Peter Zijlstra complained that rt-app doesn't produce any direct feedback — there is no result saying whether the right thing happened or not. As a result, interpreting rt-app output "requires having a brain". Rostedt suggested adding a feature to compare runs against previous output and look for regressions. Patrick Bellasi noted, though, that to work well in this mode, the tests need to be fully reproducible; that requires care in setting up the tests. At this point, he said, rt-app is not a continuous-integration tool, but it could maybe be evolved in that direction.

Reworking push/pull

Rostedt gave a brief presentation on what he called a "first-world problem". When running realtime workloads on systems with a large number (over 60) of CPUs — something he said people actually want to do — realtime tasks do not always migrate between CPUs in an optimal manner. That is due to shortcomings in how that migration is handled.

When the last running realtime task on any given CPU goes idle, he said, the CPU will examine the system to see if any other CPUs are overloaded with realtime tasks. If it finds a CPU running more than one realtime process, it will pull one of those processes over and run it locally. Once upon a time, this examination was done by grabbing locks on the remote CPUs, but that does not scale well. Now, instead, the idle CPU will send an inter-processor interrupt (IPI) to tell other CPUs that they can push an extra realtime task in its direction.

That is an improvement, but imagine a situation with many CPUs, one of which is overloaded. All of the other CPUs go idle more-or-less simultaneously, which does happen at times. They will all send IPIs to the busy CPU. One of them will get the extra process from that CPU but, having pushed that process away, the CPU still has to process a pile of useless IPIs. That adds extra load to the one CPU in the system that was already known to be overloaded, leading to observed high latencies — the one thing a realtime system is supposed to avoid at all costs.

The proposed solution is to have only the first (for some definition of "first") idle CPU send the IPI. That IPI can then be forwarded on to other overloaded CPUs if needed. The result would be the elimination of the IPI storms. Nobody seemed to think that this was a bad solution. It was noted that it would be nice to have statistics indicating how well this mechanism is working, and the conversation devolved quickly into a discussion of tracepoints (or the lack thereof) in the scheduler code. Zijlstra said that he broke user space once as a result of tracepoints, so he will not allow the addition of any more. This is a topic that was to return toward the end of the session.

Multi-core scheduling

Things took a different turn when Jean-Pierre Lozi stood up to talk about multi-core scheduling issues. Lozi is the lead author of the paper called The Linux Scheduler: a Decade of Wasted Cores [PDF], which described a number of issues with the CPU scheduler. An attempt to go through a list of those bugs drew a strong response from Zijlstra, who claimed that most of them were fixed some time ago.

The biggest potential exception is "work conservation" — ensuring that no task languishes in a CPU run queue if there are idle CPUs available elsewhere in the system. On larger systems, CPUs will indeed sit idle while tasks wait, and Zijlstra said that things will stay that way. When there are a lot of cores, he said, ensuring perfect work conservation is unacceptably expensive; it requires the maintenance of a global state and simply does not scale. Any cure would be worse than the disease.

Lozi's proposed solution is partitioning the system, essentially turning it into a set of smaller systems that can be scheduled independently. Zijlstra expressed skepticism that such an idea would be accepted, leading to a suggestion from Lozi that this work may not be intended for the mainline kernel. There was a quick and contentious discussion on the wisdom of the idea and whether it could already be done using the existing CPU-isolation mechanisms. Mathieu Desnoyers eventually intervened to pull the discussion back to its original focus: a proposal to allow the creation of custom schedulers in the kernel. This work is based on Bossa, which was originally developed some ten years ago. It uses a domain-specific language to describe a scheduler and how its decisions will be made; different partitions in the system could then run different scheduling algorithms adapted to their specific tasks.

It was pointed out that the idea of enabling loadable schedulers was firmly shot down quite a few years ago. Even so, there was a brief discussion on how they could be enabled. The likely approach would be to create a new scheduler class that would sit below the realtime classes on the priority scale, but above the SCHED_OTHER class where most work runs. The discussion ran out of time, but it seems likely that this idea will eventually return; stocking up on popcorn for that event might be advisable. Zijlstra, meanwhile, insists that the incident with the throwable microphone was entirely accidental.

Workload analysis with tracing

Desnoyers started the final topic of the session by stating that there is a real need for better instrumentation of the scheduler so that its decisions can be understood and improved. He would like exact information on process switches, wakeups, priority changes, etc. It is important to find a way to add this information without creating ABI issues that could impede future scheduler development. That means that the events have to be created in a way that allows them to evolve without breaking user-space tools.

Zijlstra said that it would not be possible to add all of the desired information even to the existing tracepoints; that would bloat them too much. Desnoyers suggested adding a new version file to each tracepoint; an application could write "2" to it to get a new, more complete output format. Rostedt complained about the use of version numbers, though, and suggested writing a descriptive string to the existing format file instead. Zijlstra said that tracepoints should default to the newest format, but Rostedt said that would break existing tools. The only workable way is to default to the old format and allow aware tools to request a change.

That was about the point where Zijlstra (semi-jokingly) declared his intent to remove all of the tracepoints from the scheduler. "I didn't want this pony", he said.

There was a wandering discussion on how it might be possible to design a mechanism that resembles tracepoints, but which is not subject to the kernel's normal ABI guarantees. Bacik said that a lot of the existing scripts in this area use kprobes; they are simply updated whenever they break. Perhaps a new sort of "tracehook" could be defined that resembles a kprobe: it is a place to hook into a given spot in the code, but without any sort of predefined output. A function could be attached to that hook to create tracepoint-style output; it could be located in a separate kernel module or be loaded as an eBPF script. Either way, that glue code would be treated like an ordinary kernel module; it is using an internal API and, as a result, must adapt if that API changes.

The developers in the room seemed to like this idea, suggesting even that it might be a way to finally get some tracepoints into the virtual filesystem layer — a part of the kernel that does not allow tracepoints at all. Your editor was not convinced, though, that the ABI issues could be so easily dodged. As the session ended, it was resolved that the idea would be presented to Linus Torvalds for a definitive opinion, most likely during the Maintainers Summit in October.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to LPC 2017].

Index entries for this article
Kernel	Development tools/rt-app
Kernel	Scheduler
Conference	Linux Plumbers Conference/2017

RE: Reworking push/pull

Posted Sep 19, 2017 1:35 UTC (Tue) by iamsrp (subscriber, #84011) [Link]

Could the problem be solved via delegation? I.e. something like:

CPU[k] goes idle
CPU[k] walks the other CPUs [0..n) to look for the first idle one (possibly itself); CPU[i]
CPU[k] sends an IPI to CPU[i] requesting that it do the search for a busy CPU
CPU[i] walks the CPUs [0..n) looking for an overloaded one, CPU[b], and "forwards" the IPI from CPU[k] to CPU[b]
CPU[i] also remembers "recently" signaled overloaded CPUs and so avoids forwarding subsequent IPIs to them

Here [i] can clearly change, and memory of recently signaled CPUs isn't shared. It's also not going to be as quick owing to the extra hop, and it may well still have pathological cases, but is possibly better..?

Notes from the LPC scheduler microconference

Posted Sep 21, 2017 15:32 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

Won't someone think of the unwanted ponies?

Notes from the LPC scheduler microconference

Posted Sep 21, 2017 18:41 UTC (Thu) by karim (subscriber, #114) [Link] (7 responses)

I'm not sure the in-kernel trace points should be considered ABI. I think that those should be freely modified at any point in time. Any tool in user-space that depends on those bares the burden to interpret them properly. There's no reason, for instance, user-space doesn't have tables that know that versions prior to X have a specific list of events and a corresponding set of parameters, that versions X through Y have different parameters/values for foo() trace point, that versions Y and onwards differ again in whatever way they do, and so on.

I suspect that, as Peter seemed to hint at, that if trace points are to be considered ABI then they're an undue burden. I also don't believe that just marking the locations as being sufficient. The whole purpose of having trace points in the kernel is for them to be self-descriptive/contained.

Notes from the LPC scheduler microconference

Posted Sep 22, 2017 13:10 UTC (Fri) by nix (subscriber, #2304) [Link] (3 responses)

Surely the thing to do here is what DTrace has long done: add stability metadata to tracepoints, with the default being 'unstable, can go away at any time' and stability only ever being guaranteed for conceptually-stable tracepoints tracing things that aren't going to go away however the kernel changes, like "process creation" or "read()". Consumers of these tracepoints then have to explicitly say that they know they are using an unstable tracepoint via some explicit option: otherwise, only stable ones are visible.

Notes from the LPC scheduler microconference

Posted Sep 22, 2017 17:01 UTC (Fri) by karim (subscriber, #114) [Link] (2 responses)

That's an interesting approach. I certainly think it has the merit of being discussed.

But from my point of view, they should all be seen as "can go away at any time". I would think that that should it make more palatable for maintainers to accept them -- and easier to garbage-collect them at any time. I bet it's even possible to create a script that allows user-space-tool-maintainers to identify trace point deltas from one version to the next and adjust their code accordingly.

Still, I see your point. In fact, that's the point I argued on the LKML circa 2005/2006 when I was still getting pushback on the LTT patches. I essentially showed that the trace points I had in 1999 and 2005/6 were essentially the same set and, hence, the whole argument about unmaintainability was overblown. IOW, some events are always going to take place in any Unix-like kernel: system calls, sched changes, interrupt entries, etc.

It's a good argument for having the trace points part of the sources. It doesn't mean they should be considered ABI. In fact, trace points are likely closer to the driver API than the system call API in terms of long-term consistency guarantees. The former can change at any time, the latter never should.

Notes from the LPC scheduler microconference

Posted Sep 25, 2017 10:35 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

While "consider everything unstable" is attractive to kernel hacker,s there are some tracepoints it doesn't make sense to erase, mostly those that map to syscalls and to user-visible operations, you'll never have a kernel that doesn't create processes, so tracepoints of that nature are stable; nor will you one that removes already-present syscalls like read(), so that tracepoint is stable too.

More generally, I think, the stability of a tracepoint is the same as the stability of the thing it is tracing. Function-based tracing is totally unstable because nobody expects the names of functions in the kernel to remain the same forever. Syscall tracing or proc:: process-state tracing is stable because, in the one case, there is an explicit guarantee about the stability of syscalls (at least, those that actually get used); in the other case, it is difficult to imagine a Unix-like kernel not implementing a way to create processes, or fork, and you'd attach those tracepoints to wherever it was you did that.

So I think we are in violent agreement here. (DTrace for Linux has gone further: the tracepoints, their type information, and the information about which D types they are translated to -- possibly turning one argument into many -- are all recorded nowhere but in the tracepoint declarations in the source code, and thus in a specialized section which is sucked out and parsed into argument types at DTrace module load time.)

Notes from the LPC scheduler microconference

Posted Sep 25, 2017 11:08 UTC (Mon) by nix (subscriber, #2304) [Link]

And of course not only did you say all of that yourself, more concisely, you said it without a blizzard of terrible typos.

I should not post before coffee.

Notes from the LPC scheduler microconference

Posted Sep 22, 2017 17:58 UTC (Fri) by nevets (subscriber, #11875) [Link] (1 responses)

Unfortunately, Linus has said otherwise. If a user space tool hooks into a trace event, and depends on it being a specific way, then if the trace event were to change in a way that the tool did not expect it to, and that change breaks the tool, that change *will* be reverted.

This has already happened with powertop. We had to keep 4 bytes of zeros in *every* event because powertop hard coded the offsets. Luckily for me, that hard coding broke when you had a 32 bit powertop on a 64 bit kernel. Then I was able to fix powertop to use the proper parsing and finally remove those 4 blank bytes. But that took years to get done.

There's a wake up "success" byte in the wake up code, that is now just a hard coded 1. That can't be removed, because a lot of tools look at that to see if the wake up was successful or not, even though it is always 1.

Notes from the LPC scheduler microconference

Posted Sep 22, 2017 18:07 UTC (Fri) by karim (subscriber, #114) [Link]

Yeah, that's a problem I think.

It would probably be better to hide the ugliness into some kind of user-space library and have the user-space tools depend on that. This way the tools wouldn't need to change as much, but they'd need to use a version of the library that understands the kernel they're running on.

Having to maintain insane padding in events just because a user-space tool misunderstood it is broken by design from my standpoint. I'm in no position to push this myself, but I really think that your time would be much better used to making tracing better than being tied to legacy issues such as this. And, more importantly, I think users would be much better off if the trace points were considered a "constant moving target".

Notes from the LPC scheduler microconference

Posted Sep 26, 2017 18:10 UTC (Tue) by bfields (subscriber, #19510) [Link]

There's no reason, for instance, user-space doesn't have tables that know that versions prior to X have a specific list of events and a corresponding set of parameters, that versions X through Y have different parameters/values for foo() trace point, that versions Y and onwards differ again in whatever way they do, and so on.

This sort of approach worries me because it seems to assume a single linear set of kernel versions when in fact there's a huge forest of distro kernels which pick and choose from upstream.

So if possible you'd like the interface to be self-describing somehow.