Notes from the LPC scheduler microconference
Workload simulation
First up was Juri Lelli, who talked about the rt-app tool that tests the scheduler by simulating specific types of workloads. Work on rt-app has been progressing; recent additions include the ability to model mutexes, condition variables, memory-mapped I/O, periodic workloads, and more. There is a new JSON grammar that can be used to specify each task's behavior. Rt-app now generates a log file for each task, making it easy to extract statistics on what happened during a test run; for example, it can be used to watch how the CPU-frequency governor reacts when a "small" task suddenly starts requiring a lot more CPU time. Developers have been using it to see how scheduler patches change the behavior of the system; it is a sort of unit test for the scheduler.
In summary, he said, rt-app has become a flexible tool for the simulation
of many types of workloads. The actual workloads are generated from traces
taken when running the use cases of interest. It is possible (and done),
he said, to compare the simulated runs with the original traces. The
rt-app repository itself has a set of relatively simple workloads, modeling
browser and audio playback workloads, for example. New workloads are
usually created when somebody finds something that doesn't work well and
wants a way to reproduce the problem.
Rafael Wysocki said that he often has trouble with kernel changes that affect CPU-frequency behavior indirectly. He would like to have a way to evaluate patches for this kind of unwanted side effect, preferably while patches are sitting in linux-next at the latest and are relatively easy to fix. Might it be possible to use rt-app for this kind of testing? Josef Bacik added that his group (at Facebook) is constantly fixing issues that come up with each new kernel; it gets tiresome, so he, too, would like to find and fix these problems earlier.
He went on to state that everybody seems to agree that rt-app is the tool for this job. There would be value in having all developers pool their workloads into a comprehensive battery of tests, but where is the right place to put them? The rt-app project itself isn't necessarily the right place for all these workloads, so it would seem that a different repository is indicated. The LISA project was suggested as one possible home. Steve Rostedt, however, suggested that these workloads could just go into the kernel tree directly. Lelli asked whether having rt-app as an external dependency would be acceptable; Rostedt said that it would.
Wysocki wondered about how the community could ensure that these workloads get run on a wide variety of hardware; it's not possible to emulate everything. Bacik replied that it's not possible to test everything either, and that attempts to do so would be a case of the perfect being the enemy of the good. If each interested developer runs the tests on their own system, the overall coverage will be good enough. It's an approach that works out well for the xfstests filesystem-testing suite, he said.
Peter Zijlstra complained that rt-app doesn't produce any direct feedback — there is no result saying whether the right thing happened or not. As a result, interpreting rt-app output "requires having a brain". Rostedt suggested adding a feature to compare runs against previous output and look for regressions. Patrick Bellasi noted, though, that to work well in this mode, the tests need to be fully reproducible; that requires care in setting up the tests. At this point, he said, rt-app is not a continuous-integration tool, but it could maybe be evolved in that direction.
Reworking push/pull
Rostedt gave a brief presentation on what he called a "first-world problem". When running realtime workloads on systems with a large number (over 60) of CPUs — something he said people actually want to do — realtime tasks do not always migrate between CPUs in an optimal manner. That is due to shortcomings in how that migration is handled.
When the last running realtime task on any given CPU goes idle, he said, the CPU will examine the system to see if any other CPUs are overloaded with realtime tasks. If it finds a CPU running more than one realtime process, it will pull one of those processes over and run it locally. Once upon a time, this examination was done by grabbing locks on the remote CPUs, but that does not scale well. Now, instead, the idle CPU will send an inter-processor interrupt (IPI) to tell other CPUs that they can push an extra realtime task in its direction.
That is an improvement, but imagine a situation with many CPUs, one of which is overloaded. All of the other CPUs go idle more-or-less simultaneously, which does happen at times. They will all send IPIs to the busy CPU. One of them will get the extra process from that CPU but, having pushed that process away, the CPU still has to process a pile of useless IPIs. That adds extra load to the one CPU in the system that was already known to be overloaded, leading to observed high latencies — the one thing a realtime system is supposed to avoid at all costs.
The proposed solution is to have only the first (for some definition of "first") idle CPU send the IPI. That IPI can then be forwarded on to other overloaded CPUs if needed. The result would be the elimination of the IPI storms. Nobody seemed to think that this was a bad solution. It was noted that it would be nice to have statistics indicating how well this mechanism is working, and the conversation devolved quickly into a discussion of tracepoints (or the lack thereof) in the scheduler code. Zijlstra said that he broke user space once as a result of tracepoints, so he will not allow the addition of any more. This is a topic that was to return toward the end of the session.
Multi-core scheduling
Things took a different turn when Jean-Pierre Lozi stood up to talk about multi-core scheduling issues. Lozi is the lead author of the paper called The Linux Scheduler: a Decade of Wasted Cores [PDF], which described a number of issues with the CPU scheduler. An attempt to go through a list of those bugs drew a strong response from Zijlstra, who claimed that most of them were fixed some time ago.
The biggest potential exception is "work conservation" — ensuring that no task languishes in a CPU run queue if there are idle CPUs available elsewhere in the system. On larger systems, CPUs will indeed sit idle while tasks wait, and Zijlstra said that things will stay that way. When there are a lot of cores, he said, ensuring perfect work conservation is unacceptably expensive; it requires the maintenance of a global state and simply does not scale. Any cure would be worse than the disease.
Lozi's proposed solution is partitioning the system, essentially turning it into a set of smaller systems that can be scheduled independently. Zijlstra expressed skepticism that such an idea would be accepted, leading to a suggestion from Lozi that this work may not be intended for the mainline kernel. There was a quick and contentious discussion on the wisdom of the idea and whether it could already be done using the existing CPU-isolation mechanisms. Mathieu Desnoyers eventually intervened to pull the discussion back to its original focus: a proposal to allow the creation of custom schedulers in the kernel. This work is based on Bossa, which was originally developed some ten years ago. It uses a domain-specific language to describe a scheduler and how its decisions will be made; different partitions in the system could then run different scheduling algorithms adapted to their specific tasks.
It was pointed out that the idea of enabling loadable schedulers was firmly shot down quite a few years ago. Even so, there was a brief discussion on how they could be enabled. The likely approach would be to create a new scheduler class that would sit below the realtime classes on the priority scale, but above the SCHED_OTHER class where most work runs. The discussion ran out of time, but it seems likely that this idea will eventually return; stocking up on popcorn for that event might be advisable. Zijlstra, meanwhile, insists that the incident with the throwable microphone was entirely accidental.
Workload analysis with tracing
Desnoyers started the final topic of the session by stating that there is a real need for better instrumentation of the scheduler so that its decisions can be understood and improved. He would like exact information on process switches, wakeups, priority changes, etc. It is important to find a way to add this information without creating ABI issues that could impede future scheduler development. That means that the events have to be created in a way that allows them to evolve without breaking user-space tools.
Zijlstra said that it would not be possible to add all of the desired
information even to the existing tracepoints; that would bloat them too
much. Desnoyers suggested adding a new version file to each
tracepoint; an application could write "2" to it to get a new,
more complete output format. Rostedt complained about the use of version
numbers, though, and suggested writing a descriptive string to the existing
format file instead. Zijlstra said that tracepoints should default to the
newest format, but Rostedt said that would break existing tools. The only
workable way is to default to the old format and allow aware tools to
request a change.
That was about the point where Zijlstra (semi-jokingly) declared his intent to remove all of the tracepoints from the scheduler. "I didn't want this pony", he said.
There was a wandering discussion on how it might be possible to design a mechanism that resembles tracepoints, but which is not subject to the kernel's normal ABI guarantees. Bacik said that a lot of the existing scripts in this area use kprobes; they are simply updated whenever they break. Perhaps a new sort of "tracehook" could be defined that resembles a kprobe: it is a place to hook into a given spot in the code, but without any sort of predefined output. A function could be attached to that hook to create tracepoint-style output; it could be located in a separate kernel module or be loaded as an eBPF script. Either way, that glue code would be treated like an ordinary kernel module; it is using an internal API and, as a result, must adapt if that API changes.
The developers in the room seemed to like this idea, suggesting even that it might be a way to finally get some tracepoints into the virtual filesystem layer — a part of the kernel that does not allow tracepoints at all. Your editor was not convinced, though, that the ABI issues could be so easily dodged. As the session ended, it was resolved that the idea would be presented to Linus Torvalds for a definitive opinion, most likely during the Maintainers Summit in October.
[Your editor would like to thank the Linux Foundation, LWN's travel
sponsor, for supporting his travel to LPC 2017].
Index entries for this article | |
---|---|
Kernel | Development tools/rt-app |
Kernel | Scheduler |
Conference | Linux Plumbers Conference/2017 |
Posted Sep 19, 2017 1:35 UTC (Tue)
by iamsrp (subscriber, #84011)
[Link]
Posted Sep 21, 2017 15:32 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Sep 21, 2017 18:41 UTC (Thu)
by karim (subscriber, #114)
[Link] (7 responses)
I suspect that, as Peter seemed to hint at, that if trace points are to be considered ABI then they're an undue burden. I also don't believe that just marking the locations as being sufficient. The whole purpose of having trace points in the kernel is for them to be self-descriptive/contained.
Posted Sep 22, 2017 13:10 UTC (Fri)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Sep 22, 2017 17:01 UTC (Fri)
by karim (subscriber, #114)
[Link] (2 responses)
But from my point of view, they should all be seen as "can go away at any time". I would think that that should it make more palatable for maintainers to accept them -- and easier to garbage-collect them at any time. I bet it's even possible to create a script that allows user-space-tool-maintainers to identify trace point deltas from one version to the next and adjust their code accordingly.
Still, I see your point. In fact, that's the point I argued on the LKML circa 2005/2006 when I was still getting pushback on the LTT patches. I essentially showed that the trace points I had in 1999 and 2005/6 were essentially the same set and, hence, the whole argument about unmaintainability was overblown. IOW, some events are always going to take place in any Unix-like kernel: system calls, sched changes, interrupt entries, etc.
It's a good argument for having the trace points part of the sources. It doesn't mean they should be considered ABI. In fact, trace points are likely closer to the driver API than the system call API in terms of long-term consistency guarantees. The former can change at any time, the latter never should.
Posted Sep 25, 2017 10:35 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
More generally, I think, the stability of a tracepoint is the same as the stability of the thing it is tracing. Function-based tracing is totally unstable because nobody expects the names of functions in the kernel to remain the same forever. Syscall tracing or proc:: process-state tracing is stable because, in the one case, there is an explicit guarantee about the stability of syscalls (at least, those that actually get used); in the other case, it is difficult to imagine a Unix-like kernel not implementing a way to create processes, or fork, and you'd attach those tracepoints to wherever it was you did that.
So I think we are in violent agreement here. (DTrace for Linux has gone further: the tracepoints, their type information, and the information about which D types they are translated to -- possibly turning one argument into many -- are all recorded nowhere but in the tracepoint declarations in the source code, and thus in a specialized section which is sucked out and parsed into argument types at DTrace module load time.)
Posted Sep 25, 2017 11:08 UTC (Mon)
by nix (subscriber, #2304)
[Link]
I should not post before coffee.
Posted Sep 22, 2017 17:58 UTC (Fri)
by nevets (subscriber, #11875)
[Link] (1 responses)
This has already happened with powertop. We had to keep 4 bytes of zeros in *every* event because powertop hard coded the offsets. Luckily for me, that hard coding broke when you had a 32 bit powertop on a 64 bit kernel. Then I was able to fix powertop to use the proper parsing and finally remove those 4 blank bytes. But that took years to get done.
There's a wake up "success" byte in the wake up code, that is now just a hard coded 1. That can't be removed, because a lot of tools look at that to see if the wake up was successful or not, even though it is always 1.
Posted Sep 22, 2017 18:07 UTC (Fri)
by karim (subscriber, #114)
[Link]
It would probably be better to hide the ugliness into some kind of user-space library and have the user-space tools depend on that. This way the tools wouldn't need to change as much, but they'd need to use a version of the library that understands the kernel they're running on.
Having to maintain insane padding in events just because a user-space tool misunderstood it is broken by design from my standpoint. I'm in no position to push this myself, but I really think that your time would be much better used to making tracing better than being tied to legacy issues such as this. And, more importantly, I think users would be much better off if the trace points were considered a "constant moving target".
Posted Sep 26, 2017 18:10 UTC (Tue)
by bfields (subscriber, #19510)
[Link]
This sort of approach worries me because it seems to assume a single linear set of kernel versions when in fact there's a huge forest of distro kernels which pick and choose from upstream.
So if possible you'd like the interface to be self-describing somehow.
Could the problem be solved via delegation? I.e. something like:RE: Reworking push/pull
Here [i] can clearly change, and memory of recently signaled CPUs isn't shared. It's also not going to be as quick owing to the extra hop, and it may well still have pathological cases, but is possibly better..?
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
Notes from the LPC scheduler microconference
There's no reason, for instance, user-space doesn't have tables that know that versions prior to X have a specific list of events and a corresponding set of parameters, that versions X through Y have different parameters/values for foo() trace point, that versions Y and onwards differ again in whatever way they do, and so on.