LWN: Comments on "Reconsidering the scheduler's wake_wide() heuristic"

Reconsidering the scheduler's wake_wide() heuristic

walkerlala — Sun, 06 Aug 2017 03:18:24 +0000

I am wondering whether it is possible to provide heuristics for the scheduler from userspace. That would be a lot easier to tackle those task, I guess.

Reconsidering the scheduler's wake_wide() heuristic

flussence — Thu, 03 Aug 2017 12:40:49 +0000

> I agree that one must be very careful not to come up with an algorithm that benefits an MPI-like loads-of-communicating-processes model if it penalizes the much more common "two tasks chattering frequently" model.

Maybe we could slim that heuristic down to “anything added to the scheduler should not further widen MuQSS's advantage”? :-)

Reconsidering the scheduler's wake_wide() heuristic

josefbacik — Mon, 31 Jul 2017 05:55:05 +0000

Sorry for some reason I missed the follow up conversations, I'll go back and read through them shortly and respond on list.

However I did come up with a different solution while looking at a CPU imbalance problem (https://josefbacik.github.io/kernel/scheduler/cgroup/2017...). Mike is right, any messing with the heuristic here is likely to end in tears. A problem with wake_wide is overloading the waker CPU when it decides we need affinity, even on heavily loaded systems. Instead of messing with wake_wide and trying to make it smarter I just addressed the problem it sometimes creates, ping-ponging. One of my patches provides a way to detect when we are trying to wake affine something that has been recently load balanced and skip the wake affine. This gets us the same behavior as if wake_wide returned 1 and side steps the problem of trying to do a one size fits most heuristic in wake_wide.

Reconsidering the scheduler's wake_wide() heuristic

glenn — Sun, 30 Jul 2017 22:51:33 +0000

I agree: the CPU cache size should be considered, but the amount of data shared between producers and consumers is important as well.

I researched (https://tinyurl.com/y7lxzcy4) enhancing a deadline-based scheduler with cache-topology-aware CPU selection, and I studied the potential benefits for workloads where producer/consumer processes can be described as a directed graph (you see workloads like this in video and computer vision pipelines). I hesitate to generalize too much from my scheduler/experiments, but I think some of the broader findings can be applied to Linux’s general scheduler.

To my surprise, I discovered something obvious that I should have realized earlier in my research: (1) For producers/consumers that share little data, cache-locality is not very important—the overhead due to cache affinity loss is negligible; and (2) for producers/consumers that share a LOT of data, cache-locality is not very important—most of the shared data are self-evicted (or evicted by unrelated work executing concurrently) from the cache anyhow. In cases (1) and (2), getting scheduled on an available CPU is more important. Cache-aware scheduling is useful only for producers/consumers that share a moderate amount of data (“goldilocks workloads”). Moreover, you must strive to schedule a consumer soon after its producer(s) produce, or the shared data may be evicted from the cache by concurrently scheduled unrelated workload.

Reconsidering the scheduler's wake_wide() heuristic

garloff — Sat, 29 Jul 2017 09:31:36 +0000

So, it indeed seems that an application should tell the kernel how hard it should try to place communicating processes close to each other. Probably should not be binary, but allow for different steps.
Question is whether this can be done efficiently at process group scope or whether it needs to be system-wide. Maybe Cgroup-wide?

Reconsidering the scheduler's wake_wide() heuristic

ejr — Fri, 28 Jul 2017 14:23:43 +0000

Most do, IIRC, although that becomes interesting with mixed MPI+OpenMP+GPU codes. You do end up treating NUMA systems as a cluster on a fast interconnect.

This patch does not look inspired by MPI codes. But there are some odd phrases in the article's introduction that probably triggered the original post's worries. Many large-scale parallel codes very much work in lock step in critical areas. Consider the all-to-all reduction for computing a residual (error term stand-in) and determining convergence. That is the opposite of the article's statement that parallel programs respond randomly...

I also have plenty of confusing data on packing v. spreading for irregular data analysis (e.g. graph algorithms). The cache locality already is kinda meh, and you often get better performance from engaging more memory controllers simultaneously and having the larger aggregate L3 across all CPUs. But not always, and there is no clear indicator that I can see from my codes. I also haven't had the time / student to dig into the choices. No heuristic will be right for everyone.

Reconsidering the scheduler's wake_wide() heuristic

mjthayer — Fri, 28 Jul 2017 12:59:11 +0000

My naivety will probably show here, but if you are running a custom MPI workload that is important enough to submit a patch to the kernel scheduler to support, what speaks against the MPI controlling CPU affinity directly from user space?

Reconsidering the scheduler's wake_wide() heuristic

nix — Fri, 28 Jul 2017 11:19:39 +0000

I agree that one must be very careful not to come up with an algorithm that benefits an MPI-like loads-of-communicating-processes model if it penalizes the much more common "two tasks chattering frequently" model. There are a *lot* of those: everything from compilers through anything at all that uses an X server on the same machine. Even on headless and server-class machines they are probably a more common workload than the MPI model is.

Equally, one should probably factor the CPU cache size in *somewhere*, though with modern topologies it's hard to figure out how: probably all levels of cache should influence the computation somehow (preferring to move things more locally unless the cache might be overloaded or a widescale search for a different NUMA node is called for), but since the number of levels and their relation with cores is all rather arch-dependent it's hard to even think of a heuristic that doesn't rapidly degrade into a muddy mess.

I'm wondering... I know scheduler knobs are strongly deprecated, but if you're running a huge MPI workload you probably *know* you are. This seems like a perfect place for a knob that MPI itself flips to say "we expect to use all nodes in a constantly-chattering pattern, ignore cache locality concerns". There aren't all that many libraries that would need adjusting, either... users not running huge MPI workloads (and libraries other than things like MPI) would not need to know about this knob.