Reconsidering the scheduler's wake_wide() heuristic

Posted Jul 28, 2017 12:59 UTC (Fri) by mjthayer (guest, #39183)
In reply to: Reconsidering the scheduler's wake_wide() heuristic by nix
Parent article: Reconsidering the scheduler's wake_wide() heuristic

My naivety will probably show here, but if you are running a custom MPI workload that is important enough to submit a patch to the kernel scheduler to support, what speaks against the MPI controlling CPU affinity directly from user space?

Reconsidering the scheduler's wake_wide() heuristic

Posted Jul 28, 2017 14:23 UTC (Fri) by ejr (subscriber, #51652) [Link] (1 responses)

Most do, IIRC, although that becomes interesting with mixed MPI+OpenMP+GPU codes. You do end up treating NUMA systems as a cluster on a fast interconnect.

This patch does not look inspired by MPI codes. But there are some odd phrases in the article's introduction that probably triggered the original post's worries. Many large-scale parallel codes very much work in lock step in critical areas. Consider the all-to-all reduction for computing a residual (error term stand-in) and determining convergence. That is the opposite of the article's statement that parallel programs respond randomly...

I also have plenty of confusing data on packing v. spreading for irregular data analysis (e.g. graph algorithms). The cache locality already is kinda meh, and you often get better performance from engaging more memory controllers simultaneously and having the larger aggregate L3 across all CPUs. But not always, and there is no clear indicator that I can see from my codes. I also haven't had the time / student to dig into the choices. No heuristic will be right for everyone.

Reconsidering the scheduler's wake_wide() heuristic

Posted Jul 29, 2017 9:31 UTC (Sat) by garloff (subscriber, #319) [Link]

So, it indeed seems that an application should tell the kernel how hard it should try to place communicating processes close to each other. Probably should not be binary, but allow for different steps.
Question is whether this can be done efficiently at process group scope or whether it needs to be system-wide. Maybe Cgroup-wide?