I'm curious about the _real_ location of data being touched. Surely a small but significant portion of the working dataset is contained within some part of the CPU caches. If you have multiple tasks cooperating/competing for data within the same page or better yet, cachelines (think locks), it seems to make more sense to get cooperating/competing (you choose the point of view) tasks running on the same CPU or at least within the same node. For many real world workloads the cache-to-cache latency between CPUs in different nodes is a bigger hit than bringing in the data from main memory. Of course this is workload dependant. Perhaps scheduler logic to identify cooperating tasks and gradually, slowly colocate them. A combination of memory migration and task colocation could provide the % increase in performance to justify a small bit of bloat in the main scheduling code path.