> I imagine there are other libraries which provide threadpool abstractions again based on the number of cores, but admit I have real knowledge here.
Yes, I think that's fairly common. (And not perfect.)
> So what I was actually worrying about was: what happens? And what should happen? I can imagine people in future coming up with requirements around wanting to know when the number of CPUs changes to scale up/down their thread pools.
Thread pools could indeed benefit from adjustments at runtime. The underlying problem is that userspace is making resource-usage decisions. In particular, parallelized programs do that to a larger extend than single-threaded ones because they have to decide how many threads they use. For userspace, it's not clear whether none/few/many threads would yield optimal performance because the lower-level trade-offs (e.g., all that's discussed in the article) aren't known; likewise, the kernel can't optimize optimally either because it doesn't know about the program's utility function and because the userspace code that is run for single-thread vs. multi-thread is often different (e.g., due to differences in synchronization) and has different performance characteristics.
> I think in theory it’s not really justifiable to have too much of a communication when scaling down. If performance monitoring code decides we are under using the CPU resources and to scale down, then clearly the threadpool is not using the cores too much so who cares. However if you were bringing more CPUs online and the threadpool is a big contributor to load, and has a 1-1 matching with number of CPUs, then growing the pool could make sense.
I agree that we shouldn't worry about having unused threads lying around. However, if the kernel knows that there will only be one big core not occupied by other processes (and thus available to our program), for example, then the program can benefit from this because it knows it doesn't have to try hard to run stuff in parallel. The parallelization overheads of course vary depending on the parallelization approach that userspace follows and the actual application/workload (e.g., task-based parallelism together with work-stealing are more robust than other approaches in cases of load imbalance).
Overall, I think we should work on getting userspace and kernel to try to optimize this jointly, because IMO this is an end-to-end optimization problem (program utility function down to HW performance characteristics) and because I doubt that we'll find a catch-all tuning policy that's feasible to add to the kernel.