Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Posted Feb 28, 2022 22:52 UTC (Mon) by calumapplepie (guest, #143655)Parent article: Extending restartable sequences with virtual CPU IDs
Obviously, more compile options is inadvisable in general, but on a system with only a few cores and a fairly large amount of memory, the benefits from virtual CPU ids are negligible, but that 2% performance change is doesn't change in relative magnitude. The option wouldn't need to be complex: just make the virtual CPU ID the same as the real one, and update both parts of the structure simultaneously. It's really easy to implement: wrap all the V_CPU logic in a #ifndef, and add a #ifdef assignment from the CPU id to the V_CPU. No distribution will enable it by default, but ricers and embedded programmers probably will.
If you want to be extra nice, this optimization could be automatically enabled at boot on all machines with 8 or fewer cores and no CPU hotplug (eg, my laptop's quad-core processor with SMT on). That'd be significantly harder than just adding a few #ifdef statements, but it'd deliver a minor speed improvement to a lot of users.
(not a kernel dev, so there might be some egregious flaw in my reasoning)
Posted Mar 1, 2022 1:34 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (1 responses)
It's unclear to me whether there would be any overall gain in disabling this feature on systems that have a user-space that make use of it. The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.
Also you are presuming that the virtual CPU IDs will bring "negligible" performance benefits on systems with few cores but with large amount of memory. I would like to see numbers here, because even though a system has a lot of memory, the locality of cache accesses still matters, and I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.
You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?
Posted Mar 2, 2022 2:20 UTC (Wed)
by calumapplepie (guest, #143655)
[Link]
It'd be very important how they use VCPU_DOMAIN. If they're indexing into an array of pointers, it wouldn't make a difference for cache locality on a few-core system: 8 cores * 8 bytes/pointer is 64 bytes, conveniently close to where I drew my arbitrary line. Of course, that's assuming they align these arrays to cache lines, but I think that's a perfectly reasonable optimization to expect.
Of course, if they're using them in any other way, than the cache optimizations are up for grabs.
> I would like to see numbers here
Sadly, like any annoying internet commentator, I am not tooled up to provide any numbers to back up my wild assertions.
> I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.
That is an excellent argument that I did not think into: how migration between cores would interact with the cache locality. I think there are some problems with assuming that locality would naturally increase using VCPU versus a standard ID, however.
Firstly, I believe (though I could be very wrong) that the scheduler will try to preferentially place a process on the same core repeatedly, so it can benefit from L1 caches. If that is the case, VCPU makes fairly little difference in practice: the indexed structure will be repeatedly accessed at the same location, holding the same cache lines hot.
Also, because the differences only matter when tasks migrate cores, by necessity the L1 and (possibly) the L2 caches won't matter. The difference made by caches is thus limited sharply.
> You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?
Completely and utterly made up. These are the idle theories of a procrastinating college students.
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs