Extending restartable sequences with virtual CPU IDs

Posted Feb 28, 2022 22:52 UTC (Mon) by calumapplepie (guest, #143655)
Parent article: Extending restartable sequences with virtual CPU IDs

For those ricers who want every last gram of performance, wouldn't it make sense to allow a compilation option disabling the virtual CPU IDs?

Obviously, more compile options is inadvisable in general, but on a system with only a few cores and a fairly large amount of memory, the benefits from virtual CPU ids are negligible, but that 2% performance change is doesn't change in relative magnitude. The option wouldn't need to be complex: just make the virtual CPU ID the same as the real one, and update both parts of the structure simultaneously. It's really easy to implement: wrap all the V_CPU logic in a #ifndef, and add a #ifdef assignment from the CPU id to the V_CPU. No distribution will enable it by default, but ricers and embedded programmers probably will.

If you want to be extra nice, this optimization could be automatically enabled at boot on all machines with 8 or fewer cores and no CPU hotplug (eg, my laptop's quad-core processor with SMT on). That'd be significantly harder than just adding a few #ifdef statements, but it'd deliver a minor speed improvement to a lot of users.

(not a kernel dev, so there might be some egregious flaw in my reasoning)

Extending restartable sequences with virtual CPU IDs

Posted Mar 1, 2022 1:34 UTC (Tue) by compudj (subscriber, #43335) [Link] (1 responses)

I agree with your statement about not multiplying the number of semi-useless config options, and this is indeed one of my objectives. So currently CONFIG_VCPU_DOMAIN is only enabled if CONFIG_RSEQ and CONFIG_SMP are enabled, but it is not exposed as an explicit Kconfig option. This means building a kernel without this feature is as simple as configuring with RSEQ=n. And indeed with CONFIG_VCPU_DOMAIN=n, the implementation of task_mm_vcpu_id() uses raw_smp_processor_id(), so if we ever want to expose this as an explicit Kconfig option, we can, but I'm not convinced this is a good idea.

It's unclear to me whether there would be any overall gain in disabling this feature on systems that have a user-space that make use of it. The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.

Also you are presuming that the virtual CPU IDs will bring "negligible" performance benefits on systems with few cores but with large amount of memory. I would like to see numbers here, because even though a system has a lot of memory, the locality of cache accesses still matters, and I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.

You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?

Extending restartable sequences with virtual CPU IDs

Posted Mar 2, 2022 2:20 UTC (Wed) by calumapplepie (guest, #143655) [Link]

> The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.

It'd be very important how they use VCPU_DOMAIN. If they're indexing into an array of pointers, it wouldn't make a difference for cache locality on a few-core system: 8 cores * 8 bytes/pointer is 64 bytes, conveniently close to where I drew my arbitrary line. Of course, that's assuming they align these arrays to cache lines, but I think that's a perfectly reasonable optimization to expect.

Of course, if they're using them in any other way, than the cache optimizations are up for grabs.

> I would like to see numbers here

Sadly, like any annoying internet commentator, I am not tooled up to provide any numbers to back up my wild assertions.

> I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.

That is an excellent argument that I did not think into: how migration between cores would interact with the cache locality. I think there are some problems with assuming that locality would naturally increase using VCPU versus a standard ID, however.

Firstly, I believe (though I could be very wrong) that the scheduler will try to preferentially place a process on the same core repeatedly, so it can benefit from L1 caches. If that is the case, VCPU makes fairly little difference in practice: the indexed structure will be repeatedly accessed at the same location, holding the same cache lines hot.

Also, because the differences only matter when tasks migrate cores, by necessity the L1 and (possibly) the L2 caches won't matter. The difference made by caches is thus limited sharply.

> You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?

Completely and utterly made up. These are the idle theories of a procrastinating college students.