LWN: Comments on "Extending restartable sequences with virtual CPU IDs"

Extending restartable sequences with virtual CPU IDs

Paf — Thu, 03 Mar 2022 05:02:40 +0000

Sure, you can do this with flags - sometimes you end up with a flag that basically says “new version”, but it can be done.

Extending restartable sequences with virtual CPU IDs

Paf — Thu, 03 Mar 2022 05:01:03 +0000

Ah, thank you for that clarification - that’s quite an extra ball of complexity. Interesting :o

Extending restartable sequences with virtual CPU IDs

compudj — Wed, 02 Mar 2022 15:59:35 +0000

struct rseq is quite different from the usual system call input/output parameters.

struct rseq is meant to: have its fields populated/read by both the kernel and user-space, be allocated by a single "owner" library (e.g. glibc), and be used by the application executable as well as by various shared objects.

So it's not as simple as having the kernel support various versions, because all users of rseq within a process (main executable and shared libraries) need to agree on its size and feature set, because there is only a single struct rseq per thread.

Therefore, the solution proposed in the patch set expose the "feature size" supported by the kernel through auxiliary vectors, which allows glibc to allocate enough memory in the per-thread area, and register that to the kernel through sys_rseq. This way, all rseq users within the process can agree on the size of the supported rseq feature set by looking at both glibc's __rseq_size and the auxiliary vector rseq feature size.

If many struct rseq per thread were a possibility, things would be very much different and then version numbering would be possible, but it's been decided otherwise for the sake of keeping the kernel implementation simple and time-bounded.

So independently of the preference for version vs size-based extensibility, a version-based extensibility scheme for struct rseq simply won't work, because all user-space binaries linked into a process need to agree on the layout.

Extending restartable sequences with virtual CPU IDs

smurf — Wed, 02 Mar 2022 12:07:34 +0000

You could just set a flag bit. Or add a flag field if there isn't one already.

Extending restartable sequences with virtual CPU IDs

farnz — Wed, 02 Mar 2022 11:37:12 +0000

Same layout but different semantics can be covered by a flags field (with the kernel rejecting requests where the flags are unknown); this means that the same fields can be interpreted differently by different kernel versions, depending on which flags you set.

Extending restartable sequences with virtual CPU IDs

Paf — Wed, 02 Mar 2022 02:21:50 +0000

By the way, this is (for my money) exactly the point of a version number - backwards compatibility by supporting multiple versions inside the API provider.

Extending restartable sequences with virtual CPU IDs

Paf — Wed, 02 Mar 2022 02:20:47 +0000

Well, you’ve decided it’s not struct rseq anymore. That’s just something you’ve decided as a definitional line in the sand - it could just as easily be struct rseq v2, with the same layout but different semantics because you decided the earlier semantics were bad.

To the second part: well, yes - you’d have to carry support for multiple versions in the kernel. That’s all it means. Other projects do this all time.

The opposition to this is just a matter of preferring a new syscall with almost identical semantics or an extra field which changes semantics - which is what would happen if a major deficiency in the semantics were found - to an explicit versioning scheme. And that’s …. It’s a valid preference, though it’s definitely not mine.

I’m not asking you to fight this fight in the kernel, the choice has been made by others, but I do know which side I fall on.

Extending restartable sequences with virtual CPU IDs

calumapplepie — Wed, 02 Mar 2022 02:20:25 +0000

> The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.

It'd be very important how they use VCPU_DOMAIN. If they're indexing into an array of pointers, it wouldn't make a difference for cache locality on a few-core system: 8 cores * 8 bytes/pointer is 64 bytes, conveniently close to where I drew my arbitrary line. Of course, that's assuming they align these arrays to cache lines, but I think that's a perfectly reasonable optimization to expect.

Of course, if they're using them in any other way, than the cache optimizations are up for grabs.

> I would like to see numbers here

Sadly, like any annoying internet commentator, I am not tooled up to provide any numbers to back up my wild assertions.

> I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.

That is an excellent argument that I did not think into: how migration between cores would interact with the cache locality. I think there are some problems with assuming that locality would naturally increase using VCPU versus a standard ID, however.

Firstly, I believe (though I could be very wrong) that the scheduler will try to preferentially place a process on the same core repeatedly, so it can benefit from L1 caches. If that is the case, VCPU makes fairly little difference in practice: the indexed structure will be repeatedly accessed at the same location, holding the same cache lines hot.

Also, because the differences only matter when tasks migrate cores, by necessity the L1 and (possibly) the L2 caches won't matter. The difference made by caches is thus limited sharply.

> You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?

Completely and utterly made up. These are the idle theories of a procrastinating college students.

Extending restartable sequences with virtual CPU IDs

compudj — Tue, 01 Mar 2022 19:27:04 +0000

If an existing struct rseq field needs to change semantic/behavior, then it is not struct rseq anymore, and it would be named something else, and possibly registered through a new system call or with specific flags set when calling sys_rseq. The extensibility scheme for struct rseq is "append only" on purpose, so user-space applications can rely on having the exposed structure content unchanged in future kernels.

An explicit version number that would be expected to change the semantic of existing struct rseq fields whenever it is bumped would not be practical: an application supporting the current version number could not hope to support newer versions until it is recompiled, which is a no-go in terms of backward compatibility of kernel ABIs exposed to user-space.

Extending restartable sequences with virtual CPU IDs

Paf — Tue, 01 Mar 2022 15:38:47 +0000

The - huge - disadvantage is that any version changes must be size modifying. What if there’s a bug or a desire to change the behavior of an existing field? Well, we can’t handle it with versioning unless we want to blow out the size.

Full stop, end of story. Sadness and clever workarounds ensue.

Extending restartable sequences with virtual CPU IDs

farnz — Tue, 01 Mar 2022 13:24:55 +0000

The length of the struct is a version number; each "API level" has a fixed size struct, where you only access elements that you're supposed to given the known struct size.

If user space passes in a struct whose size isn't one of the known values, whether it be very large and invalid, or just from a newer kernel version, the kernel rejects it, same as it would if the version number is wrong.

The extensions are ordered by presence in the kernel; if two extensions want extra data, then that's two separate struct fields, and the struct as a whole grows (it's a struct, not a union). The flags tell the kernel which fields are valid.

The advantage of size as opposed to version number is that C makes it easy to get right. I call the syscall with a pointer to the struct, and sizeof(user-space version of struct), and the compiler will assist me in getting it right (failing if I try to fill in fields I don't have, not letting me give a version number that's larger than the struct). All I have to do is ensure that the compiler can see the right definition for the struct, and I'm golden.

Extending restartable sequences with virtual CPU IDs

xecycle — Tue, 01 Mar 2022 11:02:57 +0000

Oh, we are "extending" it to fit in a smaller space :)

Extending restartable sequences with virtual CPU IDs

Cyberax — Tue, 01 Mar 2022 10:18:16 +0000

> What happens if a future extension wants two versions of the call, each with some extra data that happens to be the same size?

- Doctor, it hurts when I do that!
- Then don't do it.

Gradual steps, gradual steps.

Extending restartable sequences with virtual CPU IDs

taladar — Tue, 01 Mar 2022 08:25:38 +0000

Is that just me or does that whole business of identifying the version of the struct by its size smell quite a bit?

Why not include a version number in the struct?

What happens if user-space passes in an invalid size (e.g. very large)?

What happens if a future extension wants two versions of the call, each with some extra data that happens to be the same size?

This just seems like a way to implement this that is broken by design.

Extending restartable sequences with virtual CPU IDs

compudj — Tue, 01 Mar 2022 01:34:40 +0000

I agree with your statement about not multiplying the number of semi-useless config options, and this is indeed one of my objectives. So currently CONFIG_VCPU_DOMAIN is only enabled if CONFIG_RSEQ and CONFIG_SMP are enabled, but it is not exposed as an explicit Kconfig option. This means building a kernel without this feature is as simple as configuring with RSEQ=n. And indeed with CONFIG_VCPU_DOMAIN=n, the implementation of task_mm_vcpu_id() uses raw_smp_processor_id(), so if we ever want to expose this as an explicit Kconfig option, we can, but I'm not convinced this is a good idea.

It's unclear to me whether there would be any overall gain in disabling this feature on systems that have a user-space that make use of it. The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.

Also you are presuming that the virtual CPU IDs will bring "negligible" performance benefits on systems with few cores but with large amount of memory. I would like to see numbers here, because even though a system has a lot of memory, the locality of cache accesses still matters, and I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.

You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?

Extending restartable sequences with virtual CPU IDs

calumapplepie — Mon, 28 Feb 2022 22:52:31 +0000

For those ricers who want every last gram of performance, wouldn't it make sense to allow a compilation option disabling the virtual CPU IDs?

Obviously, more compile options is inadvisable in general, but on a system with only a few cores and a fairly large amount of memory, the benefits from virtual CPU ids are negligible, but that 2% performance change is doesn't change in relative magnitude. The option wouldn't need to be complex: just make the virtual CPU ID the same as the real one, and update both parts of the structure simultaneously. It's really easy to implement: wrap all the V_CPU logic in a #ifndef, and add a #ifdef assignment from the CPU id to the V_CPU. No distribution will enable it by default, but ricers and embedded programmers probably will.

If you want to be extra nice, this optimization could be automatically enabled at boot on all machines with 8 or fewer cores and no CPU hotplug (eg, my laptop's quad-core processor with SMT on). That'd be significantly harder than just adding a few #ifdef statements, but it'd deliver a minor speed improvement to a lot of users.

(not a kernel dev, so there might be some egregious flaw in my reasoning)

Extending restartable sequences with virtual CPU IDs

compudj — Mon, 28 Feb 2022 18:28:46 +0000

If an application has few threads, using either TLS or a per-vcpu-id approach should typically provide similar results, perhaps except on NUMA systems: in this situation the per-vcpu-id approach would help reducing the amount of CPU affinity tweaks required to provide good NUMA locality of TLS accesses.

The major use-case where the vcpu-id provides a clear gain is for applications with many threads which run on a subset of the system's cores, through use of cgroup cpusets or sched affinity. This kind of workload is typical of applications running in containers on machines that have a large number of physical cores. In this case, neither a per-core nor a per-thread approach provides an efficient use of the system's memory.

But there are also other scenarios where the virtual CPU IDs improve things. For instance, even a single-threaded application running in a NUMA system can leverage the virtual CPU IDs to make sure all per-vcpu-id data structure accesses are NUMA-node-local.

Extending restartable sequences with virtual CPU IDs

Bigos — Mon, 28 Feb 2022 17:44:34 +0000

Aren't programs that don't use too many threads better served by just using Thread Local Storage? I always thought that restartable sequences were a tool to reduce the number of "aggregators" (e.g. counters) for software that was running more threads than there were CPUs.