Extending restartable sequences with virtual CPU IDs
See the above-linked article for an overview of how restartable sequences work. As a reminder, any thread using restartable sequences must first make use of the rseq() system call to register a special structure with the kernel. That structure is used to point to the rseq_cs structure describing the current critical section (if any); the kernel also ensures that it contains the ID number of the current CPU whenever the thread is running. Consistent with the pattern used in many relatively recent system calls, rseq() requires the caller to also provide the size of the rseq structure being passed in.
That length parameter exists to support future extensions to the system call. New features will generally require new data, increasing the size of the rseq structure. By looking at the size passed by user space, the kernel can tell which version of the rseq() API the calling process expects. When carefully used, this mechanism allows existing system calls to be extended in a way that preserves compatibility with older programs.
That still leaves an open question for programs that need to discover which API version they are dealing with as a way of knowing which features are available. One possibility is to invoke the system call with the most recent version of the structure and fall back to an earlier version if the call fails. Another is to simply have the kernel say which structure size it is prepared to accept. The rseq() patches take the latter approach, making the maximum accepted structure size available via getauxval().
Having added this extension mechanism, the patch set goes on to add two extensions without actually using it. These extensions add two 32-bit values to struct rseq, which does extend its length. But, due to the way that the structure was defined (with 32-byte alignment), it will already have a 32-byte allocated size, even though the (pre-extension) structure only required 20 bytes. That said, user space will still be able to tell whether the new values are supported by looking at the return value from getauxval(). Since the new value (AT_RSEQ_FEATURE_SIZE) did not exist before this patch set showed up, getauxval() will return zero on older kernels.
The first of the new values in struct rseq is called node_id and it contains exactly that: the ID number of the NUMA node on which the current thread is running. This is evidently useful for some memory allocators and, as noted in the patch changelog, supports (in conjunction with the already-present CPU ID) an entirely user-space implementation of getcpu().
The other new value is a bit further off the beaten path: it is called vm_vcpu_id. Like the cpu_id field in the same structure, it contains an integer ID number identifying the CPU on which the thread is running. But, while cpu_id contains the CPU's ID number as known by the kernel (and the rest of the system), vm_vcpu_id has no connection with the actual CPU number; it is a virtual number managed by the kernel in a process-private number space.
This new CPU ID appears to be aimed at the needs of programs running threads on a relatively small number of CPUs in a large system. Remember that rseq() is aimed at helping programs access per-CPU data structures; such structures usually take the form of an array indexed by the current CPU ID number. That array must be large enough to hold an entry for every CPU in the system, and every entry must be properly initialized and maintained.
That is just part of the task of working with per-CPU data structures. But imagine a smallish program, with a mere dozen threads or so, running on a large server with, say, 128 CPUs. Those threads may migrate over those CPUs as they run, or they may be bound to a specific subset of CPUs; either way, that per-CPU data structure must be set up for all 128 CPUs, which is not particularly efficient. It would be much nicer to match the "per-CPU" array size to the size of the program rather than that of the system it happens to be running on.
That is the purpose of the virtual CPU ID number. These numbers are assigned by the kernel when a thread is scheduled onto a (real) CPU; the kernel takes pains to ensure that all concurrently running threads in the same process have different virtual CPU ID numbers. Those numbers are assigned from their own space, though, and are chosen to be close to zero. That leaves the program with fewer possible CPU numbers to deal with while preserving the benefits of working with per-CPU data structures.
That does raise an interesting question, though: how does an application developer know what the range of possible virtual-CPU numbers is? When asked, Desnoyers explained:
I would expect the user-space code to use some sensible upper bound as a hint about how many per-vcpu data structure elements to expect (and how many to pre-allocate), but have a "lazy initialization" fall-back in case the vcpu id goes up to the number of configured processors - 1.
One might expect the virtual-CPU ID to be bounded by the number of running threads, but the full story is more complicated than that. Using this feature, thus, will require a bit of additional complexity on the user-space side.
Managing these virtual CPU IDs has a potential downside on the kernel side of the API as well: a certain amount of the work must be done in the scheduler's context-switch path, which is one of the hottest and most performance-critical paths in the kernel. Adding overhead there is not welcome. Desnoyers has duly taken a number of steps to minimize that overhead; they are described in this patch changelog. For example, a context switch between two threads of the same program just moves the virtual CPU ID from the outgoing thread to the incoming one, with no atomic operations required. Single-threaded programs are handled specially, and there is a special cache of virtual CPU IDs attached to each run queue which can be used to avoid atomic operations as well.
Benchmarks included in that changelog show that the performance impact of
these changes is small in most cases. Whether that will be enough to get
the patches past the scheduler maintainers remains to be seen, though; they
have yet to comment on this version of the series. Should this mechanism
eventually be
merged, though, it will be another tool available to developers looking for
the best scalability possible in multithreaded applications.
Index entries for this article | |
---|---|
Kernel | Restartable sequences |
Posted Feb 28, 2022 17:44 UTC (Mon)
by Bigos (subscriber, #96807)
[Link] (1 responses)
Posted Feb 28, 2022 18:28 UTC (Mon)
by compudj (subscriber, #43335)
[Link]
The major use-case where the vcpu-id provides a clear gain is for applications with many threads which run on a subset of the system's cores, through use of cgroup cpusets or sched affinity. This kind of workload is typical of applications running in containers on machines that have a large number of physical cores. In this case, neither a per-core nor a per-thread approach provides an efficient use of the system's memory.
But there are also other scenarios where the virtual CPU IDs improve things. For instance, even a single-threaded application running in a NUMA system can leverage the virtual CPU IDs to make sure all per-vcpu-id data structure accesses are NUMA-node-local.
Posted Feb 28, 2022 22:52 UTC (Mon)
by calumapplepie (guest, #143655)
[Link] (2 responses)
Obviously, more compile options is inadvisable in general, but on a system with only a few cores and a fairly large amount of memory, the benefits from virtual CPU ids are negligible, but that 2% performance change is doesn't change in relative magnitude. The option wouldn't need to be complex: just make the virtual CPU ID the same as the real one, and update both parts of the structure simultaneously. It's really easy to implement: wrap all the V_CPU logic in a #ifndef, and add a #ifdef assignment from the CPU id to the V_CPU. No distribution will enable it by default, but ricers and embedded programmers probably will.
If you want to be extra nice, this optimization could be automatically enabled at boot on all machines with 8 or fewer cores and no CPU hotplug (eg, my laptop's quad-core processor with SMT on). That'd be significantly harder than just adding a few #ifdef statements, but it'd deliver a minor speed improvement to a lot of users.
(not a kernel dev, so there might be some egregious flaw in my reasoning)
Posted Mar 1, 2022 1:34 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (1 responses)
It's unclear to me whether there would be any overall gain in disabling this feature on systems that have a user-space that make use of it. The memory allocators (glibc, tcmalloc) will be some of the first heavy users, so I expect we will end up in a situation where having the VCPU_DOMAIN feature enabled will actually bring performance benefits to the overall system (including user-space) due to improved memory allocation/access locality.
Also you are presuming that the virtual CPU IDs will bring "negligible" performance benefits on systems with few cores but with large amount of memory. I would like to see numbers here, because even though a system has a lot of memory, the locality of cache accesses still matters, and I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.
You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?
Posted Mar 2, 2022 2:20 UTC (Wed)
by calumapplepie (guest, #143655)
[Link]
It'd be very important how they use VCPU_DOMAIN. If they're indexing into an array of pointers, it wouldn't make a difference for cache locality on a few-core system: 8 cores * 8 bytes/pointer is 64 bytes, conveniently close to where I drew my arbitrary line. Of course, that's assuming they align these arrays to cache lines, but I think that's a perfectly reasonable optimization to expect.
Of course, if they're using them in any other way, than the cache optimizations are up for grabs.
> I would like to see numbers here
Sadly, like any annoying internet commentator, I am not tooled up to provide any numbers to back up my wild assertions.
> I expect that having the memory allocator pools spread over fewer vcpu ids (compared to number of possible cpus) will provide significant gains on workloads that have few concurrently running threads that migrate between cores.
That is an excellent argument that I did not think into: how migration between cores would interact with the cache locality. I think there are some problems with assuming that locality would naturally increase using VCPU versus a standard ID, however.
Firstly, I believe (though I could be very wrong) that the scheduler will try to preferentially place a process on the same core repeatedly, so it can benefit from L1 caches. If that is the case, VCPU makes fairly little difference in practice: the indexed structure will be repeatedly accessed at the same location, holding the same cache lines hot.
Also, because the differences only matter when tasks migrate cores, by necessity the L1 and (possibly) the L2 caches won't matter. The difference made by caches is thus limited sharply.
> You are also presuming that virtual CPU IDs bring a 2% performance degradation. Is that number made up or did you benchmark the patch series discussed in the article ?
Completely and utterly made up. These are the idle theories of a procrastinating college students.
Posted Mar 1, 2022 8:25 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (11 responses)
Why not include a version number in the struct?
What happens if user-space passes in an invalid size (e.g. very large)?
What happens if a future extension wants two versions of the call, each with some extra data that happens to be the same size?
This just seems like a way to implement this that is broken by design.
Posted Mar 1, 2022 10:18 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
- Doctor, it hurts when I do that!
Gradual steps, gradual steps.
Posted Mar 1, 2022 13:24 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (9 responses)
The length of the struct is a version number; each "API level" has a fixed size struct, where you only access elements that you're supposed to given the known struct size.
If user space passes in a struct whose size isn't one of the known values, whether it be very large and invalid, or just from a newer kernel version, the kernel rejects it, same as it would if the version number is wrong.
The extensions are ordered by presence in the kernel; if two extensions want extra data, then that's two separate struct fields, and the struct as a whole grows (it's a struct, not a union). The flags tell the kernel which fields are valid.
The advantage of size as opposed to version number is that C makes it easy to get right. I call the syscall with a pointer to the struct, and sizeof(user-space version of struct), and the compiler will assist me in getting it right (failing if I try to fill in fields I don't have, not letting me give a version number that's larger than the struct). All I have to do is ensure that the compiler can see the right definition for the struct, and I'm golden.
Posted Mar 1, 2022 15:38 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (8 responses)
Full stop, end of story. Sadness and clever workarounds ensue.
Posted Mar 1, 2022 19:27 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (6 responses)
An explicit version number that would be expected to change the semantic of existing struct rseq fields whenever it is bumped would not be practical: an application supporting the current version number could not hope to support newer versions until it is recompiled, which is a no-go in terms of backward compatibility of kernel ABIs exposed to user-space.
Posted Mar 2, 2022 2:20 UTC (Wed)
by Paf (subscriber, #91811)
[Link] (5 responses)
To the second part: well, yes - you’d have to carry support for multiple versions in the kernel. That’s all it means. Other projects do this all time.
The opposition to this is just a matter of preferring a new syscall with almost identical semantics or an extra field which changes semantics - which is what would happen if a major deficiency in the semantics were found - to an explicit versioning scheme. And that’s …. It’s a valid preference, though it’s definitely not mine.
I’m not asking you to fight this fight in the kernel, the choice has been made by others, but I do know which side I fall on.
Posted Mar 2, 2022 2:21 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
Posted Mar 2, 2022 11:37 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (1 responses)
Same layout but different semantics can be covered by a flags field (with the kernel rejecting requests where the flags are unknown); this means that the same fields can be interpreted differently by different kernel versions, depending on which flags you set.
Posted Mar 3, 2022 5:02 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Mar 2, 2022 15:59 UTC (Wed)
by compudj (subscriber, #43335)
[Link] (1 responses)
struct rseq is meant to: have its fields populated/read by both the kernel and user-space, be allocated by a single "owner" library (e.g. glibc), and be used by the application executable as well as by various shared objects.
So it's not as simple as having the kernel support various versions, because all users of rseq within a process (main executable and shared libraries) need to agree on its size and feature set, because there is only a single struct rseq per thread.
Therefore, the solution proposed in the patch set expose the "feature size" supported by the kernel through auxiliary vectors, which allows glibc to allocate enough memory in the per-thread area, and register that to the kernel through sys_rseq. This way, all rseq users within the process can agree on the size of the supported rseq feature set by looking at both glibc's __rseq_size and the auxiliary vector rseq feature size.
If many struct rseq per thread were a possibility, things would be very much different and then version numbering would be possible, but it's been decided otherwise for the sake of keeping the kernel implementation simple and time-bounded.
So independently of the preference for version vs size-based extensibility, a version-based extensibility scheme for struct rseq simply won't work, because all user-space binaries linked into a process need to agree on the layout.
Posted Mar 3, 2022 5:01 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Mar 2, 2022 12:07 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Posted Mar 1, 2022 11:02 UTC (Tue)
by xecycle (subscriber, #140261)
[Link]
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
- Then don't do it.
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs
Extending restartable sequences with virtual CPU IDs