User-managed concurrency groups

Posted Dec 28, 2021 16:51 UTC (Tue) by ms (subscriber, #41272)
Parent article: User-managed concurrency groups

This feels a lot like green threads, which a lot of languages have, e.g. Go, Erlang. My question is how can it help, getting the kernel involved? I've sometimes read one of the reasons green threads have fast context switching is because the kernel isn't involved. About the only reason I can think of is you'd gain pre-emptive eviction, which is not nothing, but I'm curious if it's worth it, or if there are other benefits too?

User-managed concurrency groups

Posted Dec 28, 2021 17:11 UTC (Tue) by khim (subscriber, #9252) [Link]

> I've sometimes read one of the reasons green threads have fast context switching is because the kernel isn't involved.

Switching to kernel and back is not that slow (even green threads do syscalls). What's slow is waiting for next thread to be scheduled by kernel to become executable.

> My question is how can it help, getting the kernel involved?

Basically a band-aid for the fact that many years ago GNU/Linux rejected NGPT and went with NPTL.

If you allocate, essentially, a dedicated kernel thread for your “green thread” then you may use all syscalls and libraries which are allowed for regular threads: parts of the program where “green threads” are cooperating would work like proper “green threads”, but if you call some “heavy” function the instead of freezing your whole “green thread” machinery you would just get one-time hit when kernel would save your beacon and would give you a chance to remove misbehaving “green thread” from the cooperating pool.

> About the only reason I can think of is you'd gain pre-emptive eviction, which is not nothing, but I'm curious if it's worth it, or if there are other benefits too?

Proper TLS area is another benefit. In systems where (like in Windows) fibers (AKA “green threads”) their own private storage but share TLS for “kernel threads” it's much easier to mess things up.

Of course that one is possible without kernel help, but you get it for free if you use kernel thread machinery as “safety net” for misbehaving fibers.

User-managed concurrency groups

Posted Dec 29, 2021 22:39 UTC (Wed) by nksingh (subscriber, #94354) [Link] (6 responses)

This mechanism and the very similar Windows UMS one which was added in 2009 and ripped out in 2020 help userspace control thread scheduling in the face of arbitrary system calls with arbitrary blocking.

With traditional M:N scheduling like fibers, if the user threaded code blocks, no code in the app gets control to choose what's going to run next, unless the blocking is going through the userspace threading library. This is a major part of the reason that Go or LibUV wrap all of the IO calls, so that they can control their green thread scheduling.

UMS allows such a runtime to effectively get a callback to decide what to do next (e.g. schedule a new lightweight task) when something blocks. This is a great idea if you have a set of syscalls from your task that may or may not block in an unpredictable manner, like a read from the pagecache where you don't know if you'll miss. You can be optimistic, knowing that you'll get a callback if something goes wrong.

However using this mechanism requires some significant investment in the language ecosystem, like has been done with GoRoutines. And I don't think there's a massive performance uplift in the best of cases, but perhaps Google has measured something worthwhile in their workloads.

User-managed concurrency groups

Posted Dec 29, 2021 22:43 UTC (Wed) by nksingh (subscriber, #94354) [Link]

Here's the paper from the early 90s, where this idea was introduced:
https://dl.acm.org/doi/abs/10.1145/146941.146944

It's not clear that the premises have aged well, since threads are quite popular and do actually perform well enough.

User-managed concurrency groups

Posted Dec 29, 2021 23:08 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> However using this mechanism requires some significant investment in the language ecosystem, like has been done with GoRoutines

This kind of scheduling is not very useful for Go, because it needs to manage stacks for goroutines. Basically, Go reuses a pool of system threads, switching the current stack when needed instead of just letting the thread go to sleep.

User-managed concurrency groups

Posted Dec 29, 2021 23:49 UTC (Wed) by nksingh (subscriber, #94354) [Link] (3 responses)

I think it still helps because Go can't always control all sources of blocking. Think for instance about a pagefault. Being able to switch stacks and TLS vectors in usermode is independent of being able to know about all sources of blocking. If the go runtime gets the scheduled activation due to blocking, it can grab another system thread to run any remaining runnable goroutines immediately, rather than waiting for another thread to wake up and notice the blockage.

User-managed concurrency groups

Posted Dec 31, 2021 10:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> I think it still helps because Go can't always control all sources of blocking.

It doesn't really need to.

If a goroutine blocks for some unforeseen reason (because the underlying physical thread is processing a signal, for example), then the queued work (goroutines ready to run) associated with the physical thread will be "stolen" by other threads.

Additionally, if the goroutine got blocked inside a syscall or a C library call, it won't be counted towards GOMAXPROCS limit, so the Go scheduler will be able to launch a new thread to replace the blocked one.

It's theoretically possible to have a situation where all threads are blocked for some reason, but I can't think of a reason why.

User-managed concurrency groups

Posted Dec 31, 2021 16:11 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

The mechanism go uses for this today is ugly. It must surround every syscall which might block with "entersyscall"/"exitsyscall". "Enter" effectively sets a timer -- if "exit" isn't reached within 20μs and there are runnable goroutines waiting, it's assumed that the syscall blocked and an additional OS thread should be started/resumed to allow one of those other goroutines to run.

Yet, a syscall could take longer than that on-cpu (without blocking), in which case you've over-scheduled work vs number of cpus. Alternately, a syscall might block immediately, in which case you've wasted time, where you could've run another goroutine.

To ameliorate those issues in common cases, there's two other variants of syscall entry, one for invoking a syscall that "never" blocks, and another for syscalls that are considered to very likely immediately block, which resumes another goroutine immediately.)

This mechanism clearly works, but it really doesn't seem ideal. If the kernel could, instead, notify the go scheduler when a thread has blocked, all this guessing and heuristics could be eliminated.

User-managed concurrency groups

Posted Dec 31, 2021 22:47 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> This mechanism clearly works, but it really doesn't seem ideal. If the kernel could, instead, notify the go scheduler when a thread has blocked, all this guessing and heuristics could be eliminated.

I'm not sure if it would help. You still need a scheduler thread and checking for a stuck thread right now is pretty fast. I guess one thing that might be helpful is ability to force-preempt threads. Right now Go uses signals for that, and signals have their drawbacks.