User-managed concurrency groups

By Jonathan Corbet
December 28, 2021

The kernel's thread model is relatively straightforward and performs reasonably well, but that's not enough for all users. Specifically, there are use cases out there that benefit from a lightweight threading model that gives user space control over scheduling decisions. Back in May 2021, Peter Oskolkov posted a patch set implementing an abstraction known as user-managed concurrency groups, or UMCG. Several revisions later, many observers still lack a clear idea of what this patch is supposed to do, much less whether it is a good idea for the kernel. Things have taken a turn, though, with Peter Zijlstra's reimplementation of UMCG.

One developer reimplementing another's patch set is likely to raise eyebrows. Zijlstra's motivation for doing that work can perhaps be seen in this message, where he notes that the UMCG code looked little like the rest of the scheduler code. He also remarked that it required "reverse engineering" to figure out how UMCG was meant to be used. By the time that work was done, perhaps, it was just easier to recast the code in the form he thought it should take.

In truth, the documentation for UMCG is no better than before — a significant problem for a major proposed addition to the system-call API. But it is possible to dig through the code (and a "pretty rough" test application posted by Zijlstra) to get a sense for what is going on. In short, UMCG calls for a multi-threaded application to divide itself into "server" and "worker" threads, where there is likely to be one server thread for each CPU on the system. Server threads make scheduling decisions, while workers run according to those decisions and get the actual work done. The advantage of a system like UMCG is that scheduling can happen quickly and with little overhead from the kernel — assuming the server threads are properly implemented, of course.

Setting up

UMCG introduces three new system calls and one new structure that handles most of the communication with the kernel. Every thread participating in UMCG must have a umcg_task structure, which looks like this:

    struct umcg_task {
	__u32	state;
	__u32	next_tid;
	__u32	server_tid;
	__u64	runnable_workers_ptr;
	/* ... */
    };

Some fields have been omitted here. Note that this structure, as it will eventually be provided by the C libraries, is likely to look different. The specific fields will be discussed as they become relevant.

The first new system call is umcg_ctl(), which is used to register and unregister threads with the UMCG subsystem:

    int umcg_ctl(unsigned int flags, struct umcg_task *self, clockid_t which_clock);

The flags argument describes the operation to be performed, self is the umcg_task structure corresponding to the current thread, and which_clock controls the clock used for timestamps for this thread.

If flags contains UMCG_CTL_REGISTER, then this call is registering a new thread with the subsystem. There are two alternatives, depending on which type of thread is being registered:

If flags contains UMCG_CTL_WORKER, then this is a new worker task. In this case, self->state must be UMCG_TASK_BLOCKED, indicating that the worker is not initially running. The thread ID of the server that will handle this worker must be provided in server_tid.
Otherwise, this is a server task. Its initial state must be UMCG_TASK_RUNNING (since it is indeed running) and server_tid must be the calling thread's ID.

Workers and servers must be threads of the same process (more specifically, they must share the same address space). The system call returns zero if all goes well. For workers, though, that return will be delayed, as the calling thread will be blocked until the server schedules it to run. Registering a new worker will cause the indicated server to wake up.

The other thing that happens when a worker is registered is that its state is set to UMCG_TASK_RUNNABLE and it is added to the server's singly-linked list of available workers. The list is implemented using the runnable_workers_ptr field in each task's umcg_task structure. The kernel will push a new task onto the head of the list with a compare-and-exchange operation; the server will normally use a similar operation to take tasks off the list.

Scheduling

Most scheduling is done with calls to umcg_wait():

    int umcg_wait(unsigned int flags, unsigned long timeout);

The flags field must be zero in the current patches. The calling thread must be registered as a UMCG thread or the call will fail. If the caller is a worker thread, the timeout must also be zero; this call will suspend execution of the worker and wake the associated server process for the next scheduling decision. If the worker's state is UMCG_TASK_RUNNING (as it should be if the task is running to make this call), it will be set back to UMCG_TASK_RUNNABLE and the task will be added to the server's runnable_workers_ptr list. Thus, for a worker task, a call to umcg_wait() is a way to yield to another thread while remaining runnable.

In the case of the server, the usual reason for calling umcg_wait() is to schedule a new worker to run; this is done by setting the worker's thread ID in the next_tid field of the server's umcg_task structure before the call. If this is done, and the indicated thread is a UMCG worker in the UMCG_TASK_RUNNABLE state, it will be queued to run. The server, instead, will be blocked until either some sort of wakeup event happens or the specified timeout (if it is not zero) expires.

One important detail is that the kernel, once it successfully wakes the new worker thread, will set the server's next_tid field to zero. That allows the server to quickly check, on return from umcg_wait(), whether the thread was actually scheduled or not.

There are a few events that will cause a server to wake. If the current running worker blocks in a system call, for example, its state will be changed to UMCG_TASK_BLOCKED; the server can detect this by looking at the (previously) running worker's umcg_task structure. As noted above, a new task becoming runnable will cause a wakeup. If your editor's reading of the code is correct, there does not currently appear to be a way to notify the server that a worker task has exited entirely.

Preemption

The timeout parameter to umcg_wait() can be used by server threads to implement forced preemption after a worker has run for a period of time. If umcg_wait() returns ETIMEDOUT, the server knows that the current worker has been running for the full timeout period; the server may then choose to make it surrender the CPU. That is done in a two-step process, the first of which is to add the UMCG_TF_PREEMPT flag to the running worker's state field (again, using a compare-and-exchange operation). Then a call should be made to the third new system call:

    int umcg_kick(unsigned int flags, pid_t tid);

Where flags must be zero and tid is the thread ID of the worker to be preempted. This call will cause the worker to re-enter the scheduler, at which point the UMCG_TF_PREEMPT flag will be noticed, the worker will be suspended, and it will be placed back onto the server's runnable_workers_ptr list. Once that completes, the server will wake again to schedule a new thread.

That is pretty much the entirety of the new API at this point. This work is still clearly in an early state, though, and it would not be surprising to see a fair amount of evolution take place before it is considered for merging. UMCG arises out of Google's internal systems and reflects its use case, but there will almost certainly be other use cases for this sort of functionality, and those users have not yet made their needs known. As awareness of this work spreads, that situation can be expected to change.

Oskolkov, meanwhile has, as one might expect, required some convincing that his work really needed to be rewritten by somebody else or that the new implementation is better. He expressed discomfort with some of the changes, most notably Zijlstra's switch from a single queue of runnable workers to per-server queues. In the end, though, he said "I'm OK with having it your way if all needed features are covered". So it seems fair to assume that Zijlstra's patch reflects the future of this work. Time will tell where it goes from here.

Index entries for this article
Kernel	Scheduler/User-managed concurrency groups
Kernel	User-managed concurrency groups

User-managed concurrency groups

Posted Dec 28, 2021 16:01 UTC (Tue) by ejr (subscriber, #51652) [Link]

Anyone else reading this and thinking of Cilk? The mapping from multi-queue to a single-queue idea is through work stealing.

User-managed concurrency groups

Posted Dec 28, 2021 16:45 UTC (Tue) by khim (subscriber, #9252) [Link]

I find it really strange that word “fiber” is never mentioned in the context of these patchsets. Even if it's clearly mentioned in the very first message.

Basically the whole things, it seems, rests on the following observation: Windows reserves 12KiB of of stack for the “Stack Overflow” exception handling — but that's not needed on GNU/Linux and you may use that same amount of memory to equip each fiber with it's own, personal, thread (8KiB in kernel + 4KiB in GLibC).

Add couple of syscalls and now you have fibers which are full-blown threads so they can be used with tools which are used for threads, there are no need for special “fibers” API (like GetFiberData) and so on.

They are probably still a tiny bit heavier than purely-userspace fibers, but advantage sounds really compelling: no need to use onlyh fiber-aware libraries, etc… I guess there should be some experience with Windows Fibers which would be relevant when this patchset is considered.

User-managed concurrency groups

Posted Dec 28, 2021 16:51 UTC (Tue) by ms (subscriber, #41272) [Link] (8 responses)

This feels a lot like green threads, which a lot of languages have, e.g. Go, Erlang. My question is how can it help, getting the kernel involved? I've sometimes read one of the reasons green threads have fast context switching is because the kernel isn't involved. About the only reason I can think of is you'd gain pre-emptive eviction, which is not nothing, but I'm curious if it's worth it, or if there are other benefits too?

User-managed concurrency groups

Posted Dec 28, 2021 17:11 UTC (Tue) by khim (subscriber, #9252) [Link]

> I've sometimes read one of the reasons green threads have fast context switching is because the kernel isn't involved.

Switching to kernel and back is not that slow (even green threads do syscalls). What's slow is waiting for next thread to be scheduled by kernel to become executable.

> My question is how can it help, getting the kernel involved?

Basically a band-aid for the fact that many years ago GNU/Linux rejected NGPT and went with NPTL.

If you allocate, essentially, a dedicated kernel thread for your “green thread” then you may use all syscalls and libraries which are allowed for regular threads: parts of the program where “green threads” are cooperating would work like proper “green threads”, but if you call some “heavy” function the instead of freezing your whole “green thread” machinery you would just get one-time hit when kernel would save your beacon and would give you a chance to remove misbehaving “green thread” from the cooperating pool.

> About the only reason I can think of is you'd gain pre-emptive eviction, which is not nothing, but I'm curious if it's worth it, or if there are other benefits too?

Proper TLS area is another benefit. In systems where (like in Windows) fibers (AKA “green threads”) their own private storage but share TLS for “kernel threads” it's much easier to mess things up.

Of course that one is possible without kernel help, but you get it for free if you use kernel thread machinery as “safety net” for misbehaving fibers.

User-managed concurrency groups

Posted Dec 29, 2021 22:39 UTC (Wed) by nksingh (subscriber, #94354) [Link] (6 responses)

This mechanism and the very similar Windows UMS one which was added in 2009 and ripped out in 2020 help userspace control thread scheduling in the face of arbitrary system calls with arbitrary blocking.

With traditional M:N scheduling like fibers, if the user threaded code blocks, no code in the app gets control to choose what's going to run next, unless the blocking is going through the userspace threading library. This is a major part of the reason that Go or LibUV wrap all of the IO calls, so that they can control their green thread scheduling.

UMS allows such a runtime to effectively get a callback to decide what to do next (e.g. schedule a new lightweight task) when something blocks. This is a great idea if you have a set of syscalls from your task that may or may not block in an unpredictable manner, like a read from the pagecache where you don't know if you'll miss. You can be optimistic, knowing that you'll get a callback if something goes wrong.

However using this mechanism requires some significant investment in the language ecosystem, like has been done with GoRoutines. And I don't think there's a massive performance uplift in the best of cases, but perhaps Google has measured something worthwhile in their workloads.

User-managed concurrency groups

Posted Dec 29, 2021 22:43 UTC (Wed) by nksingh (subscriber, #94354) [Link]

Here's the paper from the early 90s, where this idea was introduced:
https://dl.acm.org/doi/abs/10.1145/146941.146944

It's not clear that the premises have aged well, since threads are quite popular and do actually perform well enough.

User-managed concurrency groups

Posted Dec 29, 2021 23:08 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> However using this mechanism requires some significant investment in the language ecosystem, like has been done with GoRoutines

This kind of scheduling is not very useful for Go, because it needs to manage stacks for goroutines. Basically, Go reuses a pool of system threads, switching the current stack when needed instead of just letting the thread go to sleep.

User-managed concurrency groups

Posted Dec 29, 2021 23:49 UTC (Wed) by nksingh (subscriber, #94354) [Link] (3 responses)

I think it still helps because Go can't always control all sources of blocking. Think for instance about a pagefault. Being able to switch stacks and TLS vectors in usermode is independent of being able to know about all sources of blocking. If the go runtime gets the scheduled activation due to blocking, it can grab another system thread to run any remaining runnable goroutines immediately, rather than waiting for another thread to wake up and notice the blockage.

User-managed concurrency groups

Posted Dec 31, 2021 10:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> I think it still helps because Go can't always control all sources of blocking.

It doesn't really need to.

If a goroutine blocks for some unforeseen reason (because the underlying physical thread is processing a signal, for example), then the queued work (goroutines ready to run) associated with the physical thread will be "stolen" by other threads.

Additionally, if the goroutine got blocked inside a syscall or a C library call, it won't be counted towards GOMAXPROCS limit, so the Go scheduler will be able to launch a new thread to replace the blocked one.

It's theoretically possible to have a situation where all threads are blocked for some reason, but I can't think of a reason why.

User-managed concurrency groups

Posted Dec 31, 2021 16:11 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

The mechanism go uses for this today is ugly. It must surround every syscall which might block with "entersyscall"/"exitsyscall". "Enter" effectively sets a timer -- if "exit" isn't reached within 20μs and there are runnable goroutines waiting, it's assumed that the syscall blocked and an additional OS thread should be started/resumed to allow one of those other goroutines to run.

Yet, a syscall could take longer than that on-cpu (without blocking), in which case you've over-scheduled work vs number of cpus. Alternately, a syscall might block immediately, in which case you've wasted time, where you could've run another goroutine.

To ameliorate those issues in common cases, there's two other variants of syscall entry, one for invoking a syscall that "never" blocks, and another for syscalls that are considered to very likely immediately block, which resumes another goroutine immediately.)

This mechanism clearly works, but it really doesn't seem ideal. If the kernel could, instead, notify the go scheduler when a thread has blocked, all this guessing and heuristics could be eliminated.

User-managed concurrency groups

Posted Dec 31, 2021 22:47 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> This mechanism clearly works, but it really doesn't seem ideal. If the kernel could, instead, notify the go scheduler when a thread has blocked, all this guessing and heuristics could be eliminated.

I'm not sure if it would help. You still need a scheduler thread and checking for a stuck thread right now is pretty fast. I guess one thing that might be helpful is ability to force-preempt threads. Right now Go uses signals for that, and signals have their drawbacks.

User-managed concurrency groups

Posted Dec 28, 2021 23:30 UTC (Tue) by camhusmj38 (subscriber, #99234) [Link] (5 responses)

Windows introduced User Mode Scheduling as a similar feature in Windows 7 I believe. It’s purpose is to overcome the issue with fibres caused by a lot of Windows Kernel APIs are thread identity dependent in some way. I have no idea how widespread it’s use is.

User-managed concurrency groups

Posted Dec 29, 2021 22:24 UTC (Wed) by nksingh (subscriber, #94354) [Link] (4 responses)

UMS was recently ripped out of Windows due to lack of usage. This new Linux mechanism is extremely similar to the Windows one.

User-managed concurrency groups

Posted Dec 30, 2021 18:24 UTC (Thu) by glenn (subscriber, #102223) [Link] (3 responses)

It could be quite challenging for applications to code against this API. A popular userspace library on top of it would help facilitate adoption. A Cilk runtime, as ejr calls out, might be a good fit. Unfortunately, the framework's recent advocate, Intel, appears to have given up on it. Cilk support has also been dropped from gcc8.

What kind of ecosystem grew up around the Windows framework? Did it fail despite an ecosystem?

User-managed concurrency groups

Posted Dec 30, 2021 19:26 UTC (Thu) by nksingh (subscriber, #94354) [Link] (1 responses)

There was concrt and SQL server. SQL didn't see that big gains for their usecase at the time and concrt wasn't heavily adopted.

This was way before nodejs and Go and the dotnet tasking and async library was still in it's infancy.

User-managed concurrency groups

Posted Dec 30, 2021 19:27 UTC (Thu) by camhusmj38 (subscriber, #99234) [Link]

Concrt removed support from UMS pretty quickly. They found no significant performance boost in their use cases as well.

User-managed concurrency groups

Posted Dec 30, 2021 22:21 UTC (Thu) by ejr (subscriber, #51652) [Link]

Merely pining for the fjords: https://cilk.mit.edu/

User-managed concurrency groups

Posted Dec 29, 2021 14:18 UTC (Wed) by jezuch (subscriber, #52988) [Link]

Waaait... No ebpf involved??

(But seriously, doesn't making a scheduling decision like this feel like the perfect use case for it?)

User-managed concurrency groups

Posted Dec 30, 2021 13:21 UTC (Thu) by taladar (subscriber, #68407) [Link] (7 responses)

This seems like the kind of feature one platform adds and then nobody uses it because they all need to support half a dozen platforms which don't have anything similar and you would have to restructure your entire application architecture to make use of it.

User-managed concurrency groups

Posted Dec 31, 2021 2:02 UTC (Fri) by wahern (subscriber, #37304) [Link]

The same could be said of io_uring, at least prior to Microsoft's clone--https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/. (Also, to be fair, io_uring is easier to implement as a compatibility library given the lack of a need to shim old syscalls. Still need to rearchitect applications, though.)

User-managed concurrency groups

Posted Dec 31, 2021 17:58 UTC (Fri) by martinfick (subscriber, #4455) [Link] (2 responses)

This is a serious question, but are there really any other server platforms besides Linux used in the real world anymore? At a scale that matters? If so, are there a half dozen?

User-managed concurrency groups

Posted Jan 3, 2022 11:16 UTC (Mon) by Sesse (subscriber, #53779) [Link] (1 responses)

Windows is highly relevant as a server platform.

User-managed concurrency groups

Posted Jan 3, 2022 14:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Not really. It's been relegated to a few legacy areas (AD controllers, Exchange servers, etc.)

User-managed concurrency groups

Posted Jan 2, 2022 18:15 UTC (Sun) by thoughtpolice (subscriber, #87455) [Link]

Is that really unusual? Linux has tons of functionality that is (broadly speaking) most useful to e.g. service providers or other specialized use cases. People who write cross platform "systems" applications that will utilize this have already had to support/architect their applications to work across broad environments for many years, this is just One More Thing, like io_uring or sandboxing or what have you. The average bog standard utility will probably keep getting written in Python or whatever so it doesn't matter honestly.

In fact, part of the reason this work is useful is because it actually allows *more* code reuse than alternative options on Linux when you want finer control over the scheduling policy — libraries no longer need to be fiber/user-mode thread aware — the alternatives also require restructuring your applications anyway...

User-managed concurrency groups

Posted Jan 2, 2022 23:26 UTC (Sun) by flussence (guest, #85566) [Link] (1 responses)

That could be said about e.g. user-mode network stacks. I don't get the point myself but people are evidently getting enough mileage out of that to justify having it in Linux.

Sometimes you can't scale vertically with faster hardware or horizontally with more hardware (especially in present times), and then the only option left is to scale inwards by re-examining old software assumptions. Or scaling backwards, by asking for less - but that doesn't lead to cool hacks...

User-managed concurrency groups

Posted Jan 2, 2022 23:29 UTC (Sun) by Wol (subscriber, #4433) [Link]

Or look back in time for a better solution that's been forgotten in all the hype about "new!" "shiny!" (don't get me started!).

Cheers,
Wol

User-managed concurrency groups

Posted Dec 31, 2021 3:13 UTC (Fri) by Ptolemy (subscriber, #153718) [Link]

The restriction that usersapce can only manage threads in the same process strictly conforms to the definition of M:N threading. However, for practical purposes, why not consider cross process thread management? In other words, in userspace, one process can control the scheduling of another process.
Didn't the previous BPF hook also consider exposing the scheduling capability of the kernel to the userspace? Therefore, this is a good opportunity to extend the kernel scheduling, so that the kernel can support the user-defined scheduling algorithm or scheduling strategy.

User-managed concurrency groups

Posted Jan 1, 2022 23:59 UTC (Sat) by alkbyby (subscriber, #61687) [Link] (1 responses)

Indeed, I think what is needed most is someone from Google fibers team perhaps with help of some writer experts to post in-depth description of how this will be used. Otherwise people get all kinds of ideas.

Indeed, this is basically an attempt to implement go-style "cheap" concurrency in C setting. With few twists.

There is video from 2013 LPC conference with some background: https://www.youtube.com/watch?v=KXuZi9aeGTw. Slides are sadly gone. (https://blog.linuxplumbersconf.org/2013/ocw/system/presen... linked from https://blog.linuxplumbersconf.org/2013/ocw/events/LPC201...)

First, fibers (equivalent of goroutines) are regular posix threads. I.e. all the usual stuff various C codes are used to do, like calling gettid, sending signals, finding all threads via procfs (i.e. GC implementations do that and send signals), all works as before. And gdb will see all the fibers too as threads without any special modifications. And core dumps have all fibers stack traces present too in usual way. This also implies all the thread local storage works as usual.

Second, it provides userspace with ability to schedule CPU time itself, which also includes ability to limit actual parallelism in some cases. Some of it is cheapness of scheduling itself. Linux kernel scheduler has to do all kinds of heurstics which cost CPU cycles and not always are useful. Paul's presentation in 2013 has lots on this (of course this days syscalls are not cheap anymore after all the security crap, but there is hope that madness will end with time). Some of it is limiting actual parallelism in order to help caching effects. I.e. when you have thread/fiber/goroutine-per-request model, you often do stuff like sending RPCs to your backends. Concurrently. So with "child" fibers. And those child fibers not bouncing around all the hundreds of cores we have this days helps efficiency a lot. There are other uses too, so IMHO whoever represents google in discussion should post decent list with specific examples and numbers when possible. Otherwise people should/will ask why not simply make kernel scheduler cheap and "right" enough.

And third, there is this essentially scheduler activations implementation to deal with fibers blocking inside syscalls. So that IO doesn't suck as much as it would otherwise. I.e. this is classic and well known challenge of M:N or green threads.

Notably, this is what Google-internal fibers don't have yet (as Peter points out too). As inside google most processes only do "IO" by RPCing to other services. And pretty much all the actual blocking/sleeping is via abseil mutex/condition-variable which has facilities to call to fiber scheduler. I.e. those functions are weak for a reason: https://github.com/abseil/abseil-cpp/blob/1ae9b71c474628d...

And those services that actually handle local SSDs and disks use proper async io facilities to do their job. Stuff like blocking on page faults/mm syscalls and what not, are indeed not that important in practice to schedule around anyways.

But obviously to get this kind of threading implementation ready for non-datacenter world, it definitely requires efficient and general way to deal with blocking syscalls.

Basically, when all this works as intended, we'll simply get cheaper and faster threading facility that will be not too much slower than goroutines, but doesn't have suck-ful IO even in corner cases.

User-managed concurrency groups

Posted Jan 2, 2022 21:37 UTC (Sun) by ejr (subscriber, #51652) [Link]

Anyone else reading this and thinking of Erlang?

User-managed concurrency groups

Posted Jan 3, 2022 17:48 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

Here we go again: As soon as somebody learns that 1:1 threading is both useful and works in the real world, someone else discovers how much more efficient m:n or even 1:n threading must be. Always coming to a cinema near you as exciting new movie for the last 40 years.