Restartable sequences and ops vectors

By Jonathan Corbet
October 31, 2017

2017 Kernel Summit

Some technologies find their way into the kernel almost immediately; others need to go through multiple iterations over a number of years first. Restartable sequences, a mechanism for lockless concurrency control in user space, fall into the latter category. At the 2017 Kernel Summit, Mathieu Desnoyers discussed yet another implementation of this concept — but this one may not be the last word either.

The core idea behind restartable sequences has not changed. An application defines a special region of code that, it is hoped, will run without interruption. This code performs some operation of interest on a per-CPU data structure that can be committed with a single instruction at the end. For example, it may prepare to remove an item from a list, with the final instruction setting a pointer that actually effects this change and makes it visible to other threads running on the same CPU. If the thread is preempted in the middle of this work, it may contend with another thread working on the same data structure. In this case, the kernel will cause the thread to jump to an abort sequence once it runs again; the thread can then clean up and try again (the "restart" part of the name). Most of the time, though, preemption does not happen, and the restartable sequence will implement a per-CPU, atomic operation at high speed.

Restartable sequences have been around for some time and are evidently in use at Google and elsewhere. But there are some rough edges, one of which is debugging. Single-stepping through a restartable sequence with a debugger will be an exercise in frustration, since the sequence will never run uninterrupted and, thus, always abort. Fixing this problem requires the implementation of some way to execute the sequence as a single step.

The solution in the current patch set is a new system call:

    int cpu_opv(struct cpu_op *ops, int opcount, int cpu, int flags);

The purpose of this system call is to accept a sequence of operations (an "ops vector") and execute it atomically. Each entry in the ops array is a single operation; the array has a maximum length of sixteen operations. The available operations include comparisons, memory copies, and basic arithmetic. The amount of data that can be operated on is bounded (to limit the maximum execution time of the vector), and all of that data is locked into memory before the execution of the ops vector begins. The vector is run in the processor indicated by cpu; the flags field must be zero in the current implementation.

The ops vector is meant to be used as a fallback when a restartable sequence aborts; it can be run during single-stepping or any other situation where the sequence itself is unable to complete successfully (it is not a suitable replacement for the sequence entirely; as a system call, it will be quite a bit slower). Users of restartable sequences would thus need to create a second implementation of their algorithm in this new language and run it when the original sequence fails. This idea, Desnoyers said, came to him in the shower one day. It is, he said, a relatively simple solution to the problem.

This was the point where your editor was unable to resist raising his hand and asking whether, rather than adding yet another interpreter to the kernel, Desnoyers could use the existing BPF language and interpreter. The existing BPF verifier could likely be adapted to the needs of the ops-vector mechanism. Desnoyers replied that BPF carries a lot of weight that is not needed here and, in any case, the ops vector should almost never actually run in real-world use. But then he went on to say that ops vectors could also be employed for simple housekeeping tasks that may not need a full restartable-sequence implementation.

Andy Lutomirski jumped in to say that BPF seemed like a reasonable solution to the problem; the BPF interpreter's context mechanism could be used to manage operands to the vector, for example. Peter Zijlstra pointed out that BPF programs have a large kernel-space context associated with them; a program might have one-hundred restartable sequences, which would add up to a lot of overhead.

Lutomirski then said that he has his own version of restartable sequences that he has been working on. Rather than abort when preemption occurs, it aborts when an actual data conflict happens. Single-stepping works in this implementation, he said. Desnoyers replied that such an approach would make the implementation more complex, but Lutomirski said that it is still better than requiring every user to implement their algorithms twice. The slow path will be poorly tested at best, and developers will often get it wrong, he said. Zijlstra replied that there would be a library that would take care of the details for most uses, though, and Ben Herrenschmidt said that only developers who truly care about restartable sequences will deal with things at that level.

Desnoyers moved on to the use cases for restartable sequences — an important topic, since Linus Torvalds has made it clear that he will not merge this code without clear evidence that it will be used. The LTTng tracing code can use this feature for fast user-space tracing across processes, Desnoyers said; he would also like it for his user-space read-copy-update implementation. The jemalloc and GNU C Library malloc() implementations can speed things up with restartable sequences. There is a use case for per-CPU statistics counters. Matthew Wilcox added that the developers of the DPDK user-space driver system also want this mechanism. Herrenschmidt said that, in the end, all of the concurrency issues that apply to the kernel also apply to user space.

The final part of the discussion wandered over various topics, including the details of how multiple, independent users can share a restartable-sequences region and whether maybe the classic BPF interpreter might be a better tool for the ops-vector job than extended BPF. Desnoyers said that he would look into the BPF option; expect the conversation to continue on the mailing lists.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Index entries for this article
Kernel	Restartable sequences
Conference	Kernel Summit/2017

Restartable sequences and ops vectors

Posted Oct 31, 2017 9:07 UTC (Tue) by luto (guest, #39314) [Link]

The latest word appears to be that my very clever ideas are being postponed. :)

Instead, the latest patch series is getting slightly tweaked to be more easily extended down the road, and the event_counter mechanism is being removed. The idea is to merge it for 4.15 without the event_counter and to wait and see if anyone actually needs it. Most use cases of event_counter seem to be using it unnecessary or are slightly racy. If this turns out to be incorrect, it's easy to add back in.

Restartable sequences and ops vectors

Posted Oct 31, 2017 11:47 UTC (Tue) by hkario (subscriber, #94864) [Link] (1 responses)

How many more patches before we will have to rename Linux to BPFVM?

j/k

Restartable sequences and ops vectors

Posted Oct 31, 2017 17:47 UTC (Tue) by wahern (subscriber, #37304) [Link]

I have vague memories that some of the initial pushback to a beefed-up BPF was precisely that it would attract too many features, and in particular too many half-baked features as there would naturally develop a lower bar to entry given the greater flexibility of BPF. Am I making that up out of whole cloth?

In any event, such criticism doesn't make it any less useful or attractive....

Restartable sequences and ops vectors

Posted Oct 31, 2017 14:07 UTC (Tue) by mm7323 (subscriber, #87386) [Link] (1 responses)

Wouldn't it be simpler to just find some efficient way to temporarily disable pre-emption from user space?

I'm imagining a syscall that tells the kernel about a memory address which contains a per task pre-emption disable flag, which the scheduler then checks before de-scheduling some task for another. Obviously disabling pre-emption for a long time would be bad so the scheduler would need to track the case that some task has excessively prevented pre-emption and then perhaps kill it with something along the lines of SIGXCPU.

Having said that, this is starting to sound like a very coarse futex. So maybe a futex is the logical conclusion?

Restartable sequences and ops vectors

Posted Nov 1, 2017 20:52 UTC (Wed) by luto (guest, #39314) [Link]

This gets into really nasty corner cases. Say I do:

disable_user_preemption();
read a swapped-out memory address;
write something to memory somewhere;
enable_user_preemption();

What is supposed to happen?

Restartable sequences and ops vectors

Posted Oct 31, 2017 14:16 UTC (Tue) by ymanton (guest, #85973) [Link] (5 responses)

This sort of thing is almost trivially done on some (all?) architectures using LL/SC synchronization, since those will typically fail on interrupts, context switches, etc. Userspace can simply wrap their sequence in LL/SC and be done with it. Some hardware transactional memory implementations also have the same behavior. I think they both have the same sorts of issues with debugging though.

Restartable sequences and ops vectors

Posted Oct 31, 2017 14:33 UTC (Tue) by daney (guest, #24551) [Link] (4 responses)

One of the main points of the restartable sequence is that it is much faster in the uncontended case than LL/SC or compare-and-swap.

Restartable sequences and ops vectors

Posted Oct 31, 2017 16:04 UTC (Tue) by ymanton (guest, #85973) [Link] (3 responses)

That would depend on how expensive the entry/exit of a restartable sequence is. The details in this article and in 697979 seem to imply that some setup is required, where as with LL/SC you simply start your sequence with an LL and end it with an SC. What you're loading/storing doesn't even matter, doesn't have to be shared, may as well be a stack location, so the overhead for this scheme is nothing more than the cost of the LL/SC itself.

Restartable sequences and ops vectors

Posted Oct 31, 2017 16:07 UTC (Tue) by josh (subscriber, #17465) [Link] (2 responses)

You set up a restartable sequence *once*, and then use it as many times as you want. Once set up, a restartable sequence has zero overhead unless preempted.

Restartable sequences and ops vectors

Posted Oct 31, 2017 16:45 UTC (Tue) by ymanton (guest, #85973) [Link] (1 responses)

Maybe I'm not looking at the right code, but the patches linked to 697990 here (https://lwn.net/Articles/697990/#do_rseq) appear to have non-trivial entry/exit sequences.

Restartable sequences and ops vectors

Posted Oct 31, 2017 17:49 UTC (Tue) by compudj (subscriber, #43335) [Link]

I'm doing changes based on Andy's feedback from last week (removing the event_counter). I'm done for x86 32/64, currently targeting other architectures. See <https://github.com/compudj/linux-percpu-dev/blob/rseq/dev...> for the work-in-progress fast-path code. There are significant changes to the user-space libraries since my RFC 3-4 weeks ago. It cleans up the interface, and makes it identical to the slowpath cpu_opv fallback API, which is a very good thing.

Restartable sequences and ops vectors

Posted Nov 1, 2017 4:37 UTC (Wed) by jthill (subscriber, #56558) [Link] (1 responses)

I don't understand why it's worth any effort to singlestep these. You can't singlestep through a LL…SC sequence and yet code that uses those isn't regarded as un-debuggable, why do the techniques used to debug those not work acceptably here?

Restartable sequences and ops vectors

Posted Nov 4, 2017 21:10 UTC (Sat) by compudj (subscriber, #43335) [Link]

The techniques used by gdb to handle LL/SC critical sections are pretty much a pattern-based heuristics. They break as soon as you craft a critical section with a pattern unexpected by gdb.

I'm providing a __rseq_table section with the rseq fast-paths in user-space, which contains the start address, length, and abort address required by gdb to understand how to better handle single-stepping wrt those critical sections. Architectures that require to single-step with explicit breakpoint at each instruction will need to use this to tell gdb that the kernel may branch off elsewhere in those critical sections. Those will require improvements to debuggers to properly handle having the kernel branch out of a rseq critical section. It's one thing to improperly single-step an application (skipping instructions), but it's another to make the application hang in a rseq critical section when executed under single-stepping. The first case can be fixed by teaching debuggers about rseq. However, we don't want to cause existing applications to hang because they are single-stepped.

For architectures that provide hw-assisted single-stepping, the only thing we need is to have a fall-back that can allow some kind of progress even though the instruction sequence is being restarted (cpu_opv provides this).

There are also other unlikely scenarios that could cause the instruction sequence to always be restarted. For instance, unexpected page faults, which could be caused by memory protection from a memory access analyzer tool, or simply having the kernel trapping a fast-path instruction that we would not have expect it to trap from a user-space perspective.

Therefore, having a fall-back that ensures progress is a very good guarantee to ensure robustness of the programs using restartable sequences. Given that some major use-cases for rseq are memory allocators, we can expect glibc and other libraries (jemalloc) to start using under the hood without knowledge from the applications. The last thing we want is that a glibc upgrade would cause applications to hang in a rseq loop when being debugged.

Restartable sequences and ops vectors

Posted Nov 1, 2017 14:41 UTC (Wed) by ejr (subscriber, #51652) [Link]

Interesting. My interest here is in performance optimization. If a chunk of code has been preempted, the cache contents are unknown and the code should do something else. Doesn't fit the first restartable sequence paradigm, but may fit Lutomirski's.

Restartable sequences and ops vectors

Posted Nov 2, 2017 16:36 UTC (Thu) by valarauca (guest, #109490) [Link]

I don't understand the usecase.

CPU Sheilds, changing interrupt handling, tweaking cgroups, and schedulers cover most the bases for your hard to hardish realtime requirements. This is even the approach DPDK folks suggest http://dpdk.org/ml/archives/dev/2014-January/001163.html

The whole malloc usecase seems reaching. JEmalloc does per-NUMA poolling, or you can set that policy. So even if you miss the local lock in your L1/L2 you are only likely going to have to reach out to L3. The underlying allocator _forcing_ itself to be rescheduled on another core doesn't strike me as wise, as my gut tells me this'll end up being worse then faulting off to another NUMA core's L3. I guess there are other single locking allocators, so maybe(?)

Also emulating transactional memory in software was bad when C++ did it, and I don't see how shoving it into kernel space will make it any better. Hell Intel's TSX manages to be slower then praying the `lock` and `mfence` gods in most scenarios.