Restartable sequences and ops vectors
The core idea behind restartable sequences has not changed. An application defines a special region of code that, it is hoped, will run without interruption. This code performs some operation of interest on a per-CPU data structure that can be committed with a single instruction at the end. For example, it may prepare to remove an item from a list, with the final instruction setting a pointer that actually effects this change and makes it visible to other threads running on the same CPU. If the thread is preempted in the middle of this work, it may contend with another thread working on the same data structure. In this case, the kernel will cause the thread to jump to an abort sequence once it runs again; the thread can then clean up and try again (the "restart" part of the name). Most of the time, though, preemption does not happen, and the restartable sequence will implement a per-CPU, atomic operation at high speed.
Restartable sequences have been around for some time and are evidently in use at Google and elsewhere. But there are some rough edges, one of which is debugging. Single-stepping through a restartable sequence with a debugger will be an exercise in frustration, since the sequence will never run uninterrupted and, thus, always abort. Fixing this problem requires the implementation of some way to execute the sequence as a single step.
The solution in the current patch set is a new system call:
int cpu_opv(struct cpu_op *ops, int opcount, int cpu, int flags);
The purpose of this system call is to accept a sequence of operations (an "ops vector") and execute it atomically. Each entry in the ops array is a single operation; the array has a maximum length of sixteen operations. The available operations include comparisons, memory copies, and basic arithmetic. The amount of data that can be operated on is bounded (to limit the maximum execution time of the vector), and all of that data is locked into memory before the execution of the ops vector begins. The vector is run in the processor indicated by cpu; the flags field must be zero in the current implementation.
The ops vector is meant to be used as a fallback when a restartable
sequence aborts; it can be run during single-stepping or any other
situation where the sequence itself is unable to complete successfully
(it is not a suitable replacement for the sequence entirely; as a system
call, it will be quite a bit slower).
Users of restartable sequences would thus need to create a second
implementation of their algorithm in this new language and run it when the
original sequence fails.
This idea, Desnoyers said, came to him in the shower one day. It is, he
said, a relatively simple solution to the problem.
This was the point where your editor was unable to resist raising his hand and asking whether, rather than adding yet another interpreter to the kernel, Desnoyers could use the existing BPF language and interpreter. The existing BPF verifier could likely be adapted to the needs of the ops-vector mechanism. Desnoyers replied that BPF carries a lot of weight that is not needed here and, in any case, the ops vector should almost never actually run in real-world use. But then he went on to say that ops vectors could also be employed for simple housekeeping tasks that may not need a full restartable-sequence implementation.
Andy Lutomirski jumped in to say that BPF seemed like a reasonable solution to the problem; the BPF interpreter's context mechanism could be used to manage operands to the vector, for example. Peter Zijlstra pointed out that BPF programs have a large kernel-space context associated with them; a program might have one-hundred restartable sequences, which would add up to a lot of overhead.
Lutomirski then said that he has his own version of restartable sequences that he has been working on. Rather than abort when preemption occurs, it aborts when an actual data conflict happens. Single-stepping works in this implementation, he said. Desnoyers replied that such an approach would make the implementation more complex, but Lutomirski said that it is still better than requiring every user to implement their algorithms twice. The slow path will be poorly tested at best, and developers will often get it wrong, he said. Zijlstra replied that there would be a library that would take care of the details for most uses, though, and Ben Herrenschmidt said that only developers who truly care about restartable sequences will deal with things at that level.
Desnoyers moved on to the use cases for restartable sequences — an important topic, since Linus Torvalds has made it clear that he will not merge this code without clear evidence that it will be used. The LTTng tracing code can use this feature for fast user-space tracing across processes, Desnoyers said; he would also like it for his user-space read-copy-update implementation. The jemalloc and GNU C Library malloc() implementations can speed things up with restartable sequences. There is a use case for per-CPU statistics counters. Matthew Wilcox added that the developers of the DPDK user-space driver system also want this mechanism. Herrenschmidt said that, in the end, all of the concurrency issues that apply to the kernel also apply to user space.
The final part of the discussion wandered over various topics, including the details of how multiple, independent users can share a restartable-sequences region and whether maybe the classic BPF interpreter might be a better tool for the ops-vector job than extended BPF. Desnoyers said that he would look into the BPF option; expect the conversation to continue on the mailing lists.
[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].
| Index entries for this article | |
|---|---|
| Kernel | Restartable sequences |
| Conference | Kernel Summit/2017 |
Posted Oct 31, 2017 9:07 UTC (Tue)
by luto (guest, #39314)
[Link]
Instead, the latest patch series is getting slightly tweaked to be more easily extended down the road, and the event_counter mechanism is being removed. The idea is to merge it for 4.15 without the event_counter and to wait and see if anyone actually needs it. Most use cases of event_counter seem to be using it unnecessary or are slightly racy. If this turns out to be incorrect, it's easy to add back in.
Posted Oct 31, 2017 11:47 UTC (Tue)
by hkario (subscriber, #94864)
[Link] (1 responses)
j/k
Posted Oct 31, 2017 17:47 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
In any event, such criticism doesn't make it any less useful or attractive....
Posted Oct 31, 2017 14:07 UTC (Tue)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
I'm imagining a syscall that tells the kernel about a memory address which contains a per task pre-emption disable flag, which the scheduler then checks before de-scheduling some task for another. Obviously disabling pre-emption for a long time would be bad so the scheduler would need to track the case that some task has excessively prevented pre-emption and then perhaps kill it with something along the lines of SIGXCPU.
Having said that, this is starting to sound like a very coarse futex. So maybe a futex is the logical conclusion?
Posted Nov 1, 2017 20:52 UTC (Wed)
by luto (guest, #39314)
[Link]
disable_user_preemption();
What is supposed to happen?
Posted Oct 31, 2017 14:16 UTC (Tue)
by ymanton (guest, #85973)
[Link] (5 responses)
Posted Oct 31, 2017 14:33 UTC (Tue)
by daney (guest, #24551)
[Link] (4 responses)
Posted Oct 31, 2017 16:04 UTC (Tue)
by ymanton (guest, #85973)
[Link] (3 responses)
Posted Oct 31, 2017 16:07 UTC (Tue)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Oct 31, 2017 16:45 UTC (Tue)
by ymanton (guest, #85973)
[Link] (1 responses)
Posted Oct 31, 2017 17:49 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
Posted Nov 1, 2017 4:37 UTC (Wed)
by jthill (subscriber, #56558)
[Link] (1 responses)
Posted Nov 4, 2017 21:10 UTC (Sat)
by compudj (subscriber, #43335)
[Link]
I'm providing a __rseq_table section with the rseq fast-paths in user-space, which contains the start address, length, and abort address required by gdb to understand how to better handle single-stepping wrt those critical sections. Architectures that require to single-step with explicit breakpoint at each instruction will need to use this to tell gdb that the kernel may branch off elsewhere in those critical sections. Those will require improvements to debuggers to properly handle having the kernel branch out of a rseq critical section. It's one thing to improperly single-step an application (skipping instructions), but it's another to make the application hang in a rseq critical section when executed under single-stepping. The first case can be fixed by teaching debuggers about rseq. However, we don't want to cause existing applications to hang because they are single-stepped.
For architectures that provide hw-assisted single-stepping, the only thing we need is to have a fall-back that can allow some kind of progress even though the instruction sequence is being restarted (cpu_opv provides this).
There are also other unlikely scenarios that could cause the instruction sequence to always be restarted. For instance, unexpected page faults, which could be caused by memory protection from a memory access analyzer tool, or simply having the kernel trapping a fast-path instruction that we would not have expect it to trap from a user-space perspective.
Therefore, having a fall-back that ensures progress is a very good guarantee to ensure robustness of the programs using restartable sequences. Given that some major use-cases for rseq are memory allocators, we can expect glibc and other libraries (jemalloc) to start using under the hood without knowledge from the applications. The last thing we want is that a glibc upgrade would cause applications to hang in a rseq loop when being debugged.
Posted Nov 1, 2017 14:41 UTC (Wed)
by ejr (subscriber, #51652)
[Link]
Posted Nov 2, 2017 16:36 UTC (Thu)
by valarauca (guest, #109490)
[Link]
CPU Sheilds, changing interrupt handling, tweaking cgroups, and schedulers cover most the bases for your hard to hardish realtime requirements. This is even the approach DPDK folks suggest http://dpdk.org/ml/archives/dev/2014-January/001163.html
The whole malloc usecase seems reaching. JEmalloc does per-NUMA poolling, or you can set that policy. So even if you miss the local lock in your L1/L2 you are only likely going to have to reach out to L3. The underlying allocator _forcing_ itself to be rescheduled on another core doesn't strike me as wise, as my gut tells me this'll end up being worse then faulting off to another NUMA core's L3. I guess there are other single locking allocators, so maybe(?)
Also emulating transactional memory in software was bad when C++ did it, and I don't see how shoving it into kernel space will make it any better. Hell Intel's TSX manages to be slower then praying the `lock` and `mfence` gods in most scenarios.
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
read a swapped-out memory address;
write something to memory somewhere;
enable_user_preemption();
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
I don't understand why it's worth any effort to singlestep these. You can't singlestep through a LL…SC sequence and yet code that uses those isn't regarded as un-debuggable, why do the techniques used to debug those not work acceptably here?
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
Restartable sequences and ops vectors
