Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
Posted Jul 31, 2020 22:09 UTC (Fri) by itsmycpu (guest, #139639)In reply to: Lockless algorithms for mere mortals by PaulMcKenney
Parent article: Lockless algorithms for mere mortals
My first impression regarding the described use cases of memory_order_relaxed is that it is worse than I thought. Not because I would doubt that there are algorithms that make good use of it (like seqLock), but because it seems really difficult to navigate the potential problems in most cases.
For example the use in reference counting increments. Even if decrements are non-relaxed, it isn't immediately obvious to me that increments can be relaxed. Although I am well aware that in the handover of a reference from thread A to thread B there needs to be a moment where an original reference is still in place, and that the atomic value itself is sequential for RMW operations like increment and decrement (which are already expensive enough), I would not be sure that thread B necessarily needs to execute the necessary barriers after incrementing the reference relaxed, and before starting to load values from the referenced object. This would seem to mean that the compiler (or the hardware) is free to move the effective load time to before the reference increment becomes effective. So the logic that prevents this is probably very tricky (at least from my point of view, unless I'm simply missing some principle that provides for verification to become much easier). My guess would be that the necessary barriers might be involved in assuring thread A that it can release its reference. But I'm not aware of a theoretical construct that would guarantee this expectation to always apply, in general, for unknown use cases.
Posted Jul 31, 2020 22:47 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (8 responses)
On the reference count increment, the trick is that although the increment can be relaxed, the act of passing it somewhere else must be properly ordered. This proper ordering is often supplied by something else, though, such as thread-creation primitives. Whether this is worthwhile can depend on the CPU.
Posted Aug 1, 2020 4:24 UTC (Sat)
by itsmycpu (guest, #139639)
[Link] (7 responses)
For example if thread A already increases the count before passing the pointer, in advance of decreasing its own reference, of course relaxed is sufficient. If thread B is the one to increase the count, I'd think there needs to be a direct or indirect synchronization point between B's increment and A's decrement. Otherwise each increment is atomic, but one can't be sure which one happens first. However that synchronization point doesn't have to be the counter itself. If it isn't, then the question arises if decrementing can't be relaxed as well. :) Especially if the count reaching zero can be taken as an indication that all read/write access has already been given up.
So yes, I think that's a valid example for relaxed. Had to think about it, though. :-)
Here http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n...
while (tmp = atomic_load_explicit(a, memory_order_relaxed))
with
while (tmp = atomic_load_explicit(a, memory_order_relaxed))
If that is still the case according to current C/C++ definitions, why don't you argue that the definition of memory_order_relaxed be refined to READ_ONCE and WRITE_ONCE? Or are you already?
Posted Aug 1, 2020 12:47 UTC (Sat)
by itsmycpu (guest, #139639)
[Link] (1 responses)
> The atomic_thread_fence() function can be used to order multiple sets of accesses, for example, by replacing a series of acquire loads with relaxed loads followed by an atomic_thread_fence(memory_order_acquire)
In its shortness, this sentence suggests the sequence:
However reading https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence suggests:
1: Reading a single synchronization variable (probably atomic relaxed, as in the example)
If you don't mind me pointing it out. Probably more like a typo.
Posted Aug 2, 2020 16:43 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
I have therefore expanded this section to make step 3 explicit. Thank you for pointing this out.
There will be a P2055R1 at some point, but in the meantime you can access the LaTeX at https://github.com/paulmckrcu/WG21-relaxedguide.git.
Posted Aug 1, 2020 14:32 UTC (Sat)
by itsmycpu (guest, #139639)
[Link] (3 responses)
If the reference counter is incremented relaxed, (for common use cases) it implies that read/write access is synchronized separately. So I'd expect that decrementing can be relaxed as well if the thread that encounters a zero reference count still has read/write access (otherwise it should be enough to use a simple load-acquire on the synchronization variable).
However the (separate) synchronization of read/write access, which can't be relaxed, probably makes the gain look small in comparison, even on those platforms where it is larger. Which makes me wonder how large it is.
Posted Aug 2, 2020 16:45 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
Of course, the results will likely vary not only across CPU families, but also within CPU families.
Posted Aug 2, 2020 17:05 UTC (Sun)
by itsmycpu (guest, #139639)
[Link] (1 responses)
Posted Aug 2, 2020 17:27 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
There is a lot of existing code to do such measurement, however, but on the other hand creating your own can be quite instructive.
Posted Aug 2, 2020 16:56 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
It turns out that in C, all the atomic operations are already volatile, including atomic_load_explicit(), which is then pretty close to READ_ONCE().
In contrast, in C++, atomic_load_explicit() can take either a volatile or a non-volatile pointer, which means that use of C++ atomic_load_explicit() on non-volatile pointers (the common case) will be subject to the transformation called out in N4444. This working paper is for C++ rather than for C, in case you were wondering why it leaves out the C-language case.
And yes, JF and I are already calling for something like READ_ONCE() and WRITE_ONCE() in C++: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p...
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
you wrote that a compiler is allowed to replace
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
}
Lockless algorithms for mere mortals
In 2.3, it describes an atomic_thread_fence with acquire:
1: A series of relaxed loads
2: An acquire fence
2: An acquire fence
3: A series of relaxed or non-atomic loads.
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals
Lockless algorithms for mere mortals