Lockless algorithms for mere mortals

Posted Jul 31, 2020 22:09 UTC (Fri) by itsmycpu (guest, #139639)
In reply to: Lockless algorithms for mere mortals by PaulMcKenney
Parent article: Lockless algorithms for mere mortals

The PDF is an interesting read (though it will take me a few days to go through it), thank you for the reference. Also Dekker synchronization so far escaped my attention, at least by name, I will be looking into it.

My first impression regarding the described use cases of memory_order_relaxed is that it is worse than I thought. Not because I would doubt that there are algorithms that make good use of it (like seqLock), but because it seems really difficult to navigate the potential problems in most cases.

For example the use in reference counting increments. Even if decrements are non-relaxed, it isn't immediately obvious to me that increments can be relaxed. Although I am well aware that in the handover of a reference from thread A to thread B there needs to be a moment where an original reference is still in place, and that the atomic value itself is sequential for RMW operations like increment and decrement (which are already expensive enough), I would not be sure that thread B necessarily needs to execute the necessary barriers after incrementing the reference relaxed, and before starting to load values from the referenced object. This would seem to mean that the compiler (or the hardware) is free to move the effective load time to before the reference increment becomes effective. So the logic that prevents this is probably very tricky (at least from my point of view, unless I'm simply missing some principle that provides for verification to become much easier). My guess would be that the necessary barriers might be involved in assuring thread A that it can release its reference. But I'm not aware of a theoretical construct that would guarantee this expectation to always apply, in general, for unknown use cases.

Lockless algorithms for mere mortals

Posted Jul 31, 2020 22:47 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (8 responses)

There are many use cases for which memory_order_relaxed is reliable and useful, but there are dragons as well. Part of the problem is that C and C++ do not respect dependencies (with a very few exceptions), so the dragons are much more difficult than they are for things like READ_ONCE() and WRITE_ONCE() in the Linux kernel.

On the reference count increment, the trick is that although the increment can be relaxed, the act of passing it somewhere else must be properly ordered. This proper ordering is often supplied by something else, though, such as thread-creation primitives. Whether this is worthwhile can depend on the CPU.

Lockless algorithms for mere mortals

Posted Aug 1, 2020 4:24 UTC (Sat) by itsmycpu (guest, #139639) [Link] (7 responses)

Partial answer: In my most critical use case, I am using the reference counter also as a synchronization point, so I need acquire/release there. However in many cases making the reference count atomic will perhaps simply be a way to avoid the need to get exclusive write access.

For example if thread A already increases the count before passing the pointer, in advance of decreasing its own reference, of course relaxed is sufficient. If thread B is the one to increase the count, I'd think there needs to be a direct or indirect synchronization point between B's increment and A's decrement. Otherwise each increment is atomic, but one can't be sure which one happens first. However that synchronization point doesn't have to be the counter itself. If it isn't, then the question arises if decrementing can't be relaxed as well. :) Especially if the count reaching zero can be taken as an indication that all read/write access has already been given up.

So yes, I think that's a valid example for relaxed. Had to think about it, though. :-)

Here http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n...
you wrote that a compiler is allowed to replace

while (tmp = atomic_load_explicit(a, memory_order_relaxed))
do_something_with(tmp);

with

while (tmp = atomic_load_explicit(a, memory_order_relaxed))
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
}

If that is still the case according to current C/C++ definitions, why don't you argue that the definition of memory_order_relaxed be refined to READ_ONCE and WRITE_ONCE? Or are you already?

Lockless algorithms for mere mortals

Posted Aug 1, 2020 12:47 UTC (Sat) by itsmycpu (guest, #139639) [Link] (1 responses)

Something I noticed in your article http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p...
In 2.3, it describes an atomic_thread_fence with acquire:

> The atomic_thread_fence() function can be used to order multiple sets of accesses, for example, by replacing a series of acquire loads with relaxed loads followed by an atomic_thread_fence(memory_order_acquire)

In its shortness, this sentence suggests the sequence:
1: A series of relaxed loads
2: An acquire fence

However reading https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence suggests:

1: Reading a single synchronization variable (probably atomic relaxed, as in the example)
2: An acquire fence
3: A series of relaxed or non-atomic loads.

If you don't mind me pointing it out. Probably more like a typo.

Lockless algorithms for mere mortals

Posted Aug 2, 2020 16:43 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Hans and I were expecting people to refer to the cited sections of the working paper "N2153: A simple and efficient memory model for weakly-ordered architectures", which covers this in detail, including step 3. But yes, that expectation might not apply to people unfamiliar with the committee. Plus N2153 was written before the C11 atomic API had been finalized, so some mapping is required to understand it.

I have therefore expanded this section to make step 3 explicit. Thank you for pointing this out.

There will be a P2055R1 at some point, but in the meantime you can access the LaTeX at https://github.com/paulmckrcu/WG21-relaxedguide.git.

Lockless algorithms for mere mortals

Posted Aug 1, 2020 14:32 UTC (Sat) by itsmycpu (guest, #139639) [Link] (3 responses)

To complete my thoughts about reference counting above:

If the reference counter is incremented relaxed, (for common use cases) it implies that read/write access is synchronized separately. So I'd expect that decrementing can be relaxed as well if the thread that encounters a zero reference count still has read/write access (otherwise it should be enough to use a simple load-acquire on the synchronization variable).

However the (separate) synchronization of read/write access, which can't be relaxed, probably makes the gain look small in comparison, even on those platforms where it is larger. Which makes me wonder how large it is.

Lockless algorithms for mere mortals

Posted Aug 2, 2020 16:45 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Try it and measure the results!

Of course, the results will likely vary not only across CPU families, but also within CPU families.

Lockless algorithms for mere mortals

Posted Aug 2, 2020 17:05 UTC (Sun) by itsmycpu (guest, #139639) [Link] (1 responses)

Currently working on a way to consistently precision-test execution times of short code fragments, as a side project...not as easy as it may sound. ;-)

Lockless algorithms for mere mortals

Posted Aug 2, 2020 17:27 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

No it is not at all easy! Today's systems are quite complex, and their are variations from one system to other (ostensibly identical) systems. And variations in a single system over time.

There is a lot of existing code to do such measurement, however, but on the other hand creating your own can be quite instructive.

Lockless algorithms for mere mortals

Posted Aug 2, 2020 16:56 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Well, if the use of memory_order_relaxed was obvious and trivial, Hans and I probably would not have written P2055R0. :-)

It turns out that in C, all the atomic operations are already volatile, including atomic_load_explicit(), which is then pretty close to READ_ONCE().

In contrast, in C++, atomic_load_explicit() can take either a volatile or a non-volatile pointer, which means that use of C++ atomic_load_explicit() on non-volatile pointers (the common case) will be subject to the transformation called out in N4444. This working paper is for C++ rather than for C, in case you were wondering why it leaves out the C-language case.

And yes, JF and I are already calling for something like READ_ONCE() and WRITE_ONCE() in C++: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p...