Wait until all futexes are available?

Posted Jun 20, 2020 1:12 UTC (Sat) by itsmycpu (guest, #139639)
In reply to: Wait until all futexes are available? by glenn
Parent article: Rethinking the futex API

> Has any thought been given to waiting until _all_ the futexes are available?
> Some sorting of mutexes could be imparted to avoid deadlock issues.
> Yes, these locks can be obtained serially in userspace, but this may result in as many as N-1 pointless wake-ups of N futexes are needed.

This problem relates to a specific approach, not to user space solutions in general. The goal is to determine contention without any kernel call at all.

The way you ask this question suggests that your life doesn't depend on the answer.
My suggestion: keep it that way.

Wait until all futexes are available?

Posted Jun 20, 2020 7:25 UTC (Sat) by glenn (subscriber, #102223) [Link] (5 responses)

Perhaps I need to rephrase my question: What can the kernel do to reduce the system overheads associated with nested locking?

Wait until all futexes are available?

Posted Jun 23, 2020 7:12 UTC (Tue) by ncm (guest, #165) [Link] (4 responses)

See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.

Wait until all futexes are available?

Posted Jun 25, 2020 8:39 UTC (Thu) by glenn (subscriber, #102223) [Link] (3 responses)

> The way you ask this question suggests that your life doesn't depend on the answer.

> See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.

Thanks for the life advice. I’ll bring it up with my therapist. Shall we get back to science and engineering?

Let me talk about some challenges I’ve run into developing a pipelined real-time task-graph runtime where I used the futex API directly to manage thread waiting and signalling. My workload is modeled by a set of nodes connected by directed edges. A node is a consumer of input edges, and it is a producer on output edges. Each node encapsulates a unit of work that is backed by a dedicated thread. This thread is ready to run when data is available on _all_ of its input edges.

Being a task-graph system, some nodes are join/fan-in points. These nodes consume data from several edges. How does a producer determine when it should wake the consumer? One method is to use a futex for each edge. The consumer can loop over these futexes until all inputs are satisfied. This may work, but it’s not particularly efficient: it doesn’t scale well (suppose your node has 10s or even 100s of inputs) as a thread may wake up many times before it does any real work. Recall that this is a runtime for a real-time application. These trivial wakeups can preempt other real-time work (e.g., active threads of other nodes) and induce threads to migrate to other CPUs (Linux is very aggressive about migrating preempted real-time threads to available CPUs), inducing cache affinity loss and additional context switch overheads. Also, this wakeup pattern is _very_ problematic for SCHED_DEADLINE. This scheduler allocates budget assuming that the thread in question exhibits a periodic/sporadic execution pattern. Our consumer thread does _not_ exhibit this pattern because it is constantly dealing with trivial wakeups. A SCHED_DEADLINE consumer can exhaust its budget dealing with trivial wakeups before the thread is ready to do any real work! This is due to SCHED_DEADLINE's budget reclaiming algorithm (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).

An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer. This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.

My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.

In my runtime, I also explored the use of eventfd and epoll. I encountered the same problem: select, poll, and epoll all wake the caller when any one file descriptor is ready, not when all are ready. I am unaware of any synchronization or IPC mechanism in Linux that would let me design my software in a composable way while simultaneously avoiding the problems of trivial wakeups. I haven’t revisited my problem since io_uring landed--perhaps there’s a solution for me there. Does anyone have a suggestion? No need to suggest a therapist. I already have one. ;)

Wait until all futexes are available?

Posted Jun 25, 2020 11:22 UTC (Thu) by itsmycpu (guest, #139639) [Link] (1 responses)

> An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer.

Your situation may have specifics that I am not aware of, so I can only hope that this is going to make sense:

Each producer keeps track of if a consumer has already been informed about the availability of data, in that consumer's list entry. That list entry of course also has a reference to the consumer. The consumer itself has an atomic counter for how many inputs it is still missing. When a producer is able to provide a new input, it reduces the consumer's (atomic) counter by 1. In case it reaches zero at that point, it wakes the consumer.

Instead of a counter, for example an atomic bitset can be used. Then the consumer's list entry contains the specific bit that this producer should set. When a producer sees that it has set the last necessary bit, it wakes the consumer (or just sets that flag if the consumer isn't waiting). This is what I have used in some cases. In my case, the consumer is often doing other work while the flag is set, so that it often does not have to wait, and the whole process takes place in user space.

> This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.

As far as I know, using futexes between processes always requires shared memory, one way or the other. That probably makes them fastest solution when blocking threads is required on the slow path.

(Of course futexes should only be used on the slow path.)

Wait until all futexes are available?

Posted Jun 30, 2020 15:42 UTC (Tue) by ncm (guest, #165) [Link]

I endorse the above. For nodes with a large number of inputs, each bit can represent another bitmap, recursively, until there are enough bits for all the inputs. When any bitmap is filled, the producer sets the bit representing it at the next level up. When the topmost bitmap is filled, the producer that filled it wakes the thread.

Bits are set by compare-and-swap. atomically, with no system call needed.

To do it with a counter, the producer atomically decrements a counter, and wakes the thread when it sees the transition to zero.

Waking the thread might mean writing a byte to a pipe it is sleeping on.

Or, the producer's thread might just add the consumer to a scoreboard of runnable tasks, and some already-running thread will get to it. That way, the kernel is never involved at all. You keep exactly as many threads as you have cores on the job, and they spin when there is no work. Isolate those cores (isolcpus, nohz_full, rcu_nocbs) so the kernel does not steal them.

Wait until all futexes are available?

Posted Jun 27, 2020 7:49 UTC (Sat) by ras (subscriber, #33059) [Link]

> An alternative approach is to have the consumer wait on a single futex.

Waking up precisely one waiter if always the fastest choice.

> This implies that the producers must run within the same process or share some state in shared memory.

Yes, but futex's only work with shared memory. Communicating via shared memory is orders of magnitude faster than another other method. That's where futex's get their speed from. So if you want speed shared memory unavoidable.

If speed is not such a big issue there are as you say lots of choices, including epoll(), pipes, semop()'s ...

> My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.

Of course it would be. Things are always simpler when there is a pre-existing library that does exactly what you want. What triggered my interest here is the claim that library should be in the kernel.