Wait until all futexes are available?
Wait until all futexes are available?
Posted Jun 25, 2020 8:39 UTC (Thu) by glenn (subscriber, #102223)In reply to: Wait until all futexes are available? by ncm
Parent article: Rethinking the futex API
> See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.
Thanks for the life advice. I’ll bring it up with my therapist. Shall we get back to science and engineering?
Let me talk about some challenges I’ve run into developing a pipelined real-time task-graph runtime where I used the futex API directly to manage thread waiting and signalling. My workload is modeled by a set of nodes connected by directed edges. A node is a consumer of input edges, and it is a producer on output edges. Each node encapsulates a unit of work that is backed by a dedicated thread. This thread is ready to run when data is available on _all_ of its input edges.
Being a task-graph system, some nodes are join/fan-in points. These nodes consume data from several edges. How does a producer determine when it should wake the consumer? One method is to use a futex for each edge. The consumer can loop over these futexes until all inputs are satisfied. This may work, but it’s not particularly efficient: it doesn’t scale well (suppose your node has 10s or even 100s of inputs) as a thread may wake up many times before it does any real work. Recall that this is a runtime for a real-time application. These trivial wakeups can preempt other real-time work (e.g., active threads of other nodes) and induce threads to migrate to other CPUs (Linux is very aggressive about migrating preempted real-time threads to available CPUs), inducing cache affinity loss and additional context switch overheads. Also, this wakeup pattern is _very_ problematic for SCHED_DEADLINE. This scheduler allocates budget assuming that the thread in question exhibits a periodic/sporadic execution pattern. Our consumer thread does _not_ exhibit this pattern because it is constantly dealing with trivial wakeups. A SCHED_DEADLINE consumer can exhaust its budget dealing with trivial wakeups before the thread is ready to do any real work! This is due to SCHED_DEADLINE's budget reclaiming algorithm (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).
An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer. This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.
My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.
In my runtime, I also explored the use of eventfd and epoll. I encountered the same problem: select, poll, and epoll all wake the caller when any one file descriptor is ready, not when all are ready. I am unaware of any synchronization or IPC mechanism in Linux that would let me design my software in a composable way while simultaneously avoiding the problems of trivial wakeups. I haven’t revisited my problem since io_uring landed--perhaps there’s a solution for me there. Does anyone have a suggestion? No need to suggest a therapist. I already have one. ;)
