The second step here is to see about exposing the information only the kernel has currently to userspace. This provides userspace with the same advantage, and avoids the overhead of the syscalls. One tricky part is choosing where to put this data in memory. The other potential drawback here is that it may only work for process private futexes, so only threads of a single process could use it, while the kernel implementation works with threads and processes.