As far as I understand, using private futex to wait on should be faster (by one atomic op) than waiting on glibc mutex, right? I've just iterated the wait/wake cycle for 10000000s of times and seen no statistically significant difference between the two. Any ideas why that may be so?