I actually implemented conditional user-space spinning in 2.4 some years ago and it had really nice performance. It relied on having a "callboard", i.e. a piece of memory that indicated, for each thread in which you were interested, whether it was running or not. The memory is registered with the kernel, which is responsible for updating it when the process state changes.
So, the idea is that you store the thread ID (or an index for the thread) in the spinlock. When you want the lock, you do a conditional load and store with you tid/thread index. If you get back zero, nobody has the lock and you're done. Otherwise, you use the tid/thread index to check the callboard to see whether the thread holding the lock is running. If so, you loop again. If thread holding the lock isn't running, you make a system call that sleeps until the spinlock value is zero (you pass the address to check in the system call).
The performance of this was good, but the nicest part was that there wasn't a significant performance drop-off as the number of contenders for the lock goes up. I no longer recall whether I was using a 4- or 8 processor machine, but I *think* it was 8. From the caching standpoint, the callboard is read often/write rarely, always a good thing. If your conditional load and store only causes a cache conversion to modified if the write actually happens, you also have good cache behavior there. (When I was working on this, the processor I was using would actually record a cache modification even if the condition wasn't met. Ick.)
The work never got pushed back and nothing ever came of it that I know of, which was kinda sad. Oracle had requested that we do this.