What about separating locks and data?

Posted Jan 4, 2013 18:58 UTC (Fri) by daney (guest, #24551)
In reply to: What about separating locks and data? by renox
Parent article: Improving ticket spinlocks

There are really two issues:

1) Each load from the Now-Serving counter memory location creates traffic on the internal CPU buses. This traffic decreases the amount of useful work that can be done. Since the threads blocked waiting for their Now-Serving number to come up are not doing anything useful, decreasing the amount of bus traffic they are generating makes everything else go faster.

2) Cache line contention/bouncing caused by Ticket-Counter modifications and modifications to the data protected by the lock.

The first issue is (mostly) the one addressed by the patch in question. Increasing the size/alignment of arch_spinlock_t to occupy an entire cache line might be beneficial for some use cases, but it would increase the size of many structures, thus causing increased cache pressure.

What about separating locks and data?

Posted Jan 5, 2013 10:38 UTC (Sat) by renox (guest, #23785) [Link]

Thanks for this informative reply.

For the second issue, what you're describing (having the spinlock occupying an entire cache line) isn't always necessary: in some cases you could put 'cold' data in the same cache line as the lock to get the best performance without using too much memory.

What about separating locks and data?

Posted Jan 10, 2013 18:30 UTC (Thu) by ksandstr (guest, #60862) [Link]

Having considered this topic for a bit, I wonder: given that there are two uses of CAS involved in ticket spinlocks, i.e. one on next-ticket (lock immediately or wait), and one on now-serving (unlock), which is as many as with regular locked/not spinlocks, the issue is clearly the increased _non-local_ write traffic on the data structure under contention. This seems to suggest a solution where spinlocks associated with objects smaller than a cacheline are moved out of the data and into, say, a hashed group keyed like the objects' parent data structure, trading some false contention for space.

That'd protect the significant cachelines from not only write-bouncing from ticket-acquisition, but from any spinlock-related "oops, had to flush this exclusive cache line to RAM in the meantime" cases due to read traffic also. I'm guessing that an operation to acquire locks on N objects without ordering foulups could fit on top of that as well.