User: Password:
Subscribe / Log in / New account

What about separating locks and data?

What about separating locks and data?

Posted Jan 4, 2013 16:54 UTC (Fri) by renox (subscriber, #23785)
Parent article: Improving ticket spinlocks

If the cause of the issue is that the lock is in the same cache line as the data, I wonder if it would be possible to ensure that the lock would be in a different place(different cache line)?

Either the lock would be handled separately or the data would have a pointer to the "real" lock.
It strikes me as quite funny that adding a level of indirection could help performance in this case.

(Log in to post comments)

What about separating locks and data?

Posted Jan 4, 2013 17:04 UTC (Fri) by corbet (editor, #1) [Link]

I'd pondered that a bit as I was writing the article. It would be painful.

If you move the lock out of the structure it is protecting, you have to allocate it in a separate step. The allocation could maybe be hidden in spin_lock_init(), but the freeing would require explicit action (in vast numbers of call sites) or some severe magic. Probably don't want to do that.

Regardless of whether the lock is relocated or not, the only way to avoid cache problems with surrounding data is to expand the lock to fill an entire cache line. That would bloat spinlocks considerably, and there are a lot of spinlocks in a running kernel. That overhead would hurt; among other things, the bloat, alone, would slow things by causing additional cache utilization.

Maybe I'm missing something (I often do), but, from what I can tell, trying to isolate the locks isn't really a viable solution.

What about separating locks and data?

Posted Jan 4, 2013 17:34 UTC (Fri) by renox (subscriber, #23785) [Link]

I agree with you that the management of deallocation is a significant issue/change.

As for the memory bloat, I think that these "separated spinlock" would be useful for only the most contended spinlocks which would somewhat mitigate the issue.
Of course contention isn't the same on all the systems and developers working on embedded or supercomputers would probably disagree on which spinlocks has to be "separated"..

What about separating locks and data?

Posted Jan 5, 2013 22:53 UTC (Sat) by dlang (subscriber, #313) [Link]

There will then be the problem of exactly how much space do you allocate for the spinlock

how do you know what the best cache size is to use on the processor that someone will buy next year and run the binary you compiled today on.

What about separating locks and data?

Posted Jan 5, 2013 23:50 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Aren't there runtime facilities in the kernel for tailoring behavior to the actual cache line size? Something like a cache line allocator?

That would make this kind of memory layout feasible until they invent a more complex form of caching.

What about separating locks and data?

Posted Jan 5, 2013 18:07 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

Putting the lock and the data together increases performance at low levels of lock contention. After all, if the lock and data are in separate cache lines, you will take two cache misses, whereas if the lock and data are in the same cache line, you will only take one cache miss.

Of course, and noted earlier in this thread, the opposite holds true at high levels of contention. Life is like that sometimes! ;-)

What about separating locks and data?

Posted Jan 4, 2013 18:58 UTC (Fri) by daney (subscriber, #24551) [Link]

There are really two issues:

1) Each load from the Now-Serving counter memory location creates traffic on the internal CPU buses. This traffic decreases the amount of useful work that can be done. Since the threads blocked waiting for their Now-Serving number to come up are not doing anything useful, decreasing the amount of bus traffic they are generating makes everything else go faster.

2) Cache line contention/bouncing caused by Ticket-Counter modifications and modifications to the data protected by the lock.

The first issue is (mostly) the one addressed by the patch in question. Increasing the size/alignment of arch_spinlock_t to occupy an entire cache line might be beneficial for some use cases, but it would increase the size of many structures, thus causing increased cache pressure.

What about separating locks and data?

Posted Jan 5, 2013 10:38 UTC (Sat) by renox (subscriber, #23785) [Link]

Thanks for this informative reply.

For the second issue, what you're describing (having the spinlock occupying an entire cache line) isn't always necessary: in some cases you could put 'cold' data in the same cache line as the lock to get the best performance without using too much memory.

What about separating locks and data?

Posted Jan 10, 2013 18:30 UTC (Thu) by ksandstr (guest, #60862) [Link]

Having considered this topic for a bit, I wonder: given that there are two uses of CAS involved in ticket spinlocks, i.e. one on next-ticket (lock immediately or wait), and one on now-serving (unlock), which is as many as with regular locked/not spinlocks, the issue is clearly the increased _non-local_ write traffic on the data structure under contention. This seems to suggest a solution where spinlocks associated with objects smaller than a cacheline are moved out of the data and into, say, a hashed group keyed like the objects' parent data structure, trading some false contention for space.

That'd protect the significant cachelines from not only write-bouncing from ticket-acquisition, but from any spinlock-related "oops, had to flush this exclusive cache line to RAM in the meantime" cases due to read traffic also. I'm guessing that an operation to acquire locks on N objects without ordering foulups could fit on top of that as well.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds