For the second issue, what you're describing (having the spinlock occupying an entire cache line) isn't always necessary: in some cases you could put 'cold' data in the same cache line as the lock to get the best performance without using too much memory.