Putting the lock and the data together increases performance at low levels of lock contention. After all, if the lock and data are in separate cache lines, you will take two cache misses, whereas if the lock and data are in the same cache line, you will only take one cache miss.
Of course, and noted earlier in this thread, the opposite holds true at high levels of contention. Life is like that sometimes! ;-)