1) Each load from the Now-Serving counter memory location creates traffic on the internal CPU buses. This traffic decreases the amount of useful work that can be done. Since the threads blocked waiting for their Now-Serving number to come up are not doing anything useful, decreasing the amount of bus traffic they are generating makes everything else go faster.
2) Cache line contention/bouncing caused by Ticket-Counter modifications and modifications to the data protected by the lock.
The first issue is (mostly) the one addressed by the patch in question. Increasing the size/alignment of arch_spinlock_t to occupy an entire cache line might be beneficial for some use cases, but it would increase the size of many structures, thus causing increased cache pressure.