The big kernel lock strikes again
The AIM benchmark attempts to measure system throughput by running a large number of tasks (perhaps thousands of them), each of which is exercising some part of the kernel. Yanmin Zhang reported that his AIM results got about 40% worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the problem; the guilty patch turned out to be the generic semaphores code. Reverting that patch made the performance regression go away - at the cost of restoring over 7,000 lines of old, unlamented code. The thought of bringing back the previous semaphore implementation was enough to inspire a few people to look more deeply at the problem.
It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few years ago. That part of the process was easy - there aren't a whole lot of other semaphores left in the kernel, especially in performance-critical places. But the BKL stubbornly remains in a number of core places, including the fcntl() system call, a number of ioctl() implementations, the TTY code, and open() for char devices. That's enough for a badly-performing BKL to create larger problems, especially when running VFS-heavy benchmarks with a lot of contention.
Ingo Molnar tracked down the problem in the new semaphore code. In short: the new semaphore code is too fair for its own good. When a semaphore is released, and there is another thread waiting for it, the semaphore is handed over to the new thread (which is then made runnable) at that time. This approach ensures that threads obtain the semaphore in something close to the order in which they asked for it.
The problem is that fairness can be expensive. The thread waiting for the semaphore may be on another processor, its cache could be cold, and it might be at a low enough priority that it will not even begin running for some time. Meanwhile, another thread may request the semaphore, but it will get put at the end of the queue behind the new owner, which may not be running yet. The result is a certain amount of dead time where no running thread holds the semaphore. And, in fact, Yanmin's experience with the AIM benchmark showed this: his system was running idle almost 50% of the time.
The solution is to bring in a technique from the older semaphore code: lock stealing. If a thread tries to acquire a semaphore, and that semaphore is available, that thread gets it regardless of whether a different thread is patiently waiting in the queue. Or, in other words, the thread at the head of the queue only gets the semaphore once it starts running and actually claims it; if it's too slow, somebody else might get there first. In human interactions, this sort of behavior is considered impolite (in some cultures, at least), though it is far from unknown. In a multiprocessor computer, though, it makes the difference between acceptable and unacceptable performance - even a thread which gets its lock stolen will benefit in the long run.
Interestingly, the patch which implements this change was merged into the mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the revert was that the patch broke semaphores in other situations; for some usage patterns, the semaphore code could fail to wake a thread when the semaphore became available. This bug could certainly have been fixed, but it appears that things will not go that way - there is a bit more going on here.
What is happening instead is that Linus has committed a patch which simply turns the BKL into a spinlock. By shorting out the semaphore code entirely, this patch fixes the AIM regression while leaving the slow (but fair) semaphore code in place. This change also makes the BKL non-preemptible, which will not be entirely good news for those who are concerned with latency issues - especially the real time tree.
The reasoning behind this course of action would appear to be this: both semaphores and the BKL are old, deprecated mechanisms which are slated for minimization (semaphores) or outright removal (BKL) in the near future. Given that, it is not worth adding more complexity back into the semaphore code, which was dramatically simplified for 2.6.26. And, it seems, Linus is happy with a sub-optimal BKL:
So the end result of all this may be a reinvigoration of the effort to
remove the big kernel lock from the kernel. It still is not something
which is likely to happen over the next few kernel releases: there is a lot
of code which can subtly depend on BKL semantics, and there is no way to be
sure that it is safe without auditing it in detail. And that is not a
small job. Alan Cox has been reworking the TTY code for some time, but he
has some ground to cover yet - and the TTY code is only part of the
problem. So the BKL will probably be with us for a while yet.
Index entries for this article | |
---|---|
Kernel | Big kernel lock |
Kernel | Latency |
Kernel | lock_kernel() |
Kernel | Semaphores |
Posted May 15, 2008 2:17 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
Posted May 15, 2008 11:55 UTC (Thu)
by k3ninho (subscriber, #50375)
[Link] (3 responses)
Posted May 15, 2008 14:25 UTC (Thu)
by dmag (guest, #17775)
[Link]
Posted May 16, 2008 17:48 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
How could there possibly be higher throughput with the ordered queue? Because it takes time to figure out who's next in the mess of people? Because that doesn't have an analog in these lock mechanisms.
This analogy suggests a sophisticated optimal way to address the issue. At the kiosk, I don't think this effect actually happens with the ordered queue because you can see your turn coming up, and you get ready. If a waiter for a lock could, shortly before the lock is available, transform from waiting on a semaphore to spinning, he would be ready to use the lock the moment it becomes available but not be able to jump much ahead of his fair ordering. Now if the dispatcher could just reload caches at each dispatch, it would be great.
Posted May 18, 2008 1:18 UTC (Sun)
by vonbrand (subscriber, #4458)
[Link]
For simple queueing models at least (Poisson arrivals, exponential service times) the only thing that matters is the number of customers in the queue, not the order in which they are served.
The big kernel lock strikes again
I like Ingo Molnar's approach to the BKL:
http://lwn.net/Articles/282319/
Basically, it's REALLY hard right now to eliminate the BKL, so Ingo's first step to make it
much easier.
The big kernel lock strikes again
<i>[T]the thread at the head of the queue only gets the semaphore once it starts running and
actually claims it; if it's too slow, somebody else might get there first. In human
interactions, this sort of behavior is considered impolite (in some cultures, at least),
though it is far from unknown.</i>
I can't make my mind up whether comparable systems in real life (e.g. bars or food kiosks)
have greater throughput when there's an ordered queue or a mess of people waiting. I suspect
that having people wait in line allows them to think of other things and not be ready when
their turn comes -- a phenomenon with parallels to the cold caches mentioned above.
K3n.
The big kernel lock strikes again
I think a better analogy would be a restaurant that gives you a buzzer that alerts you when
your table is ready.
Only in this case, the buzzer is really a pager that works anywhere in the world. So customers
will go home or go run errands while waiting. This causes a lot of latency (tables unoccupied
for huge stretches because the 'next' customer is not close by).
You can solve the problem by not having a long range buzzer (i.e. lock waiting programs in
memory to prevent them from being swapped out -- but this would waste memory, since it could
be hours before the resource is ready, and programs that don't need the resource could use the
extra memory), or you could simply use the buzzer to say "the next table free, if you can't
come quickly we'll give it to someone else and buzz you later".
The big kernel lock strikes again
I can't make my mind up whether comparable systems in real life (e.g. bars or food kiosks)
have greater throughput when there's an ordered queue or a mess of people waiting.
I suspect
that having people wait in line allows them to think of other things and not be ready when
their turn comes -- a phenomenon with parallels to the cold caches mentioned above.
The big kernel lock strikes again