The big kernel lock strikes again

By Jonathan Corbet
May 13, 2008

When Alan Cox first made Linux work on multiprocessor systems, he added a primitive known as the big kernel lock (or BKL). This lock, originally, ensured that only one processor could be running kernel code at any given time. Over the years, the role of the BKL has diminished as increasingly fine-grained locking - along with lock-free algorithms - have been implemented throughout the kernel. Getting rid of the BKL entirely has been on the list of things to do for some time, but progress in that direction has been slow in recent years. A recent performance regression tied to the BKL might give some new urgency to that task, though; it also shows how subtle algorithmic changes can make a big difference.

The AIM benchmark attempts to measure system throughput by running a large number of tasks (perhaps thousands of them), each of which is exercising some part of the kernel. Yanmin Zhang reported that his AIM results got about 40% worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the problem; the guilty patch turned out to be the generic semaphores code. Reverting that patch made the performance regression go away - at the cost of restoring over 7,000 lines of old, unlamented code. The thought of bringing back the previous semaphore implementation was enough to inspire a few people to look more deeply at the problem.

It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few years ago. That part of the process was easy - there aren't a whole lot of other semaphores left in the kernel, especially in performance-critical places. But the BKL stubbornly remains in a number of core places, including the fcntl() system call, a number of ioctl() implementations, the TTY code, and open() for char devices. That's enough for a badly-performing BKL to create larger problems, especially when running VFS-heavy benchmarks with a lot of contention.

Ingo Molnar tracked down the problem in the new semaphore code. In short: the new semaphore code is too fair for its own good. When a semaphore is released, and there is another thread waiting for it, the semaphore is handed over to the new thread (which is then made runnable) at that time. This approach ensures that threads obtain the semaphore in something close to the order in which they asked for it.

The problem is that fairness can be expensive. The thread waiting for the semaphore may be on another processor, its cache could be cold, and it might be at a low enough priority that it will not even begin running for some time. Meanwhile, another thread may request the semaphore, but it will get put at the end of the queue behind the new owner, which may not be running yet. The result is a certain amount of dead time where no running thread holds the semaphore. And, in fact, Yanmin's experience with the AIM benchmark showed this: his system was running idle almost 50% of the time.

The solution is to bring in a technique from the older semaphore code: lock stealing. If a thread tries to acquire a semaphore, and that semaphore is available, that thread gets it regardless of whether a different thread is patiently waiting in the queue. Or, in other words, the thread at the head of the queue only gets the semaphore once it starts running and actually claims it; if it's too slow, somebody else might get there first. In human interactions, this sort of behavior is considered impolite (in some cultures, at least), though it is far from unknown. In a multiprocessor computer, though, it makes the difference between acceptable and unacceptable performance - even a thread which gets its lock stolen will benefit in the long run.

Interestingly, the patch which implements this change was merged into the mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the revert was that the patch broke semaphores in other situations; for some usage patterns, the semaphore code could fail to wake a thread when the semaphore became available. This bug could certainly have been fixed, but it appears that things will not go that way - there is a bit more going on here.

What is happening instead is that Linus has committed a patch which simply turns the BKL into a spinlock. By shorting out the semaphore code entirely, this patch fixes the AIM regression while leaving the slow (but fair) semaphore code in place. This change also makes the BKL non-preemptible, which will not be entirely good news for those who are concerned with latency issues - especially the real time tree.

The reasoning behind this course of action would appear to be this: both semaphores and the BKL are old, deprecated mechanisms which are slated for minimization (semaphores) or outright removal (BKL) in the near future. Given that, it is not worth adding more complexity back into the semaphore code, which was dramatically simplified for 2.6.26. And, it seems, Linus is happy with a sub-optimal BKL:

Quite frankly, maybe we _need_ to have a bad BKL for those to ever get fixed. As it was, people worked on trying to make the BKL behave better, and it was a failure. Rather than spend the effort on trying to make it work better (at a horrible cost), why not just say "Hell no - if you have issues with it, you need to work with people to get rid of the BKL rather than cluge around it".

So the end result of all this may be a reinvigoration of the effort to remove the big kernel lock from the kernel. It still is not something which is likely to happen over the next few kernel releases: there is a lot of code which can subtly depend on BKL semantics, and there is no way to be sure that it is safe without auditing it in detail. And that is not a small job. Alan Cox has been reworking the TTY code for some time, but he has some ground to cover yet - and the TTY code is only part of the problem. So the BKL will probably be with us for a while yet.

Index entries for this article
Kernel	Big kernel lock
Kernel	Latency
Kernel	lock_kernel()
Kernel	Semaphores

The big kernel lock strikes again

Posted May 15, 2008 2:17 UTC (Thu) by dwheeler (guest, #1216) [Link]

I like Ingo Molnar's approach to the BKL:
http://lwn.net/Articles/282319/

Basically, it's REALLY hard right now to eliminate the BKL, so Ingo's first step to make it
much easier.

The big kernel lock strikes again

Posted May 15, 2008 11:55 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (3 responses)

<i>[T]the thread at the head of the queue only gets the semaphore once it starts running and
actually claims it; if it's too slow, somebody else might get there first. In human
interactions, this sort of behavior is considered impolite (in some cultures, at least),
though it is far from unknown.</i>

I can't make my mind up whether comparable systems in real life (e.g. bars or food kiosks)
have greater throughput when there's an ordered queue or a mess of people waiting.  I suspect
that having people wait in line allows them to think of other things and not be ready when
their turn comes -- a phenomenon with parallels to the cold caches mentioned above.  

K3n.

The big kernel lock strikes again

Posted May 15, 2008 14:25 UTC (Thu) by dmag (guest, #17775) [Link]

I think a better analogy would be a restaurant that gives you a buzzer that alerts you when
your table is ready.

Only in this case, the buzzer is really a pager that works anywhere in the world. So customers
will go home or go run errands while waiting. This causes a lot of latency (tables unoccupied
for huge stretches because the 'next' customer is not close by).

You can solve the problem by not having a long range buzzer (i.e. lock waiting programs in
memory to prevent them from being swapped out -- but this would waste memory, since it could
be hours before the resource is ready, and programs that don't need the resource could use the
extra memory), or you could simply use the buzzer to say "the next table free, if you can't
come quickly we'll give it to someone else and buzz you later".

The big kernel lock strikes again

Posted May 16, 2008 17:48 UTC (Fri) by giraffedata (guest, #1954) [Link]

I can't make my mind up whether comparable systems in real life (e.g. bars or food kiosks) have greater throughput when there's an ordered queue or a mess of people waiting.

How could there possibly be higher throughput with the ordered queue? Because it takes time to figure out who's next in the mess of people? Because that doesn't have an analog in these lock mechanisms.

I suspect that having people wait in line allows them to think of other things and not be ready when their turn comes -- a phenomenon with parallels to the cold caches mentioned above.

This analogy suggests a sophisticated optimal way to address the issue. At the kiosk, I don't think this effect actually happens with the ordered queue because you can see your turn coming up, and you get ready. If a waiter for a lock could, shortly before the lock is available, transform from waiting on a semaphore to spinning, he would be ready to use the lock the moment it becomes available but not be able to jump much ahead of his fair ordering. Now if the dispatcher could just reload caches at each dispatch, it would be great.

The big kernel lock strikes again

Posted May 18, 2008 1:18 UTC (Sun) by vonbrand (subscriber, #4458) [Link]

For simple queueing models at least (Poisson arrivals, exponential service times) the only thing that matters is the number of customers in the queue, not the order in which they are served.