How scalable is too much?

[Posted July 10, 2002 by corbet]

In the beginning Alan Cox created the big kernel lock (BKL), and Linux became SMP-capable. The BKL ensured that only one processor could be running kernel code at any given time, thus keeping the processors from stepping on each other. It was an effective way of bringing SMP support to a kernel which had not been designed for multiple processors.

The problem with the BKL, of course, is that multiple processors often want to run concurrently in kernel code. Most of the time, those processors are working on entirely different tasks and would not interfere with each other. The more processors you have, the worse the problem gets; the Linux kernel with just one big lock (i.e. 2.0) really did not function all that well with more than two processors. Any additional CPUs would just spend their time waiting to be able to get into the kernel code.

Scalability to larger systems, thus, requires finer locking. The BKL can be split into a memory management lock, a networking lock, a filesystem lock, etc. In the 2.1 development series, for example, the block I/O subsystem adopted its own lock (io_request_lock) to keep the block code and drivers from getting into trouble. Scalability was improved, since the block code no longer needed the BKL, and could execute concurrently with other kernel code.

But the io_request_lock serialized all block request handling. A process submitting requests for one drive could not run concurrently with a different process working with a different device. Floppy operations contended for the same lock as performance-critical disk requests. The I/O request lock improved scalability, but, once you get enough processors and drives, it was still a bottleneck. So, one of the first steps in the 2.5 block subsystem work was to replace io_request_lock with a per-queue lock, one for each device. The result will be better performance on large, disk-intensive systems.

Most other kernel subsystems have been going through a similar development process: global locks are replaced by multiple locks which protect smaller data structures. This increasingly fine-grained locking makes the kernel scalable to more and more processors, but it also brings some real costs. For example, most of us do not run Linux on huge systems, and probably never will. Embedded SMP systems are also rare. All that locking will have a cost, even though the compiler optimizes it out on uniprocessor systems.

The real cost, however, is in the complexity of the kernel code. As the kernel becomes populated with thousands of little locks, it becomes increasingly difficult to write correct kernel code. Which lock(s) must you have to access a given data structure, or to call a given function? In which order should locks be taken? Consider two code paths, both of which need locks L1 and L2. The first thread takes L1, the second takes L2; each then tries to take the other lock. The result is a deadlocked system. Avoiding this problem requires specifying ordering relationships for every lock in the system - and the number of those relationships grows exponentially with the number of locks.

One can try to document the locking requirements of each data structure and function in the kernel, and every lock ordering constraint. But, even if one honestly believed that such a document would be created (and, importantly, maintained), it would be a very thick, complicated manual. A kernel with many locks will be a kernel that is difficult to program.

Some people (i.e. Larry McVoy) have been arguing for years that Linux should not chase the "scalability" goal too far. Down that road lies a kernel that is twisted beyond maintainability, and, once you realize that this has happened, it is too late to go back. For the most part, scalability work has continued in the face of those warnings, but there are signs that things are beginning to change. For example, a recent patch which removed the BKL from the driverfs code was shouted down in a fairly strong way. Alexander Viro stated, in characteristic fashion:

"Zillion little spinlocks" means that kernel is scaled into oblivion. Literally. If you want to play with resulting body - feel free, but I like it less kinky.

So, while there has been no definitive statement of policy, it looks like at least some kernel developers are thinking that locking in the kernel is complex enough. There may be no 64-processor Linux in our future...

...at least, not in the classic SMP form. Larry McVoy has been pushing "cache-coherent clusters" as an alternative approach for some time. A CC/cluster takes a large machine and divides it into small group of (four, say) processors; each group runs an independent Linux kernel. The kernels have minimal interactions with each other, so locking issues fade to the background. Nobody has, yet, implemented such a cluster, though a lot of the pieces are there. If somebody runs with this idea, Linux could yet be the most scalable system of them all.

How scalable is too much?

Posted Jul 11, 2002 18:21 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

A minor point: the number of required ordering relationships between locks grows only linearly with the number of locks, because it has to be a total order (i.e., a simple list of all the locks in order). What grows exponentially is the number of orderings you could pick.

At least in the simple cases (global locks, ordered lists of objects with locks), it should be possible to check automatically that a set of predefined rules are followed (The hard problem is determining whether a piece of code is actually free of deadlocks; but it is a bug if locks are taken in the wrong order even if the code will not deadlock anyway). I expect the kernel checker to determine lock safety (if it doesn't already) pretty soon; if the checker can't figure it out, it's probably too complicated and likely to get broken.

How scalable is too much?

Posted Jul 12, 2002 20:21 UTC (Fri) by DeletedUser2560 ((unknown), #2560) [Link]

Each lock should have an identifier associated with it. If a piece of code needs multiple locks, it could call a single system call, and specify all the locks it needs as parameters to it. It would be the job of this system call to grab the locks in order (or, block until all the requested locks are availble, in an atomic operation)

Larry McVoy and his BitKeeper advertisements

Posted Jul 15, 2002 10:21 UTC (Mon) by leandro (guest, #1460) [Link]

I guess it's the guy's right to advertise his proprietary system, and it's Linus' right to use the tools he wants to.

But what's fishy is that, even in a totally unrelated presentation, he keeps advertising his proprietary stuff. And few people probably realise that, if Linus had taken the same kind of decision, Linux itself wouldn't be copylefted, not even free, but some kind of freeware or nagware stuff, and we would be all using BSD or the Hurd. Linux probably would never got much farther than Minix did.

Because Linux got so successful basically because it became the oficious GNU kernel, and that wouldn't have been possible with LM's BitKeeper licensing model. Linus would have to have created all the userland tools he just got free from The GNU Project.

OTOH LM proposes to make his code free only to free software, but it's not really free in itself; bringing this to its logical conclusion, if everyone would follow suit there would be no more free software, only freeware and nagware, and we would be back to the old "free only for non-commercial use" or some variation of it.

Or it was just me who realised that BitKeeper couldn't be used to develop another BitKeeper by the community, since it would require payment to develop a proprietary tool? So Larry is actually like preempting the appearance of a free software competitor!

Disgusting.