|
|
Log in / Subscribe / Register

Kernel development

Brief items

Release status

The current development kernel is 2.5.25, which was announced by Linus on July 5. It includes a 1000 HZ internal clock on x86 processors (though that may change, the real point of interest is that the internal clock has been detached from the HZ seen in user space), some SCSI midlayer work (see last week's LWN Kernel Page for a description of the plan for SCSI), a bunch of filesystem and VM layer cleanups, an NTFS update, more kbuild tweaks, and many other changes. Those wanting details can look at the long-format changelog.

Linus's BitKeeper tree for 2.5.26 contains only a small set of fixes as of this writing.

The latest prepatch from Dave Jones is 2.5.25-dj1, which catches up to the 2.5.25 kernel and throws in a number of fixes and a "fatfs crapectomy."

The latest 2.5 status summary from Guillaume Boissiere is dated July 10.

The current stable kernel is 2.4.18; Marcelo has not released any new 2.4.19 release candidates over the last week.

Alan Cox has released 2.4.19-rc1-ac1, which catches up to the first 2.4.19 release candidate and adds a small set of additional fixes.

Comments (2 posted)

Kernel development news

The end of the road for kiobufs?

Andrew Morton's "direct-to-BIO for O_DIRECT" patch is another step in the process of converting the file I/O subsystem over to the new BIO request structure. Files opened with O_DIRECT are a bit of a special case, in that I/O happens directly to or from a userspace buffer. Andrew's patch sets up a BIO request pointing directly to that buffer; for large operations, the result is a significant speedup.

That sort of optimization is certainly worthwhile. The really interesting part of this patch, however, is that it shorts out the "kiobuf" layer for O_DIRECT, and for the raw block I/O devices as well. Kiobufs were initially implemented to support that sort of raw I/O; they were intended to be a generic abstraction for a collection of physical pages in I/O operations. Kiobufs have been gradually falling out of favor over the last couple of years, however, as their limitations have come to light. They are a relatively heavyweight data structure, with high setup and teardown costs. Kiobufs also break down operations into relatively small chunks which must be processed sequentially, slowing down large requests.

The direct-to-BIO patch has eliminated the original and largest use of kiobufs within the kernel. That leads to the obvious question: is it time to remove kiobufs from 2.5? The answer seems to be "yes," and some patches removing the last remaining uses of kiobufs have started appearing. Kiobufs, it seems, are on the way out.

The only gap left if kiobufs are removed would be direct I/O support for character devices. There are devices which can benefit from direct I/O: consider the SCSI generic layer, video devices, or high-speed tape drives. Requests have been posted for a function which would map a userspace buffer into a "scatterlist," a data structure representing memory which has been set up for DMA operations. This capability would take almost all of the pain out of supporting direct I/O in character devices; no such patch has yet been posted, though.

Comments (none posted)

2.5 IDE considered harmful

The volume of the complaints about the 2.5 IDE subsystem is increasing. Consider this posting from Russell King:

If stuff in 2.5 wasn't soo broken (looking at IDE here) then more people would be using it, and less people would be wanting the 2.5 features back ported to 2.4. IMHO, at the moment 2.5 has a major problem. It is not getting the testing it deserves because things like IDE and such like aren't reasonably stable enough.

...or this one from Andi Kleen...

Testing 2.5 (in this case with x86-64) is a major problem unless you're lucky enough to find a SCSI adapter and a SCSI disk. IDE just deadlocks and hangs too often. This prevents testing everything else and stops development in 2.5 for many things.

The state of the IDE code is seen by many as a drag on the 2.5 development process as a whole. For those who are concerned, there are a few things worth looking at.

Part of the problem, apparently, is that the 2.5.25 kernel is missing several of the more recent patches, which fix serious problems. As Martin Dalecki puts it:

My plan is to provide a 98 soon which will be cummulative against 2.5.25, just to geive people a chance to work on it again. But as it stands - *plain* 2.5.25 is indeed very dangerous in this regard.

Martin's IDE-98 patch has not been posted as of this writing; those wanting to run 2.5.25 on an IDE system in the mean time and actually keep their files should apply this set of patches.

Interestingly, most of those patches were not posted by Martin (who has been on vacation). Instead, the recent IDE patches have been produced by Bartlomiej Zolnierkiewicz. Bartlomiej seems to take a bit more cautious approach, and even has the respect of former IDE maintainer Andre Hedrick. With luck, he will be more involved in future IDE work. Few people contest the need to "clean up" the IDE layer, but this work needs to be done in a very careful way.

Meanwhile, a different approach has been taken by Jens Axboe. It is normal for interesting features in the current development series to be backported to the previous stable kernel. Thus, for example, Alan Cox's 2.4.19-ac patch includes the O(1) scheduler from 2.5. Jens has gone the other direction and posted a patch (since updated) which "foreports" the 2.4 IDE layer to 2.5. His purpose was to have a stable platform to work on; the patch will be maintained until the 2.5 IDE layer becomes a little more trustworthy. It is not intended to be a long-term replacement for that layer.

With luck, the 2.5 IDE issues will settle out soon. Meanwhile, caution (or a SCSI system) is suggested for people running 2.5.

Comments (none posted)

How scalable is too much?

In the beginning Alan Cox created the big kernel lock (BKL), and Linux became SMP-capable. The BKL ensured that only one processor could be running kernel code at any given time, thus keeping the processors from stepping on each other. It was an effective way of bringing SMP support to a kernel which had not been designed for multiple processors.

The problem with the BKL, of course, is that multiple processors often want to run concurrently in kernel code. Most of the time, those processors are working on entirely different tasks and would not interfere with each other. The more processors you have, the worse the problem gets; the Linux kernel with just one big lock (i.e. 2.0) really did not function all that well with more than two processors. Any additional CPUs would just spend their time waiting to be able to get into the kernel code.

Scalability to larger systems, thus, requires finer locking. The BKL can be split into a memory management lock, a networking lock, a filesystem lock, etc. In the 2.1 development series, for example, the block I/O subsystem adopted its own lock (io_request_lock) to keep the block code and drivers from getting into trouble. Scalability was improved, since the block code no longer needed the BKL, and could execute concurrently with other kernel code.

But the io_request_lock serialized all block request handling. A process submitting requests for one drive could not run concurrently with a different process working with a different device. Floppy operations contended for the same lock as performance-critical disk requests. The I/O request lock improved scalability, but, once you get enough processors and drives, it was still a bottleneck. So, one of the first steps in the 2.5 block subsystem work was to replace io_request_lock with a per-queue lock, one for each device. The result will be better performance on large, disk-intensive systems.

Most other kernel subsystems have been going through a similar development process: global locks are replaced by multiple locks which protect smaller data structures. This increasingly fine-grained locking makes the kernel scalable to more and more processors, but it also brings some real costs. For example, most of us do not run Linux on huge systems, and probably never will. Embedded SMP systems are also rare. All that locking will have a cost, even though the compiler optimizes it out on uniprocessor systems.

The real cost, however, is in the complexity of the kernel code. As the kernel becomes populated with thousands of little locks, it becomes increasingly difficult to write correct kernel code. Which lock(s) must you have to access a given data structure, or to call a given function? In which order should locks be taken? Consider two code paths, both of which need locks L1 and L2. The first thread takes L1, the second takes L2; each then tries to take the other lock. The result is a deadlocked system. Avoiding this problem requires specifying ordering relationships for every lock in the system - and the number of those relationships grows exponentially with the number of locks.

One can try to document the locking requirements of each data structure and function in the kernel, and every lock ordering constraint. But, even if one honestly believed that such a document would be created (and, importantly, maintained), it would be a very thick, complicated manual. A kernel with many locks will be a kernel that is difficult to program.

Some people (i.e. Larry McVoy) have been arguing for years that Linux should not chase the "scalability" goal too far. Down that road lies a kernel that is twisted beyond maintainability, and, once you realize that this has happened, it is too late to go back. For the most part, scalability work has continued in the face of those warnings, but there are signs that things are beginning to change. For example, a recent patch which removed the BKL from the driverfs code was shouted down in a fairly strong way. Alexander Viro stated, in characteristic fashion:

"Zillion little spinlocks" means that kernel is scaled into oblivion. Literally. If you want to play with resulting body - feel free, but I like it less kinky.

So, while there has been no definitive statement of policy, it looks like at least some kernel developers are thinking that locking in the kernel is complex enough. There may be no 64-processor Linux in our future...

...at least, not in the classic SMP form. Larry McVoy has been pushing "cache-coherent clusters" as an alternative approach for some time. A CC/cluster takes a large machine and divides it into small group of (four, say) processors; each group runs an independent Linux kernel. The kernels have minimal interactions with each other, so locking issues fade to the background. Nobody has, yet, implemented such a cluster, though a lot of the pieces are there. If somebody runs with this idea, Linux could yet be the most scalable system of them all.

Comments (3 posted)

Patches and updates

Kernel trees

Andrea Arcangeli 2.4.19rc1aa2 ?
J.A. Magallon Linux 2.4.19-rc1-jam1 ?

Architecture-specific

Naohiko Shimizu Super Page patch for 2.4.18 "Super page" support for the Alpha architecture. ?
Greg Ungerer Announce: 2.5.25uc0 patch for mmu-less CPU's A 2.5 version of uClinux. ?
Tom Rini A generic RTC driver (for the m68k architecture). ?

Core kernel code

William Lee Irwin III lazy_buddy-2.5.25-1 Defers coalescing of adjacent pages in the buddy allocator as a way of making some operations go faster. ?
Rusty Russell cpu_mask_t "<q>This fixes the last of my cpu_online_map damage, completing the abstraction.</q>" ?
Ingo Molnar Re: O(1) batch scheduler A new version of the <tt>SCHED_BATCH</tt> patch. ?

Development tools

Device drivers

Marc Boucher New hcfpcimodem-0.97mbsibeta02070500 release A Conexant HCF 'linmodem' driver. ?
Jens Axboe 2.4 IDE core for 2.5 "<q>I needed stable IDE for 2.5 testing and it was/is clear that 2.5 just isn't quite there yet. I intend to maintain this patch set until I deem 2.5 IDE stable enough (in code) that I'm willing to spend time on that instead.</q>" ?
Patrick Mochel Driverfs updates ?
Douglas Gilbert sg driver against lk 2.5.25 ?

Filesystems and block I/O

Memory management

Rik van Riel minimal rmap for 2.5 - akpm tested "<q>If you have some time left this weekend and feel brave, please test the patch...</q>" ?

Networking

Dmitry Kasatkin Affix-1_00pre6 Stack. A BlueTooth stack for Linux. ?

Miscellaneous

Karim Yaghmour Adeos now supports SMP ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds