LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.5-rc3, which was announced by Linus on March 29. Additions this time around include lots of architecture updates, an AGPGART update, a few networking tweaks, an ACPI update, and various fixes. "Nothing earth-shattering," says Linus; things seem to be slowly settling down toward a real 2.6.5 release. See the long-format changelog for the details.

Linus's BitKeeper repository, as of this writing, contains an ALSA update, some PowerPC updates, and various other fixes.

The current tree from Andrew Morton is 2.6.5-rc3-mm2. Recent additions to -mm include some architecture updates, more scheduler work, a reworked laptop mode patch, support for huge serial ATA requests (see below), and lots of fixes.

The current 2.4 prepatch is 2.4.26-rc1, announced by Marcelo on March 28. Previously, 2.4.26-pre6 had come out on March 25. Recent changes include lots of fixes and support for Intel's AMD64-like IA32e architecture.

Comments (none posted)

Kernel development news

Merging the virtual memory work

The LWN Kernel Page has included several articles over the last month on the work to improve the scalability of the virtual memory subsystem by eliminating the reverse mapping chains currently used by the 2.6 kernel. That work reached a milestone on March 26, when Andrea Arcangeli released 2.6.5-rc2-aa3 with more virtual memory changes and a comment:

Ok, this seems feature complete. Both nonlinear swapping and prio_tree are available now. I believe objrmap-core+anon-vma+prio_tree can be merged into mainline after a bit more of testing, certainly they looks good enough for -mm.

Andrea raised the issue again when he released 2.6.5-rc3-aa1. Andrew Morton finally replied at that point:

It's a bit early for that, I feel. I'd like to see thing settle down a little more at your end first, then see that Rajesh, Hugh and if possible Ingo have had a good go through everything.

And then there are the mechanics of swallowing a largely-undocumented 4,600-line patch which touches 60 files and tosses 30-odd rejects across 16 files.

It is not surprising that Andrew would hesitate to rush into merging major virtual memory changes in the middle of a stable kernel series. Most 2.6 users will, one imagines, be relieved to see that some caution is being applied here - regardless of the eventual value of this work. Andrea, however, is in more of a hurry: "Keep in mind this whole thing is going in production in a matter of a week, so please test and review now." Those words suggest that SUSE Linux 9.1 will include the new VM code. One can only hope that Andrea's high level of confidence in that code is justified.

Comments (none posted)

COW Links

Free software hackers often find themselves cloning a large tree full of source files; with a duplicate tree, it is easy to see which files have been changed and to generate patch files. Creating such a tree can be easy as typing:

    cp -rl old-tree new-tree

This technique works well if you use a tool (emacs, say) which moves files aside before rewriting them. By moving the file, emacs breaks the link and leaves the original copy (in the old tree) unchanged. If, however, the tool rewrites the file in place (as vi tends to do), the file, as seen in both trees, will be changed.

As a solution to this problem, Jörn Engel has been working on a patch which implements "cowlinks." The idea behind a COW (copy-on-write) link is that, if the file linked to is written to, a copy will be made (thus breaking the link) and the write will be performed on the copy. With this capability, somebody wishing to duplicate and modify a tree of files could use COW links; the duplicate files would share the same blocks on disk until one was modified. And it would all work regardless of the tool being used to perform the modifications.

In fact, COW links could be used for any copy operations within the same filesystem. The result would be faster copies and, perhaps, substantial savings of disk space.

The current cowlink patch does not actually implement this behavior, however. It implements a COW bit in the inode structure, but, rather than actually perform the copy, it simply fails any attempt to write a file with more than one link. User space is then expected to notice the error and do the right thing. This is not the long-term planned behavior; from a comment in the code:

Yes, this breaks the kernel interface and is simply wrong. This is intended behaviour, so Linus will not merge the code before it is complete. Or will he?

The full behavior has not yet been implemented because it requires some tricky filesystem-level programming. There is also the issue that the right behavior for COW links has not, yet, been worked out. One obvious implementation would have COW links behave just like regular, "hard" links, with the file being truly copied when the first write is done. With that approach, however, the file will change its inode number after the writing application has opened it. That is just the sort of anomalous, nonstandard behavior that can break applications in strange and unexpected places.

An alternative would be for two COW-linked files to have separate inode numbers from the beginning, even though they share the same on-disk data. If COW links are implemented this way, no application will notice when the link is broken. What will break, however, is any application which depends on inode numbers to detect identical files. Recursive diffs will be much slower, "du" will give wrong numbers, and tar could do the wrong thing. Fixing all of these applications would require the addition of a nonstandard system call and fixing the programs to use it.

Linus has made his opinion known:

I think the correct thing to do is to just admit that cowlinks aren't POSIX, and instead see the inode number as a way to see whether the link has been broken or not. Ie just accept the inode number potentially changing.

That opinion makes it likely that development will go in that direction, but, until the code shows up, nobody knows for sure.

Comments (11 posted)

Big block transfers: good or bad?

Users of serial ATA drives on Linux will be familiar with Jeff Garzik's "libata" driver, which provides solid support for those drives with several controllers. Jeff recently posted a patch which has the potential to make SATA users happier; with this patch, libata will use the "LBA48" mode, which can perform transfers of up to 32MB in length. Says Jeff:

With this simple patch, the max request size goes from 128K to 32MB... so you can imagine this will definitely help performance. Throughput goes up. Interrupts go down. Fun for the whole family.

Interestingly, the whole family was not entirely thrilled by the idea. The problem is latency: most SATA drives will take the better part of a second to perform a 32MB transfer, during which no other requests are being processed. Several people complained, saying that a 32MB limit is far too high, and that, in any case, the performance benefits of transfers above around 1MB are minimal at best. Jeff's explanation that, in reality, transfers would be limited to 8MB with the current libata driver did little to slow the debate.

The issue being debated is not whether 32MB transfers could create latency problems; everybody agrees on that point. The difference of opinion is over where the decision on transfer sizes should be made. A device driver's job, according to Jeff, is to make the full capabilities of the device available to the kernel without imposing arbitrary limits. He would rather see the block layer deal with maximum transfer size issues. Jens Axboe, the maintainer of the block layer, responds that the block layer has no idea of the performance characteristics of any individual device, while the driver does. The driver, thus, is in the best position to make decisions about maximum transfer sizes.

In truth, the driver doesn't know the right number, either; it can depend on individual drives, the controller being used, etc. As a result, the final outcome looks like it will involve some sort of adaptive, dynamic tuning. The block layer will track the execution time of requests and note when that time gets to be too long; at that point, it will have the information needed to put a lid on request size. The same timing information could also be used to tweak the maximum tagged command queueing depth (the number of requests which can be fed simultaneously to the drive), since a number of similar issues come up there.

Comments (2 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds