User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is, released on November 10. contains about a dozen fixes, including one for the zero-length datagram bug. It does not contain a fix for the file lease denial of service bug yet, however.

The current 2.6 prepatch is 2.6.15-rc1, announced by Linus on November 11. Says Linus:

It's hard to go through in any great detail, because even the shortlog is actually almost five thousand lines and about 200kB in size, and would thus run afoul of the mailing list limits so I can't include it here. The same is true of the diffstat, only even more so. The unidiff is about a million lines in size, just the diffstat is 300+kB.

The changes are really pretty much all over the place, with over four thousand commits merged in the two weeks since 2.6.14...

The changes in 2.6.15 have been listed in detail here last week and the week before. Significant changes merged since last week's article, but before 2.6.15-rc1, include a new, simpler, type-safe netlink API, a new netfilter connection tracking implementation (which understands IPv6 and will eventually replace the current code), the removal of some SCSI subsystem typedefs (Scsi_Device, Scsi_Pointer, and Scsi_Host_Template), the removal of the owner field from struct pci_driver, and a new "platform driver" interface.

For those who are just tuning in: some of the more significant changes in 2.6.15 will include device model changes, basic hotplug memory support, a reworked NTFS implementation with improved write behavior, a big CIFS update, a number of block improvements (and a reorganization of the block layer into its own top-level directory), the Open-iSCSI initiator, InfiniBand SCSI RDMA, RapidIO support, a number of scheduler tweaks, the shared subtrees patch, four-level page tables for the ia-64 architecture, and much more. See the short log for a long list of changes, or the long-format changelog for the details.

2.6.15-rc1 marks the closing of the window for new features, so Linus's git repository contains mostly fixes. It does, however, also include a generic cmpxchg implementation for i386, new Omnikey Cardman 4000 and 4040 drivers, and a new DMA32 zone for the x86-64 architecture.

The current -mm tree is 2.6.14-mm2. Recent changes to -mm include some relayfs enhancements, some scheduler tweaks, and various fixes.

The current 2.4 kernel is 2.4.32, released by Marcelo on November 16. 2.4 is in deep maintenance mode, so there's not a whole lot of new features in 2.4.32.

Comments (1 posted)

Kernel development news

HOWTO do Linux kernel development

Greg Kroah-Hartman has gotten tired of answering the same questions about Linux kernel development. So he has put together a HOWTO document to get people started. The goals are ambitious: "This is the be-all, end-all document on this topic. It contains instructions on how to become a Linux kernel developer and how to learn to work with the Linux kernel development community." Now people just have to read it...

Comments (6 posted)

4K stacks - again

The question of whether the i386 architecture should move to using 4K kernel stacks by default has been raised a few times; LWN last covered the 4K stack issue in September. Adrian Bunk has started the discussion anew with this proposal that the -mm tree go to 4K stacks (only) now, with an eye toward changing the mainline for 2.6.16.

Most of the technical issues have not changed since September, so those arguments will not be repeated here. It is worth noting that layered block devices and filesystems have mostly been fixed. In past kernels, highly stacked devices (think of a combination of RAID, encryption, and network filesystems) could end up with very long call chains in the kernel, and, as a result, overflow the kernel stack. Most of these calls have since been serialized, so block-layer stacking should not be a problem.

The issue that remains is NDISwrapper, the glue layer which allows Windows NDIS drivers to be loaded into a Linux kernel. Windows runs with a much larger kernel stack size, so NDIS driver writers have no reason to be as careful about stack usage. And, of course, these drivers cannot be fixed to work properly with Linux. Some have argued that breaking NDISwrapper is not a possibility, since many users rely upon it to make their wireless network adapters work. But patience with this line of thought is running thin, as can be seen in this outburst from Dave Jones:

If we continue down this path, we'll have no native wireless drivers for Linux. The answer is not to complain to linux-kernel for breaking ndiswrapper, but complain to the vendors for not releasing specifications for native drivers to be written.

The good news is that the wireless situation is not as bad as one might think. There is documentation for Broadcom chips available now, and a Broadcom driver is in the works. There is also an Atheros driver which is "nearly done." Once these drivers are complete and joined with the Intel drivers already in the mainline, Linux will have much better support for wireless devices, and far fewer systems will have any reason to use NDISwrapper.

There are a number of reasons for going with the 4K stack mode, including better performance and higher reliability. Some distributions (e.g. Fedora Core and RHEL) have been shipping 4K kernels for a while now. So, while nobody has committed to moving the mainline (or -mm) toward 4K-only yet, chances are improving that it will happen sometime in the not-too-distant future.

Comments (15 posted)

VM followup: page migration and fragmentation avoidance

Page migration is the act of moving a process's pages from one part of the system to another. Often, the motivation is moving pages between NUMA nodes in the hope of improving performance. When this page last looked at the page migration patch set, it worked by forcing target pages out to the swap device. When the owning process later faults them in, these pages will end up on the desired node. This technique works, but it is not optimal: it would be nicer to avoid having to write the pages to disk and read them back in.

Christoph Lameter has now followed up with the direct migration patch set, which does away with the side-trip to the swap device. A look at the patch shows why things were not done this way in the first place; direct page migration involves rather more than simply copying the data over. The first step, after choosing a target page, is to lock that page so that nobody else will mess with it. There might currently be I/O active which involves that page, so the kernel must wait for any such I/O to complete. Only then can the real migration work begin.

The kernel must establish a swap cache entry for the page, even though it intends to avoid writing the page to swap. This entry will cause the right thing to happen if a process faults on the page while it is being moved. Then all references to the page (page table entries) are unmapped. With luck, all references will go away; if references remain for any reason, the page cannot be moved.

Actually moving the page involves copying a subset of the page status bits over, copying the page data itself, then copying the rest of the status bits. The old page is cleared out and freed. If any writeback has been queued up for the new page, it is set in motion. Then it's just a matter of cleaning up, and the page has been successfully moved.

If the kernel runs out of free pages on the target node, it will fall back to the swap-based mechanism. So that stage of this patch's evolution remains useful.

With this code in place, the kernel has the support it needs to try to keep a process's pages in local memory. The migration code might also prove useful for hotplug memory uses, where all pages must be vacated from a given region. Indeed, some of this code was originally written for hotplug applications. But, at this point, the migration is done on a best-effort basis. For NUMA systems, failure to move a page results in worse performance, but nothing particularly severe. For hotplug memory, instead, this sort of failure will block a memory remove operation altogether. Moving all pages in a region with 100% certainty remains a difficult problem without a complete solution at this time.

One of the pieces of such a solution might be active memory defragmentation which, among other things, works to keep non-movable memory allocations out of memory regions which might be removed. When we looked at active defragmentation last week, that patch set looked like it was in trouble. The overhead of the defragmentation code seemed to be too high, and a number of developers (Linus included) felt that this sort of functionality should be implemented using the kernel's zone system, rather then with a new layer in the memory allocator.

Defragmentation hacker Mel Gorman doesn't give up that easily, however. He has posted a new, "light" version of the defragmentation patch which, he hopes, will be better received. As he describes it:

This is a much simplified anti-defragmentation approach that simply tries to keep kernel allocations in groups of 2^(MAX_ORDER-1) and easily reclaimed allocations in groups of 2^(MAX_ORDER-1). It uses no balancing, tunables special reserves and it introduces no new branches in the main path. For small memory systems, it can be disabled via a config option. In total, it adds 275 new lines of code with minimum changes made to the main path.

In this version of the patch, a new GFP flag (__GFP_EASYRCLM) is added; its presence indicates an allocation which the kernel can easily get back should the need arise. It is used for user-space pages (which can usually be forced out to backing store) and in a few other situations, such as for some kernel buffers. The buddy allocator already keeps track of memory in large chunks; the new code simply steers reclaimable allocations toward some chunks, while keeping the non-reclaimable allocations in others. In this way, it is hoped, there will be no situations where one non-movable page blocks the freeing of the large, contiguous region in which it is located.

The patch works by creating a "usemap" array tracking which kind of allocation is being done from each large chunk of memory. Mel also had to split the per-CPU free lists which are used to perform fast single-page allocations; now there are two such lists, one for each allocation type. From there, it is just a matter of taking allocations from the proper pile, depending on the __GFP_EASYRCLM flag.

This version certainly reduces the footprint and overhead of the defragmentation patches. It is still not the zone-based approach that others were pushing for, however. So it remains to be seen whether "active defragmentation lite" is, in the end, better received than its predecessors.

Comments (4 posted)

The end of isa_readb() and friends

The kernel has long had a series of functions which read and write memory locations in the legacy ISA memory range. These functions, with names like isa_readb(), require no special preparation to use, and they only work in the ISA hole. They also have been obsolete and deprecated for quite some time.

Recently, there has been an effort to finally get rid of isa_readb() and friends. To that end, Al Viro has posted a set of "isaectomy" patches which fix up the remaining callers (they are made to use ioremap() and the not quite as obsolete readb() family of functions) so that the old stuff can be deleted. One would think that this work would be uncontroversial, but Linus, it turns out, is unconvinced:

Hmm.. I actually believe that the isa_read() functions are more portable and easier to use than ioremap().

The reason? A platform will always know where any legacy ISA bus resides, while the "ioremap()" thing will depend on platform PCI code to have set the right offsets (and thus the resource addresses) for whatever bus the PCI device is on.

The fact is, however, that very little in-tree code still uses these functions. They are a deprecated interface to a very old and obsolete hardware standard, and they have few defenders. So anybody maintaining out-of-tree which still uses these functions might want to take warning: they probably will not stay around for much longer.

Comments (1 posted)

A software suspend decision point

The relative calm which has settled around the software suspend subsystem may be about to come to an end. This part of the kernel, which has never worked to everybody's satisfaction, remains subject to different ideas of how the problem should be solved.

Pavel Machek's user-space software suspend patch was covered here in September. Pavel has now posted a new version of the patch with a request that it be merged for 2.6.16. The user-space approach is, clearly, the way Pavel thinks that software suspend should go. Beyond getting some code out of the kernel, this approach makes a number of add-on features, such as graphical displays, image compression, image encryption, network-based suspend, etc., easier to implement. If you want to hang a big pile of features onto the suspend mechanism, you eventually have to get into user space.

One of the first responses came from Dave Jones, who said:

Just for info: If this goes in, Red Hat/Fedora kernels will fork swsusp development, as this method just will not work there.

The main issue is the fact that the user-space approach uses /dev/kmem to repopulate memory at resume time. Red Hat and Fedora kernels do not allow memory to be overwritten in this way; there are no other applications which need that capability, with the exception of rootkits. Allowing user space to overwrite arbitrary physical pages is, to Dave, not worth it, no matter how many software suspend features it enables. Says Dave: "I'll take 'rootkit doesnt work' over 'bells and whistles'."

Nigel Cunningham, the author of the Suspend2 patches, also has some thoughts on the matter. He has been busily cleaning up the suspend2 patches with an eye toward making them more palatable for merging into the mainline. It turns out that Nigel has a set of 225 patches which he will soon make available. Since few people have seen the new patch set, it's not clear what sort of reception it will get. It can be said, though, that 225 patches is a large pile of code. Anybody trying to get a patch set of that size merged needs to have some fairly convincing arguments in hand.

At some point, Nigel's code mountain will become available, and some sort of decision will have to be made. Software suspend could be transformed into suspend2, or moved partially to user space. Or it could be left more-or-less as it is now. These are three very distinct choices - especially as nobody wants to see a repeat of the situation where the mainline kernel supported more than one software suspend implementation. With luck, when the dust settles, Linux will have a more featureful and reliable software suspend implementation than it does now. But expect some interesting discussion between now and then.

Comments (5 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

  • dmitry pervushin: SPI. (November 11, 2005)


Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds