User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.5-rc1, which was announced by Linus on March 15. This prepatch includes the incorporation of the netpoll interface (see below), some virtual memory performance improvements, the new "kref" reference counting mechanism (see below), a big ALSA update, a new Prism54 wireless driver, an NFS update, a DMA API change (see below yet again), and many fixes. See the long-format changelog for the details.

2.6.4 was released on March 10; very few fixes went in after the last release candidate. Changes since 2.6.3 include support for the Intel "ia32e" architecture, a UTF-8 tty mode, dynamic PTY allocation, sysfs support for SCSI tapes and bluetooth devices, support for large numbers of groups, a generic kernel thread infrastructure, an HFS filesystem rewrite, an R128 DRI driver security fix, the groundwork for the hotplug CPU code, and many, many fixes. The the long-format changelog has the details.

Patches in Linus's BitKeeper repository include several architecture updates, a set of fixes to make the Intermezzo filesystem work again, an IDE update, asynchronous I/O support for reiserfs, and lots of fixes.

The current tree from Andrew Morton is 2.6.5-rc1-mm1. Recent additions to the -mm tree include a plug-and-play subsystem update, a patch to enable 4K kernel stacks on the x86, the per-address-space block queue unplugging code (discussed here last week), an NFS update, a bunch of page cache work ("It seems to work OK here, but I suggest people not rush out and convert all of the corporate finance department's servers to 2.6.4-mm1."), and many fixes.

The current 2.4 kernel is 2.4.25; Marcelo released two 2.4.26 prepatches over the last week. 2.4.26-pre3 included a fair number of architecture and networking fixes; 2.4.26-pre4 (released March 16) is a much smaller patch with just a few fixes.

Comments (none posted)

Kernel development news

The DMA API changes

The 2.6 kernel is a stable series which, in theory, should be dedicated to the fixing of bugs rather than changing APIs. Anybody who risks thinking that things have become too stable, however, need only look at this massive patch from David Miller, which changes the DMA API and touches a full 100 files. This patch had done a little time in the -mm tree, but had never really been discussed on the mailing lists before its inclusion.

The change is in the "synchronization" calls that the DMA layer provides for streaming mappings. A streaming mapping is a short-lived structure set up to support one or more direct memory access operations; depending on the architecture, setting up a streaming mapping can involve creating bounce buffers, programming I/O memory management unit (IOMMU) registers, flushing processor caches, and more. These mappings have strict rules about the "ownership" of the buffer; when a streaming mapping is created, it is owned by the device, and the processor cannot touch it. If a device driver ignores that rule, it risks corrupting data in a number of ways.

It is sometimes necessary, however, to allow the processor to access a mapped streaming DMA buffer. To that end, the DMA layer has long provided a set of functions (like dma_sync_single() and pci_sync_single()) which transfer ownership of the buffer to the CPU. What has always been lacking, however, is a way to transfer ownership back to the device. To fill in that gap, the various synchronization functions have been split in two; instead of dma_sync_single() a driver must now call one or both of:

    dma_sync_single_for_cpu(struct device *dev, 
                            dma_addr_t dma_handle, 
			    size_t size,
			    enum dma_data_direction direction);

    dma_sync_single_for_device(struct device *dev, 
                               dma_addr_t dma_handle, 
			       size_t size,
			       enum dma_data_direction direction);

dma_sync_single_for_cpu() gives ownership of the DMA buffer back to the processor. After that call, driver code can read or modify the buffer, but the device should not touch it. A call to dma_sync_single_for_device() is required to allow the device to access the buffer again. The other synchronization functions (for scatter/gather and DAC mappings) have been changed as well.

As might be expected from a change like this, the result was a lot of broken drivers. The patch fixes the in-tree users of the discontinued DMA functions. Out-of-tree and binary-only drivers, however, will have to be fixed separately.

Comments (none posted)

The debut of kref

When Patrick Mochel added the "kobject" type to the 2.5.45 kernel, he described it this way:

This is not meant to be fancy; just something simple for which we can control the refcount and other common functionality using common code.

In the 2.6 kernel, the kobject type has become, via its kset and parent pointers, the glue which holds the entire device model structure together. It is the core object implementing every entry in the sysfs virtual filesystem. Kobjects also handle the generation of hotplug events when devices come and go.

Oh, yes. Kobjects also handle reference counting.

The kobject type has clearly grown past its original mandate into something fairly fancy. To address the needs of kernel hackers who only want a simple reference counter, Greg Kroah-Hartman has created a new type called kref. A kref is, indeed, a simple thing:

	struct kref {
		atomic_t refcount;
		void (*release)(struct kref *kref);

A kref comes with the usual functions one would expect: kref_init() to set it up, and kref_get() and kref_put() to manage the reference count. Once that count drops to zero, the release function is called to clean things up. All told, it's quite simple.

In fact, it would appear to be too simple for some kernel hackers, who have questioned whether there is any need for kref at all. Why not simply manipulate a reference count directly with atomic_t operations and avoid adding the space required for the release() pointer to every reference-counted object? The answer that comes back is that buggy reference counting implementations in the kernel are far from unknown, and that the overhead of using kref is tiny. As Andrew Morton put it:

I care more about being able to say "ah, it uses kref. I understand that refcounting idiom, I know it's well debugged and I know that it traps common errors". That's better than "oh crap, this thing implements its own refcounting - I need to review it for the usual errors".

Andrew's approval is sufficient; the kref patch showed up in 2.6.5-rc1.

For the future, Greg has a patch which converts the kobject reference counting mechanism over to krefs. That change may be a harder sell, however; it will expand the size of every kobject in the system (because kobjects, currently, do not store the release() function pointer directly). So that change will wait for 2.7, and may be part of a larger-scale cleanup and refactoring of the kobject type.

Comments (none posted)

Lots of SCSI disks

One of the motivations for increasing the size of the dev_t device number type in 2.6 was to allow the use of huge numbers of SCSI disks. In the 2.6.4 kernel, however, that promise remains unfulfilled; the SCSI subsystem makes no use of the expanded device number range. That will change in 2.6.5, however; a patch has been merged which allows the enumeration of up to 1 million SCSI disks.

The authors of this patch had an interesting problem to solve: they wanted to be able to enumerate all of those disks without breaking existing systems. In other words, all of the existing SCSI device numbers have to work as they do in 2.4 and prior kernels. The solution is expressed in the following macro, which turns a device index (the "nth disk") and a partition number into its associated device number:

static unsigned int make_sd_dev(unsigned int sd_nr, unsigned int part)
	return  (part & 0xf) | ((sd_nr & 0xf) << 4) |
		(sd_major((sd_nr & 0xf0) >> 4) << 20) | (sd_nr & 0xfff00);

LWN readers will, no doubt, immediately understand what is going on here. Your editor, however, had to stare at it for a little while. Then, as a way of avoiding doing real work, he made the following diagram to show how a device index and partition number are transmogrified into a device number.

[SCSI numbering diagram]

The "remap" operation takes four bits from the device index and uses them to index into an array of the 16 major numbers which have been assigned for some time to SCSI disks: 8, 65-71, and 128-135. The lowest four bits of the device index move directly down into the minor number. The result is that the first 256 SCSI disks will get exactly the same major and minor numbers that they have in 2.4 kernels.

Once that space has been exhausted, however, the four red bits in the diagram will return to zero, the major number will go back to 8, the highest-order bits in the device index are routed back into the minor number, and, as a result, the 257th disk will be given device number 8:256. The 273rd disk will advance again to the next major number; it will be given number 65:256. Additional disks will be distributed across the available major numbers indefinitely until their combined power load flips a breaker somewhere.

The result is a scheme which might be a little hard for humans to follow, but, when you are dealing with thousands of disks, that will be the case anyway. Meanwhile, most of the main design goals - support lots of disks without breaking existing systems - have been met. There is one remaining issue, however: some SCSI users have been asking for the ability to have more than 15 partitions on one drive. Supporting a larger partition space and simultaneously preserving compatibility is not currently possible because the block layer expects partitions to be assigned contiguous minor numbers. Fixing that will require tweaks to the gendisk code.

Comments (10 posted)

Netpoll is merged

One of the many new things merged into 2.6.5-rc1 is the "netpoll" infrastructure. Netpoll exists to support low-level kernel functions which may need to be able to send and receive packets over the network without involving the entire networking subsystem and without enabling interrupts. Examples include kgdbeth (which allows kernel debugging over the net), and netconsole, which enables remote, network-based consoles. The patches have been around (and in the -mm tree) for some time, but have only now found their way into the mainline. Netconsole was merged as well, but kgdbeth users will still have to apply patches for now.

Supporting netconsole in network drivers turns out to be relatively easy - for most adaptors. There is a new net_device method called poll_controller(); its job is to catch up with whatever the device has been doing. For many devices, this method looks like this:

    static void poll_my_card(struct net_device *dev);

Netpoll, in other words, is simulating device interrupts from within the kernel. Some device interrupt handlers may need tweaks to ensure that they do all of the necessary work without a real hardware interrupt, but most seem to work as they are.

Comments (none posted)

Which is the real software suspend?

Laptop users may well have noticed that there are no less than three competing software suspend implementations for the 2.6 kernel. Two of them (pmdisk and swsusp) are in the kernel itself; the third (swsusp2) is not, but is also the implementation which has seen the most work over the last several months. Unfortunately, none of these implementations could be said to be production-level code. It is possible to make a Linux system suspend to disk and resume into something that still runs, but making it work is not yet for the faint of heart.

The software suspend discussion began anew when Pavel Machek, the maintainer of the in-kernel swsusp code, asked where things were going. Pavel's preference, not surprisingly, would be to remove the pmdisk code and stick with swsusp. Pavel is not alone in feeling this way. The pmdisk implementation is a fork of the swsusp code created by Patrick Mochel, who was not enjoying good relations with Pavel at the time. By some accounts, the pmdisk code is better, but it suffers from a major problem: Patrick has gotten a new job and has vanished from the kernel development world. As a result, pmdisk has seen no development work for several months, and it is a rare user who can make it work reliably. Unless Patrick surfaces and starts working on the code again, it is likely to go away fairly soon.

The real question is what to do about swsusp2. This version of the suspend code has seen significant work by Nigel Cunningham and others. It has a number of features that others lack: the ability to abort a suspend operation, a "nice display," compression of the saved image (which can speed suspends and resumes on systems with slow disks), etc. The real difference, though, is that swsusp2 is, for many people, the only version that works at all reliably. So there is some real desire to see the swsusp2 work merged into 2.6, and further development efforts concentrated there.

The hangup seems to be the fact that the swsusp2 patch is large, and it touches a great many core files. Many of those changes are aimed at making the "refrigerator" work better. Before a system can be suspended, all processes must be put into a quiet, known state. This works by setting a "freeze" flag and sending a signal to every process telling it to put itself into the refrigerator. Once all processes are nicely chilled, the system can save its state and suspend itself.

Processes will not refrigerate themselves immediately; they must first get to a point where they hold no important resources. Sometimes, a process must get something from another process before it can be refrigerated; the example that is often raised is a process waiting for a response from an NFS server process. If the NFS server is refrigerated first, the other process will never get to where it can be frozen, and the suspend operation will fail. To avoid this sort of situation, the swsusp2 developers have gone to great lengths to identify places where a process should not, yet, be refrigerated. The result is a great many macros with names like SWSUSP_ACTIVITY_STARTING sprinkled widely though the code. If software suspend is not configured into the kernel, these macros simply vanish, so the actual changes to the core kernel are smaller than a look at a simple diffstat listing would indicate. Swsusp2 remains a large patch, however.

Nigel has offered to provide a version of swsusp2 which lacks the intrusive refrigerator changes, though he warns that it will eventually become clear that those changes are needed. Andrew Morton has indicated that this would be a step in the right direction, but he is asking for more:

Even happier would be a series of small, well explained patches which bring swsusp into a final shape upon which more than one developer actually agrees.

These wholesale replacements and deletions are an indication that something has gone wrong with the development process here.

What clearly needs to happen is that the swsusp2 work needs to be broken down into a long series of patches of the type that the kernel developers like to see: small and focused. That will be a significant effort, and the swsusp2 developers appear to lack the time to do that anytime soon. Now, perhaps, is the time for people who are concerned about a working software suspend solution (which Linux really does need) to get together to bring an end to the current, confused situation.

Comments (5 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers


Filesystems and block I/O


Memory management



Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds