LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.14, released on October 27. A very small number of patches went in since 2.6.14-rc5. Major changes in 2.6.14 include a new version of the wireless extensions, the HostAP system (which allows a Linux system to function as a wireless access point), relayfs, the DCCP network protocol, the filesystems in user space patch, v9fs, securityfs, and more.

The first 2.6.15 prepatch has not yet been released, and may not be until the window for new features closes. A major pile of patches has been merged into the mainline git repository; see the separate article, below, for a list of some of the more interesting ones.

There have been no -mm releases in the last week.

The current 2.4 prepatch is 2.4.32-rc2, released by Marcelo on Halloween. It contains a small set of fixes, mostly in the networking subsystem.

Comments (1 posted)

Kernel development news

Quote of the week

It'd help more if people focused more on testing their own shit before submitting it than complaining about -mm. If it's the same people breaking the tree all the time, I'm sure we can find a recycled set of stocks somewhere.

-- Martin Bligh

Comments (none posted)

The newest development model and 2.6.14

The 2005 Kernel Summit made some tweaks to the kernel development model with the aim of producing higher-quality releases in a more timely manner. To that end, it was said that major changes would only be allowed during the first two weeks of each development cycle; after that, only bug fixes could go in. The hope was that this rule would eliminate destabilizing patches late in the cycle and concentrate developers' minds on making things work.

The 2.6.14 kernel is the first to go through the entire cycle since the kernel summit. This kernel, released on October 27, came out almost exactly two months after 2.6.13, which showed up on August 29. That is relatively fast by 2.6 standards, but still too slow for some developers. The complainers feel that the freeze period puts too much of a damper on development, and that, somehow, the kernels should come out faster.

2.6.14 would have come out sooner were it not for a final delay to fix some remaining bugs (some of which turned out not to be real). Linus, however, is pretty happy with how 2.6.14 worked. A number of significant changes were merged, but regressions in the released kernels seem to be within reasonable limits. As a result, Linus doesn't see the need to make further changes to the process at this time:

So I'm planning on continuing with it unchanged for now. Two-week merge window until -rc1, and then another -rc kernel roughly every week until release. With the goal being 6 weeks, and 8 weeks being ok.

Andrew Morton, meanwhile, has an answer for those who think the development cycle is still too long:

a) you're sitting around feeling very very very bored while

b) the kernel is in long freeze due to the lack of kernel developer attention to known bugs

The solution seems fairly obvious to me?

It was pointed out that many bugs relate to hardware which most developers do not have. The response was that sometimes developers have to talk to users who encounter bugs and try to track them down anyway. In any case, the ongoing effort to get developers to fix bugs seems likely to be necessary for some time to come.

One other branch of the discussion, meanwhile, took on the question of whether the kernel has gotten too big. Prompted initially by Roman Zippel, Andrew Morton did some compile tests and came out with some disturbing numbers: the size of kernels with similar configurations went from about 600K (2.5.71) to over 800K (2.6.8). He also noted that the use of a current version of gcc adds almost 100K to the final kernel size when compared to gcc 2.95.4. Clearly, some serious inflation is going on somewhere.

Except that it's not quite so clear. Adrian Bunk demonstrated that, by using the -Os compile option (which instructs gcc to optimize for size), current compilers can make kernels which are quite a bit smaller than those made with the old 2.95 release. The resulting discussion suggests that the kernel developers may try making -Os the default for kernel builds in the future. Fedora already builds its kernels this way. The interesting thing is that, in the past, kernels built with -Os have often performed as well as (or even better than) those optimized for speed. Cache effects have a huge impact on kernel performance, and a smaller kernel is more cache friendly.

Compiler issues aside, there truly has been some growth in the kernel. Linus is not surprised by this:

On the other hand, I do believe that bloat is a fact of life.... The fact is, we do do more, and we're more complex. Our VM is a _lot_ more complex, and our VFS layer has grown a lot due to all the support it has for situations that simply weren't an issue before. And even when not used, that support is there.

Expect an increase in de-bloating work in the near future. In some areas, this work has been ongoing for a while - consider, for example, the effort to shrink the sk_buff structure used to represent packets in the networking subsystem. For a more extreme example, see Matt Mackall's SLOB allocator, a replacement for the slab subsystem which is much smaller, but which does not perform as well on larger systems. SLOB is not for everybody (it's mainly intended for embedded systems), but it almost certainly foreshadows a surge in Linux weight reduction patches.

Comments (19 posted)

What's going into 2.6.15

The release of the 2.6.14 kernel opened the door for new changes. Many developers have been quick to submit their patches, with the result that nearly 2000 commits have been merged for 2.6.15. The door will remain open for two weeks - until around November 11 - at which point the kernel should return to stabilization mode.

Many of the patches merged are fixes, and quite a few of them are in architecture-specific code. Among the rest, however, are the following, starting with user-visible changes:

  • An update to the generic 802.11 code which includes, among other things, quality-of-service support, the ability to use hardware crypto and fragmentation offload functions, and "wireless spy" support.

  • A driver for Marvell serial ATA controllers. There is also a new "ATA passthrough" ioctl() allowing arbitrary ATA commands to be sent to devices.

  • The old "bluetty" driver has been removed. Everybody should be using the bluez stack for Bluetooth devices at this point.

  • As a result of the device model changes, the 2.6.15 kernel will require version 071 (or higher) of the udev utility.

  • A new uevent device attribute in sysfs can be used to manually force the creation of a hotplug event for an existing device. This feature can be used to regenerate hotplug events for devices which were present when the system was booted.

  • The PowerPC 4xx on-chip Ethernet driver has been replaced with a completely rewritten, more efficient version.

  • A new driver for the Freescale Ethernet devices found in some embedded systems.

  • Support for the old Cobalt servers has been restored.

  • Basic support for hot-pluggable memory.

  • A big NTFS rework with much-improved write support.

  • A big InfiniBand update, with support for a wider range of userspace verbs.

  • Support for ARM "RealView" boards.

  • A large CIFS filesystem update, with support for change notifications, mounting from "legacy" servers, case-independent file names, and more.

  • DRM support for Radeon PCI Express cards

API changes and other internal patches visible to kernel developers include:

  • The nested class devices patch and associated input subsystem patches. For those who are curious about where the device model work will go from here, Greg Kroah-Hartman has posted a roadmap on his weblog.

  • More conversions of internal function prototypes to use the gfp_t type introduced in 2.6.14.

  • A number of block layer patches, including a rework of the elevator switch code and the generic dispatch queue patch. The new I/O barrier code has not been merged as of this writing.

  • A big rework of the remote procedure call code, and a number of associated NFS updates.

  • Some power management changes, including a driver API change; see this article for details.

  • A new mechanism allowing code to be notified when USB busses and devices come and go. Drivers do not normally need to use these notifiers, but some of the core code benefits from them.

  • The driver model class "interface" add() and remove() methods have picked up a new parameter: a pointer to the actual interface structure.

  • There is a new reader/writer semaphore function rwsem_is_locked(), which tests whether the rwsem is read locked without blocking.

  • There is a new variant of vmalloc():

         void *vmalloc_node(unsigned long size, int node);
    

    As one might expect, it allocates memory on a specific NUMA node.

  • The "reserved" bit for memory pages - used to mark pages which are not managed by the kernel page allocator (kernel text, non-memory areas, etc.) - has been all but removed. No core code uses it now, with the exception of software suspend, and that will get fixed eventually. There are reports that this change breaks VMware.

  • A set of Linux security module hooks for the (relatively) new key management functions.

  • A new kernel thread function:

        int kthread_stop_sem(struct task_struct *kt, struct semaphore *s);
    

    This function will stop a kernel thread which might be waiting on the given semaphore.

  • A "torture test" module for the read-copy-update mechanism.

Stay tuned: there is still time for quite a few more changes to be merged before the 2.6.15 window closes.

Comments (4 posted)

Some power management changes for 2.6.15

The 2.6.14 kernel has brought with it a few changes to the power management API. The first of these has to do with the suspend() and resume() methods found in struct device_driver. These methods would be called three times for each suspend and resume operation, in order to maintain compatibility with an older version of the API. The new versions are called once, and have different prototypes:

    int (*suspend) (struct device *dev, pm_message_t state);
    int (*resume) (struct device *dev);

This change required updates to a fair number of drivers, so the patch is relatively large.

The other change is for devices which can supply "wakeup events" to the kernel. These devices include network adapters with "wake-on-LAN" capability, keyboards, and simple power switches. The power management core has been reworked to enable these devices to perform their wakeup functions while providing overall control to the system administrator.

The dev_pm_info structure (found inside struct device) has gotten two new, single-bit fields. Drivers for devices which can create wakeup events should set the can_wakeup field to one. The actual issuance of such events, however, should be controlled by the may_wakeup field. If that field is zero, the power management core has decreed that wakeups should not be issued. A device_may_wakeup() helper function has been added to make testing the may_wakeup bit easy.

The patch adds a new wakeup field in sysfs. When read, it will return enabled or disabled (or an empty string if the device is not capable of generating wakeup events at all). The system administrator can also write a new value to allow (or disallow) the generation of wakeup events from the device.

The driver core code has been merged, along with support for wakeups from USB devices. As of this writing, however, the PCI wakeup code has some outstanding issues with G5 systems which has prevented it from going into the mainline.

Comments (none posted)

Fragmentation avoidance

Mel Gorman's fragmentation avoidance patches were covered here last February. This patch set divides all memory allocations into three categories: "user reclaimable," "kernel reclaimable," and "kernel non-reclaimable." The idea to support multi-page contiguous allocations by grouping reclaimable allocations together. If no contiguous memory ranges are available, one can be created by forcing out reclaimable pages. Since non-reclaimable pages have been segregated into their own area, the chances of such a page blocking the creation of a contiguous set of free pages is relatively small.

Mel recently posted version 19 of the fragmentation avoidance patch and requested that it be included in the -mm kernel. That request started a lengthy discussion on whether this patch set is a good idea or not. There is, it seems, a fair amount of uncertainty over whether this code belongs in the kernel. There are a few reasons for wanting fragmentation avoidance, and the arguments differ for each of them.

The first of these reasons is to increase the probability of high-order (multi-page) allocations in the kernel. Nobody denies that Mel's patch achieves that goal, but there are developers who claim that a better approach is to simply eliminate any such allocations. In fact, most multi-page allocations were dealt with some time ago. A few remain, however, including the two-page kernel stacks still used by default on most systems. When the kernel stack allocation fails, it blocks the creation of a new process. The kernel may eventually move to single-page stacks in all situations, but a few higher-order allocations will remain. It is not always possible to break required memory into single-page chunks.

The next reason, strongly related to the first, is huge pages. The huge page mechanism is used to improve performance for certain applications on large systems; there are few users currently, but that could change if huge pages were easier to work with. Huge pages cannot be allocated for applications in the absence of a large - and suitably aligned - region of contiguous memory. In practice, they are very difficult to create on systems which have been running for any period of time. Failure to allocate a huge page is relatively benign; the application simply has to get by with regular pages and take the performance hit. But, given that you have a huge page mechanism, making it work more reliably would be worthwhile.

The fragmentation avoidance patches can help with both high-order allocations and huge pages. There is some debate over whether it is the right solution to the problem, however. The often-discussed alternative would be to create one or more new memory zones set aside for reclaimable memory. This approach would make use of the zone system already built into the kernel, thus avoiding the creation of a new layer. A zone-based system might also avoid the perceived (though somewhat unproven) performance impact of the fragmentation avoidance patches. Given that this impact is said to be felt in that most crucial of workloads - kernel compiles - this argument tends to resonate with the kernel developers.

The zone-based approach is not without problems, however. Memory zones, currently, are static; as a result, somebody would have to decide how to divide memory between the reclaimable and non-reclaimable zones. This adjustment looks like it would be hard to get right in any sort of reliable way. In the past, the zone system has also been the source of a number of performance problems, mostly related to balancing of allocations between the zones. Increasing the complexity of the zone system and adding more zones could well bring those problems back.

There is another motivation for fragmentation avoidance which brings a different set of constraints: support for hot-pluggable memory. This feature is useful on high-availability systems, but it is also heavily used in association with virtualization. A host running a number of virtualized Linux instances can, by way of the hotplug mechanism, shift its memory resources between those instances in response to the demands of each.

Before memory can be removed from a running system, its contents must be moved elsewhere - at least, if one wants to still have a running system afterward. The fragmentation avoidance patches can help by putting only reclaimable allocations in the parts of memory which might be removed. As long as all the pages in a region can be reclaimed, that region is removable.

A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to support hot-pluggable memory be able to provide a 100% success rate. The current code need not live up to that metric, but there needs to be a clear path toward that goal. Otherwise, the kernel developers risk advertising a feature which they may not ever be able to support in a reliable way. The backers of fragmentation avoidance would like to merge the patches, solving 90% of the problem, and leave the other 90% for later. Ingo, instead, fears that second 90%, and wants to know how it will get done.

Why can't the current patches offer 100% reliability if they only put reclaimable memory in hot-pluggable regions? There are ways to lock down pages which were once reclaimable; these include DMA operations and pages explicitly locked by user space. There is also the issue of what happens when the kernel runs out of non-reclaimable memory. Rather than fail a non-reclaimable allocation attempt, the kernel will allocate a page from the reclaimable region. This fallback is necessary to avoid inflicting reliability problems on the rest of the kernel. But the presence of a non-reclaimable page in a reclaimable region will prevent the system from vacating that region.

This problem can be solved by getting rid of non-reclaimable allocations altogether. And that can be done by changing how the kernel's address space works. Currently, the kernel runs in a single, contiguous virtual address space which is mapped directly onto physical memory - often using a single, large page table entry. (The vmalloc() region is a special exception, but it is not an issue here). If the kernel were, instead, to use normal-sized pages like the rest of the system, its memory would no longer need to be physically contiguous. Then, if a kernel page gets in the way, it can simply be moved to a more convenient location.

Beyond the fact that this approach fundamentally changes the kernel's memory model, there are a couple of little issues with it. There would be a performance hit caused by the higher translation buffer use, and an increase in the amount of memory needed to store the kernel's page tables. Certain kernel operations - DMA in particular - cannot tolerate physical addresses which might change at arbitrary times. So there would have to be a new API where drivers could request physically-nailed regions - and be told by the kernel to give them up. In other words, breaking up the kernel's address space opens a substantial barrel of worms. It is not the sort of change which would be accepted in the absence of a fairly strong motivation, and it is not clear that hot-pluggable memory is a sufficiently compelling cause.

So no conclusions have been reached on the inclusion of the fragmentation avoidance patches. In the short term, Andrew Morton's controversy avoidance mechanisms are likely to keep the patch out of the -mm tree, however. But there are legitimate reasons for wanting this capability in the kernel, and the issue is unlikely to go away. Unless somebody comes up with a better solution, it could be hard to keep Mel's patch out forever.

Comments (5 posted)

Levanta's MapFS released

Once upon a time, kernel developers would post their contributions on the linux-kernel mailing list. Now they issue press releases instead. Along those lines, Levanta (the company once known as Linuxcare) has announced the availability of MapFS. This GPL-licensed module allows a read-only filesystem to be mounted locally for write access, with any changes being kept on the local system. It looks like another implementation of the "translucent filesystem" idea.

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds