LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements Letters to the editor ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status The current 2.6 kernel is 2.6.14, released on October 27. A very small number of patches went in since 2.6.14-rc5. Major changes in 2.6.14 include a new version of the wireless extensions, the HostAP system (which allows a Linux system to function as a wireless access point), relayfs, the DCCP network protocol, the filesystems in user space patch, v9fs, securityfs, and more.The first 2.6.15 prepatch has not yet been released, and may not be until the window for new features closes. A major pile of patches has been merged into the mainline git repository; see the separate article, below, for a list of some of the more interesting ones. There have been no -mm releases in the last week. The current 2.4 prepatch is 2.4.32-rc2, released by Marcelo on Halloween. It contains a small set of fixes, mostly in the networking subsystem.
Kernel development news Quote of the week
It'd help more if people focused more on testing their own shit
before submitting it than complaining about -mm. If it's the same
people breaking the tree all the time, I'm sure we can find a
recycled set of stocks somewhere.
-- Martin Bligh
The newest development model and 2.6.14 The 2005 Kernel Summit made some tweaks to the kernel development model with the aim of producing higher-quality releases in a more timely manner. To that end, it was said that major changes would only be allowed during the first two weeks of each development cycle; after that, only bug fixes could go in. The hope was that this rule would eliminate destabilizing patches late in the cycle and concentrate developers' minds on making things work.The 2.6.14 kernel is the first to go through the entire cycle since the kernel summit. This kernel, released on October 27, came out almost exactly two months after 2.6.13, which showed up on August 29. That is relatively fast by 2.6 standards, but still too slow for some developers. The complainers feel that the freeze period puts too much of a damper on development, and that, somehow, the kernels should come out faster. 2.6.14 would have come out sooner were it not for a final delay to fix some remaining bugs (some of which turned out not to be real). Linus, however, is pretty happy with how 2.6.14 worked. A number of significant changes were merged, but regressions in the released kernels seem to be within reasonable limits. As a result, Linus doesn't see the need to make further changes to the process at this time:
So I'm planning on continuing with it unchanged for now. Two-week
merge window until -rc1, and then another -rc kernel roughly every
week until release. With the goal being 6 weeks, and 8 weeks being
ok.
Andrew Morton, meanwhile, has an answer for those who think the development cycle is still too long:
a) you're sitting around feeling very very very bored while
b) the kernel is in long freeze due to the lack of kernel developer attention to known bugs The solution seems fairly obvious to me? It was pointed out that many bugs relate to hardware which most developers do not have. The response was that sometimes developers have to talk to users who encounter bugs and try to track them down anyway. In any case, the ongoing effort to get developers to fix bugs seems likely to be necessary for some time to come. One other branch of the discussion, meanwhile, took on the question of whether the kernel has gotten too big. Prompted initially by Roman Zippel, Andrew Morton did some compile tests and came out with some disturbing numbers: the size of kernels with similar configurations went from about 600K (2.5.71) to over 800K (2.6.8). He also noted that the use of a current version of gcc adds almost 100K to the final kernel size when compared to gcc 2.95.4. Clearly, some serious inflation is going on somewhere. Except that it's not quite so clear. Adrian Bunk demonstrated that, by using the -Os compile option (which instructs gcc to optimize for size), current compilers can make kernels which are quite a bit smaller than those made with the old 2.95 release. The resulting discussion suggests that the kernel developers may try making -Os the default for kernel builds in the future. Fedora already builds its kernels this way. The interesting thing is that, in the past, kernels built with -Os have often performed as well as (or even better than) those optimized for speed. Cache effects have a huge impact on kernel performance, and a smaller kernel is more cache friendly. Compiler issues aside, there truly has been some growth in the kernel. Linus is not surprised by this:
On the other hand, I do believe that bloat is a fact of life....
The fact is, we do do more, and we're more complex. Our VM is a _lot_
more complex, and our VFS layer has grown a lot due to all the
support it has for situations that simply weren't an issue
before. And even when not used, that support is there.
Expect an increase in de-bloating work in the near future. In some areas, this work has been ongoing for a while - consider, for example, the effort to shrink the sk_buff structure used to represent packets in the networking subsystem. For a more extreme example, see Matt Mackall's SLOB allocator, a replacement for the slab subsystem which is much smaller, but which does not perform as well on larger systems. SLOB is not for everybody (it's mainly intended for embedded systems), but it almost certainly foreshadows a surge in Linux weight reduction patches.
What's going into 2.6.15 The release of the 2.6.14 kernel opened the door for new changes. Many developers have been quick to submit their patches, with the result that nearly 2000 commits have been merged for 2.6.15. The door will remain open for two weeks - until around November 11 - at which point the kernel should return to stabilization mode.Many of the patches merged are fixes, and quite a few of them are in architecture-specific code. Among the rest, however, are the following, starting with user-visible changes:
API changes and other internal patches visible to kernel developers include:
Stay tuned: there is still time for quite a few more changes to be merged before the 2.6.15 window closes.
Some power management changes for 2.6.15 The 2.6.14 kernel has brought with it a few changes to the power management API. The first of these has to do with the suspend() and resume() methods found in struct device_driver. These methods would be called three times for each suspend and resume operation, in order to maintain compatibility with an older version of the API. The new versions are called once, and have different prototypes:
int (*suspend) (struct device *dev, pm_message_t state);
int (*resume) (struct device *dev);
This change required updates to a fair number of drivers, so the patch is relatively large. The other change is for devices which can supply "wakeup events" to the kernel. These devices include network adapters with "wake-on-LAN" capability, keyboards, and simple power switches. The power management core has been reworked to enable these devices to perform their wakeup functions while providing overall control to the system administrator. The dev_pm_info structure (found inside struct device) has gotten two new, single-bit fields. Drivers for devices which can create wakeup events should set the can_wakeup field to one. The actual issuance of such events, however, should be controlled by the may_wakeup field. If that field is zero, the power management core has decreed that wakeups should not be issued. A device_may_wakeup() helper function has been added to make testing the may_wakeup bit easy. The patch adds a new wakeup field in sysfs. When read, it will return enabled or disabled (or an empty string if the device is not capable of generating wakeup events at all). The system administrator can also write a new value to allow (or disallow) the generation of wakeup events from the device. The driver core code has been merged, along with support for wakeups from USB devices. As of this writing, however, the PCI wakeup code has some outstanding issues with G5 systems which has prevented it from going into the mainline.
Fragmentation avoidance Mel Gorman's fragmentation avoidance patches were covered here last February. This patch set divides all memory allocations into three categories: "user reclaimable," "kernel reclaimable," and "kernel non-reclaimable." The idea to support multi-page contiguous allocations by grouping reclaimable allocations together. If no contiguous memory ranges are available, one can be created by forcing out reclaimable pages. Since non-reclaimable pages have been segregated into their own area, the chances of such a page blocking the creation of a contiguous set of free pages is relatively small.Mel recently posted version 19 of the fragmentation avoidance patch and requested that it be included in the -mm kernel. That request started a lengthy discussion on whether this patch set is a good idea or not. There is, it seems, a fair amount of uncertainty over whether this code belongs in the kernel. There are a few reasons for wanting fragmentation avoidance, and the arguments differ for each of them. The first of these reasons is to increase the probability of high-order (multi-page) allocations in the kernel. Nobody denies that Mel's patch achieves that goal, but there are developers who claim that a better approach is to simply eliminate any such allocations. In fact, most multi-page allocations were dealt with some time ago. A few remain, however, including the two-page kernel stacks still used by default on most systems. When the kernel stack allocation fails, it blocks the creation of a new process. The kernel may eventually move to single-page stacks in all situations, but a few higher-order allocations will remain. It is not always possible to break required memory into single-page chunks. The next reason, strongly related to the first, is huge pages. The huge page mechanism is used to improve performance for certain applications on large systems; there are few users currently, but that could change if huge pages were easier to work with. Huge pages cannot be allocated for applications in the absence of a large - and suitably aligned - region of contiguous memory. In practice, they are very difficult to create on systems which have been running for any period of time. Failure to allocate a huge page is relatively benign; the application simply has to get by with regular pages and take the performance hit. But, given that you have a huge page mechanism, making it work more reliably would be worthwhile. The fragmentation avoidance patches can help with both high-order allocations and huge pages. There is some debate over whether it is the right solution to the problem, however. The often-discussed alternative would be to create one or more new memory zones set aside for reclaimable memory. This approach would make use of the zone system already built into the kernel, thus avoiding the creation of a new layer. A zone-based system might also avoid the perceived (though somewhat unproven) performance impact of the fragmentation avoidance patches. Given that this impact is said to be felt in that most crucial of workloads - kernel compiles - this argument tends to resonate with the kernel developers. The zone-based approach is not without problems, however. Memory zones, currently, are static; as a result, somebody would have to decide how to divide memory between the reclaimable and non-reclaimable zones. This adjustment looks like it would be hard to get right in any sort of reliable way. In the past, the zone system has also been the source of a number of performance problems, mostly related to balancing of allocations between the zones. Increasing the complexity of the zone system and adding more zones could well bring those problems back. There is another motivation for fragmentation avoidance which brings a different set of constraints: support for hot-pluggable memory. This feature is useful on high-availability systems, but it is also heavily used in association with virtualization. A host running a number of virtualized Linux instances can, by way of the hotplug mechanism, shift its memory resources between those instances in response to the demands of each. Before memory can be removed from a running system, its contents must be moved elsewhere - at least, if one wants to still have a running system afterward. The fragmentation avoidance patches can help by putting only reclaimable allocations in the parts of memory which might be removed. As long as all the pages in a region can be reclaimed, that region is removable. A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to support hot-pluggable memory be able to provide a 100% success rate. The current code need not live up to that metric, but there needs to be a clear path toward that goal. Otherwise, the kernel developers risk advertising a feature which they may not ever be able to support in a reliable way. The backers of fragmentation avoidance would like to merge the patches, solving 90% of the problem, and leave the other 90% for later. Ingo, instead, fears that second 90%, and wants to know how it will get done. Why can't the current patches offer 100% reliability if they only put reclaimable memory in hot-pluggable regions? There are ways to lock down pages which were once reclaimable; these include DMA operations and pages explicitly locked by user space. There is also the issue of what happens when the kernel runs out of non-reclaimable memory. Rather than fail a non-reclaimable allocation attempt, the kernel will allocate a page from the reclaimable region. This fallback is necessary to avoid inflicting reliability problems on the rest of the kernel. But the presence of a non-reclaimable page in a reclaimable region will prevent the system from vacating that region. This problem can be solved by getting rid of non-reclaimable allocations altogether. And that can be done by changing how the kernel's address space works. Currently, the kernel runs in a single, contiguous virtual address space which is mapped directly onto physical memory - often using a single, large page table entry. (The vmalloc() region is a special exception, but it is not an issue here). If the kernel were, instead, to use normal-sized pages like the rest of the system, its memory would no longer need to be physically contiguous. Then, if a kernel page gets in the way, it can simply be moved to a more convenient location. Beyond the fact that this approach fundamentally changes the kernel's memory model, there are a couple of little issues with it. There would be a performance hit caused by the higher translation buffer use, and an increase in the amount of memory needed to store the kernel's page tables. Certain kernel operations - DMA in particular - cannot tolerate physical addresses which might change at arbitrary times. So there would have to be a new API where drivers could request physically-nailed regions - and be told by the kernel to give them up. In other words, breaking up the kernel's address space opens a substantial barrel of worms. It is not the sort of change which would be accepted in the absence of a fairly strong motivation, and it is not clear that hot-pluggable memory is a sufficiently compelling cause. So no conclusions have been reached on the inclusion of the fragmentation avoidance patches. In the short term, Andrew Morton's controversy avoidance mechanisms are likely to keep the patch out of the -mm tree, however. But there are legitimate reasons for wanting this capability in the kernel, and the issue is unlikely to go away. Unless somebody comes up with a better solution, it could be hard to keep Mel's patch out forever.
Levanta's MapFS released Once upon a time, kernel developers would post their contributions on the linux-kernel mailing list. Now they issue press releases instead. Along those lines, Levanta (the company once known as Linuxcare) has announced the availability of MapFS. This GPL-licensed module allows a read-only filesystem to be mounted locally for write access, with any changes being kept on the local system. It looks like another implementation of the "translucent filesystem" idea.
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.