Brief items
The current 2.6 kernel is 2.6.14,
released on October 27. A
very small number of patches went in since 2.6.14-rc5. Major changes in
2.6.14 include a new version of the wireless extensions, the
HostAP system (which allows a Linux
system to function as a wireless access point), relayfs, the
DCCP network protocol, the
filesystems in user space patch,
v9fs,
securityfs, and more.
The first 2.6.15 prepatch has not yet been released, and may not be until
the window for new features closes. A major pile of patches has been
merged into the mainline git repository; see the separate article, below,
for a list of some of the more interesting ones.
There have been no -mm releases in the last week.
The current 2.4 prepatch is 2.4.32-rc2, released by Marcelo on
Halloween. It contains a small set of fixes, mostly in the networking
subsystem.
Comments (1 posted)
Kernel development news
It'd help more if people focused more on testing their own shit
before submitting it than complaining about -mm. If it's the same
people breaking the tree all the time, I'm sure we can find a
recycled set of stocks somewhere.
-- Martin Bligh
Comments (none posted)
The
2005 Kernel Summit made some
tweaks to the kernel development model with the aim of producing
higher-quality releases in a more timely manner. To that end, it was said
that major changes would only be allowed during the first two weeks of each
development cycle; after that, only bug fixes could go in. The hope was
that this rule would eliminate destabilizing patches late in the cycle and
concentrate developers' minds on making things work.
The 2.6.14 kernel is the first to go through the entire cycle since the
kernel summit. This kernel, released on October 27, came out almost
exactly two months after 2.6.13, which showed up on August 29. That
is relatively fast by 2.6 standards, but still
too slow for some developers. The complainers feel that the
freeze period puts too much of a damper on development, and that, somehow,
the kernels should come out faster.
2.6.14 would have come out sooner were it not for a final delay to fix some
remaining bugs (some of which turned out not to be real). Linus, however,
is pretty happy with how 2.6.14 worked. A
number of significant changes were merged, but regressions in the released
kernels seem to be within reasonable limits. As a result, Linus doesn't
see the need to make further changes to the process at this time:
So I'm planning on continuing with it unchanged for now. Two-week
merge window until -rc1, and then another -rc kernel roughly every
week until release. With the goal being 6 weeks, and 8 weeks being
ok.
Andrew Morton, meanwhile, has an answer for
those who think the development cycle is still too long:
a) you're sitting around feeling very very very bored while
b) the kernel is in long freeze due to the lack of kernel developer
attention to known bugs
The solution seems fairly obvious to me?
It was pointed out that many bugs relate to hardware which most developers
do not have. The response was that sometimes developers have to talk to
users who encounter bugs and try to track them down anyway. In any case,
the ongoing effort to get developers to fix bugs seems likely to be
necessary for some time to come.
One other branch of the discussion, meanwhile, took on the question of
whether the kernel has gotten too big. Prompted initially by Roman
Zippel, Andrew Morton did some compile
tests and came out with some disturbing numbers: the size of kernels
with similar configurations went from about 600K (2.5.71) to over 800K
(2.6.8). He also noted that the use of a current version of gcc adds
almost 100K to the final kernel size when compared to gcc 2.95.4.
Clearly, some serious inflation is going on somewhere.
Except that it's not quite so clear. Adrian Bunk demonstrated that, by using the -Os
compile option (which instructs gcc to optimize for size), current
compilers can make kernels which are quite a bit smaller than those made
with the old 2.95 release. The resulting discussion suggests that the
kernel developers may try making -Os the default for kernel builds
in the future. Fedora already builds its kernels this way. The
interesting thing is that, in the past, kernels built with -Os
have often performed as well as (or even better than) those optimized for
speed. Cache effects have a huge impact on kernel performance, and a
smaller kernel is more cache friendly.
Compiler issues aside, there truly has been some growth in the kernel.
Linus is not surprised by this:
On the other hand, I do believe that bloat is a fact of life....
The fact is, we do do more, and we're more complex. Our VM is a _lot_
more complex, and our VFS layer has grown a lot due to all the
support it has for situations that simply weren't an issue
before. And even when not used, that support is there.
Expect an increase in de-bloating work in the near future. In some areas,
this work has been ongoing for a while - consider, for example, the effort
to shrink the sk_buff structure used to represent packets in the
networking subsystem. For a more extreme example, see Matt Mackall's SLOB allocator, a
replacement for the slab subsystem which is much smaller, but which does
not perform as well on larger systems. SLOB is not for everybody (it's
mainly intended for embedded systems), but it almost certainly foreshadows
a surge in Linux weight reduction patches.
Comments (19 posted)
The release of the 2.6.14 kernel opened the door for new changes. Many
developers have been quick to submit their patches, with the result that
nearly 2000 commits have been merged for 2.6.15. The door will remain open
for two weeks - until around November 11 - at which point the kernel
should return to stabilization mode.
Many of the patches merged are fixes, and quite a few of them are in
architecture-specific code. Among the rest, however, are the following,
starting with user-visible changes:
- An update to the generic 802.11 code which includes, among other
things, quality-of-service support, the ability to use hardware crypto
and fragmentation offload functions, and "wireless spy" support.
- A driver for Marvell serial ATA controllers. There is also a new "ATA
passthrough" ioctl() allowing arbitrary ATA commands to be sent to
devices.
- The old "bluetty" driver has been removed. Everybody should be using
the bluez stack for Bluetooth devices at this point.
- As a result of the device model changes, the 2.6.15 kernel will
require version 071 (or higher) of the udev utility.
- A new uevent device attribute in sysfs can be used to
manually force the creation of a hotplug event for an existing
device. This feature can be used to regenerate hotplug events for
devices which were present when the system was booted.
- The PowerPC 4xx on-chip Ethernet driver has been replaced with a
completely rewritten, more efficient version.
- A new driver for the Freescale Ethernet devices found in some
embedded systems.
- Support for the old Cobalt servers has been restored.
- Basic support for hot-pluggable memory.
- A big NTFS rework with much-improved write support.
- A big InfiniBand update, with support for a wider range of userspace
verbs.
- Support for ARM "RealView" boards.
- A large CIFS filesystem update, with support for change notifications,
mounting from "legacy" servers, case-independent file names, and more.
- DRM support for Radeon PCI Express cards
API changes and other internal patches visible to kernel developers include:
- The nested class devices
patch and associated input subsystem patches. For those who are
curious about where the device model work will go from here, Greg
Kroah-Hartman has posted a roadmap on his
weblog.
- More conversions of internal function prototypes to use the
gfp_t type
introduced in 2.6.14.
- A number of block layer patches, including a rework of the elevator
switch code and the generic
dispatch queue patch. The new I/O barrier code has not been
merged as of this writing.
- A big rework of the remote procedure call code, and a number of
associated NFS updates.
- Some power management changes, including a driver API change; see this article for details.
- A new mechanism allowing code to be notified when USB
busses and devices come and go. Drivers do not normally need to use these
notifiers, but some of the core code benefits from them.
- The driver model class "interface" add() and
remove() methods have picked up a new parameter: a pointer to
the actual interface structure.
- There is a new reader/writer semaphore function
rwsem_is_locked(), which tests whether the rwsem is read
locked without blocking.
- There is a new variant of vmalloc():
void *vmalloc_node(unsigned long size, int node);
As one might expect, it allocates memory on a specific NUMA node.
- The "reserved" bit for memory pages - used to mark pages which are not
managed by the kernel page allocator (kernel text, non-memory areas,
etc.) - has been all but removed. No core code uses it now, with the
exception of software suspend, and that will get fixed eventually.
There are reports that this change breaks VMware.
- A set of Linux security module hooks for the (relatively) new
key management functions.
- A new kernel thread function:
int kthread_stop_sem(struct task_struct *kt, struct semaphore *s);
This function will stop a kernel thread which might be waiting on the
given semaphore.
- A "torture test" module for the read-copy-update mechanism.
Stay tuned: there is still time for quite a few more changes to be merged
before the 2.6.15 window closes.
Comments (4 posted)
The 2.6.14 kernel has brought with it a few changes to the power management
API. The first of these has to do with the
suspend() and
resume() methods found in
struct device_driver. These
methods would be called three times for each suspend and resume operation,
in order to maintain compatibility with an older version of the API. The
new versions are called once, and have different prototypes:
int (*suspend) (struct device *dev, pm_message_t state);
int (*resume) (struct device *dev);
This change required updates to a fair number of drivers, so the patch is
relatively large.
The other change is for devices which can supply "wakeup events" to the
kernel. These devices include network adapters with "wake-on-LAN"
capability, keyboards, and simple power switches. The power management
core has been reworked to enable these devices to perform their wakeup
functions while providing overall control to the system administrator.
The dev_pm_info structure (found inside struct device)
has gotten two new, single-bit fields. Drivers for devices which can
create wakeup events should set the can_wakeup field to one. The
actual issuance of such events, however, should be controlled by the
may_wakeup field. If that field is zero, the power management
core has decreed that wakeups should not be issued. A
device_may_wakeup() helper function has been added to make testing
the may_wakeup bit easy.
The patch adds a new wakeup field in sysfs. When read, it will
return enabled or disabled (or an empty string if the
device is not capable of generating wakeup events at all). The system
administrator can also write a new value to allow (or disallow) the
generation of wakeup events from the device.
The driver core code has been merged, along with support for wakeups from
USB devices. As of this writing, however, the PCI wakeup code has some
outstanding issues with G5 systems which has prevented it from going into
the mainline.
Comments (none posted)
Mel Gorman's fragmentation avoidance patches were covered here
last February. This patch set
divides all memory allocations into three categories: "user reclaimable,"
"kernel reclaimable," and "kernel non-reclaimable." The idea to support
multi-page contiguous allocations by grouping reclaimable allocations
together. If no contiguous memory ranges are available, one can be created
by forcing out reclaimable pages. Since non-reclaimable pages have been
segregated into their own area, the chances of such a page blocking the
creation of a contiguous set of free pages is relatively small.
Mel recently posted version 19 of
the fragmentation avoidance patch and requested that it be included in
the -mm kernel. That request started a lengthy discussion on whether this
patch set is a good idea or not. There is, it seems, a fair amount of
uncertainty over whether this code belongs in the kernel.
There are a few reasons for wanting fragmentation avoidance, and the
arguments differ for each of them.
The first of these reasons is to increase the probability of high-order
(multi-page) allocations in the kernel. Nobody denies that Mel's patch
achieves that goal, but there are developers who claim that a better
approach is to simply eliminate any such allocations. In fact, most
multi-page allocations were dealt with some time ago. A few remain,
however, including the two-page kernel stacks still used by default on most
systems. When the kernel stack allocation fails, it blocks the creation of
a new process. The kernel may eventually move to single-page stacks in all
situations, but a few higher-order allocations will remain. It is not
always possible to break required memory into single-page chunks.
The next reason, strongly related to the first, is huge pages. The huge
page mechanism is used to improve performance for certain applications on
large systems; there are few users currently, but that could change if huge
pages were easier to work with. Huge pages cannot be allocated for
applications in the absence of a large - and suitably aligned - region of
contiguous memory. In practice, they are very difficult to create on
systems which have been running for any period of time. Failure to
allocate a huge page is relatively benign; the application simply has to
get by with regular pages and take the performance hit. But, given that
you have a huge page mechanism, making it work more reliably would be
worthwhile.
The fragmentation avoidance patches can help with both high-order
allocations and huge pages. There is some debate over whether it is the
right solution to the problem, however. The often-discussed alternative
would be to create one or more new memory zones set aside for reclaimable
memory. This approach would make use of the zone system already built into
the kernel, thus avoiding the creation of a new layer. A zone-based system
might also avoid the perceived (though somewhat unproven) performance
impact of the fragmentation avoidance patches. Given that this impact is
said to be felt in that most crucial of workloads - kernel compiles - this
argument tends to resonate with the kernel developers.
The zone-based approach is not without problems, however. Memory zones,
currently, are static; as a result, somebody would have to decide how to
divide memory between the reclaimable and non-reclaimable zones. This
adjustment looks like it would be hard to get right in any sort of reliable
way. In the past, the zone system has also been the source of a number of
performance problems, mostly related to balancing of allocations between
the zones. Increasing the complexity of the zone system and adding more
zones could well bring those problems back.
There is another motivation for fragmentation avoidance which brings a
different set of constraints: support for hot-pluggable memory. This
feature is useful on high-availability systems, but it is also heavily used
in association with virtualization. A host running a number of virtualized
Linux instances can, by way of the hotplug mechanism, shift its memory
resources between those instances in response to the demands of each.
Before memory can be removed from a running system, its contents must be
moved elsewhere - at least, if one wants to still have a running system
afterward. The fragmentation avoidance patches can help by putting only
reclaimable allocations in the parts of memory which might be removed. As
long as all the pages in a region can be reclaimed, that region is
removable.
A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to
support hot-pluggable memory be able to provide a 100% success rate. The
current code need not live up to that metric, but there needs to be a clear
path toward that goal. Otherwise, the kernel developers risk advertising a
feature which they may not ever be able to support in a reliable way. The
backers of fragmentation avoidance would like to merge the patches, solving
90% of the problem, and leave the other 90%
for later. Ingo, instead, fears that second 90%, and wants to know how it
will get done.
Why can't the current patches offer 100% reliability if they only put
reclaimable memory in hot-pluggable regions? There are ways to lock down
pages which were once reclaimable; these include DMA operations and pages
explicitly locked by user space. There is also the issue of what happens
when the kernel runs out of non-reclaimable memory. Rather than fail a
non-reclaimable allocation attempt, the kernel will allocate a page from
the reclaimable region. This fallback is necessary to avoid inflicting
reliability problems on the rest of the kernel. But the presence of a
non-reclaimable page in a reclaimable region will prevent the system from
vacating that region.
This problem can be solved by getting rid of non-reclaimable allocations
altogether. And that can be done by changing how the kernel's address
space works. Currently, the kernel runs in a single, contiguous virtual
address space which is mapped directly onto physical memory - often using a
single, large page table entry. (The vmalloc() region is a
special exception, but it is not an issue here). If the kernel were,
instead, to use normal-sized pages like the rest of the system, its memory
would no longer need to be physically contiguous. Then, if a kernel page
gets in the way, it can simply be moved to a more convenient location.
Beyond the fact that this approach fundamentally changes the kernel's
memory model, there are a couple of little issues with it. There would be
a performance hit caused by the higher translation buffer use, and an
increase in the amount of memory needed to store the kernel's page tables.
Certain kernel operations - DMA in particular - cannot tolerate physical
addresses which might change at arbitrary times. So there would have to be
a new API where drivers could request physically-nailed regions - and be
told by the kernel to give them up. In other words, breaking up the
kernel's address space opens a substantial barrel of worms. It is not the
sort of change which would be accepted in the absence of a fairly strong
motivation, and it is not clear that hot-pluggable memory is a sufficiently
compelling cause.
So no conclusions have been reached on the inclusion of the fragmentation
avoidance patches. In the short term, Andrew Morton's controversy
avoidance mechanisms are likely to keep the patch out of the -mm tree,
however. But there are legitimate reasons for wanting this capability in
the kernel, and the issue is unlikely to go away. Unless somebody comes up
with a better solution, it could be hard to keep Mel's patch out forever.
Comments (5 posted)
Once upon a time, kernel developers would post their contributions on the linux-kernel mailing list. Now they issue press releases instead. Along those lines, Levanta (the company once known as Linuxcare) has
announced
the availability of MapFS. This GPL-licensed module allows a read-only filesystem to be mounted locally for write access, with any changes being kept on the local system. It looks like another implementation of the "translucent filesystem" idea.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>