Brief items
The current stable 2.6 kernel is 2.6.14.2,
released on November 10.
2.6.14.2 contains about a dozen fixes, including one for the zero-length
datagram bug. It does not contain a fix for the
file lease denial of service bug yet, however.
The current 2.6 prepatch is 2.6.15-rc1, announced by Linus on
November 11. Says Linus:
It's hard to go through in any great detail, because even the
shortlog is actually almost five thousand lines and about 200kB in
size, and would thus run afoul of the mailing list limits so I
can't include it here. The same is true of the diffstat, only even
more so. The unidiff is about a million lines in size, just the
diffstat is 300+kB.
The changes are really pretty much all over the place, with over
four thousand commits merged in the two weeks since 2.6.14...
The changes in 2.6.15 have been listed in detail here last week and the week before. Significant
changes merged since last week's article, but before 2.6.15-rc1, include a
new, simpler, type-safe netlink API, a new netfilter connection tracking
implementation (which understands IPv6 and will eventually replace the
current code), the removal of some SCSI subsystem typedefs
(Scsi_Device, Scsi_Pointer, and
Scsi_Host_Template), the removal of the owner field from
struct pci_driver, and a new "platform driver" interface.
For those who are just tuning in: some of the more significant changes in
2.6.15 will include device model changes, basic hotplug memory support, a
reworked NTFS implementation with improved write behavior, a big CIFS
update, a number of block improvements (and a reorganization of the block
layer into its own top-level directory), the Open-iSCSI initiator,
InfiniBand SCSI RDMA, RapidIO
support, a number of scheduler tweaks, the shared subtrees patch,
four-level page tables for the ia-64 architecture, and much more. See the short log for a long list of changes, or
the
long-format changelog for the details.
2.6.15-rc1 marks the closing of the window for new features, so Linus's git
repository contains mostly fixes. It does, however, also include a generic
cmpxchg implementation for i386, new Omnikey Cardman 4000 and 4040
drivers, and a new DMA32 zone for the x86-64 architecture.
The current -mm tree is 2.6.14-mm2. Recent changes to
-mm include some relayfs enhancements, some scheduler tweaks, and various
fixes.
The current 2.4 kernel is 2.4.32, released by Marcelo on
November 16. 2.4 is in deep maintenance mode, so there's not a whole
lot of new features in 2.4.32.
Comments (1 posted)
Kernel development news
Greg Kroah-Hartman has gotten tired of answering the same questions about
Linux kernel development. So he has put together
a HOWTO document to get people
started. The goals are ambitious: "
This is the be-all, end-all
document on this topic. It contains instructions on how to become a Linux
kernel developer and how to learn to work with the Linux kernel development
community." Now people just have to
read it...
Comments (6 posted)
The question of whether the i386 architecture should move to using 4K
kernel stacks by default has been raised a few times; LWN last
covered the 4K stack issue in
September. Adrian Bunk has started the discussion anew with
this proposal that the -mm tree
go to 4K stacks (only) now, with an eye toward changing the mainline for
2.6.16.
Most of the technical issues have not changed since September, so those
arguments will not be repeated here. It is worth noting that layered block
devices and filesystems have mostly been fixed. In past kernels, highly
stacked devices (think of a combination of RAID, encryption, and network
filesystems) could end up with very long call chains in the kernel, and, as
a result, overflow the kernel stack. Most of these calls have since been
serialized, so block-layer stacking should not be a problem.
The issue that remains is NDISwrapper, the glue layer which allows Windows
NDIS drivers to be loaded into a Linux kernel. Windows runs with a much
larger kernel stack size, so NDIS driver writers have no reason to be as
careful about stack usage. And, of course, these drivers cannot be fixed
to work properly with Linux. Some have argued that breaking NDISwrapper is
not a possibility, since many users rely upon it to make their wireless
network adapters work. But patience with this line of thought is running
thin, as can be seen in this outburst from
Dave Jones:
If we continue down this path, we'll have no native wireless
drivers for Linux. The answer is not to complain to linux-kernel
for breaking ndiswrapper, but complain to the vendors for not
releasing specifications for native drivers to be written.
The good news is that the wireless situation is
not as bad as one might think. There is documentation for Broadcom
chips available now, and a
Broadcom driver is in the works. There is also an
Atheros driver which is "nearly done." Once these drivers are complete
and joined with the Intel drivers already in the mainline, Linux will have
much better support for wireless devices, and far fewer systems will have
any reason to use NDISwrapper.
There are a number of reasons for going with the 4K stack mode, including
better performance and higher reliability. Some distributions (e.g. Fedora
Core and RHEL) have been shipping 4K kernels for a while now. So, while
nobody has committed to moving the mainline (or -mm) toward 4K-only yet,
chances are improving that it will happen sometime in the not-too-distant
future.
Comments (15 posted)
Page migration is the act of moving a process's pages from one part of the
system to another. Often, the motivation is moving pages between NUMA nodes
in the hope of improving performance. When this page last
looked at the page migration patch
set, it worked by forcing target pages out to the swap device. When
the owning process later faults them in, these pages will end up on the
desired node. This technique works, but it is not optimal: it would be
nicer to avoid having to write the pages to disk and read them back in.
Christoph Lameter has now followed up with the direct migration patch set,
which does away with the side-trip to the swap device. A look at the patch
shows why things were not done this way in the first place; direct page
migration involves rather more than simply copying the data over. The
first step, after choosing a target page, is to lock that page so that
nobody else will mess with it. There might currently be I/O active which
involves that page, so the kernel must wait for any such I/O to complete.
Only then can the real migration work begin.
The kernel must establish a swap cache entry for the page, even though it
intends to avoid writing the page to swap. This entry will cause the right
thing to happen if a process faults on the page while it is being moved.
Then all references
to the page (page table entries) are unmapped. With luck, all references
will go away; if references remain for any reason, the page cannot be
moved.
Actually moving the page involves copying a subset of the page status bits
over, copying the page data itself, then copying the rest of the status
bits. The old page is cleared out and freed. If any writeback has been
queued up for the new page, it is set in motion. Then it's just a matter
of cleaning up, and the page has been successfully moved.
If the kernel runs out of free pages on the target node, it will fall back
to the swap-based mechanism. So that stage of this patch's evolution
remains useful.
With this code in place, the kernel has the support it needs to try to keep
a process's pages in local memory. The migration code might also prove
useful for hotplug memory uses, where all pages must be vacated from a
given region. Indeed, some of this code was originally written for hotplug
applications. But, at this point, the migration is done on a best-effort
basis. For NUMA systems, failure to move a page results in worse
performance, but nothing particularly severe. For hotplug memory, instead,
this sort of failure will block a memory remove operation altogether.
Moving all pages in a region with 100% certainty remains a difficult
problem without a complete solution at this time.
One of the pieces of such a solution might be active memory defragmentation
which, among other things, works to keep non-movable memory allocations out
of memory regions which might be removed. When we looked at active
defragmentation last week,
that patch set looked like it was in trouble. The overhead of the
defragmentation code seemed to be too high, and a number of developers
(Linus included) felt that this sort of functionality should be implemented
using the kernel's zone system, rather then with a new layer in the memory
allocator.
Defragmentation hacker Mel Gorman doesn't give up that easily, however. He
has posted a new, "light" version
of the defragmentation patch which, he hopes, will be better received.
As he describes it:
This is a much simplified anti-defragmentation approach that simply
tries to keep kernel allocations in groups of 2^(MAX_ORDER-1) and
easily reclaimed allocations in groups of 2^(MAX_ORDER-1). It uses
no balancing, tunables special reserves and it introduces no new
branches in the main path. For small memory systems, it can be
disabled via a config option. In total, it adds 275 new lines of
code with minimum changes made to the main path.
In this version of the patch, a new GFP flag (__GFP_EASYRCLM) is
added; its presence indicates an allocation which the kernel can easily get
back should the need arise. It is used for user-space pages (which can
usually be forced out to backing store) and in a few other situations, such
as for some kernel buffers. The buddy allocator already keeps track of
memory in large chunks; the new code simply steers reclaimable allocations
toward some chunks, while keeping the non-reclaimable allocations in
others. In this way, it is hoped, there will be no situations where one
non-movable page blocks the freeing of the large, contiguous region in
which it is located.
The patch works by creating a "usemap" array tracking which kind of allocation is
being done from each large chunk of memory. Mel also had to split the
per-CPU free lists which are used to perform fast single-page allocations;
now there are two such lists, one for each allocation type. From there, it
is just a matter of taking allocations from the proper pile, depending on
the __GFP_EASYRCLM flag.
This version certainly reduces the footprint and overhead of the
defragmentation patches. It is still not the zone-based approach that
others were pushing for, however. So it remains to be seen whether "active
defragmentation lite" is, in the end, better received than its
predecessors.
Comments (4 posted)
The kernel has long had a series of functions which read and write memory
locations in the legacy ISA memory range. These functions, with names like
isa_readb(), require no special preparation to use, and they only
work in the ISA hole. They also have been obsolete and deprecated for
quite some time.
Recently, there has been an effort to finally get rid of
isa_readb() and friends. To that end, Al Viro has posted a set of
"isaectomy" patches which fix up the remaining callers (they are made to
use ioremap() and the not quite as obsolete readb()
family of functions) so that the old stuff can be deleted. One would think
that this work would be uncontroversial, but Linus, it turns out, is unconvinced:
Hmm.. I actually believe that the isa_read() functions are more
portable and easier to use than ioremap().
The reason? A platform will always know where any legacy ISA bus
resides, while the "ioremap()" thing will depend on platform PCI
code to have set the right offsets (and thus the resource
addresses) for whatever bus the PCI device is on.
The fact is, however, that very little in-tree code still uses these
functions. They are a deprecated interface to a very old and obsolete
hardware standard, and they have few defenders. So anybody maintaining
out-of-tree which still uses these functions might want to take warning:
they probably will not stay around for much longer.
Comments (1 posted)
The relative calm which has settled around the software suspend subsystem
may be about to come to an end. This part of the kernel, which has never
worked to everybody's satisfaction, remains subject to different ideas of
how the problem should be solved.
Pavel Machek's user-space software suspend patch was covered here in September.
Pavel has now posted a
new version of the patch with a request that it be merged for 2.6.16.
The user-space approach is, clearly, the way Pavel thinks that software
suspend should go. Beyond getting some code out of the kernel, this
approach makes a number of add-on features, such as graphical displays,
image compression, image encryption, network-based suspend, etc., easier to
implement. If you want to hang a big pile of features onto the suspend
mechanism, you eventually have to get into user space.
One of the first responses came from Dave
Jones, who said:
Just for info: If this goes in, Red Hat/Fedora kernels will fork
swsusp development, as this method just will not work there.
The main issue is the fact that the user-space approach uses
/dev/kmem to repopulate memory at resume time. Red Hat and Fedora
kernels do not allow memory to be overwritten in this way; there are no
other applications which need that capability, with the exception of
rootkits. Allowing user space to overwrite arbitrary physical pages is, to
Dave, not worth it, no matter how many software suspend features it
enables. Says Dave: "I'll take 'rootkit doesnt work' over 'bells and
whistles'."
Nigel Cunningham, the author of the Suspend2 patches, also has some
thoughts on the matter. He has been busily cleaning up the suspend2
patches with an eye toward making them more palatable for merging into the
mainline. It turns out that Nigel has a set of
225 patches which he will soon make available. Since few people have
seen the new patch set, it's not clear what sort of reception it will get.
It can be said, though, that 225 patches is a large pile of code. Anybody
trying to get a patch set of that size merged needs to have some fairly
convincing arguments in hand.
At some point, Nigel's code mountain will become available, and some sort
of decision will have to be made. Software suspend could be transformed
into suspend2, or moved partially to user space. Or it could be left
more-or-less as it is now. These are three very distinct choices -
especially as nobody wants to see a repeat of the situation where the
mainline kernel supported more than one software suspend implementation.
With luck, when the dust settles, Linux will have a more featureful and
reliable software suspend implementation than it does now. But expect some
interesting discussion between now and then.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
- dmitry pervushin: SPI.
(November 11, 2005)
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>