Brief items
The current 2.6 prepatch remains 2.6.10-rc2.
Linus's BitKeeper repository continues to accumulate patches, though they
are mostly of a bugfix nature. They include some x86 single-stepping
fixes, a number of "sparse" annotations, the token-based memory management fix,
a memory technology device (and JFFS2) update, a frame buffer device
update, some user-mode Linux patches, some page allocator tuning, and a few
architecture updates.
The current patch from Andrew Morton is 2.6.10-rc2-mm3. Recent changes to -mm include
a number of disk quota fixes, a big DVB update, and some fixes to make SELinux
and reiserfs work together.
The current ultra-stable patch from Alan Cox is 2.6.9-ac11.
The current 2.4 kernel is 2.4.28; Marcelo seems in no hurry to start
the 2.4.29 process. For those wanting an extra-hard 2.4 kernel, Solar
Designer has released 2.4.28-ow1, which
includes a number of security fixes.
Comments (none posted)
Kernel development news
Xen
is a free virtualization system designed to allow multiple virtual machines
to be run on a single host system with high performance. The Xen system
(version 2.0 was
released
recently) offers a number of interesting features, including flexible
networking between virtual machines and the ability to transparently move
virtual machines between physical hosts while they are running. Xen's
authors claim that the performance hit from running under Xen is only "a
few percent."
Now that the 2.0 release is out, the Xen developers would like to merge
their code into the mainline kernel. The bulk of this code adds the new
Xen "architecture," which enables the kernel to run on the virtual machine
provided by Xen itself. The architecture code is available
from the Xen site for those who are interested.
Another significant chunk is a set
of drivers which provide Xen-hosted systems with network interfaces,
file-backed block devices, and console devices.
Inclusion of both of those patch sets should be relatively uncontroversial;
they do not affect any code which is not actually built for the Xen
architecture, and thus should not risk breaking anything. The final set,
however, will have to be looked at more closely; these are the patches to
the core kernel itself. Most of these patches make the kernel work with
Xen's very different way of managing and allocating memory; they include a
new sk_buff structure allocation function, a change to how
/dev/mem works on the Xen architecture, and a new
ptep_establish_new() function which optimizes the instantiation of
new pages. Perhaps the most controversial change is a change in how the
architecture-specific arch_free_page() function works: under Xen,
this function might actually short out the rest of the page allocator
functions and dispose of the page itself. This technique allows Xen to
manage a single page pool for multiple virtual machines, but not everybody
liked changing the interface to arch_free_page() in that way.
That said, there appears to be no strong opposition to the inclusion of
these patches. It would not be surprising to see them go into -mm sometime
after 2.6.10 comes out.
Comments (7 posted)
The
Filesystems in User Space (FUSE) patch
has been around for some time. FUSE acts as a kernel filesystem which
turns around and passes all VFS requests out to a user-space daemon, which
is expected to do something reasonable with them. There are
numerous projects using
FUSE to implement interesting filesystems in user space. The FUSE
developers have now
requested that FUSE be
merged into the 2.6 kernel. They may yet get there, but some obstacles
stand in the way.
Linus started by complaining that FUSE was
"too messy." Some of his impressions, it turns out, may have been based on
a reading of old code. Some of the things he was complaining about were
parts of the 2.4 version of the patch; they are not present in the version
being put forward for inclusion.
There is, however, one show-stopping problem which remains in the code. If
the system's memory gets to be full of dirty pages which must be written to
a FUSE filesystem, and the user-space process which implements that
filesystem has been swapped out, the system can deadlock. It cannot clean
up those dirty pages until they have been written to the backing store, it
cannot write those pages until the user-space daemon has been paged in, and
it cannot page in the daemon until the dirty pages are cleaned. The system
comes to a screeching halt and the users reconsider the whole idea of
user-space filesystems.
The problem is most easily demonstrated through the use of shared writable
mappings. With such mappings, user space can create vast numbers of dirty
pages without the operating system knowing about it. Andrew Morton demonstrated that this is not just a
theoretical problem; it can be made to happen on real systems. The problem
can also be made to happen by simply writing too much data to the
filesystem. All this led Linus to lecture
on the topic:
Guys, there is a _reason_ why microkernels suck. This is an example
of how things are _not_ "independent". The filesystems depend on
the VM, and the VM depends on the filesystem. You can't just split
them up as if they were two separate things (or rather: you _can_
split them up, but they still very much need to know about each
other in very intimate ways).
In this case, the worst problems can be avoided by simply disallowing
shared, writable mappings. That limitation will not, in fact, bother too
many people; these mappings are not heavily used. It's also necessary to
take steps like limiting the number of pages currently queued for writing
out. This limit will affect users, in that it will reduce performance. It
has been noted, however, that deadlocks tend to have an even worse impact
on performance.
In response to the above concerns (and others), the FUSE patches have been
reworked. Among other things, the shared, writable mapping support has
been split out into a separate, optional patch. There's no word on whether
it will be merged, though Linus did suggest
that it might:
I'm a sucker. Ask anybody. I'll accept the exact same patch that I
rejected earlier if you just do it the right way. I'm convinced
that some people actually do it on purpose just for the amusement
value ("Look, he did it _again_. What a doofus!")
Whether Andrew Morton is so gullible remains to be seen.
Comments (4 posted)
After a long period of development, the
OpenIB Alliance has posted
an initial set of patches for review. The
current patch set is not proposed for inclusion, though the project has
made it clear that merging into a not-too-distant 2.6 kernel is something
they would like. The initial comments suggest that there may not be much
opposition to that.
The patch set is large, reflecting the complexity of the InfiniBand
specification. At the bottom layer, a driver for Mellanox adapters is
included with the patch set; it's some 9,000 lines of sparsely-commented
code. The core "midlayer" manages InfiniBand ports and makes access to the
fabric available for the upper layers. The midlayer also allows for
user-space administration by facilitating the passing of "MADs"
("management datagrams") back and forth.
The upper layers of the InfiniBand specification envision support for a number of features,
including MPI (message passing interface, heavily used in clustered
applications), SDP (socket direct protocol: a networking standard based on
remote DMA), SRP (remote SCSI), and IP over InfiniBand using the classic
socket interface. The current OpenIB patches concentrate on full IP (both
IPv4 and IPv6) support; most of the other high-level protocols are not yet
implemented.
The comments on the InfiniBand code have been relatively minor, so far.
The project's choices for device names (deeply nested names like
/dev/infiniband/mthca0/ports/1/mad) will likely be changed. The
project also went with dynamic device number assignment. This technique
works well on systems running a tool like udev to create the
device nodes, but it makes life difficult on systems where device nodes
must be created manually by the administrator. For now, at least, plenty
of such systems exist, so static device numbers are needed. The OpenIB
drivers also rely on ioctl() calls for a number of administrative
functions; questions were raised, but the current interface is not likely
to be changed in any significant way.
Perhaps the most surprising complaint, to many, was the objection to the
dual GPL/BSD license carried by the OpenIB code. BSD-licensed code is not
normally a problem in the kernel; it can be included in a larger,
GPL-licensed program without any sort of infringement. The OpenIB code
uses read-copy-update (RCU), however, and
that usage brings an additional constraint. IBM holds a patent on RCU, and
has licensed that patent for use with GPL-licensed code. As is the case
with many of these patent licenses, BSD-licensed code is not covered. So
the OpenIB developers may find themselves having to (1) drop the BSD license from
their code, (2) stop using RCU, or (3) get some sort of special
exemption from IBM. It appears that they
will choose the second option.
One issue which has not come up is concern over the licensing of the
InfiniBand specification or any patents which may apply to it. The
InfiniBand developers seem to have resolved those concerns through
a combination of easing access to the specification and pointing out that
the InfiniBand patent agreement is closely aligned with the agreements
which apply to other standards, such as PCI. There may well be patented
technologies lurking within the InfiniBand specification, but InfiniBand
should not present a higher risk of patent difficulties than any other part
of the kernel.
Comments (2 posted)
Andrew Tridgell has been hacking away on Samba 4 for a while now; that
project has gotten to the point that he has
started doing some performance testing. His
first set of results looked like this (numbers in MB/sec):
| Filesystem | No xattr | With xattr |
| ext2 | 68 | 64 |
| ext3 | 67 | 58 |
| xfs | 62 | 40 |
| xfs 2K inode | 63 | 58 |
| tmpfs | 69 | -- |
| jfs | 36 | 29 |
| reiser3 | 58 | 44 |
These results show that all filesystems slow down when extended attributes
are used. This matters for Samba 4 because Windows filesystems make
heavy use of extended attributes. As Tridge put it:
The high cost of xattr support is a bit of a problem.... I hope we can
reduce the cost of xattrs as otherwise Samba4 is going to be
seriously disadvantaged when full windows compatibility is
needed. I'm guessing that nearly all Samba installs will be using
xattrs by this time next year, as we can't do basic security
features like WinXP security zones without them, so making them
perform well will be important.
The cause of the performance problems is not particularly mysterious. Most
filesystems store extended attributes in a special data block, away from
the rest of the associated file's metadata. So working with a file's
extended attributes forces the filesystem to go out and read another block
from the drive. The extra transfers and seeks take their toll on
performance, as can be seen in the numbers above.
A pointer to the solution can be seen there as well. The "xfs 2K inode"
results were obtained by turning on the XFS large inode option. This
option expands the size of the on-disk inode structure, making room for the
extended attributes to be stored there. When the inode is read from the
drive, the extended attributes come with it, and no separate I/O is
required to work with them. When this option is enabled, the performance
hit for using extended attributes with XFS is much reduced.
It turns out that a large inode patch for
ext3 has been in the works for a while; it has passed muster with the
ext3 developers, but has not yet been pushed into the mainline. Tridge tried this patch and was pleased with the
results:
Using a 256 byte inode on ext3 gained a factor of up to 7x in
performance, and only lost a very small amount when xattrs were not
used. It took ext3 from a very mediocre performance to being the
clear winner among current Linux journaled filesystems for
performance when xattrs are used. Eventually I think that larger
inodes should become the default.
First, however, the patch must be merged. With testimonials like this,
that merger is likely to happen in the relatively near future.
One interesting mystery remains, however: Tridge gets notably better results with
2.6.10-rc2-mm2 than what he gets with 2.6.10-rc2. As of this writing,
nobody seems to have an explanation for why ext3 should perform that much
better in the -mm kernel. Inquiring minds very much want to know, however,
and Andrew Morton is working at finding out which patch makes the
difference.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>