User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.10-rc2.

Linus's BitKeeper repository continues to accumulate patches, though they are mostly of a bugfix nature. They include some x86 single-stepping fixes, a number of "sparse" annotations, the token-based memory management fix, a memory technology device (and JFFS2) update, a frame buffer device update, some user-mode Linux patches, some page allocator tuning, and a few architecture updates.

The current patch from Andrew Morton is 2.6.10-rc2-mm3. Recent changes to -mm include a number of disk quota fixes, a big DVB update, and some fixes to make SELinux and reiserfs work together.

The current ultra-stable patch from Alan Cox is 2.6.9-ac11.

The current 2.4 kernel is 2.4.28; Marcelo seems in no hurry to start the 2.4.29 process. For those wanting an extra-hard 2.4 kernel, Solar Designer has released 2.4.28-ow1, which includes a number of security fixes.

Comments (none posted)

Kernel development news

Xen is coming

Xen is a free virtualization system designed to allow multiple virtual machines to be run on a single host system with high performance. The Xen system (version 2.0 was released recently) offers a number of interesting features, including flexible networking between virtual machines and the ability to transparently move virtual machines between physical hosts while they are running. Xen's authors claim that the performance hit from running under Xen is only "a few percent."

Now that the 2.0 release is out, the Xen developers would like to merge their code into the mainline kernel. The bulk of this code adds the new Xen "architecture," which enables the kernel to run on the virtual machine provided by Xen itself. The architecture code is available from the Xen site for those who are interested. Another significant chunk is a set of drivers which provide Xen-hosted systems with network interfaces, file-backed block devices, and console devices.

Inclusion of both of those patch sets should be relatively uncontroversial; they do not affect any code which is not actually built for the Xen architecture, and thus should not risk breaking anything. The final set, however, will have to be looked at more closely; these are the patches to the core kernel itself. Most of these patches make the kernel work with Xen's very different way of managing and allocating memory; they include a new sk_buff structure allocation function, a change to how /dev/mem works on the Xen architecture, and a new ptep_establish_new() function which optimizes the instantiation of new pages. Perhaps the most controversial change is a change in how the architecture-specific arch_free_page() function works: under Xen, this function might actually short out the rest of the page allocator functions and dispose of the page itself. This technique allows Xen to manage a single page pool for multiple virtual machines, but not everybody liked changing the interface to arch_free_page() in that way.

That said, there appears to be no strong opposition to the inclusion of these patches. It would not be surprising to see them go into -mm sometime after 2.6.10 comes out.

Comments (7 posted)

Should FUSE be merged?

The Filesystems in User Space (FUSE) patch has been around for some time. FUSE acts as a kernel filesystem which turns around and passes all VFS requests out to a user-space daemon, which is expected to do something reasonable with them. There are numerous projects using FUSE to implement interesting filesystems in user space. The FUSE developers have now requested that FUSE be merged into the 2.6 kernel. They may yet get there, but some obstacles stand in the way.

Linus started by complaining that FUSE was "too messy." Some of his impressions, it turns out, may have been based on a reading of old code. Some of the things he was complaining about were parts of the 2.4 version of the patch; they are not present in the version being put forward for inclusion.

There is, however, one show-stopping problem which remains in the code. If the system's memory gets to be full of dirty pages which must be written to a FUSE filesystem, and the user-space process which implements that filesystem has been swapped out, the system can deadlock. It cannot clean up those dirty pages until they have been written to the backing store, it cannot write those pages until the user-space daemon has been paged in, and it cannot page in the daemon until the dirty pages are cleaned. The system comes to a screeching halt and the users reconsider the whole idea of user-space filesystems.

The problem is most easily demonstrated through the use of shared writable mappings. With such mappings, user space can create vast numbers of dirty pages without the operating system knowing about it. Andrew Morton demonstrated that this is not just a theoretical problem; it can be made to happen on real systems. The problem can also be made to happen by simply writing too much data to the filesystem. All this led Linus to lecture on the topic:

Guys, there is a _reason_ why microkernels suck. This is an example of how things are _not_ "independent". The filesystems depend on the VM, and the VM depends on the filesystem. You can't just split them up as if they were two separate things (or rather: you _can_ split them up, but they still very much need to know about each other in very intimate ways).

In this case, the worst problems can be avoided by simply disallowing shared, writable mappings. That limitation will not, in fact, bother too many people; these mappings are not heavily used. It's also necessary to take steps like limiting the number of pages currently queued for writing out. This limit will affect users, in that it will reduce performance. It has been noted, however, that deadlocks tend to have an even worse impact on performance.

In response to the above concerns (and others), the FUSE patches have been reworked. Among other things, the shared, writable mapping support has been split out into a separate, optional patch. There's no word on whether it will be merged, though Linus did suggest that it might:

I'm a sucker. Ask anybody. I'll accept the exact same patch that I rejected earlier if you just do it the right way. I'm convinced that some people actually do it on purpose just for the amusement value ("Look, he did it _again_. What a doofus!")

Whether Andrew Morton is so gullible remains to be seen.

Comments (4 posted)

InfiniBand arrives

After a long period of development, the OpenIB Alliance has posted an initial set of patches for review. The current patch set is not proposed for inclusion, though the project has made it clear that merging into a not-too-distant 2.6 kernel is something they would like. The initial comments suggest that there may not be much opposition to that.

The patch set is large, reflecting the complexity of the InfiniBand specification. At the bottom layer, a driver for Mellanox adapters is included with the patch set; it's some 9,000 lines of sparsely-commented code. The core "midlayer" manages InfiniBand ports and makes access to the fabric available for the upper layers. The midlayer also allows for user-space administration by facilitating the passing of "MADs" ("management datagrams") back and forth.

The upper layers of the InfiniBand specification envision support for a number of features, including MPI (message passing interface, heavily used in clustered applications), SDP (socket direct protocol: a networking standard based on remote DMA), SRP (remote SCSI), and IP over InfiniBand using the classic socket interface. The current OpenIB patches concentrate on full IP (both IPv4 and IPv6) support; most of the other high-level protocols are not yet implemented.

The comments on the InfiniBand code have been relatively minor, so far. The project's choices for device names (deeply nested names like /dev/infiniband/mthca0/ports/1/mad) will likely be changed. The project also went with dynamic device number assignment. This technique works well on systems running a tool like udev to create the device nodes, but it makes life difficult on systems where device nodes must be created manually by the administrator. For now, at least, plenty of such systems exist, so static device numbers are needed. The OpenIB drivers also rely on ioctl() calls for a number of administrative functions; questions were raised, but the current interface is not likely to be changed in any significant way.

Perhaps the most surprising complaint, to many, was the objection to the dual GPL/BSD license carried by the OpenIB code. BSD-licensed code is not normally a problem in the kernel; it can be included in a larger, GPL-licensed program without any sort of infringement. The OpenIB code uses read-copy-update (RCU), however, and that usage brings an additional constraint. IBM holds a patent on RCU, and has licensed that patent for use with GPL-licensed code. As is the case with many of these patent licenses, BSD-licensed code is not covered. So the OpenIB developers may find themselves having to (1) drop the BSD license from their code, (2) stop using RCU, or (3) get some sort of special exemption from IBM. It appears that they will choose the second option.

One issue which has not come up is concern over the licensing of the InfiniBand specification or any patents which may apply to it. The InfiniBand developers seem to have resolved those concerns through a combination of easing access to the specification and pointing out that the InfiniBand patent agreement is closely aligned with the agreements which apply to other standards, such as PCI. There may well be patented technologies lurking within the InfiniBand specification, but InfiniBand should not present a higher risk of patent difficulties than any other part of the kernel.

Comments (2 posted)

Which filesystem for Samba4?

Andrew Tridgell has been hacking away on Samba 4 for a while now; that project has gotten to the point that he has started doing some performance testing. His first set of results looked like this (numbers in MB/sec):

FilesystemNo xattrWith xattr
xfs 2K inode6358

These results show that all filesystems slow down when extended attributes are used. This matters for Samba 4 because Windows filesystems make heavy use of extended attributes. As Tridge put it:

The high cost of xattr support is a bit of a problem.... I hope we can reduce the cost of xattrs as otherwise Samba4 is going to be seriously disadvantaged when full windows compatibility is needed. I'm guessing that nearly all Samba installs will be using xattrs by this time next year, as we can't do basic security features like WinXP security zones without them, so making them perform well will be important.

The cause of the performance problems is not particularly mysterious. Most filesystems store extended attributes in a special data block, away from the rest of the associated file's metadata. So working with a file's extended attributes forces the filesystem to go out and read another block from the drive. The extra transfers and seeks take their toll on performance, as can be seen in the numbers above.

A pointer to the solution can be seen there as well. The "xfs 2K inode" results were obtained by turning on the XFS large inode option. This option expands the size of the on-disk inode structure, making room for the extended attributes to be stored there. When the inode is read from the drive, the extended attributes come with it, and no separate I/O is required to work with them. When this option is enabled, the performance hit for using extended attributes with XFS is much reduced.

It turns out that a large inode patch for ext3 has been in the works for a while; it has passed muster with the ext3 developers, but has not yet been pushed into the mainline. Tridge tried this patch and was pleased with the results:

Using a 256 byte inode on ext3 gained a factor of up to 7x in performance, and only lost a very small amount when xattrs were not used. It took ext3 from a very mediocre performance to being the clear winner among current Linux journaled filesystems for performance when xattrs are used. Eventually I think that larger inodes should become the default.

First, however, the patch must be merged. With testimonials like this, that merger is likely to happen in the relatively near future.

One interesting mystery remains, however: Tridge gets notably better results with 2.6.10-rc2-mm2 than what he gets with 2.6.10-rc2. As of this writing, nobody seems to have an explanation for why ext3 should perform that much better in the -mm kernel. Inquiring minds very much want to know, however, and Andrew Morton is working at finding out which patch makes the difference.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds