|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.9-rc1, announced by Linus on August 24. Note that this patch applies against 2.6.8, not 2.6.8.1. Changes merged include a bunch of gcc-3.5 fixes, a big serial ATA update, a number of NT filesystem improvements, block I/O barrier support for several filesystems and transports, the limited ability for normal processes to lock memory, lots of CPU frequency controller patches, some read-copy-update improvements, a netfilter update, an ACPI update, the token-based thrashing control patch (see the August 4 Kernel Page), a new USB storage block driver, lots of architecture updates, and lots of fixes. The long-format changelog has the details.

Linus has continued merging patches at a high rate; his BitKeeper repository contains, as of this writing, numerous network driver updates, some random number generator fixes, a fix for the audio CD writing memory leak, some VFS interface improvements, executable support in hugetlb mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a number of asynchronous I/O fixes and improvements, a User-mode Linux update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler tweaks, the removal of the very last suser() call, and lots of fixes.

The current patch from Andrew Morton is 2.6.8.1-mm4. Recent changes to -mm include the return of the kexec code, a change in the copy_*_user() interface (see below), Nick Piggin's CPU scheduler ("to see what happens"), and the reiser4 filesystem (see below).

The current 2.4 prepatch is still 2.4.28-pre1; Marcelo has released no prepatches since August 15.

Comments (1 posted)

Kernel development news

Quote of the week

ReiserFS V3 is the stablest Linux filesystem, and V4 is the fastest.

In regards to claims by ext2 that they are the de facto standard Linux filesystem, the most polite thing to say is that many persons disagree, and it is interesting that those persons seem to include the distros that are growing in market share. See http://www.namesys.com/benchmarks.html for why many disagree.

-- From the reiser4 configuration help text

Comments (3 posted)

Looking at reiser4

The reiser4 filesystem came one step closer to inclusion when it was added to 2.6.8.1-mm2. This filesystem was covered here in July, 2003; those interested in a lengthy writeup with lots of details and weird artwork can find it at namesys.com. In short, reiser4's claims include very high performance, high-level transactional capability, enhanced security, and a flexible plugin architecture which should make it possible to do truly different and interesting things.

Actually playing with reiser4 involves getting a recent -mm kernel (or downloading it separately and applying it to another kernel). The tools for building and checking reiser4 filesystems can be found over here. There is a shareable library ("libaal") which must be built first, followed by the "reiser4progs" package. If the reiser4progs configuration process tells you that you lack the proper version of libaal, it probably means you forgot to run ldconfig between the two steps.

We ran some very simple tests using the only benchmark that really matters: working with the kernel source tree. The first step was to look at the simple usage of space; reiser4 claims to be more efficient in that regard. This table indicates how much space was used (in KB) in various points in the kernel build process:

Filesystem Space usage
EmptyNew kernel treeBuilt kernel tree
reiser4 188 206,000 659,000
ext3 32,800 271,000 727,000

An empty ext3 filesystem has a fair amount of overhead (almost 33MB on a 2GB partition) that is not seen on reiser4; the reason is that reiser4 does not need to pre-allocate any inode tables. That saves some space; it also means that reiser4 filesystems will never run out of inodes. Reiser4 is also clearly more efficient in its file layout; an unbuilt kernel tree takes about 15% less space than on ext3.

The next step was a set of highly unscientific timing tests involving various tasks: untarring a kernel, building that kernel, grepping dirty words out of the kernel source, and two find commands: one which tests on file names only, and one requiring a stat() of each file. The tests were run on some bleeding-edge hardware: an otherwise unused 4GB IDE disk on a dual Pentium-450 system. The filesystem was unmounted between tests to clear its pages out of the cache. Here's the results; two times are presented: elapsed and system.

Filesystem Test
Untar Build Grep find (name) find (stat)
reiser4 67/41 1583/386 78/12 12.5/1.3 15.2/4.0
ext3 55/24 1400/217 62/8 10.4/1.1 12.1/2.5

Anybody who tries to draw any real conclusions from the above results should probably think again. That said, it would seem that reiser4's claim to being the fastest Linux filesystem remains unproven. Incidentally, here's another quote from the reiser4 configuration help text:

If using a kernel made by a distro that thinks they are our competitor (sigh) rather than made by Linus, always check each release to make sure they have not turned this on to make us look slow as was done once in the past.

This text describes a debugging option; that option was not enabled for these tests.

Meanwhile, the inclusion of reiser4 into -mm has, as desired, increased the number of developers looking at the code. Many of them are not entirely happy with what they see. The first problem is that reiser4 will fail horribly with 4K kernel stacks; it seems that quite a few large data structures are kept on the stack. The reiser4 hackers will be looking at reworking memory allocation to get around that particular problem.

Rik van Riel was the first to stumble across the sys_reiser4() system call. The code to implement sys_reiser4() is present (and built) in -mm, but the actual call is not added to the system call table. A patch comes with the source to make that addition, however.

According to the documentation:

A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls.... Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). Reiser4 will use a syntax suitable for evolving into Reiser5() syntax with its set theoretic naming.

This syntax, it seems, is implemented via a yacc-generated parser, which is duly stuffed into the kernel. As Rik notes, this approach is likely to be controversial, even before people start thinking about what the new operations actually do.

Reiser4 blurs the distinction between files and directories as part of Hans Reiser's general view of how filesystems should be used. For example, extended attributes, according to Hans, should not exist in their own namespace; they should just look like more files. With the right plugins, it should also be possible to do things like treat a tar archive as a directory tree and move around within it. There are, it seems, some immediate problems with this idea. As Christoph Hellwig pointed out, reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and security holes, and is unlikely to go over well. Al Viro noted some severe locking problems (leading to easy denial of service attacks) with the file-as-directory implementation as well.

Reiser4, it seems, may have a bit of a rough road on its way into the kernel. Hans's approach to PR is unlikely to help in this regard, though it should be noted that Linus likes some of the reiser4 features. One hopes that reiser4 will get into the kernel eventually. It would surely be a mistake to believe that the optimal set of filesystem semantics has been achieved. The reiser4 project is arguably the place where the most thinking is happening about where filesystems should go in the future. If Linux is unwilling to host the results of that work (after the obvious problems are fixed), it may eventually find itself trying to catch up with some other kernel which proves to be more accepting.

Comments (26 posted)

API changes under consideration

There are two relatively significant API changes which are currently being tossed around for possible inclusion. Forewarned is forearmed, and all that, so here's a quick summary of what is being looked at.

2.6.8.1-mm4 included a patch which changes how copy_to_user() and copy_from_user() return a failure status. These functions have, for a long time, returned the number of bytes which they failed to copy to or from user space. This interface differs from what kernel programmers normally expect, and has caused confusion and bugs many times in the past. As David Miller put it:

People who are experts and work every day on their platform get this stuff wrong, myself included. This means we are too dumb to debug this code, according to The Practice of Programming :-)

Rusty Russell also expressed his opinion on the copy_*_user() interface, as only Rusty can, a couple of years ago.

Andrew Morton has decided that, perhaps, the time has come to fix the interface. In 2.6.8.1-mm4, the copy functions return the usual negative error code when things fail - at least, on the i386 platform. The change is overtly experimental, "It's a see-what-breaks thing." So far, reports of breakage are relatively scarce.

On the other front, consider remap_page_range(). This function is prototyped as:

    int remap_page_range(struct vm_area_struct *vma, unsigned long virt,
                         unsigned long phys, unsigned long size, 
                         pgprot_t prot);

Its primary use is mapping memory found on I/O controllers into the virtual address space of a process. This function is accompanied by io_remap_page_range(), which is more explicitly intended for I/O areas. On almost every architecture, io_remap_page_range() is simply another name for remap_page_range(), but the SPARC architecture is different; it can make use of that architecture's I/O space to do things more efficiently.

Paul Jackson recently noticed another difference: the SPARC versions of io_remap_page_range() have six arguments, while everybody else has only five. Needless to say, this is a curious discrepancy; it also makes it hard to write platform-independent code which uses io_remap_page_range().

The extra argument on the SPARC architecture is an integer "space" value; what it really is for, it turns out, is to specify the "I/O space" into which the pages are to be mapped. It is a response to a problem with the remap_page_range() interface: the physical address which is to be the target of the mapping is typed as an unsigned long. So a target address which requires more than 32 bits cannot be specified on 32-bit systems. SPARC I/O space addresses are above the 32-bit range. So the extra argument is required on the SPARC simply to provide the upper 32 bits for the physical address.

Various options for smoothing out the difference were considered. In the end, the idea that seems to be winning is to change the remap_page_range() API slightly: instead of passing the target address as an address, that value should be expressed as a page frame number. That change gets rid of the 12 address bits used for the offset within the page (which are unused in remap_page_range() since that function deals in whole pages) and lets them be used for additional high-end bits, effectively extending the address range to 44 bits - which is enough.

William Lee Irwin has put together a patch which implements this change for most architectures. Since the change breaks every caller of remap_page_range(), the patch touches a lot of files. Should the patch ever be merged, externally-maintained drivers will have to be fixed as well. This transition will not be helped by the fact that the compiler will not be able to detect unfixed code.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.9-rc1 ?
Andrew Morton 2.6.8.1-mm2 ?
Andrew Morton 2.6.8.1-mm3 ?
Andrew Morton 2.6.8.1-mm4 ?
maximilian attems 2.6.8.1-kjt2 ?
Nick Piggin 2.6.8.1-np1 ?
Con Kolivas 2.6.8.1-ck3 ?
Con Kolivas 2.6.8.1-ck4 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Jonathan Corbet Remove struct bus_type->add() ?
Dave Jones includes cleanup. ?
William Lee Irwin III WAITQUEUE_DEBUG crapectomy ?

Memory management

Security-related

Michal Ludvig /dev/crypto for Linux ?

Benchmarks and bugs

Martin J. Bligh Performance of -mm2 and -mm4 ?

Miscellaneous

Mariusz Mazur linux-libc-headers 2.6.8.0 ?
Stephen Hemminger iproute2 for 2.6.8 (and 2.4.27) ?
Jeff Garzik rng-tools updated ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds