Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.9-rc1, announced by Linus on August 24. Note that this patch applies against 2.6.8, not 2.6.8.1. Changes merged include a bunch of gcc-3.5 fixes, a big serial ATA update, a number of NT filesystem improvements, block I/O barrier support for several filesystems and transports, the limited ability for normal processes to lock memory, lots of CPU frequency controller patches, some read-copy-update improvements, a netfilter update, an ACPI update, the token-based thrashing control patch (see the August 4 Kernel Page), a new USB storage block driver, lots of architecture updates, and lots of fixes. The long-format changelog has the details.Linus has continued merging patches at a high rate; his BitKeeper repository contains, as of this writing, numerous network driver updates, some random number generator fixes, a fix for the audio CD writing memory leak, some VFS interface improvements, executable support in hugetlb mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a number of asynchronous I/O fixes and improvements, a User-mode Linux update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler tweaks, the removal of the very last suser() call, and lots of fixes.
The current patch from Andrew Morton is 2.6.8.1-mm4. Recent changes to -mm include
the return of the kexec code, a change in the copy_*_user()
interface (see below), Nick Piggin's CPU scheduler ("to see what
happens
"), and the reiser4 filesystem (see below).
The current 2.4 prepatch is still 2.4.28-pre1; Marcelo has released no prepatches since August 15.
Kernel development news
Quote of the week
In regards to claims by ext2 that they are the de facto standard Linux filesystem, the most polite thing to say is that many persons disagree, and it is interesting that those persons seem to include the distros that are growing in market share. See http://www.namesys.com/benchmarks.html for why many disagree.
-- From the reiser4 configuration help text
Looking at reiser4
The reiser4 filesystem came one step closer to inclusion when it was added to 2.6.8.1-mm2. This filesystem was covered here in July, 2003; those interested in a lengthy writeup with lots of details and weird artwork can find it at namesys.com. In short, reiser4's claims include very high performance, high-level transactional capability, enhanced security, and a flexible plugin architecture which should make it possible to do truly different and interesting things.Actually playing with reiser4 involves getting a recent -mm kernel (or downloading it separately and applying it to another kernel). The tools for building and checking reiser4 filesystems can be found over here. There is a shareable library ("libaal") which must be built first, followed by the "reiser4progs" package. If the reiser4progs configuration process tells you that you lack the proper version of libaal, it probably means you forgot to run ldconfig between the two steps.
We ran some very simple tests using the only benchmark that really matters: working with the kernel source tree. The first step was to look at the simple usage of space; reiser4 claims to be more efficient in that regard. This table indicates how much space was used (in KB) in various points in the kernel build process:
| Filesystem | Space usage | ||
|---|---|---|---|
| Empty | New kernel tree | Built kernel tree | |
| reiser4 | 188 | 206,000 | 659,000 |
| ext3 | 32,800 | 271,000 | 727,000 |
An empty ext3 filesystem has a fair amount of overhead (almost 33MB on a 2GB partition) that is not seen on reiser4; the reason is that reiser4 does not need to pre-allocate any inode tables. That saves some space; it also means that reiser4 filesystems will never run out of inodes. Reiser4 is also clearly more efficient in its file layout; an unbuilt kernel tree takes about 15% less space than on ext3.
The next step was a set of highly unscientific timing tests involving various tasks: untarring a kernel, building that kernel, grepping dirty words out of the kernel source, and two find commands: one which tests on file names only, and one requiring a stat() of each file. The tests were run on some bleeding-edge hardware: an otherwise unused 4GB IDE disk on a dual Pentium-450 system. The filesystem was unmounted between tests to clear its pages out of the cache. Here's the results; two times are presented: elapsed and system.
| Filesystem | Test | ||||
|---|---|---|---|---|---|
| Untar | Build | Grep | find (name) | find (stat) | |
| reiser4 | 67/41 | 1583/386 | 78/12 | 12.5/1.3 | 15.2/4.0 |
| ext3 | 55/24 | 1400/217 | 62/8 | 10.4/1.1 | 12.1/2.5 |
Anybody who tries to draw any real conclusions from the above results should probably think again. That said, it would seem that reiser4's claim to being the fastest Linux filesystem remains unproven. Incidentally, here's another quote from the reiser4 configuration help text:
This text describes a debugging option; that option was not enabled for these tests.
Meanwhile, the inclusion of reiser4 into -mm has, as desired, increased the number of developers looking at the code. Many of them are not entirely happy with what they see. The first problem is that reiser4 will fail horribly with 4K kernel stacks; it seems that quite a few large data structures are kept on the stack. The reiser4 hackers will be looking at reworking memory allocation to get around that particular problem.
Rik van Riel was the first to stumble across the sys_reiser4() system call. The code to implement sys_reiser4() is present (and built) in -mm, but the actual call is not added to the system call table. A patch comes with the source to make that addition, however.
According to the documentation:
This syntax, it seems, is implemented via a yacc-generated parser, which is duly stuffed into the kernel. As Rik notes, this approach is likely to be controversial, even before people start thinking about what the new operations actually do.
Reiser4 blurs the distinction between files and directories as part of Hans Reiser's general view of how filesystems should be used. For example, extended attributes, according to Hans, should not exist in their own namespace; they should just look like more files. With the right plugins, it should also be possible to do things like treat a tar archive as a directory tree and move around within it. There are, it seems, some immediate problems with this idea. As Christoph Hellwig pointed out, reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and security holes, and is unlikely to go over well. Al Viro noted some severe locking problems (leading to easy denial of service attacks) with the file-as-directory implementation as well.
Reiser4, it seems, may have a bit of a rough road on its way into the kernel. Hans's approach to PR is unlikely to help in this regard, though it should be noted that Linus likes some of the reiser4 features. One hopes that reiser4 will get into the kernel eventually. It would surely be a mistake to believe that the optimal set of filesystem semantics has been achieved. The reiser4 project is arguably the place where the most thinking is happening about where filesystems should go in the future. If Linux is unwilling to host the results of that work (after the obvious problems are fixed), it may eventually find itself trying to catch up with some other kernel which proves to be more accepting.
API changes under consideration
There are two relatively significant API changes which are currently being tossed around for possible inclusion. Forewarned is forearmed, and all that, so here's a quick summary of what is being looked at.2.6.8.1-mm4 included a patch which changes how copy_to_user() and copy_from_user() return a failure status. These functions have, for a long time, returned the number of bytes which they failed to copy to or from user space. This interface differs from what kernel programmers normally expect, and has caused confusion and bugs many times in the past. As David Miller put it:
Rusty Russell also expressed his opinion on the copy_*_user() interface, as only Rusty can, a couple of years ago.
Andrew Morton has decided that, perhaps, the time has come to fix the
interface. In 2.6.8.1-mm4, the copy functions return the usual negative
error code when things fail - at least, on the i386 platform. The change
is overtly experimental, "It's a see-what-breaks thing.
" So
far, reports of breakage are relatively scarce.
On the other front, consider remap_page_range(). This function is prototyped as:
int remap_page_range(struct vm_area_struct *vma, unsigned long virt,
unsigned long phys, unsigned long size,
pgprot_t prot);
Its primary use is mapping memory found on I/O controllers into the virtual address space of a process. This function is accompanied by io_remap_page_range(), which is more explicitly intended for I/O areas. On almost every architecture, io_remap_page_range() is simply another name for remap_page_range(), but the SPARC architecture is different; it can make use of that architecture's I/O space to do things more efficiently.
Paul Jackson recently noticed another difference: the SPARC versions of io_remap_page_range() have six arguments, while everybody else has only five. Needless to say, this is a curious discrepancy; it also makes it hard to write platform-independent code which uses io_remap_page_range().
The extra argument on the SPARC architecture is an integer "space" value; what it really is for, it turns out, is to specify the "I/O space" into which the pages are to be mapped. It is a response to a problem with the remap_page_range() interface: the physical address which is to be the target of the mapping is typed as an unsigned long. So a target address which requires more than 32 bits cannot be specified on 32-bit systems. SPARC I/O space addresses are above the 32-bit range. So the extra argument is required on the SPARC simply to provide the upper 32 bits for the physical address.
Various options for smoothing out the difference were considered. In the end, the idea that seems to be winning is to change the remap_page_range() API slightly: instead of passing the target address as an address, that value should be expressed as a page frame number. That change gets rid of the 12 address bits used for the offset within the page (which are unused in remap_page_range() since that function deals in whole pages) and lets them be used for additional high-end bits, effectively extending the address range to 44 bits - which is enough.
William Lee Irwin has put together a patch which implements this change for most architectures. Since the change breaks every caller of remap_page_range(), the patch touches a lot of files. Should the patch ever be merged, externally-maintained drivers will have to be fixed as well. This transition will not be helped by the fact that the compiler will not be able to detect unfixed code.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
