Brief items
The current 2.6 prepatch is 2.6.9-rc1,
announced by Linus on August 24. Note
that this patch applies against 2.6.8, not 2.6.8.1. Changes merged include
a bunch of gcc-3.5 fixes, a big serial ATA update, a number of NT
filesystem improvements, block I/O barrier support for several filesystems
and transports, the limited ability for normal processes to lock memory,
lots of CPU frequency controller patches, some read-copy-update
improvements, a netfilter update, an ACPI update, the token-based thrashing
control patch (see
the August 4 Kernel
Page), a new USB storage block driver, lots of architecture updates,
and lots of fixes.
The long-format
changelog has the details.
Linus has continued merging patches at a high rate; his BitKeeper
repository contains, as of this writing, numerous network driver updates,
some random number generator fixes, a fix for the audio CD writing memory
leak, some VFS interface improvements, executable support in hugetlb
mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a
number of asynchronous I/O fixes and improvements, a User-mode Linux
update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler
tweaks, the removal of the very last suser() call, and lots of fixes.
The current patch from Andrew Morton is 2.6.8.1-mm4. Recent changes to -mm include
the return of the kexec code, a change in the copy_*_user()
interface (see below), Nick Piggin's CPU scheduler ("to see what
happens"), and the reiser4 filesystem (see below).
The current 2.4 prepatch is still 2.4.28-pre1; Marcelo has released no
prepatches since August 15.
Comments (1 posted)
Kernel development news
ReiserFS V3 is the stablest Linux filesystem, and V4 is the fastest.
In regards to claims by ext2 that they are the de facto standard
Linux filesystem, the most polite thing to say is that many persons
disagree, and it is interesting that those persons seem to include
the distros that are growing in market share. See
http://www.namesys.com/benchmarks.html for why many disagree.
-- From the reiser4 configuration help text
Comments (3 posted)
The reiser4 filesystem came one step closer to inclusion when it was added
to
2.6.8.1-mm2. This filesystem was covered
here
in July, 2003; those interested in a
lengthy writeup with lots of details and weird artwork can find it at
namesys.com. In short, reiser4's claims
include very high performance, high-level transactional capability,
enhanced security, and a flexible plugin architecture which should make it
possible to do truly different and interesting things.
Actually playing with reiser4 involves getting a recent -mm kernel (or
downloading it separately and applying it to another kernel). The tools for
building and checking reiser4 filesystems can be found over here. There is a
shareable library ("libaal") which must be built first, followed by the
"reiser4progs" package. If the reiser4progs configuration process tells
you that you lack the proper version of libaal, it probably means you
forgot to run ldconfig between the two steps.
We ran some very simple tests using the only benchmark that really matters:
working with the kernel source tree. The first step was to look at the
simple usage of space; reiser4 claims to be more efficient in that regard.
This table indicates how much space was used (in KB) in various points in
the kernel build process:
| Filesystem |
Space usage |
| Empty | New kernel tree | Built kernel tree |
| reiser4 |
188 |
206,000 |
659,000 |
| ext3 |
32,800 |
271,000 |
727,000 |
An empty ext3 filesystem has a fair amount of overhead (almost 33MB on a
2GB partition) that is not seen on reiser4; the reason is that reiser4 does
not need to pre-allocate any inode tables. That saves some space; it also
means that reiser4 filesystems will never run out of inodes. Reiser4 is
also clearly more efficient in its file layout; an unbuilt kernel tree takes
about 15% less space than on ext3.
The next step was a set of highly unscientific timing tests involving
various tasks: untarring a kernel, building that kernel, grepping dirty words out of the kernel source,
and two find commands: one which
tests on file names only, and one requiring a stat() of each
file. The tests were run on some bleeding-edge hardware: an otherwise
unused 4GB IDE
disk on a dual Pentium-450 system. The filesystem was unmounted between
tests to clear its pages out of
the cache. Here's the results; two times are presented: elapsed and
system.
| Filesystem |
Test |
| Untar |
Build |
Grep |
find (name) |
find (stat) |
| reiser4 |
67/41 |
1583/386 |
78/12 |
12.5/1.3 |
15.2/4.0 |
| ext3 |
55/24 |
1400/217 |
62/8 |
10.4/1.1 |
12.1/2.5 |
Anybody who tries to draw any real conclusions from the above results
should probably think again. That said, it would seem that reiser4's claim
to being the fastest Linux filesystem remains unproven. Incidentally,
here's another quote from the reiser4 configuration help text:
If using a kernel made by a distro that thinks they are our
competitor (sigh) rather than made by Linus, always check each
release to make sure they have not turned this on to make us look
slow as was done once in the past.
This text describes a debugging option; that option was not enabled
for these tests.
Meanwhile, the inclusion of reiser4 into -mm has, as desired, increased the
number of developers looking at the code. Many of them are not entirely
happy with what they see. The first problem is that reiser4 will fail
horribly with 4K kernel stacks; it seems that quite a few large data
structures are kept on the stack. The reiser4 hackers will be looking at
reworking memory allocation to get around that particular problem.
Rik van Riel was the first to stumble across
the sys_reiser4() system call. The code to implement
sys_reiser4() is present (and built) in -mm, but the actual call
is not added to the system call table. A patch comes with the source to
make that addition, however.
According to the documentation:
A new system call sys_reiser4() will be implemented to support
applications that don't have to be fooled into thinking that they
are using POSIX. Through this entry point a richer set of semantics
will access the same files that are also accessible using POSIX
calls.... Reiser4() will implement all features
necessary to access ACLs as files/directories rather than as
something neither file nor directory. These include opening and
closing transactions, performing a sequence of I/Os in one system
call, and accessing files without use of file descriptors
(necessary for efficient small I/O). Reiser4 will use a syntax
suitable for evolving into Reiser5() syntax with its set theoretic
naming.
This syntax, it seems, is implemented via a yacc-generated parser, which is
duly stuffed into the kernel. As Rik notes, this approach is likely to be
controversial, even before people start thinking about what the new
operations actually do.
Reiser4 blurs the distinction between files and directories as part of Hans
Reiser's general view of how filesystems should be used. For example,
extended attributes, according to Hans, should not exist in their own
namespace; they should just look like more files. With the right plugins,
it should also be possible to do things like treat a tar archive
as a directory tree and move around within it. There are, it seems,
some immediate problems with this idea. As Christoph Hellwig pointed out, reiser4 allows an open
with the O_DIRECTORY flag to succeed even if the target is not a
directory. That defeats the use of O_DIRECTORY as a way of
avoiding race conditions and security holes, and is unlikely to go over
well. Al Viro noted some severe locking
problems (leading to easy denial of service attacks) with the
file-as-directory implementation as well.
Reiser4, it seems, may have a bit of a rough road on its way into the
kernel. Hans's approach to PR is unlikely
to help in this regard, though it should be noted that Linus likes some of the reiser4 features.
One hopes that reiser4 will get into the kernel eventually. It would surely be a
mistake to believe that the optimal set of filesystem semantics has been
achieved. The reiser4 project is arguably the place where the most
thinking is happening about where filesystems should go in the future. If
Linux is unwilling to host the results of that work (after the obvious
problems are fixed), it may eventually find itself trying to catch up with
some other kernel which proves to be more accepting.
Comments (26 posted)
There are two relatively significant API changes which are currently being
tossed around for possible inclusion. Forewarned is forearmed, and all
that, so here's a quick summary of what is being looked at.
2.6.8.1-mm4 included a
patch which changes how copy_to_user() and
copy_from_user() return a failure status. These functions have,
for a long time, returned the number of bytes which they failed to copy to
or from user space. This interface differs from what kernel programmers
normally expect, and has caused confusion and bugs many times in the past.
As David Miller put it:
People who are experts and work every day on their platform get
this stuff wrong, myself included. This means we are too dumb to
debug this code, according to The Practice of Programming :-)
Rusty Russell also expressed his opinion
on the copy_*_user() interface, as only Rusty can, a couple of
years ago.
Andrew Morton has decided that, perhaps, the time has come to fix the
interface. In 2.6.8.1-mm4, the copy functions return the usual negative
error code when things fail - at least, on the i386 platform. The change
is overtly experimental, "It's a see-what-breaks thing." So
far, reports of breakage are relatively scarce.
On the other front, consider remap_page_range(). This function is
prototyped as:
int remap_page_range(struct vm_area_struct *vma, unsigned long virt,
unsigned long phys, unsigned long size,
pgprot_t prot);
Its primary use is mapping memory found on I/O controllers into the virtual
address space of a process. This function is accompanied by
io_remap_page_range(), which is more explicitly intended for I/O
areas. On almost every architecture, io_remap_page_range() is
simply another name for remap_page_range(), but the SPARC
architecture is different; it can make use of that architecture's I/O space
to do things more efficiently.
Paul Jackson recently noticed another
difference: the SPARC versions of io_remap_page_range() have six
arguments, while everybody else has only five. Needless to say, this is a
curious discrepancy; it also makes it hard to write platform-independent
code which uses io_remap_page_range().
The extra argument on the SPARC architecture is an integer "space" value;
what it really is for, it turns out, is to specify the "I/O space" into
which the pages are to be mapped. It is a response to a problem with the
remap_page_range() interface: the physical address which is to be
the target of the mapping is typed as an unsigned long. So a
target address which requires more than 32 bits cannot be specified on
32-bit systems. SPARC I/O space addresses are above the 32-bit range. So
the extra argument is required on the SPARC simply to provide the upper 32
bits for the physical address.
Various options for smoothing out the difference were considered. In the
end, the idea that seems to be winning is to change the
remap_page_range() API slightly: instead of passing the target
address as an address, that value should be expressed as a page frame
number. That change gets rid of the 12 address bits used for the offset
within the page (which are unused in remap_page_range() since that
function deals in whole pages) and lets them be used for additional
high-end bits, effectively extending the address range to 44 bits - which
is enough.
William Lee Irwin has put together a patch
which implements this change for most architectures. Since the change
breaks every caller of remap_page_range(), the patch touches a lot
of files. Should the patch ever be merged, externally-maintained drivers
will have to be fixed as well. This transition will not be helped by the
fact that the compiler will not be able to detect unfixed code.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>