The developers have been busy, however. Linus's BitKeeper repository contains, as of this writing, more filesystem conversions to the new symbolic link resolution code (which will eventually allow an increase in the maximum link depth), a new waitid() system call implementing the POSIX call by the same name, a "fake NUMA" mode for x86-64 testing, a small-footprint tmpfs implementation, the base KProbes patch, a set of IDE updates, support for scheduler profiling (seeing where context switches come from), automatic TCP window scaling calculation, a kobject change (it uses kref now), a USB gadget interface update with "On The Go" support, a big ALSA update, the removal of the Philips webcam driver, numerous network driver updates, some random number generator fixes, a fix for the audio CD writing memory leak, some VFS interface improvements, executable support in hugetlb mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a number of asynchronous I/O fixes and improvements, a User-mode Linux update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler tweaks, the removal of the very last suser() call, and lots of fixes.
The current tree from Andrew Morton is 2.6.9-rc1-mm2. Recent changes to -mm include some scheduler fixes (Nick Piggins's scheduler is still in -mm), the removal of the resident set size limit ("pending some evidence that it does useful things"), the out-of-line spinlocks patch (for x86 and x86-64), lockmeter for x86-64, and many fixes and updates.
The current 2.4 prepatch is 2.4.28-pre2, released by Marcelo on August 25. Changes include a serial ATA update, some gcc-3.4 fixes, an NFS update, and various other fixes.
Kernel development news
Here is last week's table, with a new line for tests done starting with a reiser4-built tarball:
|Untar||Build||Grep||find (name)||find (stat)|
The results do show a significant difference in performance when the files are created in the right order - and the differences carry through all of the operations performed on the filesystem, not just the untar. In other words, the performance benefits of reiser4 are only fully available to those who manage to create their files in the right order. Future plans call for a "repacker" process to clean up after obnoxious users who insist on creating files in something other than the optimal order, but that tool is not yet available. (For what it's worth, restoring from the reiser4 tarball did not noticeably change the ext3 results).
Last week, the discussion about reiser4 got off to a rather rough start. Even so, it evolved into a lengthy but reasonably constructive technical conversation touching on many of the issues raised by reiser4.
At the top of the list is the general question of the expanded capabilities offered by this filesystem; these include transactions, the combined file/directory objects (and the general representation of metadata in the filesystem namespace), and more. The kernel developers are nervous about changes to filesystem semantics, and they are seriously nervous about creating these new semantics at the filesystem level. The general feeling is that any worthwhile enhancements offered by reiser4 should, instead, be implemented at the virtual filesystem (VFS) level, so that more filesystems could offer them. Some developers want things done that way from the start. If there is a consensus, however, it would be along the lines laid out by Andrew Morton: accept the new features in reiser4 for now (once the other problems are addressed) with the plan of shifting the worthwhile ones into the VFS layer. The reiser4 implementation would thus be seen as a sort of prototype which could be evolved into the true Linux version.
Hans Reiser doesn't like this idea:
Somehow, over the years, Hans has neglected to tell the developers that he was, in fact, planning to replace the entire VFS. That plan looks like a difficult sell, but reiser4 could become the platform that is used to shift the VFS in the directions he sees.
Meanwhile, the reiser4 approach to metadata has attracted a fair amount of attention. Imagine you have a reiser4 partition holding a kernel tree; at the top of that tree is a file called CREDITS. It's an ordinary file, but it can be made to behave in extraordinary ways:
$ tree CREDITS/metas CREDITS/metas |-- bmap |-- gid |-- items |-- key |-- locality |-- new |-- nlink |-- oid |-- plugin | |-- compression | |-- crypto | |-- digest | |-- dir | |-- dir_item | |-- fibration | |-- file | |-- formatting | |-- hash | |-- perm | `-- sd |-- pseudo |-- readdir |-- rwx |-- size `-- uid 1 directory, 24 files
You can also type "cd CREDITS; cat ." to view the file. (One must set execute permission on the file before any of this works).
What appears to be a plain file also looks like a directory containing a number of other files. Most of these files contain information normally obtained with the stat() system call: uid is the owner, size is the length in bytes, rwx is the permissions mask, etc. Some of the others (bmap, items, oid) provide a window into how the file is represented inside the filesystem. This is all part of Hans Reiser's vision of moving everything into the namespace; rather than using a separate system call to learn about a file's metadata, just access the right pseudo file.
One branch of the discussion took issue with the "metas" name. Using reiser4 means that you cannot have any file named metas anywhere within the filesystem. Some people would like to change the name; ideas like ..metas, ..., and @ have been tossed around, but Hans seems uninclined to change things.
Another branch, led by Al Viro, worries about the locking considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons; one of those is locking. Those interested in the details can see this rather dense explanation from Al, or a translation by Linus to something resembling technical English. Linus's example is essentially this: imagine you have a directory "a" containing two subdirectories dir1 and dir2. You also have "b", which is simply a link to a. Imagine that two processes simultaneously attempt these commands:
|Process 1||Process 2|
|mv a/dir1 a/dir2/newdir||mv b/dir2 b/dir1/newdir|
Both commands cannot succeed, or you will have just tied your filesystem into a knot. So some sort of locking is required to serialize the above actions. Doing that kind of locking is very hard when there are multiple paths into the same directory; it is an invitation to deadlocks. The problem could be fixed by putting a monster lock around the entire filesystem, but the performance cost would be prohibitive. The usual approach has been to simply disallow this form of aliasing on directory names, and thus avoid the problem altogether.
In the reiser4 world, all files are also directories. So hard links to files become hard links to directories, and all of these deadlock issues come to the foreground. The concerns expressed by the kernel developers - which appear to be legitimate - is that the reiser4 team has not thought about these issues, and there is no plan to solve the problem. Wiring the right sort of mutual exclusion deeply into a filesystem is a hard thing to do as an afterthought. But something will have to be done; Al Viro has made it clear that he will oppose merging reiser4 until the issue has been addressed, and it is highly unlikely that it would go in over his objections (Linus: "This means that if Al Viro asks about locking and aliasing issues, you don't ignore it, you ask 'how high?'")
One way of dealing with the locking issues (and various other bits of confusion) would be to drop the "files as directories" idea and create a namespace boundary there. Files could still have attributes, but an application which wished to access them would use a separate system call to do so. The openat() interface, which is how Solaris solves the problem, seems like the favored approach. Pushing attributes into their own namespace breaks the "everything in one namespace" idea which is so fundamental to reiser4, but it would offer compatibility with Solaris and make many of the implementation issues easier to deal with. On the other hand, applications would have to be fixed to use openat() (or be run with runat).
Another contingent sees the reiser4 files-as-directories scheme as the way to implement multi-stream files. Linux is one of the few modern operating systems without this concept. The Samba developers, in particular, would love to see a multi-stream implementation, since they have to export a multi-stream interface to the rest of the world. There are obvious simple applications of multi-stream files, such as attaching icons to things. Some people are ready to use the reiser4 plugin mechanism and go nuts, however; they would like to add streams which present compressed views of files, automatically produce and unpack archive files, etc. Linus draws the line at that sort of stuff, though:
Which means that the only _real_ technical issue for supporting named streams really ends up being things like samba, which want named streams just because the work they do fundamentally is about them, for externally dictated reasons. Doing named streams for any other reason is likely just being stupid.
Once you do decide that you have to do named streams, you might then decide to use them for convenient things like icons. But it should very much be a secondary issue at that point.
Yet another concern has to do with how user space will work with this representation of file metadata. Backup programs have no idea of how to save the metadata; cp will not copy it, etc. Fixing user space is certainly an issue. The fact is, however, that, if reiser4 or the VFS of the future changes our idea of how a file behaves, the applications will be modified to deal with the new way of doing things. Meanwhile, it has been pointed out that reiser4-style metadata is probably easier for applications to work with than the current extended attribute interface, which is also not understood by most applications.
The discussion looks likely to continue for some time. Regardless of the outcome, Hans Reiser will certainly have accomplished one of his goals: he has gotten the wider community to start to really think about our filesystems and how they affect our systems and how we use them.
The original maintainer questions the value of the GPL-only code. Without the decompression module, the camera can only be used in a very low-resolution mode. There are a couple of reasons for wanting that code back, however. One of the more interesting ones was posted by a member of the LavaRnd project. It seems that a Philips webcam, with the lens cap in place, is a good source of entropy for random number generators. In fact, the low-resolution stream is even better than the full-resolution version for this application. The LavaRnd folks would like to see the GPL driver back - and they have even volunteered to maintain it.
The other use for the GPL driver would be as a starting point while the compression protocol is reverse engineered and a completely free driver is created. There has been some speculation that this reverse engineering would be relatively easy - but it will remain speculation until somebody produces some code.
In any case, the PWC driver is likely to come back in some form; USB maintainer Greg Kroah-Hartman has stated that a conversation is in progress with Nemosoft (the original author) and that a patch is forthcoming. Getting a driver which only supports the low-resolution mode is unlikely to please many PWC owners, but it is a start. If the end result of all this is, eventually, a 100% free driver supporting full functionality, everybody will be better off.
Linux, however, has no mechanism which allows filesystems to perform local disk caching. Or, at least, it didn't have such a mechanism; David Howells's CacheFS patch changes that.
With CacheFS, the system administrator can set aside a partition on a block device for file caching. CacheFS will then present an interface which may be used by other filesystems. There is a basic registration interface, and a fairly elaborate mechanism for assigning an index to each file. Different filesystems will have different ways of creating identifiers for files, so CacheFS tries to impose as little policy as possible and let the filesystem code do what it wants. Finally, of course, there is an interface for caching a page from a file, noting changes, removing pages from the cache, etc.
CacheFS does not attempt to cache entire files; it must be able to deal with the possibility that somebody will try to work with a file which is bigger than the entire cache. It also does not actually guarantee to cache anything; it must be able to perform its own space management, and things must still function even in the absence of an actual cache device. This should not be an obstacle for most filesystems which, by their nature, must be prepared to deal with the real source for their files in the first place.
CacheFS is meant to work with other filesystems, rather than being used as a standalone filesystem in its own right. Its partitions must be mounted before use, however, and CacheFS uses the mount point to provide a view into the cached filesystem(s). The administrator can even manually force files out of the cache by simply deleting them from the mounted filesystem.
Interposing a cache between the user and the real filesystem clearly adds another failure point which could result in lost data. CacheFS addresses this issue by performing journaling on the cache contents. If things come to an abrupt halt, CacheFS will be able to replay any lost operations once everything is up and functioning again.
The current CacheFS patch is used only by the AFS filesystem, but work is in progress to adapt others as well. NFS, in particular, should benefit greatly from CacheFS, especially when NFSv4 (which is designed to allow local caching) is used. Expect this patch to have a relatively easy journey into the mainstream kernel. For those wanting more information, see the documentation file included with the patch.GmailFS has been released. GmailFS is a fun hack which allows a Linux system to use a Gmail account as a remote storage device; it can be mounted as a normal (if, perhaps, slow) filesystem. It's a user-space filesystem written in Python.
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds