LWN.net Logo

Reiser4 - the mammoth arrives

One of the remaining issues to be resolved before the Halloween feature freeze is whether the Reiser4 filesystem will be included in the 2.5 kernel. This has been a hard question to answer, however, given that almost nobody had actually seen the Reiser4 source. That situation, at least, has come to an end with the announcement of the first public Reiser4 snapshot.

Reiser4 is the latest incarnation of the ReiserFS filesystem. It is not simply an upgrade; Reiser4 has been redesigned and reimplemented from the beginning. It is a completely different filesystem than the ReiserFS (also known as "Reiser3") found in the 2.4 kernel; should it be included, the next stable kernel will contain both Reiser3 and Reiser4, as separate options.

There is a fair amount of online information available on Reiser4, though some of it makes for a bit of a challenging read. This lengthy document provides discussion in depth of many of the Reiser4 features (not all of which are implemented yet), along with an explanation of Hans Reiser's long-term vision for filesystems, a polemic on free software, and some of the weirdest imagery to be found in software documentation anywhere. The document entitled The Infrastructure for Security Attributes in Reiser4 is actually a relatively straightforward discussion of many of the technical details behind the Reiser4 design, and might be a better starting point.

For those wanting a shorter summary, here's a few of the features to be found in Reiser4:

  • The filesystem maintains many of the basic features of Reiser3 - it is based on (mostly) balanced trees, with file data incorporated in the tree along with names. Reiser4 thus remains well suited to the handling of large numbers of very small files.

  • It is smarter about block allocation and data placement. Block allocation is delayed until file data is actually written to disk, leading to more efficient layouts. On-disk layout is done with extents. The result of these optimizations is that the filesystem's read performance is greatly improved over Reiser3.

  • "Wandering logfiles" take some techniques from log-structured filesystems to provide journaling without (always) writing data to the disk twice. In many cases, Reiser4 can write "journal" data to a disk block, then atomically swap the journal block into the file itself. The journaling code can overwrite or replace blocks, depending on which technique would provide better layout on the disk.

  • Most filesystem semantics are implemented with plugins. The normal Unix directory behavior, for example, is implemented with the "Unix directory plugin." Plugins can be used to implement security features (access control lists and such), encryption, maintenance of audit trails, and no end of strange, non-POSIX semantics. Hans Reiser remains determined to implement a lot of interesting features in his filesystem, and plugins are the mechanism by which those features will be included.

  • Reiser4 is heavily transaction-oriented, and is able to provide guarantees that operations will be performed atomically. Future plans call for the ability to perform multi-file operations in an atomic manner.

  • The Reiser4 design includes a reiser4() system call "to support applications that don't have to be fooled into thinking that they are using POSIX." This system call will accept (and parse) command strings that can describe complex operations. The reiser4() system call is not implemented in the current snapshot.

As an example of the sort of uses that the Reiser4 developers eventually would like to see, consider the classic Unix password file. Each line in the file describes one account, and contains several colon-separated fields with information like the account name, user and group IDs, the user's home directory and shell, etc. In Reiser4, each field in the password file would become a file in its own right; one could obtain the home directory of a given user via a path like:

	/etc/passwd/user/home

A special-purpose plugin would aggregate the various files, so that a process reading /etc/passwd would see the same information as always. But each field file could be protected differently; a user could have write access to the file describing his or full name, but not to the one containing the user ID value.

In the Reiser4 vision, file attributes would also be stored as files. For a given file, something like file/owner would contain the UID of the user who owns that file.

Needless to say, in the long-term Reiser vision, Linux systems will behave rather differently than they do now. In the shorter term, Reiser4 promises a high-performance journaling filesystem with highly efficient handling of small files and a plugin architecture which encourages experiments with interesting new semantics.

Will it be merged? The Reiser4 team plans to submit a patch for merging at the last second, sometime before midnight on Halloween. Some developers have argued that it is too late to propose a major new feature that nobody has had a chance to look at. Hans feels this is inappropriate:

I'm the last straggler coming back from the hunt, and I've got what looks like it might be a wooly mammoth on my shoulders, and my tribesmen are complaining that I'm late for dinner. How about helping me by cutting down a tree for the roasting spit instead

Linus has not offered any public opinions on the matter. The Reiser4 patch is apparently unintrusive, however, so there is probably no real reason not to include it.


(Log in to post comments)

Reiser4 - the mammoth arrives

Posted Oct 31, 2002 3:52 UTC (Thu) by cpeterso (guest, #305) [Link]

Hans has made many comments on LKML that Reiser4 is "100% faster" for many benchmarks. Was Reiser3 just really slow? How does Reiser4 performance compare to ext3?

Reiser4 - the mammoth arrives

Posted Oct 31, 2002 18:12 UTC (Thu) by himi (guest, #340) [Link]

Reiserfs was a bit slower than ext2, but not by that much - on bonnie++ I saw numbers like ~19MB/s reads from ext2, and about 16-17MB/s reads from reiserfs. This is on a disk that hdparm -t lists as giving about 21MB/s from the platters. I can't remember about writes, but I think it was slower there too - I may have had tail packing enabled, though,which would screw with the numbers.

The real test will be how it performs with a) lots of small files, and b) big files: reiserfs has always been better at dealing with lots of small files than ext2 (which isn't saying much, and with directory indexing ext3 looks to be as fast), but it's had problems with very large files, and it's tended to have problems with fragmentation leading to performance degredation over time. If they've dealt with those problems, then I could see "100% faster" being a reasonable description, assuming of course it /is/ that much faster.

Of course, I'm not going to trust my data to Reiser4 until it's been in real use for a while . . . Maybe 2.6.5 or so ;-)

himi

System CPU usage

Posted Nov 1, 2002 13:02 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

One thing I've noticed in many benchmarks (well, at least the few that report this figure) is that Reiser3 seems to be very CPU heavy as compared to ext2. They may achieve about the same benchmark numbers as other filesystems when run with a disk-intensive benchmark, but (as I recall) the system CPU %age is much higher for Reiser3 than others. The effect of higher system CPU usage is to slow down any compute-intensive processes that might be running in parallel with your I/O intensive application.

I'd like to see how Reiser4 fares. It's apparently a major rewrite, so if it has fundamental changes in design, it could have completely different CPU usage patterns.

--Joe

Reiser4 - the mammoth arrives

Posted Oct 31, 2002 11:59 UTC (Thu) by mwh (guest, #582) [Link]

This sort of thing sounds interesting, but I wonder if it's too late in the day to fiddle with things on this level. I mean, I'm reluctant to tie my code to Linux alone, never mind Linex-with-a-particular-file-system.

Depressing, isn't it?

Reiser4 - the mammoth arrives

Posted Oct 31, 2002 12:47 UTC (Thu) by xanni (subscriber, #361) [Link]

The clever thing about the design is that the new features are transparent to applications that don't want to deal with the new semantics. So you can continue to write applications that just treat it as a fast, efficient POSIX filesystem while other applications can make use of the new magic on the very same files. Existing filesystems have been stuck with 1940s concepts for too long; there are plenty of ideas from the 1960s and later that I've been looking forward to having for my applications, and the Reiser plugin architecture should help make these things a lot more feasible.

Cheers,
Andrew Pam

Reiser4 - the mammoth arrives

Posted Nov 1, 2002 19:23 UTC (Fri) by radeex (guest, #765) [Link]

I'm going mad over want for an automatically CVSed ~/stuff. I looked at linux's existing virtual filesystem interfaces, and they're nasty for implementing something like this. I sure hope Reiser has a decent API for plugins (and that there will be a Python wrapper for it! Heck, maybe I'll write one myself if the API's good enough :))

Reiser4 - the mammoth arrives

Posted Nov 7, 2002 19:54 UTC (Thu) by job (guest, #670) [Link]

I hope you mean RCS and not CVS..!

Reiser4 - the mammoth arrives

Posted Nov 1, 2002 19:41 UTC (Fri) by leandro (guest, #1460) [Link]

ReiserFS seems to try to be a filesystem on a DBMS. I wonder if it would not be more powerful and easy to make that DBMS a relational one, like MS and IBM oft promised but never delivered.

Reiser4 - the mammoth arrives

Posted Nov 7, 2002 14:01 UTC (Thu) by Wol (guest, #4433) [Link]

Like IBM promised ... ?

Didn't IBM DELIVER with the AS/400 / OS/400 combo?

And take a look at Pick, which was doing exactly that in the sixties. btw, did you know that Pick was the ?first? commercial DBMS to be ported to linux, some seven or so years ago?

And take a look at open source Pick, www.maverick-dbms.org

Cheers,
Wol

Reiser4 - the mammoth arrives

Posted Aug 12, 2003 1:30 UTC (Tue) by leandro (guest, #1460) [Link]

> Didn't IBM DELIVER with the AS/400 / OS/400 combo?

No, because the OS/400 isn't relational. It is but the engine for SQL, and even SQL is in several violations of fundamental precepts and proscriptions of the relational model.

> take a look at Pick

Pick is not even up to SQL standards. It's not even a deviation from the relational model, but totaly unrelated.

GNU Hurd has this already

Posted Nov 7, 2002 6:33 UTC (Thu) by kaol (subscriber, #2346) [Link]

They're called translators there. You can do magic passwd files with them, just as in this example here. Translators can provide even whole directory structures, so that 'mount' is done actually by setting a translator.

If the plugins would run in user space and could be set by any user, this would make linux much more micro-kernel-like.

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds