By Jonathan Corbet
December 2, 2008
There is a great deal of activity around Linux filesystems currently. Of
the many ongoing efforts, two receive the most attention: ext4, the
extension of ext3 expected to keep that filesystem design going for a few
more years, and btrfs, which is seen by many as the long-term filesystem of
the future. But there is another project out there which is moving quickly
and is worth a look: Daniel Phillips's Tux3 filesystem.
Daniel is not a newcomer to filesystem development. His Tux2 filesystem was
announced in 2000; it attracted a fair amount of interest until it turned out that
Network Appliance, Inc. held patents on a number of techniques used in
Tux2. There was some talk of filing for defensive patents, and Jeff Merkey
popped
up for long enough to claim to have hired a patent attorney to help
with the situation. What really happened is that Tux2 simply faded from
view.
Tux3 is built on some of the same ideas as Tux2, but many of those ideas
have evolved over the eight intervening years. The new filesystem, one
hopes, has changed enough to avoid the attention of NetApp, which has shown
a willingness to use software patents to defend its filesystem turf.
Like any self-respecting contemporary filesystem, Tux3 is based on
B-trees. The inode table is such a tree; each file stored within is also a
B-tree of blocks. Blocks are mapped using extents, of course - another
obligatory feature for new filesystems. Most of the expected features are
present. In many ways, Tux3 looks like yet another POSIX-style filesystem,
but there are some interesting differences.
Tux3 implements transactions through a forward-logging mechanism. A set of
changes to the filesystem will be batched together into a "phase," which is
then written to the journal. Once the phase is committed to the journal,
the transaction is considered to be safely completed. At some future time,
the filesystem code will "roll up" the journal changes and write them back
to the static version of the filesystem.
The logging implementation is interesting. Tux3 uses a variant of the
copy-on-write mechanism employed by Btrfs; it will not allow any filesystem
block to be overwritten in place. So writing to a block within a file will
cause a new block to be allocated, with the new data written there. That,
in turn, will require that the filesystem data structure which maps
file-logical blocks to physical blocks (the extent) will need to be changed
to reflect the new block location. Tux3
handles this by writing the new blocks directly to their final location,
then putting a "promise"
to update the metadata block into the log. At roll-up time, that promise
will be fulfilled through the allocation of a new block and, if necessary,
the logging of a promise to change the next-higher block in the tree. In
this way, changes to files propagate up through the filesystem one step at
a time, without the need to make a recursive, all-at-once change.
The end result is that the results of a specific change can remain in the
log for some time. In Tux3, the log can be thought of as an integral part
of the filesystem's metadata. This is true to the point that Tux3 doesn't
even bother to roll up the log when the filesystem is unmounted; it just
initializes its state from the log when the next mount happens. Among
other things, Daniel says, this approach ensures that the journal recovery
code will be well-tested and robust - it will be exercised at every
filesystem mount.
In most filesystems, on-disk inodes are fixed-size objects. In Tux3,
instead, their size will be variable. Inodes are essentially containers
for attributes; in Tux3, normal filesystem data and extended attributes are
treated in almost the same way. So an inode with more attributes will be
larger. Extended attributes are compressed through the use of an "atom
table" which remaps attribute names onto small integers. Filesystems with
extended attributes tend to have large numbers of files using attributes
with a small number of names, so the space savings across an entire
filesystem could be significant.
Also counted among a file's attributes are the blocks where the data is
stored. The Tux3 design envisions a number of different ways in which file
blocks can be tracked. A B-tree of extents is a common solution to this
problem, but its benefits are generally seen with larger files. For
smaller files - still the majority of files on a typical Linux system - data can be
stored either directly in the inode or at the other end of a simple block
pointer. Those representations are more compact for small files, and they
provide quicker data access as well. For the moment, though, only extents
are implemented.
Another interesting - but unimplemented - idea for Tux3 is the concept of
versioned pointers. The
btrfs filesystem implements snapshots by retaining a copy of the entire
filesystem tree; one of these copies exists for every snapshot. The
copy-on-write mechanism in btrfs ensures that those snapshots share data
which has not been changed, so it is not as bad as it sounds. Tux3 plans
to take a different approach to the problem; it will keep a single copy of
the filesystem tree, but keep track of different versions of blocks (or
extents, really) within that tree. So the versioning information is stored
in the leaves of the tree, rather than at the top.
But the versioned extents idea has been deferred for now, in favor of getting
a working filesystem together.
Also removed from the initial feature list is support for subvolumes. This
feature initially seemed like an easy thing to do, but interaction with
fsync() proved hard. So Daniel finally concluded that volume management was best left
to volume managers and dropped the subvolume feature from Tux3.
One feature which has never been on the list is checksumming of data.
Daniel once commented:
Having been checksumming filesystem data during continuous
replication for two years now on multiple machines, and having
caught exactly zero blocks of bad data passed as good in that time,
I consider the spectre of disks passing bad data as good to be
largely vendor FUD. That said, checksumming will likely appear in
the feature list at some point, I just consider it a decoration,
not an essential feature.
Tux3 development is far from the point where the developers can worry about
"decorations"; it remains, at this point, an embryonic project being pushed
by a developer with a bit of a reputation for bright ideas which never
quite reach completion. The code, thus far, has been developed in user
space using FUSE. There is,
however, an in-kernel version
which is now ready for further development. According to Daniel:
The functionality we have today is roughly like a buggy Ext2 with
missing features. While it is very definitely not something you
want to store your files on, this undeniably is Tux3 and
demonstrates a lot of new design elements that I have described in
some detail over the last few months. The variable length inodes,
the attribute packing, the btree design, the compact extent
encoding and deduplication of extended attribute names are all
working out really well.
The potential user community for a stripped-down ext2 with bugs is likely
to be relatively small. But the Tux3 design just might have enough to
offer to make it a contender eventually.
First, though, there are a few little
problems to solve. At the top of the list, arguably, is the complete lack
of locking - locking being the rocks upon which other filesystem projects
have run badly aground. The code needs some cleanups - little problems
like the almost complete lack of comments and the use of macros as formal
function parameters are likely to raise red flags on wider review. Work
on an fsck utility does not appear to have begun. There has been no real
benchmarking work done; it will be interesting to see how Daniel can manage
the "never overwrite a block" policy in a way which does not fragment files
(and thus hurt performance) over time. And so on.
That said, a lot of these problems could end up being resolved rather
quickly. Daniel has put the code out there and appears to have attracted an
energetic (if small) community of contributors. Tux3 represents the core
of a new filesystem with some interesting ideas. Code comments may be
scarce, but Daniel - never known as a tight-lipped developer - has posted a
wealth of information which can be found in the Tux3
mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a
reserved unilateral right to relicense the code to anything else - but
developers who are comfortable with that are likely to find an interesting
and fast-moving project to play in.
(
Log in to post comments)