Unioning file systems: Architecture, features, and design choices

March 18, 2009

This article was contributed by Valerie Aurora

A unioning file system combines the namespaces of two or more file systems together to produce a single merged namespace. This is useful for things like a live CD/DVD: you can union-mount a small, writable file system on top of a read-only DVD file system and have a usable system without needing to transfer the system files from the DVD to the root file system. Another use is to export a single read-only base file system via NFS to multiple clients, each with their own small, writable overlay file system union mounted on top. Another use might be a default set of files for a home directory.

Unioning file systems of various sorts have been around since at least the Translucent File Service (or File System), written around 1988 for SunOS. BSD has had union mounts since 4.4 BSD-Lite, around 1994, and Plan 9 implemented union directories in a similar time frame. The first prototype of a union-style file system for Linux was the Inheriting File System (IFS), written by Werner Almsberger for Linux 0.99 in 1993. It was later abandoned, as Werner writes, because "I looked back at what I did and was disgusted by its complexity." Later implementations for Linux include unionfs (2003), aufs (2006), and union mounts (2004), as well as FUSE prototypes and implementations of various forms. So how is it that in 2009, Linux still does not have an in-tree unioning file system?

The short answer is that none of the existing implementations meet the kernel maintainers' standards of simplicity, clarity, and correctness. What makes it so difficult to write a unioning file system that meets these standards? In this week's article, we'll review general unioning file system concepts, and common design problems and solutions. In a subsequent article, we'll review the top contenders for a mainline kernel unioning file system for Linux, as well as unioning file systems for other operating systems.

Unioning file systems implementation guidelines

First, let's define what a useful unioning file system implementation would look like. The basic desire is to have one read-only file system, and some sort of writable overlay on top. The overlay must persistently store changes and allow arbitrary manipulation of the combined namespace (including persistent effective removal of parts of the read-only namespace - "whiteouts"). The implementation should be fast enough for use as a root file system, and should require little or no modification of underlying file systems. It should be implemented in the kernel; FUSE-based file systems have many uses but aren't appropriate for many use cases (root file systems, embedded systems, high (or even moderate) file system performance, etc.).

A successful unioning file system will be an improvement (in terms of disk space used, cost of code maintenance, deployment time, etc.) over the alternatives - copying over all the files up front, clever division of file systems into multiple mount points, or writing an entire new on-disk file system. If we gain all the features of a unioning file system, but complicate the VFS code too much, we'll have a union file system at the cost of a slow, unmaintainable, buggy VFS implementation. If a union file system includes as much or more code than the underlying file system implementations, you start to wonder if supporting unioning in underlying file systems individually would be more maintainable.

One alternative to unioning file systems is the copy-on-write block device, often used for virtualized system images. A COW block device works for many applications, but for some the overhead of a block device is much higher than that of a unioning file system. Writes to a file system on a COW block device will produce many duplicated blocks as the bitmaps, journals, directory entries, etc. are written. A change that could be encapsulated in a few bytes of directory entry in a unioning file system could require several hundred kilobytes of change at the block level. Worse, changes to a block device tend only to grow; even with content-based block merging (not a common feature), two file systems which are logically nearly identical may differ by many megabytes at the block level. An important case is one in which you delete many files; with a unioning file system this will decrease the space needed to store differences. With a COW block device, used space will increase.

One problem that should NOT be solved by unioning file systems (and file systems in general) is source control. Modern source control systems are quite good compared to the crud we had to live with back in the 1990's. Source control software back then was so buggy and slow that it actually seemed like a good idea to implement parts of it in the kernel; indeed, some commercially successful source control systems still in use today require explicit file system support. Nowadays, many fast, reliable source control systems are available and it is generally agreed that complex and policy-heavy software should be implemented outside of the kernel. (Those who disagree are welcome to write gitfs.)

General concepts

Most unioning file systems share some general concepts. This section describes branches, whiteouts, opaque directories, and file copy up.

Branches are the various file systems that are unioned together. Branch access policies can be read-only, writable, or more complex variations which depend on the permissions of other branches. Branches are ordered or stacked; usually the branch "on top" is the writable branch and the branch "on the bottom" is read-only. Branches can sometimes be re-ordered, removed, added, or their permissions changed on the fly.

A commonly required feature is that when a particular directory entry is deleted from a writable branch, that directory entry should never appear again, even if it appears in a lower branch. That is, deleting a file named "x" should result in no file named "x" in that directory afterward, even if a file named "x" exists in a lower branch. Usually this is implemented through a combination of whiteouts and opaque directories. A whiteout is a directory entry that covers up all entries of a particular name from lower branches. An opaque directory does not allow any of the namespace from the lower branches to show through from that point downwards in the namespace.

When a file on a read-only branch is altered, it must be copied up to some writable branch. Copy up is usually triggered either when the file is opened with write permissions, or when the first write to the file's data or metadata occurs.

Common design problems

Unioning file systems often run into the same difficult design problems. Often, these problems only have a few obvious options with built-in tradeoffs, and unioning file systems can be characterized to some degree by which set of tradeoffs they choose. In this section, we'll review some of the top design problems and their common solutions.

Namespace pollution: Often, whiteouts, opaque directories, persistent inode numbers, and any other persistent file system data are stored in specially-named files. These files clutter up the available namespace and produce unexpected behavior. Some minor namespace pollution has been acceptable in UNIX as long as it is restricted to the top-level directory (think "/lost+found"), but namespace pollution on a per-directory or per-file basis is generally frowned on. Solutions to this problem include various ways of shuffling around the namespace pollution - moving it to a single subdirectory per directory or file system or storing it in an external file system - or not creating namespace pollution in the first place (which generally requires support from the underlying file system).

Whiteouts and opaque directories: While the concepts of whiteouts and opaque directories are fairly general, the implementation varies. The most common option is to create a directory entry with a special name, such as ".wh.<filename>" which indicates that the corresponding directory entry should never be returned. This can cause clashes with user file names and prevent stacking one union over another. It also makes directory removal time-consuming. An "empty" directory can have many thousands of whiteout entries which must be deleted before the rmdir() can complete. Sometimes directory removals are pushed off to a special work thread to improve rmdir() latency.

Another option is to create a special file or directory entry type to mark whiteout directory entries, and give whiteout directory entries a reserved inode number. The name in the directory entry itself is the same as the name being whited out and does not appear in directory listings. This form of whiteouts requires support from the underlying file system to store the necessary flags and from the file system utilities to accept the special entries. One more option is to make whiteout entries be hard links to special whiteout files, or symbolic links to reserved targets. The hard link solution requires handling the case of exceeding the maximum link count on the target file.

Directories can be marked opaque with either a flag in the directory inode (again requiring support from the underlying file system) or with a directory entry with a reserved name, like ".opaque".

Timing of directory copies: When a file is created on a writable branch in a directory that exists only on another branch, the entire directory path to that file must be present on the writable branch. The path can be created on-demand when the target file is copied, or each directory may be preemptively copied to the writable branch as it is opened. Creating paths on demand introduces complexity, locking issues, and additional latency on file writes. Creating paths on directory open simplifies code and improves latency on file write, but uses up additional (often negligible) space on the writable branch.

Kernel memory usage: Unioning file systems often introduce new data structures, extra copies of data structures, and a variety of cached metadata. For example, a union of two file systems may require three VFS inodes to be allocated for one logical object. The most common implementation of whiteouts and duplicate entry removal requires reading all directory entries from all branches into memory and constructing a coherent view of the directory there. If this cache is maintained across system calls, we have to worry about regenerating it when underlying branches change. When it is not cached, we have to reallocate memory and remerge the entries repeatedly. Either way, a lot of kernel memory must be allocated.

Code cleanliness: One of the main points of a unioning file system is to reuse existing file system code, with the expected benefits in ease of maintenance and future development in that code base. If the implementation is excessively complex or intrusive, the net effect will be a reduction in ease of maintenance and development. The sheer number and variety of Linux file systems (on-disk, in-memory, and pseudo) demonstrates the benefit of clean and simple file system interfaces.

Stack depth: The Linux kernel has a limited available stack depth. Calling internal file system functions from unexpected code paths or, worse yet, recursively, can easily result in exceeding the kernel stack limit. Some unioning file systems are implemented as stacked or layered file systems, which inherently add to stack depth.

Number of branches: Memory usage is often proportional to the number of branches in use. Branches may be limited to a compile-time maximum, but allocating enough memory for the maximum is prohibitive. Dynamic allocation and reallocation as branches are added and removed can be complex and introduce new opportunities for failure.

Coherency of changes: A common feature request is to allow modification of more than one branch at once. This requires some method of cache coherence between the various parts of the file system. Usually this method is absent or only partially correct. Users often request direct access to the file systems making up the branches of the union (instead of access through the unioning file system), a situation particularly difficult to deal with.

Dynamic branch management: Users often would like to add, remove, or change the policies of branches in a union while the union is still mounted. In trivial cases, this is a simple matter of parsing mount options and manipulating data structures, but may have major effects on any cache coherency implementation.

readdir() and friends: One of the great tragedies of the UNIX file system interface is the enshrinement of readdir(), telldir(), seekdir(), etc. family in the POSIX standard. An application may begin reading directory entries and pause at any time, restarting later from the "same" place in the directory. The kernel must give out 32-bit magic values which allow it to restart the readdir() from the point where it last stopped. Originally, this was implemented the same way as positions in a file: the directory entries were stored sequentially in a file and the number returned was the offset of the next directory entry from the beginning of the directory. Newer file systems use more complex schemes and the value returned is no longer a simple offset. To support readdir(), a unioning file system must merge the entries from lower file systems, remove duplicates and whiteouts, and create some sort of stable mapping that allows it to resume readdir() correctly. Support from userspace libraries can make this easier by caching the results in user memory.

Stable, unique inode numbers: NFS exports and some applications expect inode numbers of files not to change between mounts, reboots, etc. Applications also expect that files can be uniquely identified by a combination of the device id and the inode number. A unioning file system can't just copy up the inode number from the underlying file system since the same inode is very likely to be used on more than one file system. Unique (but not stable) inode numbers can be implemented fairly trivially, but stable inode numbers, require some sort of persistent state mapping files from the underlying file system to assigned inode numbers.

mmap(): mmap() is always the hard part. A good way to sound smart without knowing anything about file systems is to nod sagely and say, "What about mmap()?" mmap() on a file in a union file system is hard because the file may suddenly disappear and be replaced with another (copy up of a file on write from another process) or can be changed without the cooperation of the unioning file system (direct writes to the file systems backing branches). Some unioning file systems permit forms of mmap() which are not correctly implemented according to the POSIX standard.

rename() between branches: rename() is a convenient system call that allows atomic file renames on the same file system. Some unioning file systems try to emulate rename() across different branches. Others just return EXDEV, the error code for an attempted rename() across different file systems. Most applications can cope with the failure of rename() and fall back to a normal file copy.

File system implementation differences: Different file systems have different rules about permitted file names, allowed characters, name length, case sensitivity, extended attribute names and sizes, etc. The unioning file system wants to translate from one file system's rules to another. Some union file systems just restrict the types of file systems they support rather than implement complex translation code for uncommon use cases.

Multiple hard links: A file on a lower branch may have multiple hard links; that is, multiple paths point to the same inode. When a file with multiple hard links on a read-only branch is altered, strict UNIX semantics require finding all the other paths to that file and copying them to the writable branch as well. Unfortunately, UNIX file systems generally don't provide an efficient way to find all the paths to an inode. Some union file systems keep a list of inodes with undiscovered alternate paths and copy them over when a new path is accessed, others just ignore the problem. It's an open question as to how many applications depend on the correct behavior of hard links when used in this manner.

Permissions, owner, and timestamps: These attributes are often calculated using using a combination of underlying file system permissions, options specified on mount, the user and umask at time of mount, and branch management policies. Exact policies vary wildly.

Feature creep: Sometimes, the features provided by unioning file systems go beyond the actual needs of 99.9% of unioning use cases. This may be the result of fanciful user requests, or a "just because we can" approach - users only ask for two or three branches, but it's possible to support thousands of branches, so let's do it, or the code is structured such that all combinations of X, Y, and Z features are trivial to implement, even though users only want X and Y or Y and Z. Each feature may seem nearly free, but often ends up constraining the implementation or providing new opportunities for bugs. Focus is important.

Summary

Union file systems are inherently difficult to implement for many reasons, but much of the complexity and ugliness come from solving the following problems: whiteouts, readdir() support, stable inode numbers, and concurrent modifications to more than one branch at a time. Few of these problems have obviously superior solutions, only solutions with different tradeoffs.

Coming next

In the next article, we will review BSD union mounts, unionfs, aufs, and Linux union mounts. For each of these unioning file systems, we'll look at specific implementation details, including kernel data structures and their solutions to the various union file system problems described in this article. We'll also review the issues surrounding merging each solution into the mainline Linux kernel. We'll wrap up with general analysis of what does and doesn't work for unioning file systems, from both a purely technical and a software engineering point of view.

To be continued...

Acknowledgements

Many discussions with developers helped with the background of this article, but this article in particular benefited from discussions with Erez Zadok, Christoph Hellwig, and Jan Blunck.

Index entries for this article
Kernel	Filesystems/Union
GuestArticles	Aurora (Henson), Valerie

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 9:07 UTC (Thu) by mjthayer (guest, #39183) [Link] (1 responses)

Is it possible to provide a generic layer in the kernel that union filesystems/stacking filesystems/logical volume management could be built on top of? It seems to me that any two implementations of one of these are likely to have quite a bit in common, not least the problems that they have to deal with, as pointed out by this article. That would also fit in with the *nix "separate mechanism and policy" approach.

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 17:06 UTC (Fri) by arnd (subscriber, #8866) [Link]

The "union mounts" approach by Jan Blunck et.al. does the implementation
in the common vfs code and requires only minimal changes in the top
mounted file systems that get used for writable branches, in order to
implement features like whiteouts.

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 12:16 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

Also, a "standard" use case is to mount a root r/w ext2fs over an ISO9660 fs physically on CDROM/DVD, and so add devices in /dev where ISO9660 do not have the concept of devices.
I still vote for a real ext[23]fs either on a RAM disk or any other block device, full of soft link - but modify the meaning of soft-links with sticky attributes in the kernel - so that link attributes overwrite the attributes of the linked-to file; and to blank-out a file use either an attribute combination of the soft link, or a special and invalid path pointed by the soft link.

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 14:18 UTC (Thu) by masoncl (subscriber, #47138) [Link] (3 responses)

Great article Val!

Btrfs isn't a unioning filesystem, but it does have some of the features as well. It ends up somewhere between these solutions and a cow block device.

Arjan had the idea to implement a btrfs seed filesystem. You basically end up with a read-only filesystem that other read/write filesystems can point to. All modifications go to the read/write filesystems, and the read-only seed can be used multiple times. Yan Zheng coded this up a while back.

Modifications happen through COW, and so we end up making more copies of metadata blocks than any of the solutions Val listed. But, unlike the COW block device, when you delete a file, the space can be reclaimed if it was allocated from one of the read/write filesystems.

Short intro:

mkfs.btrfs /dev/sdb
mount /dev/sdb /mnt
copy stuff onto /mnt
umount /mnt

# turn /dev/sdb into a read-only seed
btrfstune -S 1 /dev/sdb

mount /dev/sdb /mnt

# add a read/write device to the read only FS
btrfs-vol -a /dev/sdc /mnt

umount /mnt

# mount our new read/write filesystem
mount /dev/sdc /mnt

# COW sends all changes to /dev/sdc only, and leaves
# sdb alone

Future versions will be easier to use, with fewer unmount steps in the middle of things. This doesn't replace the union filesystems listed here since it is all btrfs specific, but it is another tool in the union box.

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 5:56 UTC (Fri) by vaurora (guest, #38407) [Link] (2 responses)

Thanks, Chris! I totally forgot to mention that file systems like btrfs give you many of these features in a clean, integrated, fast manner. btrfs makes more sense for a significant subset of the use cases proposed for unioning file systems, especially cases where you'd like to tentatively make changes that you may want to make permanent (or roll back if they don't work).

One use case I don't know how to solve with btrfs alone is the writable-over-DVD case. I suppose you could put a btrfs file system image on the CD/DVD and add a writable device?

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 14:37 UTC (Fri) by masoncl (subscriber, #47138) [Link] (1 responses)

I'm afraid the btrfs image on the dvd is the only way other than the existing union filesystems. The same goes for the unionfs use cases around network filesystems.

One problem with that is btrfs likes to leave free space hanging around, and so it isn't yet well suited to compacting images down on an iso/dvd.

Patching btrfs progs to create a compact btrfs FS for burning would be a fun project if anyone is looking for ways to fill their time ;)

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 17:26 UTC (Fri) by arnd (subscriber, #8866) [Link]

To take this thought a little bit further, this could also incorporate:

* laying out files and metadata for performance characteristics of optical
media (seek times vary a lot with location, 2kb sector boundary, ...).

* compressing all files in advance may give you much better placement
options than online compression.

* This could be integrated into mkisofs in the same way that HFS support
is. Iso9660 leaves a significant amount of free space in the front, so it
may just fit.

* In a similar way, it could be done in the same way that ext3 conversion
works, but on an existing iso9660 file system to create a hybrid file
system. Bonus points for making it work with multisession writing on an
already burnt iso9660 disk.

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 16:07 UTC (Thu) by perlwolf (guest, #46060) [Link] (1 responses)

An alternate approach that could be considered would be a unioning volume.

The advantage is that it would work with any file-system type that simply expects a volume to be a series of sectors that it reads and writes. Since the meta-data is stored somewhere on those volume sectors, there is no need to worry about whiteout/opaque directories or copy-file-on-write, etc.

In exchange, it would have a totally different set of problem issues. The over-volume has to keep track of which sectors it actually contains and which are transparent and need to go to the underlying volume. When the set of changed sectors is small, a hash of the sector ids that have been changed would point to the changed content (which would be contained in a small area of contiguous sectors on the over-volume) might be the best representation. When the change set got to be a significant proportion of the entire volume, a bit map selecting the over-volume/under-volume would be providing and the over-volume would use the same sector number as the under-volume, perhaps with an offset if the bitmap is kept at the beginning of the volume. Somewhere in between, as the number of changed sectors grew, it would have to transition between the two formats (and perhaps go through other intervening ones). In all cases, the union device driver would likely need to store some fairly large amounts of data in memory (either the hash table, or the bit map). Another representation might be to have a table mapping contiguous blocks of sectors that are in the over-volume - that would save the space of requiring a bit for every sector and allow the over-volume to not have its sectors in the same place as the under-volume; although it would degrade horribly if there were huge numbers of non-contiguous sectors over-written. Such a table would be slower though, requiring a binary search O(log n) instead of a hash or bitmap lookup of O(1) to find where to get a sector.)

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 22:34 UTC (Thu) by martinfick (subscriber, #4455) [Link]

From an interface standpoint, aren't you describing lvm snapshots? Which are essentially a COW block device, no?

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 18:04 UTC (Thu) by foom (subscriber, #14868) [Link] (1 responses)

> An important case is one in which you delete many files; with a unioning file system this will
> decrease the space needed to store differences. With a COW block device, used space will
> increase.

It doesn't actually *have* to be the case. When linux adds the hooks needed by Flash drives to
notify the device about which blocks are free, then the COW block device can use those same hooks
to free space.

If I were a kernel developer I'd be very wary of the endless-tarpit of unionfs issues when a way to
make a COW block device probably efficient enough is right around the corner.

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 6:01 UTC (Fri) by vaurora (guest, #38407) [Link]

Good point! I believe that TRIM support of some form will be integrated in the near future (in file system terms - that is, several years) and that will make COW block devices more tenable. However, even with TRIM support, you run into the problem where the logical file systems are equivalent, but the same logical data is stored in different locations between the two images. The metadata is very likely to be different too at the block level.

COW block devices are a part of the solution space, they just don't completely replace unioning file systems.

Unioning file systems: Architecture, features, and design choices

Posted Mar 19, 2009 21:34 UTC (Thu) by rvfh (guest, #31018) [Link]

Excellent article from Val, as usual. Thanks!

FUSE root fs, embedded, performance

Posted Mar 20, 2009 4:32 UTC (Fri) by szaka (guest, #12740) [Link] (1 responses)

FUSE can be used in many different ways. The below comments are just some short and minor corrections. It's a totally different issue how much they could be true for a FUSE-based union file system.

Thankfully FUSE file systems can be used as root file systems. For instance NTFS-3G is used by some Linux distributions as root file system for about two years.

FUSE is often used in embedded systems. One example is djmount but NTFS-3G is also included in many consumer electronics, NAS, set-top boxes, multimedia players/recorders and many other different type of devices.

Typically the dominant performance factors of general purpose, block device based file systems are the file system design, quality of the implementation and optimization/tuning to the specific hardware platform. A FUSE file system can be high-performance too if it's used efficiently.

FUSE can be considered as a kernel file system driver where the slow paths are forwarded to user space. One could also think about it as a high-performance kernel network file system with extremely low latency and high bandwidth. For instance the current best write performance of the NTFS-3G driver is 1.83 GB/s and it's the 3rd fastest in metadata operations after btrfs and ext4. Which I think is not too bad considering that it's not fully optimized yet.

FUSE root fs, embedded, performance

Posted Mar 20, 2009 5:52 UTC (Fri) by vaurora (guest, #38407) [Link]

Thanks for the corrections! I usually learn as much from the user comments as from researching the article and this story is no exception.

Perhaps I'm just paranoid (well, I know I am), but I still don't want access to my root file system to go through any sort of userspace intermediary. I am glad to know that it is possible.

"Embedded" covers a lot of ground. I worked on an "embedded" system with 1GB of RAM back in 2002. Whenever I think that even the tiniest devices have megabytes of RAM, someone proves me wrong again...

I don't discuss performance in any detail any more. :) No matter what you say, someone else's workload performs better or worse, and the software can always be more optimized.

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 18:27 UTC (Fri) by giraffedata (guest, #1954) [Link]

Wow, that's a big job.

Has anyone worked on a simpler subset? What about a read-only union of read-only filesystems, without whiteouts?

That would at least buy you the kinds of things that are done with path environment variables (e.g. PATH) today.

Unioning file systems: Architecture, features, and design choices

Posted Mar 20, 2009 22:58 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

Can someone explain *why* seekdir() is so intrinsically hard?

I mean, POSIX is quite happy for all files created after opendir() to not
be reflected in the output from that DIR handle... so why doesn't glibc
simply remember everything it's been given from getdents() until the
closedir()? Seeking on *that* would be trivial.

(The only downside is potentially unbounded userspace memory usage, but if
you're playing with gigabyte-sized directories other things will go wrong
first: e.g. there are a *lot* of apps out there that do things that scale
as O(n log n) or even worse in the size of a directory... and it's only a
user process doing it to itself anyway. Is it just that this memory usage
isn't worth it for a call as never-used as seekdir()?)

Unioning file systems: Architecture, features, and design choices

Posted Mar 21, 2009 18:25 UTC (Sat) by jbailey (guest, #16890) [Link] (1 responses)

Off the top of my head:

* To preserve the old syscall, we need to keep this functionality anyway.

* Retrieving directory entries could be expensive over wan links and such,
taking that huge hit on opendir might be a little much (How many times have
I done ls on a large dir and found myself hammering on C-c?)

* opendir sending me past my ulimit or available ram would be an
interesting DoS attack. Too many files, and you can't ls to figure out
what you should start deleting. No globs for you either. =)

If the interface could change, it might be nice to have a timelimit, and
throw an EINTR or some such on a seekdir that amounted to "suck it up and
start again."

Unioning file systems: Architecture, features, and design choices

Posted Mar 21, 2009 19:34 UTC (Sat) by nix (subscriber, #2304) [Link]

Ah, no. I mis-explained. I expect opendir() to do just what it does now:
but readdir()ing will remember the contents. (This is fine, because you
can't seekdir() to somewhere that you haven't telldir()ed, and you can't
telldir() something you haven't readdir()ed already.)

There's no DoS problem, because the application can keep an eye on the
amount of readdir()ing it's done, and stop if need be. It makes seekdir(),
even over NFS, a doddle, and retrieving dirents is no more expensive than
it is now.

I don't understand why glibc doesn't *already* implement this. Why on
earth is seekdir() the kernel's job?

(And, yes, we'd have to preserve the old syscall, but given the number of
users --- none on my system, two in the *entire* Debian source tree when I
counted it a few years ago --- I don't think anyone would care much, or
even notice, if it rotted gently into brokenness, or completely failed to
work on new filesystems.)

Unioning file systems: Architecture, features, and design choices

Posted Mar 21, 2009 21:19 UTC (Sat) by kolyshkin (guest, #34342) [Link]

Yet another use case for unioning FS is containers or just chroots (I use the term «container» below, but all that applies to chroot as well).

You have a basic set of packages (aka "the base system") preinstalled into a special location which you when union mount with (initially empty) container COW/private area. Then a container wants to change any file a private COW copy is created. This setup allows for:
1. Fast container creation all the files are already there, you only have to mkdir and mount.
2. Sharing files on disk; ideally you need N times less space where N is number of containers you have.
3. Sharing pages in memory; whenever the library/binary/mmap'ed file used from different containers is the same on disk (i.e. device and inode are the same) there will be only one copy of its read-only data in memory.

In practice benefits from #2 and #3 can only be achieved if containers are identical (same distro, same exact versions of same packages), and will deteriorate over time when you modify files from inside those containers (like when you run 'yum update' in both you may end up with same packages being updated but now their files are in COW areas this is yet another problem to deal with).

Access timestamps

Posted Mar 22, 2009 0:44 UTC (Sun) by pjm (guest, #2080) [Link] (1 responses)

I'd just add to the section on timestamps that implementing access timestamps is more costly for some approaches than others, and some union filesystems choose not to implement atime for files that haven't been copied up to a writable filesystem.

(Exceptionally good article, btw.)

Access timestamps

Posted Mar 25, 2009 4:47 UTC (Wed) by vaurora (guest, #38407) [Link]

You're right, I somehow managed to forget to talk about atime. Thanks for pointing that out!

Unioning file systems: Architecture, features, and design choices

Posted Mar 22, 2009 0:50 UTC (Sun) by jbreiden (guest, #7090) [Link] (2 responses)

Great article. I'm very much looking forward to your next article, with practical advice about the various union filesystem options and their performance.

I suspect there is a relatively new use case for union filesystems, due to the increasing popularity of solid state drives. For example, I help run a server with many hundred gigabytes of small files on on XFS, on RAID-1, on rotating rust. There is enough reading and writing (and therefore seeking) going on to push the performance limits of the storage subsystem. So I bought one of those whiz-bang Intel SSD drives, with the intention of sending all writes to the flash drive, and using the rotating rust purely for reads (at least during normal operation). After some investigation of various unionfs options, I tried mhddfs since it sounded like the simplest thing to deal with. It didn't perform so hot, with load shooting up to about 700 before I killed it (maybe it was trying to replicate the admittedly large directory structure somewhere? Who knows?) So now I'm back to the primitive "manage everything with a rats nest of symbolic links" strategy.

Unioning file systems: Architecture, features, and design choices

Posted Mar 25, 2009 5:57 UTC (Wed) by vaurora (guest, #38407) [Link]

Performance? That's for suckers! Seriously, I avoid writing about file system benchmarks like the plague - everyone's workload is different, and everyone's file system hasn't been optimized just yet.

Have fun with your new SSD!

Unioning file systems: Architecture, features, and design choices

Posted Apr 6, 2009 12:34 UTC (Mon) by tpo (subscriber, #25713) [Link]

I am having quite severe mhddfs performance issues. The cron job, that searches the filesystem to update the "locate" database regularily brings my system to a halt and I need to kill the job to be able to move the mouse.

This imho points to a larger problem within Linux but I did not look at it in more detail.

Unioning file systems: Architecture, features, and design choices

Posted Mar 26, 2009 9:04 UTC (Thu) by anandsr21 (guest, #28562) [Link]

I am thinking why would we need the complexity of allowing branches. The union FS should (to my knowledge which is admittedly very less) only have the changes to the underlying fs and nothing else. We should have user level tools that should modify the u-fs to work with a new kind of underlying system if the need arises. The need should arise only during upgrades, and then the distribution should provide it.

I would think that ... would be the ideal name for the special directory containing changes in each directory.

-anand

Unioning file systems: Architecture, features, and design choices

Posted Mar 26, 2009 21:42 UTC (Thu) by muwlgr (guest, #35359) [Link]

Plan9 has formalized union mount semantics in quite clear and simple but completely non-POSIX way. Overall, it has dropped a lot of POSIX assumptions to reach new levels of cleanness and simplicity. So, some things are quite hard to borrow back from Plan9 into Unix/Linux.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 3:42 UTC (Tue) by sunburnt (guest, #88819) [Link] (14 responses)

How about a junction link? Currently a link can hold a $PATH,
but it`s dead as it`s not recognized ( by the kernel or FS ? ).
A link`s much simpler and less kernel intensive than mounting.
All paths in the link`s $PATH would be visible and readable.
Path precedence handles duplicate dirs. and files, no white-outs.
But only the first path in it`s $PATH could be written to, but
if it`s read-only or not enough space is available, then error.

This would allow a CD + writable layer on top to work properly.
Also possible, writable junctions on dirs. inside Squash files.

# A sym link can already hold a $PATH, why not make it functional?

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 7:34 UTC (Tue) by anselm (subscriber, #2796) [Link] (13 responses)

Let's assume we have files /a/x and /b/x, and /c is a symbolic link to /a:/b. What happens if the user says »rm /c/x«?

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 13:44 UTC (Tue) by nix (subscriber, #2304) [Link] (12 responses)

The first time you invoke it, rm /c/x chases the link to /a, because x is present there, and removes /a/x; the second time, it removes /b/x.

A nastier question is what rm -r does in the context of such links. A really nasty question is what wildcard expansion does: what does /c/f* do if there are files matching f* in both /a and /b? As far as I can see, the semantics sunburnt described do not allow for the obvious semantics (expand files in /a and /b) without less-than-obvious changes to every program that does globbing and fnmatching everywhere -- and without those changes, the semantics you seem likely to get are quite bizarre.

Nice idea, torpedoed by harsh reality, I fear. (At least one Unix tried something similar, but it allowed $VARIABLES in the link, not paths, so at least a link could only expand to one thing in a given context.)

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 13:55 UTC (Tue) by anselm (subscriber, #2796) [Link] (9 responses)

The first time you invoke it, rm /c/x chases the link to /a, because x is present there, and removes /a/x; the second time, it removes /b/x.

That's the obvious way of handling this. It does break the (reasonable) assumption that after a successful rm(1) of a file that file will be gone (in the sense of »open("/c/x", O_RDONLY) fails«), rather than replaced by the next one over.

One of the main goals of a union filesystem (or equivalent), in my opinion, is that the files and directories it presents behave as nearly like ones on a non-union filesystem as possible. From that point of view the »paths in links« approach is a complete non-starter.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 14:49 UTC (Tue) by hummassa (subscriber, #307) [Link] (1 responses)

> From that point of view the »paths in links« approach is a complete non-starter.

Actually, I strongly disagree. This is the way it would work anyway, in any implementation of union fs: "rm /c/f" removing /a/f but /b/f staying in its place for the next open() call (/b and /b/f are readonly, so you can't remove it again). That way, for instance, the readonly "bottom" fs can come with a default "/etc/resolv.conf" but it could be overwritten by user configuration on the readwrite "top" fs. If the user removes all configuration, the default config stays.

I actually think that "path in links" and "junction links" should be the default mounting mechanism instead of "/etc/fstab" because then it would be completely portable (you get a disk with n partitions, organize them with links to one another, disconnect it from one computer, connect it on another and voilá... everything is in its place.) And everything could be done lazily (you only actually mount a device once you access the link)

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 15:03 UTC (Tue) by hummassa (subscriber, #307) [Link]

> It does break the (reasonable) assumption that after a successful rm(1) of a file that file will be gone (in the sense of »open("/c/x", O_RDONLY) fails«), rather than replaced by the next one over.

Why would this assumption be reasonable? It's not a single-process, single-user system. Once you remove a file, other processes are totally permitted to re-create it immediately.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 14:49 UTC (Tue) by nix (subscriber, #2304) [Link] (6 responses)

Yeah, but of course that reasonable assumption, while often true, is racy and thus not a valid assumption at all: it's only guaranteed safe if you are operating on a disconnected subset of the filesystem and you are certain no other thread or process has that subset as its current directory nor has an fd to any directory in that subset. This seems like an extremely rare case (not least because the subset has to be connected briefly while you're setting it up, and you can't stop something else with cd'ing into it in that time period: even if you tweak permissions appropriately, a root-owned process can still get in there).

So any program that depends on an open() after an unlink() returning -ENOENT is broken in any case, and we don't need to pander to them. (They should probably be using O_CREAT|O_EXCL.)

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 15:15 UTC (Tue) by anselm (subscriber, #2796) [Link] (5 responses)

That's why I said »assumption«, not »guarantee«. It is clear that it is possible to stipulate cases where if you remove a file, another one will immediately pop up in its place – and it is also clear that one shouldn't write programs based on the idea that just because a file was removed just now, another one of the same name cannot have appeared in the meantime.

However, if – like most of the people most of the time – you're just working away in your shell somewhere below $HOME, that sort of thing is highly unlikely. In that context, chances are that if you just removed a file you'd expect it to be gone, not replaced by some older or default version. For example, I often do something like »rm *~« in order to clean up before, e.g., making a tar file, and I don't want this to unearth random files from the layer below.

The other problem is that with this approach it is impossible to make files go away completely if they exist in a read-only layer. This can be a hassle for programs that assume certain defaults if they don't have a certain configuration file. If somebody in your virtualisation base image helpfully supplied a configuration file you don't want, either you're in luck because the program doesn't mind an empty configuration file that you put on top, or else you will have to come up with one that undoes just the stuff you don't want because you can't get rid of the unwanted file altogether.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 22:22 UTC (Tue) by hummassa (subscriber, #307) [Link] (2 responses)

Why would you mount an unionfs over your /home/x?

What you really want to do is to union mount /etc, /usr/share and other places where you want to have writable configuration or data files shadowing read-only (original) files.

In that case, the semantics you Need from -- for instance -- "rm /etc/resolv.conf" is exactly "substitute /etc/resolv.conf for the default file".

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 22:31 UTC (Tue) by anselm (subscriber, #2796) [Link] (1 responses)

In that case, the semantics you Need from -- for instance -- "rm /etc/resolv.conf" is exactly "substitute /etc/resolv.conf for the default file".

That's not the greatest example – not having an /etc/resolv.conf file at all also means something, and with the »path in symlink« approach you can't make a file that is present in a lower layer appear as if it wasn't there after all, which is something that union filesystems can usually handle.

Unioning file systems: Architecture, features, and design choices

Posted Jan 16, 2013 0:00 UTC (Wed) by hummassa (subscriber, #307) [Link]

You can have "negative filesystem skeletons" like you have in some unionfs implementations. That way, you do "/x -> /a:-/b:/c:-/d" and if you remove a file in /x, the implementation will touch the same name (modulo an extension or prefix) under /b, that will mark said file as inexistent even if it exists in /c...

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 23:48 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

That's why I said »assumption«, not »guarantee«.

Sorry, I tend to read 'assumption of X' as a statement that we had better have a guarantee of X, lest we break the code making that assumption. Miscommunication. :)

However, if – like most of the people most of the time – you're just working away in your shell somewhere below $HOME, that sort of thing is highly unlikely.

Quite. It could perfectly well confuse the heck out of users, and I don't see a way to implement the less-confusing whiteout model using this sort of symlink-path. It just shouldn't break any software that isn't already broken.

The other problem is that with this approach it is impossible to make files go away completely if they exist in a read-only layer.

Yeah -- but, again, it's impossible to make files go away completely (or change in any way) if they exist on a read-only filesystem, and nobody complains about that. I personally don't see this sort of symlink-path model as a viable way to implement union mounts, but they could be interesting in and of themselves regardless. It's worth thinking about -- the sort of experimentation that doesn't happen often in Linux anymore due to the drag of the installed base of software and its tedious demands that we not gratuitously break it :P

Unioning file systems: Architecture, features, and design choices

Posted Jan 16, 2013 0:06 UTC (Wed) by anselm (subscriber, #2796) [Link]

Yeah -- but, again, it's impossible to make files go away completely (or change in any way) if they exist on a read-only filesystem, and nobody complains about that.

That's because read-only filesystems don't pretend to be read-write filesystems.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 15:00 UTC (Tue) by hummassa (subscriber, #307) [Link] (1 responses)

> A nastier question is what rm -r does in the context of such links.

Exactly what it does in the case of a union fs: removes all files, recursively, from the "top" filesystem (the "bottom" filesystems are read-only), refusing to remove any files present in the "bottom" filesystems.

> A really nasty question is what wildcard expansion does: what does /c/f* do if there are files matching f* in both /a and /b? As far as I can see, the semantics sunburnt described do not allow for the obvious semantics (expand files in /a and /b) without less-than-obvious changes to every program that does globbing and fnmatching everywhere -- and without those changes, the semantics you seem likely to get are quite bizarre.

Again, no nasty question: /c/f* will bring all files that match in the union /c, i.e., a merge with of /a/f* and /b/f*. A file called f1 present both in /a and in /b will come only once, referring obviously to /a/f1.

Unioning file systems: Architecture, features, and design choices

Posted Jan 15, 2013 23:50 UTC (Tue) by nix (subscriber, #2304) [Link]

OK, it sounds like rm will need explicit support for this, then. The semantics you describe certainly don't fall naturally out of the readdir() semantics you suggest.