Union mount workshop notes
Union Mount Workshop: Review of Existing Approaches
(Revised on March 22, 2009.)
Hello,
this is a report from a union mount workshop held from November 3 to 4
in the SUSE/Novell offices in Nuremberg [1]. Back in November, we
discussed our requirements, the different ways how these requirements
can be met, and more specifically union mounts, an approach which we
believe looks promising.
All of the use cases we are interested in basically boil down to the
same thing: having an image or filesystem that is used read-only (either
because it is not writable, or because writing to the image is not
desired), and pretending that this image or filesystem is writable,
storing changes somewhere else.
None of the use cases require to merge arbitrary file systems (*), a
feature which union mounts provide by design. What we need is "only" a
mechanism to "overlay" a read-only image or file system with a writable
one. This means that many approaches other than union mounts may be
applicable.
(*) One use case is to produce a number of modular images which
can be plugged together in a flexible way: several base OS
images, database images, web server images, etc. It seems
desirable to come up with a solution for plugging multiple
of those images together, resulting in a working system.
This use case requires to solve the problem of somehow
cleverly combining files which exist in more than one image
(like the rpm database). It is unclear how to achieve this,
and I am not convinced that combining images ad-hoc by
simply mounting them together (as opposed to having an
explicit merge step) is of relevance. We consider this use
case out of scope at this point.
At different layers of the storage stack, different solutions exist or
are being worked on, with different advantages and disadvantages. Let's
look at those layers.
THE BLOCK LAYER
At the block layer, copy-on-write could be used to construct a writable
filesystem image out of the read-only image and an exception store:
blocks are read from the read-only image originally. As soon as they
are modified, this fact and the block's new contents are recorded in the
exception store; from that point on, a read will return the new block's
contents from the exception store. This is how device mapper snapshots
work.
Advantages:
- Above the block layer, everything will look like before:
except for the performance degradation, filesystems and
applications will see no difference in behavior.
- Existing files will be modified at the block level, so
working with huge directories and big files will perform
similarly to how these operations perform today (after taking
the block layer slowdown into account).
Disadvantages:
- Snapshots implemented at the block layer are slow. The
current device mapper implementation scales very poorly. Too
much I/O is necessary because the block layer does not have
any meta-information about disk blocks. (Things are
improving somewhat; particularly, filesystems can pass down
some meta-information to the block layer since recently.
Still, some optimizations available to higher-layer
approaches are not available at the block layer.)
- Mixing different filesystem types, such as a compressed
read-only and an uncompressed writable filesystem, is not
possible with block layer approaches. (The block layer does
not know what a file system is.)
- Because they only record the deltas relative to a read-only
image, exception stores are relatively worthless in
isolation. If the read-only image lost, becomes corrupted,
or is modified by accident, the exception stores based on it
become worthless.
Block device approaches are not limited to a single machine. Network
block devices can be used in distributed environments. (This implies
additional overheads, however.)
THE FILESYSTEM LAYER
Future filesystems based on copy-on-write designs, such as btrfs, do (or
will) support in-filesystem block device management and snapshots.
Conceptually this is similar to block-layer mechanisms, but it allows
optimizations that would be hard to achieve otherwise, which makes this
approach scale much better.
Advantages:
- Transparent to users: everything will behave as if on a
normal, writable file system.
- Faster, because more meta-information is available, changes
can be tracked at a sub-block granularity, and functionality
that the filesystem already provides does not need to be
duplicated at the block layer.
- Efficient compression is easier to implement than at the
block layer.
- At least in btrfs, snapshots are first-class objects; they do
not merely record the deltas relative to another image. The
filesystem itself guarantees consistency. Despite that, they
may be based on a read-only device, combined with a writable
device.
Disadvantages:
- These filesystems do not have network transparency built in.
Block-layer network transparency may be used though, and such
filesystems may be exposed to the network via a network
filesystem like NFS, or Zach Brown's CRFS [2] (in development).
THE VFS LAYER: MOUNTS / BIND MOUNTS
The classical approach for making part of the filesystem namespace
writable is to mount writable and read-only filesystems together "in the
right way". As an extension to ordinary mounts, bind mounts allow to
mount specific directories and even files on top of each other.
Advantages:
- Bind mounts exist today; the technology is well understood,
and simple.
Disadvantages:
- The places in the filesystem namespace that shall be writable
need to be known in advance. It may be difficult to adapt to
applications which write to files in unexpected places.
Additional places in the namespace need to be made writable
manually when necessary.
- Bind mounts do not have a concept of unified directories:
files can only be mounted over existing files; new files or
directories can not be created that way. When files need to
be added, as frequently happens in the /etc directory, for
example, the entire directory must be copied, and files in
the directory must either be copied too, bind mounted to
individually, or replaced by symlinks. (In the latter two
cases, the old files referred to must be visible somewhere
else in the namespace.)
- Software updates modify files in many places which are
usually accessed read-only. Our software is rpm based, so
rpm could be modified to bind mount new files over their
previous versions, but over time, the number of union mounts
this would lead to would become excessive, and this would not
scale well: upon system startup, a large number of mounts
would have to be set up, which would take time, and consume a
significant amount of kernel memory. Such systems would
become hard to maintain.
- Some applications may see unexpected error codes when
manipulating bind mounted files and directories: for example,
bind mounted files and directories cannot be renamed (but
neither can files on a read-only filesystem). It could be
argued that applications relying on undefined behavior are
buggy, but this would not help when legacy applications need
to be used.
THE VFS LAYER: UNION MOUNTS
Union mounts are more flexible than bind mounts: file systems are
mounted on top of each other transparently so that at the top layer,
files from the top and bottom layer are visible at the same time.
(Ordinary mounts are opaque, hiding all lower layers.) Files at the top
layer hide files at the bottom layer with the same name. The bottom
layer is never written to. The top layer may be read-only (which is
comparatively boring), or writable.
Union mounts can be stacked on top of each other and it is possible to
"see through" multiple layers. I will describe union mounts in terms of
two layers here to keep things simple.
If a file is deleted which exists at the bottom layer, a so-called
whiteout file with the same name is created at the top layer. Users
never get to see this file; it is not included in readdir results, and
trying to open it fails with errno == ENOENT. If a file with the same
name is later created, this file replaces the whiteout.
When a new directory is created in place of a whiteout and the bottom
layer contains a directory with the same name, the files from the bottom
layer would be visible. This is not desirable as newly created
directories are supposed to be empty. To fix this defect, new
directories are created with a special "opaque" flag set: this flag
indicates that bottom-layer directory contents shall remain hidden.
When files at the bottom layer are about to be modified, since the
bottom layer is accessed read-only, they first need to be copied into
the top layer. This operation is commonly referred to as copy-up. With
union mounts, the decision when to copy files up is made at file open
time.
(Directories do not require a copy-up operation: when a directory that
does not exist at the top layer already is modified, it is sufficient to
create an empty directory at the top layer with the "opaque" flag
cleared.)
Advantages:
- Union mounts are very easy to set up, and they are relatively
simple conceptually.
- File system types can be arbitrarily mixed.
- Compared to the alternatives, union mounts are "relatively
easy" to implement. (But note the various problems mentioned
below which make it difficult to argue that a particular
implementation is "correct" or "good enough".)
- Even though the result may not be the expected one, the top
layer may be mounted over a bottom layer even after the
bottom layer was modified.
- Multiple filesystems can be mounted together arbitrarily,
fully merging their contents.
Disadvantages:
- When an process opens a file in read-only mode, it will get a
file descriptor on the bottom-layer filesystem. Another
process may open the same file for writing. The read-only
file is copied up, and the second process will get a file
descriptor to a different file. The first process will not
see any changes done by the second process. (Subsequent
read-only opens will return the top-layer file, of course.)
- When a file is copied up, it changes its device and inode
numbers (the st_dev and st_ino fields returned by stat(2)).
Applications making assumptions about these fields may be
surprised.
- Union mounts are name based, and hardlinks across mount
layers are not supported. When a file on the bottom layer
with a link count greater than one is opened for writing,
copying it up will drop its link count to one. When another
alias of the same bottom-layer file is opened for writing
afterwards, the top layer will end up with two independent
copies, each with a link count of one.
(It might help to change stat(2) to always return a fake link
count of one for all files on the bottom layer.)
- Operations expected to be cheap may suddenly become very
expensive. This includes changing file timestamps and
creating a hardlink to a bottom layer file, both of which
will trigger a copy-up operation. (Failing link(2) with
errno == EXDEV for files within the same directory may at
first seem acceptable, but many applications make the
assumption that hardlinks will be supported, and do not
implement the appropriate fallback code.)
- Copy-up of lage files may become prohibitively slow. (Think
of fixing a few tags in your music collection.) File system
or block layer approaches scale better in this respect.
- POSIX specifies requirements for readdir that, taken
together, result in having to compute and cache the entire
directory contents, removing duplicates and whiteouts. The
reason is because processes may arbitrarily seek in the
readdir stream (see telldir(3) and seekdir(3)). This is a
cpu and memory intensive operation.
- Union mounts operate on files as a whole, and do not provide
mechanisms for merging files. When multiple filesystems are
union mounted, the file on the top layer will "win", which
may still lead to an inconsistent system. (Think of files
such as the rpm database.)
- Inotify: ?
BETWEEN THE FILESYSTEM AND VFS LAYERS: UNIONFS, AUFS
Unionfs and Aufs try to appear as a filesystem to the VFS, and as the
VFS to filesystems. Some of the namespace handling is done at the
filesystem level, which results in various problems (locking, cache
coherency, etc.).
Both of these implementations include a number of features which our use
cases do not immediately require (multiple writable layers, stable inode
numbers across copy-up, cross-layer hardlinks). These features lead to
a tremendous increase of the complexity of the implementation though.
Advantages:
- Available today.
Disadvantages:
- The existing implementations are overly complex, and full of
bugs. They are not maintainable.
SOME RANDOM THOUGHTS
In many environments, the underlying read-only image needs to be updated
over time, and overlays based on a previous version of the image need to
be moved to more recent versions. This will require telling system
files from configuration, and configuration from user data. From a
kernel / base OS point of view, we can only provide the mechanisms; the
higher-level issues need to be addressed is systems management software.
One mechanism likely to help is layering, in which a snapshot is based
on another snapshot, or multiple layers are union mounted together.
System updates could go into a different layer than configuration, and
user data could go into yet another layer. All the approaches discussed
support this kind of layering to some degree; all approaches above the
block layer do so in a reasonably scalable way.
Tools will be needed for efficiently analyzing the relative changes
between two layers. This should be possible by mounting the two
versions next to each other and comparing them using traditional tools.
For some approaches, lower-lever tools could be written for achieving
the same results more efficiently. (Such tools may be difficult to
write, and might not be worth the bother, though.)
CONCLUSION (AS OF MARCH 2009)
The union mount/filesystem problem doesn't have a right and several wrong
solutions; each of the discussed solutions has its own strenghts and
weaknesses, even when only considering the union of one read-only and one
writable image. In the long run and where applicable, in-filesystem
snapshots look the most promising to me.
When different types of filesystems must be mixed (and until
in-filesystem snapshots on Linux have matured), union mounts may well
save the day. They could be implemented incrementally starting with the
union of two directories, then extended by whiteouts to allow deletes,
and then recursion and opaque directories could be added. Problems such
as unstable inode numbers and missing support for hardlinks across
layers would remain though; trying to solve them at the VFS layer would
make no sense. This makes union mounts apropriate only in limited
scenarios; when used beyond this, users would trip over the inherent
defects far too often.
Miklos Szeredi has recently posted an early delta filesystem prototype
based on FUSE [3]. While this approach does not unifying filesystems,
it does record the deltas relative to a read-only filesystem. This
allows to record metadata-only changes without triggering full,
expensive copy-up operations, for example. Stacking a delta filesystem
on top of read-only union mounts seems interesting, too. Many of the
details of the delta filesystem design have not been flashed out yet
though, so we will have to see.
Andreas Gruenbacher,
20 November 2008 and 22 March 2009
REFERENCES
[1] Union Mount Workshop, http://en.opensuse.org/Union_Mount_Workshop
[2] Cache Coherent Networked File System (CRFS),
http://oss.oracle.com/projects/crfs/
[3] Miklos Szeredi, delta filesystem prototype,
http://lwn.net/Articles/321391/
