|
|
Log in / Subscribe / Register

Union mount workshop notes

Union Mount Workshop: Review of Existing Approaches

(Revised on March 22, 2009.)

Hello,

this is a report from a union mount workshop held from November 3 to 4
in the SUSE/Novell offices in Nuremberg [1].  Back in November, we
discussed our requirements, the different ways how these requirements
can be met, and more specifically union mounts, an approach which we
believe looks promising.

All of the use cases we are interested in basically boil down to the
same thing: having an image or filesystem that is used read-only (either
because it is not writable, or because writing to the image is not
desired), and pretending that this image or filesystem is writable,
storing changes somewhere else.

None of the use cases require to merge arbitrary file systems (*), a
feature which union mounts provide by design.  What we need is "only" a
mechanism to "overlay" a read-only image or file system with a writable
one.  This means that many approaches other than union mounts may be
applicable.

 (*) One use case is to produce a number of modular images which
     can be plugged together in a flexible way: several base OS
     images, database images, web server images, etc.  It seems
     desirable to come up with a solution for plugging multiple
     of those images together, resulting in a working system.
     This use case requires to solve the problem of somehow
     cleverly combining files which exist in more than one image
     (like the rpm database).  It is unclear how to achieve this,
     and I am not convinced that combining images ad-hoc by
     simply mounting them together (as opposed to having an
     explicit merge step) is of relevance.  We consider this use
     case out of scope at this point.

At different layers of the storage stack, different solutions exist or
are being worked on, with different advantages and disadvantages.  Let's
look at those layers.


THE BLOCK LAYER

At the block layer, copy-on-write could be used to construct a writable
filesystem image out of the read-only image and an exception store:
blocks are read from the read-only image originally.  As soon as they
are modified, this fact and the block's new contents are recorded in the
exception store; from that point on, a read will return the new block's
contents from the exception store.  This is how device mapper snapshots
work.

 Advantages:

  - Above the block layer, everything will look like before:
    except for the performance degradation, filesystems and
    applications will see no difference in behavior.

  - Existing files will be modified at the block level, so
    working with huge directories and big files will perform
    similarly to how these operations perform today (after taking
    the block layer slowdown into account).

 Disadvantages:

  - Snapshots implemented at the block layer are slow.  The
    current device mapper implementation scales very poorly.  Too
    much I/O is necessary because the block layer does not have
    any meta-information about disk blocks.  (Things are
    improving somewhat; particularly, filesystems can pass down
    some meta-information to the block layer since recently.
    Still, some optimizations available to higher-layer
    approaches are not available at the block layer.)

  - Mixing different filesystem types, such as a compressed
    read-only and an uncompressed writable filesystem, is not
    possible with block layer approaches.  (The block layer does
    not know what a file system is.)

  - Because they only record the deltas relative to a read-only
    image, exception stores are relatively worthless in
    isolation.  If the read-only image lost, becomes corrupted,
    or is modified by accident, the exception stores based on it
    become worthless.

Block device approaches are not limited to a single machine.  Network
block devices can be used in distributed environments.  (This implies
additional overheads, however.)


THE FILESYSTEM LAYER

Future filesystems based on copy-on-write designs, such as btrfs, do (or
will) support in-filesystem block device management and snapshots.
Conceptually this is similar to block-layer mechanisms, but it allows
optimizations that would be hard to achieve otherwise, which makes this
approach scale much better.

 Advantages:

  - Transparent to users: everything will behave as if on a
    normal, writable file system.

  - Faster, because more meta-information is available, changes
    can be tracked at a sub-block granularity, and functionality
    that the filesystem already provides does not need to be
    duplicated at the block layer.

  - Efficient compression is easier to implement than at the
    block layer.

  - At least in btrfs, snapshots are first-class objects; they do
    not merely record the deltas relative to another image.  The
    filesystem itself guarantees consistency.  Despite that, they
    may be based on a read-only device, combined with a writable
    device.

 Disadvantages:

  - These filesystems do not have network transparency built in.
    Block-layer network transparency may be used though, and such
    filesystems may be exposed to the network via a network
    filesystem like NFS, or Zach Brown's CRFS [2] (in development).


THE VFS LAYER: MOUNTS / BIND MOUNTS

The classical approach for making part of the filesystem namespace
writable is to mount writable and read-only filesystems together "in the
right way".  As an extension to ordinary mounts, bind mounts allow to
mount specific directories and even files on top of each other.

 Advantages:

  - Bind mounts exist today; the technology is well understood,
    and simple.

 Disadvantages:

  - The places in the filesystem namespace that shall be writable
    need to be known in advance.  It may be difficult to adapt to
    applications which write to files in unexpected places.
    Additional places in the namespace need to be made writable
    manually when necessary.

  - Bind mounts do not have a concept of unified directories:
    files can only be mounted over existing files; new files or
    directories can not be created that way.  When files need to
    be added, as frequently happens in the /etc directory, for
    example, the entire directory must be copied, and files in
    the directory must either be copied too, bind mounted to
    individually, or replaced by symlinks.  (In the latter two
    cases, the old files referred to must be visible somewhere
    else in the namespace.)

  - Software updates modify files in many places which are
    usually accessed read-only.  Our software is rpm based, so
    rpm could be modified to bind mount new files over their
    previous versions, but over time, the number of union mounts
    this would lead to would become excessive, and this would not
    scale well: upon system startup, a large number of mounts
    would have to be set up, which would take time, and consume a
    significant amount of kernel memory.  Such systems would
    become hard to maintain.

  - Some applications may see unexpected error codes when
    manipulating bind mounted files and directories: for example,
    bind mounted files and directories cannot be renamed (but
    neither can files on a read-only filesystem).  It could be
    argued that applications relying on undefined behavior are
    buggy, but this would not help when legacy applications need
    to be used.


THE VFS LAYER: UNION MOUNTS

Union mounts are more flexible than bind mounts: file systems are
mounted on top of each other transparently so that at the top layer,
files from the top and bottom layer are visible at the same time.
(Ordinary mounts are opaque, hiding all lower layers.) Files at the top
layer hide files at the bottom layer with the same name.  The bottom
layer is never written to.  The top layer may be read-only (which is
comparatively boring), or writable.

Union mounts can be stacked on top of each other and it is possible to
"see through" multiple layers.  I will describe union mounts in terms of
two layers here to keep things simple.

If a file is deleted which exists at the bottom layer, a so-called
whiteout file with the same name is created at the top layer.  Users
never get to see this file; it is not included in readdir results, and
trying to open it fails with errno == ENOENT.  If a file with the same
name is later created, this file replaces the whiteout.

When a new directory is created in place of a whiteout and the bottom
layer contains a directory with the same name, the files from the bottom
layer would be visible.  This is not desirable as newly created
directories are supposed to be empty.  To fix this defect, new
directories are created with a special "opaque" flag set: this flag
indicates that bottom-layer directory contents shall remain hidden.

When files at the bottom layer are about to be modified, since the
bottom layer is accessed read-only, they first need to be copied into
the top layer.  This operation is commonly referred to as copy-up.  With
union mounts, the decision when to copy files up is made at file open
time.

(Directories do not require a copy-up operation: when a directory that
does not exist at the top layer already is modified, it is sufficient to
create an empty directory at the top layer with the "opaque" flag
cleared.)

 Advantages:

  - Union mounts are very easy to set up, and they are relatively
    simple conceptually.

  - File system types can be arbitrarily mixed.

  - Compared to the alternatives, union mounts are "relatively
    easy" to implement.  (But note the various problems mentioned
    below which make it difficult to argue that a particular
    implementation is "correct" or "good enough".)

  - Even though the result may not be the expected one, the top
    layer may be mounted over a bottom layer even after the
    bottom layer was modified.

  - Multiple filesystems can be mounted together arbitrarily,
    fully merging their contents.

 Disadvantages:

  - When an process opens a file in read-only mode, it will get a
    file descriptor on the bottom-layer filesystem.  Another
    process may open the same file for writing.  The read-only
    file is copied up, and the second process will get a file
    descriptor to a different file.  The first process will not
    see any changes done by the second process.  (Subsequent
    read-only opens will return the top-layer file, of course.)

  - When a file is copied up, it changes its device and inode
    numbers (the st_dev and st_ino fields returned by stat(2)).
    Applications making assumptions about these fields may be
    surprised.

  - Union mounts are name based, and hardlinks across mount
    layers are not supported.  When a file on the bottom layer
    with a link count greater than one is opened for writing,
    copying it up will drop its link count to one.  When another
    alias of the same bottom-layer file is opened for writing
    afterwards, the top layer will end up with two independent
    copies, each with a link count of one.

    (It might help to change stat(2) to always return a fake link
    count of one for all files on the bottom layer.)

  - Operations expected to be cheap may suddenly become very
    expensive.  This includes changing file timestamps and
    creating a hardlink to a bottom layer file, both of which
    will trigger a copy-up operation.  (Failing link(2) with
    errno == EXDEV for files within the same directory may at
    first seem acceptable, but many applications make the
    assumption that hardlinks will be supported, and do not
    implement the appropriate fallback code.)

  - Copy-up of lage files may become prohibitively slow.  (Think
    of fixing a few tags in your music collection.) File system
    or block layer approaches scale better in this respect.

  - POSIX specifies requirements for readdir that, taken
    together, result in having to compute and cache the entire
    directory contents, removing duplicates and whiteouts.  The
    reason is because processes may arbitrarily seek in the
    readdir stream (see telldir(3) and seekdir(3)).  This is a
    cpu and memory intensive operation.

  - Union mounts operate on files as a whole, and do not provide
    mechanisms for merging files.  When multiple filesystems are
    union mounted, the file on the top layer will "win", which
    may still lead to an inconsistent system.  (Think of files
    such as the rpm database.)

  - Inotify: ?


BETWEEN THE FILESYSTEM AND VFS LAYERS: UNIONFS, AUFS

Unionfs and Aufs try to appear as a filesystem to the VFS, and as the
VFS to filesystems.  Some of the namespace handling is done at the
filesystem level, which results in various problems (locking, cache
coherency, etc.).

Both of these implementations include a number of features which our use
cases do not immediately require (multiple writable layers, stable inode
numbers across copy-up, cross-layer hardlinks).  These features lead to
a tremendous increase of the complexity of the implementation though.

 Advantages:

  - Available today.

 Disadvantages:

  - The existing implementations are overly complex, and full of
    bugs.  They are not maintainable.


SOME RANDOM THOUGHTS

In many environments, the underlying read-only image needs to be updated
over time, and overlays based on a previous version of the image need to
be moved to more recent versions.  This will require telling system
files from configuration, and configuration from user data.  From a
kernel / base OS point of view, we can only provide the mechanisms; the
higher-level issues need to be addressed is systems management software.

One mechanism likely to help is layering, in which a snapshot is based
on another snapshot, or multiple layers are union mounted together.
System updates could go into a different layer than configuration, and
user data could go into yet another layer.  All the approaches discussed
support this kind of layering to some degree; all approaches above the
block layer do so in a reasonably scalable way.

Tools will be needed for efficiently analyzing the relative changes
between two layers.  This should be possible by mounting the two
versions next to each other and comparing them using traditional tools.
For some approaches, lower-lever tools could be written for achieving
the same results more efficiently.  (Such tools may be difficult to
write, and might not be worth the bother, though.)


CONCLUSION (AS OF MARCH 2009)

The union mount/filesystem problem doesn't have a right and several wrong
solutions; each of the discussed solutions has its own strenghts and
weaknesses, even when only considering the union of one read-only and one
writable image.  In the long run and where applicable, in-filesystem
snapshots look the most promising to me.

When different types of filesystems must be mixed (and until
in-filesystem snapshots on Linux have matured), union mounts may well
save the day.  They could be implemented incrementally starting with the
union of two directories, then extended by whiteouts to allow deletes,
and then recursion and opaque directories could be added.  Problems such
as unstable inode numbers and missing support for hardlinks across
layers would remain though; trying to solve them at the VFS layer would
make no sense.  This makes union mounts apropriate only in limited
scenarios; when used beyond this, users would trip over the inherent
defects far too often.

Miklos Szeredi has recently posted an early delta filesystem prototype
based on FUSE [3].  While this approach does not unifying filesystems,
it does record the deltas relative to a read-only filesystem.  This
allows to record metadata-only changes without triggering full,
expensive copy-up operations, for example.  Stacking a delta filesystem
on top of read-only union mounts seems interesting, too.  Many of the
details of the delta filesystem design have not been flashed out yet
though, so we will have to see.


Andreas Gruenbacher,
20 November 2008 and 22 March 2009


REFERENCES

[1] Union Mount Workshop, http://en.opensuse.org/Union_Mount_Workshop

[2] Cache Coherent Networked File System (CRFS),
    http://oss.oracle.com/projects/crfs/

[3] Miklos Szeredi, delta filesystem prototype,
    http://lwn.net/Articles/321391/


to post comments


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds