|| ||Al Viro <email@example.com>|
|| ||[RFC] shared subtrees|
|| ||Thu, 13 Jan 2005 22:18:51 +0000|
[apologies for delay - there'd been lots of unrelated crap lately]
NOTE: as far as I'm concerned, that's a beginning of VFS-2.7 branch.
All that work will stay in a separate tree, with gradual merge back
into 2.6 once the things start settling down.
OK, here comes the first draft of proposed semantics for subtree
sharing. What we want is being able to propagate events between
the parts of mount trees. Below is a description of what I think
might be a workable semantics; it does *NOT* describe the data
structures I would consider final and there are considerable
areas where we still need to figure out the right behaviour.
Let's start with introducing a notion of propagation node; I consider
it only as a convenient way to describe the desired behaviour - it
almost certainly won't be a data structure in the final variant.
1) each p-node corresponds to a group of 1 or more vfsmounts.
2) there is at most 1 p-node containing a given vfsmount.
3) each p-node owns a possibly empty set of p-nodes and vfsmounts
4) no p-node or vfsmount can be owned by more than one p-node
5) only vfsmounts that are not contained in any p-nodes might be owned.
6) no p-node can own (directly or via intermediates) itself (i.e. the
graph of p-node ownership is a forest).
These guys define propagation:
a) if vfsmounts A and B are contained in the same p-node, events
propagate from A to B
b) if vfsmount A is contained in p-node p, vfsmount B is contained
in p-node q and p owns q, events propagate from A to B
c) if vfsmount A is contained in p-node p and vfsmount B is owned
by p, events propagate from A to B
d) propagation is transitive: if events propagate from A to B and
from B to C, they propagate from A to C.
In other words, members of the same p-node are equivalent and events anywhere
in p-node are propagated to all its slaves. Note that not any transitive
relation can be represented that way; it has to satisfy the following
* A->C and B->C => A->B or B->A
All propagation setups we are going to deal with will satisfy that condition.
How do we set them up?
* we can mark a subtree sharable. Every vfsmount in the subtree
that is not already in some p-node gets a single-element p-node of its
* we can mark a subtree slave. That removes all vfsmounts in
the subtree from their p-nodes and makes them owned by said p-nodes.
p-nodes that became empty will disappear and everything they used to
own will be repossessed by their owners (if any).
* we can mark a subtree private. Same as above, but followed
by taking all vfsmounts in our subtree and making them *not* owned
Of course, namespace operations (clone, mount, etc.) affect that structure
and are affected by it (that's what it's for, after all).
That one is simple - we copy vfsmounts as usual
* if vfsmount A is contained in p-node p, then copy of A goes into
the same p-node
* if A is owned by p, then copy of A is also owned by p
* no new p-nodes are created.
We have a new vfsmount A and want to attach it to mountpoint somewhere in
vfsmount B. If B does not belong to any p-node, everything is as usual; A
doesn't become a member or slave of any p-node and is simply attached to B.
If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events
propagated from B and all p-nodes p1,...,pk that contain them.
* A gets cloned into n copies and these copies (A1,...,An) are attached
to corresponding points in B1,...,Bn.
* k new p-nodes (q1,...,qk) are created
* Ai is contained in qj <=> Bi is contained in qj
* qi owns qj <=> pi owns pj
* qi owns Aj <=> pi owns Bj
In other words, mount is propagated and propagation among the new vfsmounts
mirrors the propagation between mountpoints.
bind works almost identically to mount; new vfsmount is created for every
place that gets propagation from mountpoint and propagation is set up to
mirror that between the mountpoints. However, there is a difference: unlike
the case of mount, vfsmount we were going to attach (say it, A) has some
history - it was created as a copy of some pre-existing vfsmount V. And
that's where the things get interesting:
* if V is contained in some p-node p, A is placed into the same
p-node. That may require merging one of the p-nodes we'd just created
with p (that will be the counterpart of the p-node containing the mountpoint).
* if V is owned by some p-node p, then A (or p-node containing A)
becomes owned by p.
rbind is recursive bind, so we just do binds for everything we had in
a subtree we are binding in obvious order; everything is described
by previous case.
umount everything that gets propagation from victim.
6. mount --move
prohibited if what we are moving is in some p-node, otherwise we move
as usual to intended mountpoint and create copies for everything that
gets propagation from there (as we would do for rbind).
similar to --move
How to use all that stuff?
mount --bind /floppy /floppy
mount --make-shared /floppy
mount --rbind / /jail
<finish setting the jail up, umount whatever doesn't belong there,
mount --make-slave /jail/floppy
and we get /floppy in chroot jail slave to /floppy outside - if somebody
(u)mounts stuff on it, that will get propagated to jail.
same, but with the namespaces instead of chroots.
same subtree visible (and kept in sync) in several places - just
mark it shared and rbind; it will stay in sync
have some daemon control the stuff in a subtree sharable with many
namespaces, chroots, etc. without any magic:
mark that subtree sharable
clone with CLONE_NS
parent marks that subtree slave
child keeps working on the tree in its private namespace.
There's a lot more applications of the same idea, of course - AFS and its
ilk, autofs-like stuff (with proper handling of MNT_EXPIRE and traps - see
below), etc., etc.
Areas where we still have to figure things out:
* MNT_EXPIRE handling done right; there are some fun ideas in that area,
but they still need to be done in more details (basically, lazy expire -
mount in a slave expiring into a trap that would clone a copy from master
when stepped upon).
* traps and their sharing. What we want is an ability to use the master/slave
mechanisms for *all* cross-namespace/cross-chroot issues in autofs, so that
daemon would only need to work with the namespace of its own and no nothing
about other instances.
* implementation ;-) It certainly looks reasonably easy to do; memory
demands are linear by number of vfsmounts involved and locking appears
to be solvable.
* whatever issues that might come up from MVFS demands (and AFS, and...)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/