New tricks for XFS

By Jake Edge
February 21, 2018

linux.conf.au

The XFS filesystem has been in the kernel for fifteen years and was used in production on IRIX systems for five years before that. But it might just be time to teach that "old dog" of a filesystem some new tricks, Dave Chinner said, at the beginning of his linux.conf.au 2018 presentation. There are a number of features that XFS lacks when compared to more modern filesystems, such as snapshots and subvolumes; but he has been thinking—and writing code—on a path to get them into XFS.

Some background

XFS is the "original B-tree filesystem" since everything that the filesystem stores is organized in B-trees. They are not actually a traditional B-tree, rather they are a form of B* tree. A difference is that each node has a sibling pointer, which allows horizontal traversal of the tree. That kind of traversal is important when looking at features like copy on write (CoW).

An XFS filesystem is split into allocation groups, "which are like mini-filesystems"; they have their own free-space index B-trees, inode B-trees, reverse-mapping B-trees, and so on. File data is referenced by extents, with the help of B-trees. "Directories and attributes are more B-trees"; the directory B-tree is the most complex as it is a "virtually mapped, multiple index B-tree with all sorts of hashing" for scalability.

XFS uses writeahead journaling for crash resistance. It has checkpoint-based journaling that is meant to reduce the write amplification that can result from changing blocks that are already in the journal.

He followed that with a quick overview of CoW filesystems. When a CoW filesystem writes to a block of data or metadata, it first makes a copy of it; in doing so, it needs to update the index tree entries to point to the new block. That leads to modifying the block that holds those entries, which necessitates another copy, thus a modification to the parent index entry, and so on, all the way up to the root of the filesystem. All of those updates can be written together anywhere in the filesystem, which allows lots of optimizations to be done. It also provides consistent on-disk images, since the entire update can be written prior to making an atomic change to the root-level index.

All of that is great for crash recovery, he said, but the downside is that it requires that space be allocated for these on-disk updates. That allocation process requires metadata updates, which means a metadata tree update, thus more space needs to be allocated for that. That leads to the problem that the filesystem does not know exactly how much space is going to be needed for a given CoW operation. "That leads to other problems in the future."

These index tree updates are what provide many of the features that are associated with CoW filesystems, Chinner said, "sharing, snapshots, subvolumes, and so on". They are all a natural extension of having an index tree structure that reference-counts objects; that allows multiple indexes to point to the same object by just increasing the reference count on it. Snapshots are simply keeping around an index tree that has been superseded; that can be done by taking a reference to that tree. Replication is done by creating a copy of the tree and all of its objects, which is a complicated process, but "does give us the send-receive-style replication" that users are familiar with.

CoW in XFS is different. Because of the B* trees, it cannot do the leaf-to-tip update that CoW filesystems do; it would require updating laterally as well, which in the worst case means updating the entire filesystem. So CoW in XFS is data-only.

Data-only CoW limits the functionality that XFS can provide; features like deduplication and file cloning are possible, but others are not. The features it does provide are useful for projects like overlayfs and NFS, Chinner said. The advantage of data-only CoW is that there is no impact on non-shared data or metadata. In addition, XFS can always calculate how much space is needed for a CoW operation because only the data is copied; the metadata is updated in place.

But, since the metadata updates are not done with CoW, crash resiliency is a bit more difficult—it is not a matter of simply writing a new tree branch and then switching to it atomically. XFS has implemented "deferred operations", which are a kind of "intent logging mechanism", Chinner said. Deferred operations were used for freeing extents in the past, but have been extended to do reference count and reverse B-tree mapping updates. That allows replaying CoW updates as part of recovery.

What is a subvolume?

Thinking about all of that led Chinner to a number of questions about what can be done with data-only CoW. Everyone seems to want subvolume snapshots, but that seems to require CoW operations for metadata. How can the problem be repackaged so that there is a way to implement the same functionality? That is the ultimate goal, of course. He wondered how much of a filesystem was actually needed to implement a subvolume. There are other implementations to look at, so we can learn from them, he said. "What should we avoid? What do they do right?" The good ideas can be stolen—copied—"because that's the easy way".

Going back to first principles, he asked: "what is a subvolume? What does it provide?" From what he can tell, there are three attributes that define a subvolume. It has flexible capacity, so it can grow or shrink without any impact. A subvolume is also a fully functioning filesystem that allows operations like punching holes in files or cloning. The main attribute, though, is that a subvolume is the unit of granularity for snapshots. Everything else is built on top of those three attributes.

He asked: could subvolumes be implemented as a namespace construct that sits atop the filesystem? Bind mounts and mount namespaces already exist in VFS, he wondered if those could be used to create something that "looks like and smells like a subvolume". If you add a directory hierarchy quota on top of a bind mount, it will result in a kind of flexible space management. If you "squint hard enough", that is something like a subvolume, he said.

Similarly, a recursive copy operation using --reflink=always can create a kind of snapshot. It still replicates the metadata, but the vast majority of the structure has been cloned without copying the data. Replication can be done with rsync and tar; "sure, it's slow", but there are tools to do that sort of thing. It doesn't really resemble a Btrfs subvolume, for example, but it can still provide the same functionality, Chinner said. In addition, overlayfs copies data and replicates metadata, so it shows that you can provide "something that looks like a subvolume using data-only copy on write".

Another idea might be to implement the subvolume below the filesystem with a device construct of some sort. In fact, we already have that, he said. A filesystem image can be stored in a sparse file, then loopback mounted. That image file can be cloned with data-only CoW, which allows for fast snapshots. The space management is "somewhat flexible", but is limited by what the block layer provides and what filesystems implement. Replication is a simple file copy.

What this shows "is that what we think of as a subvolume, we're already using", Chinner said. The building blocks are there, they are just being used in ways that do not make people think of subvolumes.

The loopback filesystem solution suffers from the classic ENOSPC problem, however. If the filesystem that holds the image file runs out of space, it will communicate that by returning ENOSPC, but the filesystem inside the image will not be prepared to handle that failure and things break horribly: "blammo!". This is same problem that thin provisioning has. It is worse than the CoW filesystem ENOSPC problem, because you can't predict when it will happen and you can't recover when it does, he said.

He returned to the idea of learning from others at that point. Overlayfs and, to a lesser extent, Btrfs have taught us that specifying subvolumes via mount options is "really, really clunky", Chinner said. Btrfs subvolumes share the same superblock, which can cause some subtle issues about how they are treated by various tools like find or backup programs. A subvolume needs to be implemented as an independent VFS entity and not just act like one. "There's only so much you can hide by lying."

The ENOSPC problem is important to solve. The root of the problem is that upper and lower volumes (however defined) have a different view of free-space availability and those two layers do not communicate about it. This problem has been talked about many times at LSFMM (for example, in 2017 and in 2016) without making any real progress. But a while back, Christoph Hellwig came up with a file layout interface for the Parallel NFS (pNFS) server running on top of XFS; it allowed the pNFS client to remotely map files from the server and to allocate blocks from the server. The actual data lives elsewhere and the client does its reads and writes to those locations; so the client is doing its filesystem allocation on the server and then doing the I/O to somewhere else. This provides a model for a cross-layer communication of space accounting and management that is "very instructive".

A new kind of subvolume

He has been factoring all of this into his thinking on a new type of subvolume; one that acts the same as the subvolumes CoW filesystems have, but is implemented quite differently. The kernel could be changed so that it can directly mount image files (rather than via the loopback device) and a device space-management API could be added. If a filesystem implements both sides of that API, image files of the same filesystem type can be used as subvolumes. The API can be used to get the mapping information, which will allow the subvolume to do its I/O directly to the host filesystem's block device. This breaks the longstanding requirement that filesystems must use block devices; with his changes, they can now use files directly.

But this mechanism will still work for block devices, which will make it useful for thin provisioning as well. The thin-provisioned block device (such as dm-thin) can implement the host side of the space-management API; the filesystem can then use the client-side API for space accounting and I/O mapping. That way the underlying block device will report ENOSPC before the filesystem has modified its structures and issued I/O. That is something of a bonus, he said, but if his idea solves two problems at once, that gives him reason to think he is on the right track.

Snapshots are "really easy in this model". The subvolume is frozen and the image file is cloned. It is fast and efficient. In effect, the subvolume gets CoW metadata even though its filesystem does not implement it; the data-only CoW of the filesystem below (where the image file resides) provides the metadata CoW.

Replication could be done by copying the image files, but there are better ways to do it. Two image files can be compared to determine which blocks have changed between two snapshots. It is quite simple to do and does not require any knowledge of what is in the files being replicated. He implemented a prototype using XFS filesystems on loopback devices in 200 lines of shell script using xfs_io. "It's basically a delta copy" that is independent of what is in the filesystem image; if you had two snapshots of ext4 filesystems, the same code would work, he said.

There are features that people are asking for that the current CoW filesystems (e.g. Btrfs, ZFS) cannot provide, but this new scheme could. Right now, there is a lot of data shared between files on disk that is not shared once it gets to the page cache. If you have 500 containers based on the same golden image, you can have multiple snapshots being used but each container has its own version of the same file in the cache. "So you have 500 copies of /bin/bash in memory", he said. Overlayfs does this the right way since it shares the one cached version of the unmodified Bash between all of the containers.

His goal is to get that behavior for this new scheme as well. That requires sharing the data in shared extents in the page cache. It is a complex and difficult problem, Chinner said, because the page cache is indexed by file and offset, whereas the only information available for the shared extents is their physical location in the filesystem (i.e. the block number). Instead of doing an exhaustive search in the page cache to see if a shared extent is cached, he is proposing adding a buffer cache that is indexed by block number. XFS already has a buffer cache, but it doesn't have a way to share pages between multiple files. Chinner indicated that Matthew Wilcox was working on solving that particular problem; that solution would be coming "maybe next week", he said with a grin.

For a long time people have been saying that you don't need encryption for subvolumes because containers are isolated, but then came Meltdown and Spectre, which broke all that isolation. He thinks that may lead some to want more layers of defense to make it harder to steal their data when that isolation breaks down. Adding the generic VFS file-encryption API to XFS will allow encrypting the image files and/or individual files within a subvolume. There might be something to be gained by adding key management into the space-management API as well.

It is looking like XFS could offer "encrypted, snapshottable, cloned subvolumes with these mechanisms", Chinner said. There is still a lot of work to do to get there, of course; it is still in its early stages.

The management interface that will be presented to users is not nailed down yet; he has been concentrating on getting the technology working before worrying about policy management. How subvolumes are represented, what the host volume looks like to users, and whether everything is a subvolume are all things that need to be worked out. There is also a need to integrate this work with tools like Anaconda and Docker.

None of the code has had any review yet; it all resides on his laptop and servers. Once it gets posted, there will be lots of discussion about the pieces he will need to push into the kernel as well as the XFS-specific parts. There will probably be "a few flame wars around that, a bit of shouting, all the usual melodrama that goes along with doing controversial things". He recommended popcorn.

He then gave a demo (starting around 36:56 in the YouTube video of the talk) of what he had gotten working so far. It is a fairly typical early stage demo, but managed to avoid living up to the names of the subvolume and snapshot, which were "blammo" and "kaboom".

After the demo, Chinner summarized the talk (and the work). He started out by looking at how to get the same functionality as subvolumes, but without implementing copy on write for metadata. The "underlying revelation" was to use files as subvolumes and to treat subvolumes as filesystems. That gives the same functionality as a CoW filesystem for that old dog XFS.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]

Index entries for this article
Kernel	Filesystems/XFS
Conference	linux.conf.au/2018

New tricks for XFS

Posted Feb 22, 2018 8:55 UTC (Thu) by jag (subscriber, #3766) [Link]

I found the paragraph on pNFS a bit hard to understand. From reading the linked site the idea is that you go through the server for the metadata while having direct access to the (remote) data storage, thus removing the server as a bottleneck for the bulk of the reads and writes.

New tricks for XFS

Posted Feb 22, 2018 9:16 UTC (Thu) by TomH (subscriber, #56149) [Link] (2 responses)

Apple HFS predates XFS by eight years according to wikipedia and also uses B-Trees for the directory tree and extents (https://en.wikipedia.org/wiki/Hierarchical_File_System#De...) indeed strictly I think the catalog at least has horizontal pointers between the leaf nodes make it B+ at least.

New tricks for XFS

Posted Feb 22, 2018 23:51 UTC (Thu) by simcop2387 (subscriber, #101710) [Link] (1 responses)

HFS has existed since then, but did it actually use B-Trees from the beginning? I know it's had a number of updates and extensions since the original inception. And was also supplanted with HFS+ eventually too.

New tricks for XFS

Posted Feb 23, 2018 8:44 UTC (Fri) by TomH (subscriber, #56149) [Link]

Well I know HFS used B-Trees by late 1992 or early 1993 when I was writing code to read Mac floppies on RISC OS systems.

New tricks for XFS

Posted Feb 22, 2018 10:11 UTC (Thu) by nix (subscriber, #2304) [Link] (22 responses)

There are not many articles where I start laughing from the sheer power of the abstraction halfway through. This was one of them.

The hard part of all this does indeed appear to be the usual space management and policy palaver. If you're storing snapshots in image files, where do those files reside? On the XFS filesystem that the snapshot is being mounted on, probably, but where? .snapshot above the initial mount point? /.snapshot? But the space-management problem is worse.

You can run out of space in a subvolume when there is plenty of space on the containing fs, because the image file is too small. One could kludge around this a bit by routinely creating subvolumes as sparse files as big as the containing fs -- but since these can obviously only grow, and the inner fs has no idea that every time it writes to a block full of zeroes it's actually doing irreversible space allocation, that then means you can steal space from the containing fs until it runs out of room, even if the subvolume is still rather small. Also, if you expand the containing fs, you probably want to expand all the subvolume files as well... the space management looks *painful* and frankly these things are starting to look not very much like conventional files at all, but like a sort of super-sparse file where the inner and outer filesystems should communicate (at least if they're the same fs) to share free space between all subvolumes and the containing fs. From the POSIX API perspective these would probably look like sparse files that could shrink again, and routinely did, as the contained fs notified the containing fs that it had deallocated a bunch of blocks.

(Oh, also privilege problems emerge, though to my eye they are minor compared to the space management problems. Unprivileged snapshots appear... hard in this model, unless you want unprivileged mount of arbitrary XFS filesystems, which is probably still a bit risky.)

New tricks for XFS

Posted Feb 22, 2018 13:07 UTC (Thu) by dgc (subscriber, #6611) [Link] (20 responses)

I'm not surprised by your reaction. I've lost count of the number of people who have told me what I demonstrated in my talk can't be done.... :P

> but like a sort of super-sparse file

Well, yes. The underlying principle is that subvolumes need to be abstracted from physical storage space. e.g. This is all based on patches to XFS to dissociate the physical size of the filesystem from the amount of space a subvolume presents to the user that this is all based on. This can't be done without sparse storage, be it a file or a thin block device, and the initial patchset is here:

https://marc.info/?l=linux-xfs&m=150900682026621&w=2

> where the inner and outer filesystems should communicate
> (at least if they're the same fs) to share free space between all subvolumes and the containing fs

Yup, this is precisely the mechanism I've implemented to handle the ENOSPC problem. More info in this thread:

https://marc.info/?l=linux-xfs&m=151722047128077&w=2

-Dave.

New tricks for XFS

Posted Feb 22, 2018 13:59 UTC (Thu) by Paf (subscriber, #91811) [Link]

Quite an impressive set of tricks, Dave! Nice work.

New tricks for XFS

Posted Feb 22, 2018 14:28 UTC (Thu) by doublez13 (guest, #122213) [Link] (2 responses)

Are you currently looking into integrating fscrypt into XFS?

New tricks for XFS

Posted Feb 23, 2018 21:46 UTC (Fri) by dgc (subscriber, #6611) [Link] (1 responses)

Yes. I mentioned that in the talk.

New tricks for XFS

Posted Feb 24, 2018 1:30 UTC (Sat) by rbrito (guest, #66188) [Link]

Sorry for not having yet watched your talk, but is there the possibility of having transparent compression on the filesystem?

Unprivileged mounts

Posted Feb 22, 2018 20:03 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (1 responses)

Any chance this can be done in such a way that we could allow root in a user namespace to mount the subvolume.

This would require something to not allow access to the subvolume file except to mount it as a subvolume.

Unprivileged mounts

Posted Feb 23, 2018 21:52 UTC (Fri) by dgc (subscriber, #6611) [Link]

That's really a separate problem. :P

I don't see any fundamental problem with doing this - we're going to want to enable root inside a container to snapshot their own subvolumes, so we're going to need some kind of mechanism to allow non-init namespace access to the management interfaces. We haven't decided on how we're going to present snapshot volume management to userspace yet - this is on my list of requirements for when we start on that aspect of the problem....

New tricks for XFS

Posted Feb 22, 2018 20:50 UTC (Thu) by clump (subscriber, #27801) [Link] (6 responses)

Forgive me if this has been covered. Would creating "XFS2" with perhaps incompatible changes make some of your work easier than modifying the exiting XFS codebase?

New tricks for XFS

Posted Feb 23, 2018 17:05 UTC (Fri) by wazoox (subscriber, #69624) [Link] (5 responses)

There have already been incompatible changes in XFS in the past, quite recently in fact with the addition of metadata checksumming. As long as there are available bit flags to warn of new incompatible features, there is no reason to create such a radical change.

New tricks for XFS

Posted Feb 23, 2018 19:45 UTC (Fri) by clump (subscriber, #27801) [Link] (4 responses)

If you create a new version, you're free to add features like shrinking, subvolumes, etc, without a hodgepodge of arcane mount options and compatibility hacks. Users averse to change could stay on the original XFS. Witness the migration from ext3 to ext4, the latter of which happened to implement features found in XFS.

I also wish there was something akin to an ext5 with an upstream COW implementation.

New tricks for XFS

Posted Feb 23, 2018 21:15 UTC (Fri) by wazoox (subscriber, #69624) [Link]

Well, different filesystems with different philosophies. ext2/3/4 touts compatibility, XFS touts performance. For many years XFS has graduallly added incompatible features (but with options to make compatible filesystems in xfsprogs).

New tricks for XFS

Posted Feb 23, 2018 21:23 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

The way this is done with XFS is that unlike ext4, old filesystems are rarely dynamically upgradable to add new features: to gain reflinks or the reverse-mapping btree you must re-mkfs, and until the feature is declared stable (4.15 for those features) it gets a warning at mount time and requires non-default mkfs options at mkfs time. After that, the new feature becomes the default.

So old filesystems mostly continue to use older codepaths, or aspects of newer codepaths which are heavily tested by virtue of the weight of existing filesystems out there. Only people willing to take the significant plunge of re-mkfsing are affected by the potential bug load (and, obviously, don't even do this with test filesystems on important machines until you're confident it won't oops :) ).

(The unusual thing about this set of features is that it doesn't seem to need new xfs features at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's *very* impressive to me.)

New tricks for XFS

Posted Feb 23, 2018 22:01 UTC (Fri) by dgc (subscriber, #6611) [Link] (1 responses)

> (The unusual thing about this set of features is that it doesn't seem to need new xfs features
> at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's
> *very* impressive to me.)

No, that's not the case. I think I showed the list of experimental features that the demo used in the talk. Sure, reflink and rmapbt are no longer experimental (as of 4.16-rc1), but I added a new "thinspace" feature for thin provisioning awareness and another for "subvol" support, because that also requires special flags in inodes and other such things to mark the filesystem as a subvol.

Overall, there's remarkably little change for the subvol feature - about 1400 lines of new kernel code and ~300 lines of userspace code were needed for the demo I gave during the talk. Compare that to the data-COW feature in XFS that it relies on - we added about 20,000 lines of code for that between kernel and userspace....

New tricks for XFS

Posted Mar 3, 2018 13:58 UTC (Sat) by nix (subscriber, #2304) [Link]

Ah right, a few small additions then: still surprisingly little. A couple of new flags doesn't sound like the kind of thing that would require a v6 filesystem and a new mkfs, unlike the massive addition which was rmapbt and all that it implies.

New tricks for XFS

Posted Feb 22, 2018 21:42 UTC (Thu) by nix (subscriber, #2304) [Link] (6 responses)

That's *clever*. (And I clearly need to watch your talk, which has been burning a hole in my disk space for a few weeks now.)

I saw the thin provisioning-aware stuff go by, but didn't realise it would have *this* consequence. I mean, thin provisioning is all very well but unless you're running a data centre or testing filesystems isn't really important in your day-to-day life. But everyone likes snapshots :)

Of course this also obsoletes LVM: instead of LVM, you have a single giant XFS filesystem with each "partition" consisting of a sparse disk image containing whatever filesystem you like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.

There isn't even more metadata than there would otherwise be, despite the existence of two filesystems where before you had one, since the outermost one has minimal metadata tracking the space allocation of the contained fs and that's just about all. As lots of subvolumes are created, the extents in that fs tracking the increasingly fragmented thin file containing the subvolumes will grow ever smaller (the same problem CoW filesystems have traditionally had with crazy fragmentation), but defragmenting single files with XFS is something it's been very good at for a long time, and it's a heck of a lot easier than defragmenting a filesystem with CoW metadata would be: the metadata in the contained fs, of course, is completely unchanged by any of this, rather than having to go through rewrite-and-change hell.

I'll admit it. I want all this cool stuff *now* and I'm kicking myself that I re-mkfsed my big fileserver only eight months ago and can't convince myself to redo it like this instantly. :)

New tricks for XFS

Posted Feb 22, 2018 22:12 UTC (Thu) by rodgerd (guest, #58896) [Link] (3 responses)

> I'll admit it. I want all this cool stuff *now* and I'm kicking myself that I re-mkfsed my big fileserver only eight months ago and can't convince myself to redo it like this instantly. :)

1. Convert big fileserver into two Gluster servers.
2. Reformat them with Cool Stuff one by one.
3. Beg the Gluster maintainers to integrate their tooling with the new XFS features.

New tricks for XFS

Posted Feb 22, 2018 22:44 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

I'm not sure how you convert one big machine into two servers. With an axe? (It has an odd number of disks in a RAID array -- not something I can evenly split :) ).

I am vaguely tempted to do something kinda similar with my new desktop, which will likely come with an MMC far too small to hold the fs I want to put on it, but *with* nice fast 10GbE (imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays). I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :) however, it's quite possible that NFSv4 onto a big RAID array over the 10GbE would be *faster*, particularly for writes. In practice I suspect I'll be happy with just nfsroot and ignoring the MMC entirely. (Which means I'll have to thunk to the fileserver to do snapshots etc in any case... though if I go to NFSv4.2 I will at least be able to do reflinks on the client.)

New tricks for XFS

Posted Feb 23, 2018 4:52 UTC (Fri) by rodgerd (guest, #58896) [Link] (1 responses)

> I'm not sure how you convert one big machine into two servers. With an axe?

I find "being a pack rat with more cases and motherboards lying about than is reasonable" seems to work acceptably well ;). The odd number of disks is a bit trickier.

> I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :)

Good heavens. That's a dedication to complexity! I'm moving my performance-sensitive Gluster FSes to use tiering, so I'm in no position to criticise, though.

> imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays

The plunging price of network equipment continues to astonish me. They days where just management was a prohibitively expensive option are long gone, and I'm loving it.

New tricks for XFS

Posted Feb 24, 2018 20:06 UTC (Sat) by naptastic (guest, #60139) [Link]

> I find "being a pack rat with more cases and motherboards lying about than is reasonable" seems to work acceptably well ;). The odd number of disks is a bit trickier.

I've learned that the only way to know for sure that you have enough is to have too much.

New tricks for XFS

Posted Feb 23, 2018 22:08 UTC (Fri) by dgc (subscriber, #6611) [Link] (1 responses)

> Of course this also obsoletes LVM: instead of LVM, you have a single giant XFS filesystem
> with each "partition" consisting of a sparse disk image containing whatever filesystem you
> like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing
> fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.

No, it doesn't obsolete LVM. You still need something to manage your physical storage that the base XFS filesystem is placed on. What it allows is much more flexible use and management of the space within that XFS filesystem, but it doesn't change how you manage the underlying devices. Especially if you want to grow the base filesystem in future....

New tricks for XFS

Posted Mar 3, 2018 14:00 UTC (Sat) by nix (subscriber, #2304) [Link]

You still need something to manage your physical storage that the base XFS filesystem is placed on.

I'm not sure why one giant filesystem completely filling the physical storage needs management -- but I suppose if it's sitting on a RAID array or something, that array is more growable than if it were a simple disk of that size, so you might still need some layer between the physical storage and the top-level XFS.

New tricks for XFS

Posted Mar 8, 2018 9:32 UTC (Thu) by nilsmeyer (guest, #122604) [Link]

Correct me if I'm wrong, but since we have discard these days, there should be no irrevocable space allocations.

New tricks for XFS

Posted Feb 22, 2018 20:19 UTC (Thu) by dcg (subscriber, #9198) [Link] (4 responses)

What about subvolumes inside of subvolumes? I guess they are possible, but it will require to coordinate ENOSPC management between all of them. I guess that too many embedded subvolumes will be a worst case for this model?

Also, what are your next plans for XFS - add some kind of internal raid capabilities, and then fully become a ZFS-like filesystem? :P

New tricks for XFS

Posted Feb 22, 2018 22:12 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

As far as I can see, you can make subvolumes and subsubvolumes and subsubsubvolumes in a single image (with one 'containing' and one 'contained' fs) by just CoWing the data making up the subvolume and asking the container to CoW the metadata (which to it is data). After all, you don't need to CoW entire files at once: you can CoW *bits* of them, so XFS can just CoW the pieces of itself that contain the metadata you are making a subvolume out of. I *think* you don't need multiple levels (which means you don't need mount rights or root privileges, unless it is decided that you need them in order to ask the containing fs to do a CoW) but they should work just as well -- you make a sparse image the size of the fs it's contained in, and allocations trickle out through all the layers, as do trim requests when deallocations happen (which turn that bit of the non-sparse file sparse again).

I am bubbling over thinking of things you can do with this. It's so powerful!

New tricks for XFS

Posted Feb 23, 2018 0:19 UTC (Fri) by nix (subscriber, #2304) [Link]

OK, having seen the talk, it's clear that what is currently implemented requires whole new fs images: the entire-subvolume is the unit of snapshotting. However... I don't see why that is necessarily the case. (Though having two files which are partially shared blocks and partially CoWed blocks, which would be a requirement for doing it this way, *is* genuinely new and might have to wait for xfs v6. Or v-infinity if the idea is simply insane. :) ).

New tricks for XFS

Posted Feb 23, 2018 15:47 UTC (Fri) by dcg (subscriber, #9198) [Link] (1 responses)

I am not sure I understand. What I was trying to say is that, as I understand it, there is a metadata structure that the copy-on-write mechanism can't share: space management. As you create subvolumes inside of subvolumes, each one of these "matryoshka" subvolumes will have to handle its internal space (eg the free space tree) in some way. At first, the CoW mechanism will share the blocks that store the free space information, of course. But once you start using the filesystems, they will differ quickly. For Nth subvolumes, you will have Nth free space trees for each one (plus the host).

Now let's imagine that you make a large file in the last subvolume. It will requires to allocate blocks for its image residing in the previous filesystem. If the allocation is big enough, it may require also to allocate blocks for that previous subvolume, and would require to allocate blocks for him too - all the way to the host filesystem. Allocation performance will suffer heavily. In short, XFS needs to replicate space management structures for each embedded subvolume (and potentially need to update *all* of them in some extreme cases). In ZFS/btrfs, space management is shared by all subvolumes so it does not have these problems.

This is what leads me to believe that excessive subvolume embedding would be a scalability issue for this model, but may be I'm not understanding it right

New tricks for XFS

Posted Feb 23, 2018 21:26 UTC (Fri) by nix (subscriber, #2304) [Link]

Agreed, that might be problematic if you have huge numbers of subsubsubvolumes -- but wouldn't that be problematical for other similar filesystems too? (The volume of log updates in all these layers is the other possible problem.)

(I'll admit that what I want isn't subvolumes at all -- it's snapshotting on a directory granularity level. No subvolumes, no notion of "this is like a smaller subfilesystem", just each subdir can be snapshotted and the snapshots can be forked freely. That's still, I think, something not that hard with a CoW tree fs but incredibly clunky with Dave's scheme. One filesystem image per directory? I don't think that'll scale too well... a complete log and fs structures per directory, uh, no. :) )

FS shrinking

Posted Feb 23, 2018 20:51 UTC (Fri) by tarvin (guest, #4412) [Link]

I would be happy "just" to have FS shrinking support in XFS.

New tricks for XFS

Posted Mar 25, 2019 9:45 UTC (Mon) by ShalokShalom (guest, #111022) [Link] (1 responses)

Hi Dave

Is there any news?

New tricks for XFS

Posted Sep 26, 2021 12:11 UTC (Sun) by bluss (guest, #47454) [Link]

August 2021 status here: https://www.spinics.net/lists/linux-xfs/msg55316.html