LWN: Comments on "New tricks for XFS"

New tricks for XFS

bluss — Sun, 26 Sep 2021 12:11:56 +0000

August 2021 status here: https://www.spinics.net/lists/linux-xfs/msg55316.html

New tricks for XFS

ShalokShalom — Mon, 25 Mar 2019 09:45:53 +0000

Hi Dave

Is there any news?

New tricks for XFS

nilsmeyer — Thu, 08 Mar 2018 09:32:24 +0000

Correct me if I'm wrong, but since we have discard these days, there should be no irrevocable space allocations.

New tricks for XFS

nix — Sat, 03 Mar 2018 14:00:53 +0000

You still need something to manage your physical storage that the base XFS filesystem is placed on.

I'm not sure why one giant filesystem completely filling the physical storage needs management -- but I suppose if it's sitting on a RAID array or something, that array is more growable than if it were a simple disk of that size, so you might still need some layer between the physical storage and the top-level XFS.

New tricks for XFS

nix — Sat, 03 Mar 2018 13:58:34 +0000

Ah right, a few small additions then: still surprisingly little. A couple of new flags doesn't sound like the kind of thing that would require a v6 filesystem and a new mkfs, unlike the massive addition which was rmapbt and all that it implies.

New tricks for XFS

naptastic — Sat, 24 Feb 2018 20:06:56 +0000

> I find "being a pack rat with more cases and motherboards lying about than is reasonable" seems to work acceptably well ;). The odd number of disks is a bit trickier.

I've learned that the only way to know for sure that you have enough is to have too much.

New tricks for XFS

rbrito — Sat, 24 Feb 2018 01:30:01 +0000

Sorry for not having yet watched your talk, but is there the possibility of having transparent compression on the filesystem?

New tricks for XFS

dgc — Fri, 23 Feb 2018 22:08:52 +0000

> Of course this also obsoletes LVM: instead of LVM, you have a single giant XFS filesystem
> with each "partition" consisting of a sparse disk image containing whatever filesystem you
> like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing
> fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.

No, it doesn't obsolete LVM. You still need something to manage your physical storage that the base XFS filesystem is placed on. What it allows is much more flexible use and management of the space within that XFS filesystem, but it doesn't change how you manage the underlying devices. Especially if you want to grow the base filesystem in future....

New tricks for XFS

dgc — Fri, 23 Feb 2018 22:01:22 +0000

> (The unusual thing about this set of features is that it doesn't seem to need new xfs features
> at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's
> *very* impressive to me.)

No, that's not the case. I think I showed the list of experimental features that the demo used in the talk. Sure, reflink and rmapbt are no longer experimental (as of 4.16-rc1), but I added a new "thinspace" feature for thin provisioning awareness and another for "subvol" support, because that also requires special flags in inodes and other such things to mark the filesystem as a subvol.

Overall, there's remarkably little change for the subvol feature - about 1400 lines of new kernel code and ~300 lines of userspace code were needed for the demo I gave during the talk. Compare that to the data-COW feature in XFS that it relies on - we added about 20,000 lines of code for that between kernel and userspace....

Unprivileged mounts

dgc — Fri, 23 Feb 2018 21:52:15 +0000

That's really a separate problem. :P

I don't see any fundamental problem with doing this - we're going to want to enable root inside a container to snapshot their own subvolumes, so we're going to need some kind of mechanism to allow non-init namespace access to the management interfaces. We haven't decided on how we're going to present snapshot volume management to userspace yet - this is on my list of requirements for when we start on that aspect of the problem....

New tricks for XFS

dgc — Fri, 23 Feb 2018 21:46:23 +0000

Yes. I mentioned that in the talk.

New tricks for XFS

nix — Fri, 23 Feb 2018 21:26:32 +0000

Agreed, that might be problematic if you have huge numbers of subsubsubvolumes -- but wouldn't that be problematical for other similar filesystems too? (The volume of log updates in all these layers is the other possible problem.)

(I'll admit that what I want isn't subvolumes at all -- it's snapshotting on a directory granularity level. No subvolumes, no notion of "this is like a smaller subfilesystem", just each subdir can be snapshotted and the snapshots can be forked freely. That's still, I think, something not that hard with a CoW tree fs but incredibly clunky with Dave's scheme. One filesystem image per directory? I don't think that'll scale too well... a complete log and fs structures per directory, uh, no. :) )

New tricks for XFS

nix — Fri, 23 Feb 2018 21:23:16 +0000

The way this is done with XFS is that unlike ext4, old filesystems are rarely dynamically upgradable to add new features: to gain reflinks or the reverse-mapping btree you must re-mkfs, and until the feature is declared stable (4.15 for those features) it gets a warning at mount time and requires non-default mkfs options at mkfs time. After that, the new feature becomes the default.

So old filesystems mostly continue to use older codepaths, or aspects of newer codepaths which are heavily tested by virtue of the weight of existing filesystems out there. Only people willing to take the significant plunge of re-mkfsing are affected by the potential bug load (and, obviously, don't even do this with test filesystems on important machines until you're confident it won't oops :) ).

(The unusual thing about this set of features is that it doesn't seem to need new xfs features at all: any fs with the existing reflink and rmapbt features can take advantage of it! That's *very* impressive to me.)

New tricks for XFS

wazoox — Fri, 23 Feb 2018 21:15:16 +0000

Well, different filesystems with different philosophies. ext2/3/4 touts compatibility, XFS touts performance. For many years XFS has graduallly added incompatible features (but with options to make compatible filesystems in xfsprogs).

FS shrinking

tarvin — Fri, 23 Feb 2018 20:51:21 +0000

I would be happy "just" to have FS shrinking support in XFS.

New tricks for XFS

clump — Fri, 23 Feb 2018 19:45:49 +0000

If you create a new version, you're free to add features like shrinking, subvolumes, etc, without a hodgepodge of arcane mount options and compatibility hacks. Users averse to change could stay on the original XFS. Witness the migration from ext3 to ext4, the latter of which happened to implement features found in XFS.

I also wish there was something akin to an ext5 with an upstream COW implementation.

New tricks for XFS

wazoox — Fri, 23 Feb 2018 17:05:17 +0000

There have already been incompatible changes in XFS in the past, quite recently in fact with the addition of metadata checksumming. As long as there are available bit flags to warn of new incompatible features, there is no reason to create such a radical change.

New tricks for XFS

dcg — Fri, 23 Feb 2018 15:47:53 +0000

I am not sure I understand. What I was trying to say is that, as I understand it, there is a metadata structure that the copy-on-write mechanism can't share: space management. As you create subvolumes inside of subvolumes, each one of these "matryoshka" subvolumes will have to handle its internal space (eg the free space tree) in some way. At first, the CoW mechanism will share the blocks that store the free space information, of course. But once you start using the filesystems, they will differ quickly. For Nth subvolumes, you will have Nth free space trees for each one (plus the host).

Now let's imagine that you make a large file in the last subvolume. It will requires to allocate blocks for its image residing in the previous filesystem. If the allocation is big enough, it may require also to allocate blocks for that previous subvolume, and would require to allocate blocks for him too - all the way to the host filesystem. Allocation performance will suffer heavily. In short, XFS needs to replicate space management structures for each embedded subvolume (and potentially need to update *all* of them in some extreme cases). In ZFS/btrfs, space management is shared by all subvolumes so it does not have these problems.

This is what leads me to believe that excessive subvolume embedding would be a scalability issue for this model, but may be I'm not understanding it right

New tricks for XFS

TomH — Fri, 23 Feb 2018 08:44:06 +0000

Well I know HFS used B-Trees by late 1992 or early 1993 when I was writing code to read Mac floppies on RISC OS systems.

New tricks for XFS

rodgerd — Fri, 23 Feb 2018 04:52:12 +0000

> I'm not sure how you convert one big machine into two servers. With an axe?

I find "being a pack rat with more cases and motherboards lying about than is reasonable" seems to work acceptably well ;). The odd number of disks is a bit trickier.

> I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :)

Good heavens. That's a dedication to complexity! I'm moving my performance-sensitive Gluster FSes to use tiering, so I'm in no position to criticise, though.

> imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays

The plunging price of network equipment continues to astonish me. They days where just management was a prohibitively expensive option are long gone, and I'm loving it.

New tricks for XFS

nix — Fri, 23 Feb 2018 00:19:09 +0000

OK, having seen the talk, it's clear that what is currently implemented requires whole new fs images: the entire-subvolume is the unit of snapshotting. However... I don't see why that is necessarily the case. (Though having two files which are partially shared blocks and partially CoWed blocks, which would be a requirement for doing it this way, *is* genuinely new and might have to wait for xfs v6. Or v-infinity if the idea is simply insane. :) ).

New tricks for XFS

simcop2387 — Thu, 22 Feb 2018 23:51:09 +0000

HFS has existed since then, but did it actually use B-Trees from the beginning? I know it's had a number of updates and extensions since the original inception. And was also supplanted with HFS+ eventually too.

New tricks for XFS

nix — Thu, 22 Feb 2018 22:44:14 +0000

I'm not sure how you convert one big machine into two servers. With an axe? (It has an odd number of disks in a RAID array -- not something I can evenly split :) ).

I am vaguely tempted to do something kinda similar with my new desktop, which will likely come with an MMC far too small to hold the fs I want to put on it, but *with* nice fast 10GbE (imagine my surprise to find that managed switches with a couple of 10GbE ports on have plunged in price to under £200 nowadays). I was thinking of putting Lustre on it as a way to split the metadata onto the MMC and the data onto the big machine over the 10GbE :) however, it's quite possible that NFSv4 onto a big RAID array over the 10GbE would be *faster*, particularly for writes. In practice I suspect I'll be happy with just nfsroot and ignoring the MMC entirely. (Which means I'll have to thunk to the fileserver to do snapshots etc in any case... though if I go to NFSv4.2 I will at least be able to do reflinks on the client.)

New tricks for XFS

rodgerd — Thu, 22 Feb 2018 22:12:31 +0000

> I'll admit it. I want all this cool stuff *now* and I'm kicking myself that I re-mkfsed my big fileserver only eight months ago and can't convince myself to redo it like this instantly. :)

1. Convert big fileserver into two Gluster servers.
2. Reformat them with Cool Stuff one by one.
3. Beg the Gluster maintainers to integrate their tooling with the new XFS features.

New tricks for XFS

nix — Thu, 22 Feb 2018 22:12:12 +0000

As far as I can see, you can make subvolumes and subsubvolumes and subsubsubvolumes in a single image (with one 'containing' and one 'contained' fs) by just CoWing the data making up the subvolume and asking the container to CoW the metadata (which to it is data). After all, you don't need to CoW entire files at once: you can CoW *bits* of them, so XFS can just CoW the pieces of itself that contain the metadata you are making a subvolume out of. I *think* you don't need multiple levels (which means you don't need mount rights or root privileges, unless it is decided that you need them in order to ask the containing fs to do a CoW) but they should work just as well -- you make a sparse image the size of the fs it's contained in, and allocations trickle out through all the layers, as do trim requests when deallocations happen (which turn that bit of the non-sparse file sparse again).

I am bubbling over thinking of things you can do with this. It's so powerful!

New tricks for XFS

nix — Thu, 22 Feb 2018 21:42:34 +0000

That's *clever*. (And I clearly need to watch your talk, which has been burning a hole in my disk space for a few weeks now.)

I saw the thin provisioning-aware stuff go by, but didn't realise it would have *this* consequence. I mean, thin provisioning is all very well but unless you're running a data centre or testing filesystems isn't really important in your day-to-day life. But everyone likes snapshots :)

Of course this also obsoletes LVM: instead of LVM, you have a single giant XFS filesystem with each "partition" consisting of a sparse disk image containing whatever filesystem you like (as long as it supports trimming of unused regions): if it's XFS, it asks the containing fs to CoW the appropriate metadata regions whenever a subvolume is created, and bingo.

There isn't even more metadata than there would otherwise be, despite the existence of two filesystems where before you had one, since the outermost one has minimal metadata tracking the space allocation of the contained fs and that's just about all. As lots of subvolumes are created, the extents in that fs tracking the increasingly fragmented thin file containing the subvolumes will grow ever smaller (the same problem CoW filesystems have traditionally had with crazy fragmentation), but defragmenting single files with XFS is something it's been very good at for a long time, and it's a heck of a lot easier than defragmenting a filesystem with CoW metadata would be: the metadata in the contained fs, of course, is completely unchanged by any of this, rather than having to go through rewrite-and-change hell.

I'll admit it. I want all this cool stuff *now* and I'm kicking myself that I re-mkfsed my big fileserver only eight months ago and can't convince myself to redo it like this instantly. :)

New tricks for XFS

clump — Thu, 22 Feb 2018 20:50:28 +0000

Forgive me if this has been covered. Would creating "XFS2" with perhaps incompatible changes make some of your work easier than modifying the exiting XFS codebase?

New tricks for XFS

dcg — Thu, 22 Feb 2018 20:19:13 +0000

What about subvolumes inside of subvolumes? I guess they are possible, but it will require to coordinate ENOSPC management between all of them. I guess that too many embedded subvolumes will be a worst case for this model?

Also, what are your next plans for XFS - add some kind of internal raid capabilities, and then fully become a ZFS-like filesystem? :P

Unprivileged mounts

ebiederm — Thu, 22 Feb 2018 20:03:49 +0000

Any chance this can be done in such a way that we could allow root in a user namespace to mount the subvolume.

This would require something to not allow access to the subvolume file except to mount it as a subvolume.

New tricks for XFS

doublez13 — Thu, 22 Feb 2018 14:28:08 +0000

Are you currently looking into integrating fscrypt into XFS?

New tricks for XFS

Paf — Thu, 22 Feb 2018 13:59:56 +0000

Quite an impressive set of tricks, Dave! Nice work.

New tricks for XFS

dgc — Thu, 22 Feb 2018 13:07:34 +0000

I'm not surprised by your reaction. I've lost count of the number of people who have told me what I demonstrated in my talk can't be done.... :P

> but like a sort of super-sparse file

Well, yes. The underlying principle is that subvolumes need to be abstracted from physical storage space. e.g. This is all based on patches to XFS to dissociate the physical size of the filesystem from the amount of space a subvolume presents to the user that this is all based on. This can't be done without sparse storage, be it a file or a thin block device, and the initial patchset is here:

https://marc.info/?l=linux-xfs&m=150900682026621&w=2

> where the inner and outer filesystems should communicate
> (at least if they're the same fs) to share free space between all subvolumes and the containing fs

Yup, this is precisely the mechanism I've implemented to handle the ENOSPC problem. More info in this thread:

https://marc.info/?l=linux-xfs&m=151722047128077&w=2

-Dave.

New tricks for XFS

nix — Thu, 22 Feb 2018 10:11:05 +0000

There are not many articles where I start laughing from the sheer power of the abstraction halfway through. This was one of them.

The hard part of all this does indeed appear to be the usual space management and policy palaver. If you're storing snapshots in image files, where do those files reside? On the XFS filesystem that the snapshot is being mounted on, probably, but where? .snapshot above the initial mount point? /.snapshot? But the space-management problem is worse.

You can run out of space in a subvolume when there is plenty of space on the containing fs, because the image file is too small. One could kludge around this a bit by routinely creating subvolumes as sparse files as big as the containing fs -- but since these can obviously only grow, and the inner fs has no idea that every time it writes to a block full of zeroes it's actually doing irreversible space allocation, that then means you can steal space from the containing fs until it runs out of room, even if the subvolume is still rather small. Also, if you expand the containing fs, you probably want to expand all the subvolume files as well... the space management looks *painful* and frankly these things are starting to look not very much like conventional files at all, but like a sort of super-sparse file where the inner and outer filesystems should communicate (at least if they're the same fs) to share free space between all subvolumes and the containing fs. From the POSIX API perspective these would probably look like sparse files that could shrink again, and routinely did, as the contained fs notified the containing fs that it had deallocated a bunch of blocks.

(Oh, also privilege problems emerge, though to my eye they are minor compared to the space management problems. Unprivileged snapshots appear... hard in this model, unless you want unprivileged mount of arbitrary XFS filesystems, which is probably still a bit risky.)

New tricks for XFS

TomH — Thu, 22 Feb 2018 09:16:09 +0000

Apple HFS predates XFS by eight years according to wikipedia and also uses B-Trees for the directory tree and extents (https://en.wikipedia.org/wiki/Hierarchical_File_System#De...) indeed strictly I think the catalog at least has horizontal pointers between the leaf nodes make it B+ at least.

New tricks for XFS

jag — Thu, 22 Feb 2018 08:55:58 +0000

I found the paragraph on pNFS a bit hard to understand. From reading the linked site the idea is that you go through the server for the metadata while having direct access to the (remote) data storage, thus removing the server as a bottleneck for the bulk of the reads and writes.