LWN: Comments on "Changing filesystem resize patterns"

Changing filesystem resize patterns

sammythesnake — Tue, 24 May 2022 10:08:12 +0000

Wouldn't it make sense to use the same logic for the journal as for the filesystem as a whole? I.e. allocate a huge address space for it but only use as much as makes sense for the amount that's appropriate for the size of the filesystem actually used. The "spare" wouldn't need to be backed by actual storage capacity as it's never touched.

That way, when the journal needs to grow to suit a growing filesystem, the address space has already been reserved to keep the journal contiguous and all that needs to change is that the address space is now backed by whatever provisioning setup is being used and the FS code now using that space.

I don't know much about filesystem design, but perhaps a similar approach could also apply to things like inodes - allocate lots of them but only use as many as are needed leaving the others unprovisioned.

Changing filesystem resize patterns

cortana — Mon, 23 May 2022 11:33:44 +0000

In addition, many cloud providers charge their customers based on the size of the virtual block device being used, which means that customers have good reason to wait until the filesystem is nearly full before expanding it. A common pattern is that once a filesystem gets, say, 99% full, another 1GB or 5GB are added; that pattern repeats over and over for the filesystem. "That tends to result in worst-case filesystem fragmentation." Most filesystems are not designed to work well when running nearly full, he said.

In the case of XFS, for instance, is it better to use lvmthin(7) to thinly provision storage, adding more PVs to the VG containing a thin pool LV before it runs out of space; or is it better to resize and then run xfs_fsr to defrag?

I suppose that: if you've been 99% full for a while, and have lots of fragmentation, and the only expand by a small amount; is it possible there's not enough free space to allow ideal defragmentation? Or does the defragmentation operation potentially just take a long time?

Changing filesystem resize patterns

dgc — Sun, 22 May 2022 23:54:58 +0000

> That isn't at all what I was suggesting - sorry that I wasn't clear.
> I was suggesting something like:
> First Gigabyte is 10 100MB AGs
> Next 9 Gigabytes are 9 1GB AGs
> Next 90GB are 9 10GB AGs
> Next 900GB are 9 100GB AGs
> etc

Yes, that's kind of what I assumed you were suggesting. It solves some of the "find the AG" corner cases, but it introduces a whole new set of problems that we'd have to solve. Compared to dumping the filesystem on dm-thin and changing a couple of hundred lines of accounting code, it's a massive undertaking.

> The AGs used for allocating files to device locations do not necessarily have to match the block groups used for placing inodes and other metadata.

XFS doesn't have "block groups". It has allocation groups - they are the units that hold inodes, metadata and data.

By default, XFS has no limitation on the AGs which can hold inodes and metadata within the filesystem. But there are user selectable allocation policies which place arbitrary constraints on where inodes and metadata can be placed. That's why I mentioned inode32 - the implicaitons of full 1TB sized AGs and inode32 are that only AG 0 can hold inodes. Hence the layout you suggest above will only provide 100MB of physical inode space w/ inode32 which equates to a hard cap of roughly 100k inodes no matter how large the fs is grown to. That is obviously not enough inodes for even a root disk on a modern system. Hence for inode32, we can't consider using a physical AG 0 size of less than 10GB for a filesystem with variable sized AGs that could be grown into the TBs range.

It also makes no sense if the filesystem starts large (e.g. we're starting with a capacity of TBs), and so we now have two completely different filesystem layouts - one for tiny filesystems that can grow, and one for anything that starts at over 500GB. That's not really sustainable from a maintenance and longevity POV - we need to reduce complexity and the size of the configuration matrix, not make it worse.

> In my above example on a 1TB device, XFS could treat the first 28 on-disk AGs as a single 100GB group for allocation, then each of the remaining AGs as other 100GB allocation groups.

Unfortunately, we can't do that easily. Locking, transaction reservations, etc all mean we have to treat AGs as independent entities.

For allocation policy purposes, we can group sequential AGs together (I was looking at this as "concatenated groups" to treat multiple TB of contiguous disk space as a single free space entity back in ~2007), but we still have to treat them as individual AGs when it comes to addressing, allocation, reflink and rmap tracking, etc. Hence it doesn't really solve any of the problems with having lots of small AGs in a large filesystem - it just hides them under another layer of abstraction.

There also performance implications of lots of AGs on spinning disks. At 4 AGs (default), the cost of seeking between data in different AGs is just on the edge of the cliff. We lose about 5% sustained performance on spinning disks at 4AGs vs 1AG (which is why ext4 is always a touch faster on single spinning disks than XFS). By 10 AGs the performance is halved because of the amount of seek thrashing between AGs that occurs as the data sets are spread across the internal address space. IOWs, from a performance perspective on spinning disks, we really want to avoid a large number of small AGs in a small filesystem if at all possible. To fix this we basically have to completely redesign the allocator algorithms to no treat all AGs equally...

> But as you suggest, there is a big difference between such abstract musings and the hard practical realities of changing code in a production filesystem that many bytes of data are depending on...

I wish that more people understood this.

-Dave.

Changing filesystem resize patterns

neilbrown — Thu, 19 May 2022 04:10:00 +0000

Thanks for the deeper insights Dave!

> to add a bunch of new metadata to support both variable size structures and structures of unpredictable, unknown locations.

That isn't at all what I was suggesting - sorry that I wasn't clear.
I was suggesting something like:
First Gigabyte is 10 100MB AGs
Next 9 Gigabytes are 9 1GB AGs
Next 90GB are 9 10GB AGs
Next 900GB are 9 100GB AGs
etc
This is just illustrative - in practice you would use powers of 2 and other details might be different. E.g. the size of the smallest AG might be determined at mkfs time. So: nothing much that is unpredictable.

This layout would then be the same no matter how big the device was - so there would be no need to change anything when you make the device larger - just add new AGs, some of which might be bigger than any existing ones. 90% of the space would always be in the largest, or second largest, AGs.

I think this addresses your concerns about disaster recovery and mapping between inode number and the AG which stores the inode.

Your concern about allocation strategies when there are varying sizes AGs is not immediately solved. However....
The AGs used for allocating files to device locations do not necessarily have to match the block groups used for placing inodes and other metadata. One is a run-time concept and the other is an on-disk concept.
In my above example on a 1TB device, XFS could treat the first 28 on-disk AGs as a single 100GB group for allocation, then each of the remaining AGs as other 100GB allocation groups.

But as you suggest, there is a big difference between such abstract musings and the hard practical realities of changing code in a production filesystem that many bytes of data are depending on...

Changing filesystem resize patterns

dgc — Wed, 18 May 2022 23:11:18 +0000

> Do they need to be "fully dynamic" ?? Don't they just need internal structures to scale with size?

They already tend to, though there are maximum size bounds for "internal structures" that pose limits. The problem is that once the structures are laid down on disk, they are fixed for the life of the filesystem. Growing can only replicate more of those same sized structures, and that's where all the problems lie.

> Have I over-simplified too much?

No, but you've assumed that the filesystems don't already do that. The problem is this scaling happens at mkfs time and that sets a lot of things in stone that have been assumed will never change.

> So: could we scale these tables based on their location instead of the total size?

Yes, we could, but that involves changing the on disk format to add a bunch of new metadata to support both variable size structures and structures of unpredictable, unknown locations.

I'll speak for XFS here, because the common question is "why can't we just implement variable size allocation groups?" and that's exactly what you are asking here.

The size of the AG is fixed at mkfs time. On a 2GB fs, that's 500MB. Grow that out to 10TB, and we have 20,000 AGs and it's the AG count that causes issues for us. AGs can be up to 1TB in size - a 10TB filesystem will be laid out with 10x1TB AGs, or if it on RAID5/6 it will have 32x~300GB AGs (because RAID has more disks and can handle more concurrency). So you can see the difference in layout between 20,000x500MB AGs vs 10x1TB AGs.

The problem is not the size of the AG - they are all indexed by btrees, so there is very little in way of performance difference between maintiain free space indexes in a 500MB AG and a 1TB AG. They scale with size just fine and we already support different sized AGs at runtime - they have an internal length variable in their headers and the last AG never fits neatly into the device and so is always shorter than the rest.

But the problem here is that all the algorithms that assume that all AGs are teh same size. e.g if we know the size and the AG number, right now we know exactly where it is located on the block device. We know it is equal to all other AGs, too, so algorithms do linear AG iteration because no one AG is more likely to be better than any other based on physical characteristics. As such, we have one variable in the superblock that tells us what the size of an AG is, and another that tells us the AG count. Our backup for that information for disaster recovery is in the AG headers - we have secondary superblock in the first sector of every AG.

So let's look at disaster recovery - the first requirement for any filesystem design is to be able to recover from loss of key structures. Right now, if we lose the primary superblock and AG 0 headers (say someone wipes out the first 64kB of the block device) then we have to do a sector-by-sector search to find the first secondary superblock in AG #1. We optimise the search based on device size and what mkfs defaults would do (so might only take a single IO to find a valid secondary), but if that doesn't work we have to read every single sector out past the first TB to find the start of the second AG. Then we can recover the size of the AGs, grab all the other secondary superblocks, validate them and recover the primary.

Now, if we have variable size AGs, how do we recover from the loss of the superblock and/or the AG location/size index tree? How do we find the start of all the AGs in the filesystem and validate they are correct? We can't play mkfs tricks or just find the secondary SB in AG#1 - we have to read every single sector of the block device to find every AG header.

How long does that take if we have a PB scale filesystem? At a scan rate of 1GB/s, just that scan will take roughly a day per PB. IOWs, even if we solve these problems, recovering from the loss or corruption of the AG indexing structure can be catastrophically bad from a disaster recovery POV - if it's going to take a week just to find the AG headers, we may as well just throw it away and restore from backups.

> So: could we scale these tables based on their location instead of the total size?

Sure, we could do that, but that still brings a host of other problems with it and introduces a whole set of new ones.

e.g. allocation algorithms that select AGs with larger free spaces first then always try exact or near locality allocations for all followup allocations to that inode. That changes the filesystem fill algorithms from low->high to high->low because all the AGs with large free spaces are high up in the address space. i.e. on physical disks we change from filling the outer edge (fastest) first, to filling from the inner edge (slowest) first.

Changes like this also have implications for AG lock ordering which only allows multiple AGs to be locking in ascending order. If we allocate from the highest AG first and it ENOSPCs mid operation, we need to allocate from another AG to complete it. If the only Ags we can chose from are lower down, then we can deadlock with other multi-ag allocations. The only answer there is *suprious ENOSPC*, because shutting down the filesystem when stuff like this happens is not an option. This spurious ENOSPC normally only happens when the filesystem is nearly full, so it generally isn't an unexpected result. But variable AG sizes will bring this behaviour out much earlier when the filesystem is nowhere near full.

There's there's just the general variable AG size issues, like the internal sparse adress space that XFS uses for inodes and filesystem block addressing (detect a trend here?). We encode the AG number into the high bits, and the block number in the AG into the low bits (i.e. AGNO|AGBNO). The number of bits each use is actually variable and based on the size of the AG set at mkfs time. Inode numbers are similar in being "AGNO|AGBNO|Inode in chunk" - it's a sparse encoding of it's location in the address space and so we can go straight from inode number to it's physical location just with a couple of shifts and some masking.

Hence if the AG is going to be variable size, then we always have to reserve 32 bits for the AGBNO in this address space so we can always support 1TB AG sizes in the address space. This means the every AG will actually consume 1TB address space, regardless of it's actual size.

At this point, the maximum size of the filesystem drops - it's no longer 8EB, it's 8EB - all the unused address space. IOWs, even just calculating the actual physical capacity of the filesystem becomes hard - we have to add up the size of each individual AG rather than just multiplying 2 numbers together. When we support 2^32 AGs, that's a non-trivial problem.

Then there is inode numbers instantly going over 32 bits. If the AG #0 is only 500MB in size, and the user selects the "inode32" allocator, they now only have a maximum of 500MB of storage for inodes in the entire filesystem. That's because inode numbers are sparse and putting them in AG #1 will use 33 bits of address space and hence have a 33bit inode number. With the current layout, there's 200 AGs that inodes could be placed in (a full 1TB of storage) before the inode address space goes over 32 bits...

Hence going to variable size AGs based on size of the device has user visible implications, even for users that haven't used grow and will never use grow. Put that on top of having to make serious changes to the on disk format, tools, disaster recovery algorithms and redesign runtime algorithms to support variable sizes sanely - we are effectively talking about redesigning the entire core of the filesystem.

> But it seems a lot less than "designing a new filesystem".

The effort is on the same level of on-disk format changes that the ext3->ext4 transition involved. I'll leave you to decide if that's less effort than "designing a new filesystem" given the years of work that involved.

-Dave.

Changing filesystem resize patterns

MrWim — Wed, 18 May 2022 17:52:24 +0000

A file system image inside a tar file created with --sparse would have the same properties IIUC.

Changing filesystem resize patterns

neilbrown — Mon, 16 May 2022 04:24:44 +0000

> Modern filesystems like btrfs or bcachefs already have fully dynamic layouts, but I don't think we're ever going to get there with ext4 or XFS.

Do they need to be "fully dynamic" ?? Don't they just need internal structures to scale with size?

As I understand it (and my understanding is based on very old information that I haven't taken the trouble to update), these filesystem have various tables (Inodes, free block list) that are distributed across the device (in what were once called "cylinder groups"). Everything else is space used for file (and indirect block) storage, which is already fully dynamic. Have I over-simplified too much?

These tables should be smaller on smaller filesystems to avoid wastage (??) and larger on larger filesystems to improve sequential read time and to reduce fragmentation(??). And you cannot change the size of already-allocated tables when you grow the filesystem (because we don't have fully dynamic layouts).

So: could we scale these tables based on their location instead of the total size? So all inode groups aren't the same size, but their sizes and locations can be determined directly from data in the superblock. The calculation would be slightly more complex than it currently is, but it would just be a different calculation.

With this arrangement, most of the device would use a layout that is optimal for the device size (assuming the scale is closer to logarithmic than linear), and so you get most of the performance benefit.

Obviously this is still a format change and not to be done lightly. But it seems a lot less than "designing a new filesystem".

Changing filesystem resize patterns

rhowe — Sun, 15 May 2022 10:18:13 +0000

It seems to me the issue is in the resize tooling. Perhaps they should at least warn about resize operations which lead to suboptimal filesystems?

Changing filesystem resize patterns

dgc — Sun, 15 May 2022 05:46:18 +0000

> Is this really that different from users thinking that their space limit is min(max_cloud_size, current_cloud_allocated_size)? In other words, logical capability == current cloud allocated limit, and physical capacity = current logical fs limits.

You got the first part right - logical capacity == cloud allocated limit - but the second statement doesn't make any sense - there are no "current logical fs limits". Filesystems like ext4 and XFS currently only have physical size and capacity limits.

> To me, this seems more like a ENOSPC reporting & automatic growing problem.

That's the symptom user's will see, not the cause of the problem. Filesystems already report ENOSPC just fine, and they can be grown online automatically just fine as well. IOWs, dynamic growth or even auto-grow is not an issue here - the problem is the type of storage being provided to the filesystems by the cloud infrastructure and the constraints that places on the algorithms the filesystems can use to grow capacity.

> The only true architectural deficiency would be with file systems that are not dynamic enough in their internal structures and have problems growing from 10 GB to 10 TB.

I thought it was clear that this is the fundamental problem my proposal addresses.

We can't easily redesign the on-disk format of a filesystem to fix a limitation like this. In general terms, that's called "designing a new filesystem", and will take 5-10 years to get it to reliable production quality. Modern filesystems like btrfs or bcachefs already have fully dynamic layouts, but I don't think we're ever going to get there with ext4 or XFS.

So for ext4 and XFS, we make them "fully dynamic" by using dm-thin to break the link between physical device size and available backing storage capacity. In doing so, we don't need to dynamically grow or shrink internal structures - dm-thin just locates them in the backing store pool as it sees fit. At this point, the filesystems can be provided with a single value that tells the filesystem what the logical capacity of the backing store is. And with those additions we have capacity constrained, physical location independent ext4 and XFS filesystems.

As a result, we'll get much more capacity scalabilty from ext4 and XFS than we could get in any other way. XFS will now easily scale up to 10-20PB, and down to 100MB without any problems.

And given that these architecture mods we also gain *unconstrained instantaneous shrink* capability for free, scaling from 20PB back down to 1GB will be just as easy...

Users stand to gain a *lot* from the storage management POV with this small, subtle architecture change. Problems people have been complaining about for years *just go away*, and all it takes is some relatively simple implementation changes to some filesystems and the deployment processes. But just changing the filesystems alone won't get us there - higher layers in teh stack need some change, too.

-Dave.

Don't underestimate cloud image builders

dgc — Sat, 14 May 2022 23:34:52 +0000

That's pretty much exactly what I've been proposing - the small mods to the filesystem I talked about and pointed to the patch series from 2017 are exactly the mods needed for the fs to 'report "90% used (out of 10GB available space)"'....

-Dave.

We've been having this problem way before the cloud was even a thing...

mchouque — Sat, 14 May 2022 19:06:08 +0000

It's true the cloud has somehow made that problem worse but it's been a problem for a long while, way before the cloud was a thing.

I've seen so many times the following cases in the different places I worked at:
- pv/vg/lv created many years ago on SAN and being migrated as the underlying servers and SAN bays are refreshed, lv's being increased as time goes on
- lv's which start small (because "we" don't know or didn't expect the growth or for how long this stupid app would survive)

Eventually there comes a point where:
- you're resizing a FS just to create inodes not because you need free space
- your resized FS doesn't have all the fancy bits and features of a recent FS of the same type (xattr/acl becoming enabled by default by mkfs on ext comes to mind, meaning older ext fs wouldn't have them if you didn't have the options in your fstab)
- you hit some magical limit which prevents you from resizing
- the underlying hardware has changed meaning the magical parameters used during creation no longer apply
- ...

Yes, the problem comes from the fact it's easier to do all that rather than to start from freshly created FS' and to sync data to them but the bottom line is it's been an issue for a while.

I mean I even had the issue with UFS filesystems on Solaris back in the day where growfs couldn't do its magic if the initial system was too small.

One could argue tools should just refuse to do what is being asked of them when the end result is just too different from a regular FS (or add a --i-really-know-what-im-doing flag)...

A different but similar problem was with ESX where if you had less than 3GB of RAM on your VM, you couldn't hot add RAM to got past 3GB (or was it a Linux problem, can't remember?). Or VM hardware types not being upgraded as hypervisors are being upgraded. Etc.

Changing filesystem resize patterns

dcg — Sat, 14 May 2022 15:12:58 +0000

> That way the filesystem can be laid out at mkfs time as if it was a physical 10TB filesystem because it's got a 10TB address space,

Is this really that different from users thinking that their space limit is min(max_cloud_size, current_cloud_allocated_size)? In other words, logical capability == current cloud allocated limit, and physical capacity = current logical fs limits.

To me, this seems more like a ENOSPC reporting & automatic growing problem. The only true architectural deficiency would be with file systems that are not dynamic enough in their internal structures and have problems growing from 10 GB to 10 TB.

Don't underestimate cloud image builders

dtlin — Sat, 14 May 2022 08:54:17 +0000

LVM thin is that layer currently - you can "create 1TB volume, with initial 10GB allocation", under any filesystem. It would be more usable with a bit more integration - e.g. if the filesystem could report "90% used (out of 10GB available space)" rather than "0.9% used (out of 1TB virtual space)" - but it does work.

Don't underestimate cloud image builders

Wol — Sat, 14 May 2022 07:14:57 +0000

Or have something similar to LVM. When I snapshot an image I can allocate only a small amount of space. As I carry on working, the snapshot grows as required.

So have an lvm-style option that says "create 1TB volume, with initial 10GB allocation". Then when I run mkfs it thinks its got a terabyte to play with. We have things like trim, and all that, nowadays, so the volume level has all the information it needs to only allocate and use the actual space it needs.

This way (in line with Unix's "do one thing and do it well") we don't need to change any filesystems or anything - we just have a layer dedicated to "squashing" a disk image into the smallest real disk space reasonably possible. Maybe tweak the layers above to "format by allocating zeroes" rather than actually writing "this space is empty" data, and you can have large, optimised, allocation tables that don't actually take up any disk space.

Cheers,
Wol

Don't underestimate cloud image builders

garloff — Sat, 14 May 2022 06:40:25 +0000

A fair amount of cloud users use the images that are provided by the cloud operators and customize them using cloud-init mechanisms. If they bring their own images, there's also a good chance that those are based on what distros provide as cloud images with some optional postprocessing.
There are a few tools used for building these images, most prominently disk-image-builder. And a few using kiwi.
Having one mkfs option where you can pass one parameter setting a size that the FS should be prepared to grow to and then talking to a dozen folks that provide images would take care of most cases IMVHO.
So, Ubuntu would build 10GiB distro images optimized to be resized to a 50GiB filesystem...
Thinking a bit further: That parameter could default to 2x the size of the filesystem being built, allowing for some growth by default, and one could also of course manually set it to the FS size to optimize for FSes that never grow...
Sidenote: Good cloud apps should not consider the root filesystem as a good place to hold persistent data.

Changing filesystem resize patterns

wahern — Fri, 13 May 2022 23:08:13 +0000

Like VirtFS? See https://landley.net/kdocs/ols/2010/ols2010-pages-109-120.pdf and https://blog.linuxplumbersconf.org/2010/ocw/system/presen...

It uses the VirtIO command ring buffer for file operations (io_uring before io_uring). Doesn't seem to have drawn as much attention as I thought it might, though it's been around for awhile.

Changing filesystem resize patterns

pj — Fri, 13 May 2022 06:32:55 +0000

At least one distro has install instructions that say something like: 'download this image, dd it onto a USB stick, then expand the partition size to max', so it's not an uncommon use case.

Changing filesystem resize patterns

dgc — Thu, 12 May 2022 22:03:38 +0000

> why not use tar or cpio?

Because it's much slower than a sequential copy from a linear file to a linear block device. The OS and storage devices are highly optimised for efficient/fast sequential data operations, tar/cpio have to do heaps of metadata modifications to create all the small containers (dirents, files, etc) that end up getting spread all over the place (can result in small random metadata writeback IO) before it can do all the data copying. And even the data copy is sub-optimal in that it runs as lots of small discontiguous data writes spread all over the device.

The whole point of using qcow2 as an image transport mechanism is that it ships as a linear file, but the internal format is a set of a offset-based data regions that can be copied directly to the storage. And with image building, the initial qcow2 image will actually be written as ascending offset order regions. IOWs, instead of a single sequential data copy to the block device, it turns it into a sequential image file read and a set of smaller sequential {seek, write} (or pwrite()) operations that are run in ascending address order. That's almost as efficient and fast as a straight linear copy, but it skips both transporting and copying all the empty space in the filesystem image.

IOWs, using qcow2 as a image transport mechanism means we gain sparse image distribution support and we lose almost no performance at the image building stage or the "copy image to storage device" deployment phase. We also gain the ability to efficiently ship near empty 100TB filesystem images containing the container image so they only take up as much space as the existing linear 2GB image technique. Then if we deploy the sparse image to sparse storage devices....

As I said, we already have almost all the pieces we need to solve this whole class of issues once and for all with simple mods to image building, transport, deployment and kernel filesystem accounting.

-Dave.

Changing filesystem resize patterns

Conan_Kudo — Thu, 12 May 2022 21:37:41 +0000

I thought tarot cards were maybe more effective. Perhaps I was wrong... 😛

Changing filesystem resize patterns

Hello71 — Thu, 12 May 2022 14:44:34 +0000

> While the problems have been identified, solutions have not been; he was hoping that attendees had some interesting ideas. One thing that would be useful, Ts'o said, is to have a standard format for large filesystem images that could be inflated onto block devices into the full size of the filesystem. The ext4 developers have been experimenting with using the qcow format; there is a utility in e2fsprogs called e2image that can create these images. They only contain the blocks that are actually used by the filesystem, so they are substantially smaller than the filesystem they will create. The XFS developers have also been looking at xfsdump, since it has some similar capabilities, but for XFS filesystems.
>
> When he and Wong were talking, they agreed that some single standard format would be useful. One possibility is qcow, but it is not well-specified and the QEMU developers, who created the qcow format, discourage its use as an interchange format, he said. Perhaps there are others that could be considered, but getting agreement between the various filesystems is important. That would help move away from the idea that dd is the state-of-the-art tool for transferring filesystems.

why not use tar or cpio? as i see it, the benefits of raw filesystem images are: 1. they can be directly mounted, and 2. they can be installed on a storage device using extremely simple and common programs and algorithms (useful for e.g. minimal recovery environments). if you use qcow or some other custom format, neither of these benefits applies anymore, and you may as well use tar (which is more widely supported than your new format).

Changing filesystem resize patterns

dgc — Thu, 12 May 2022 14:19:13 +0000

It doesn't require anything outside the filesystem to prototype. I did that 5 years ago. The concept works at the filesystem level, but the problem is deployment needs buy-in from the application/storage management layers.

Stratis isn't the solution - it is actually part of the problem. I have proposed this model to the Stratis developers twice in the past ~3 years to address the problems their users have reported with thin provisioned storage pools going ENOSPC unexpectedly under the filesystem and things breaking as a result. There have also been other mechanisms proposed to solve their problems (e.g. directory tree quotas that match pool capacity) but none have been picked up by Stratis. The sparse address space filesystem architecture solves their problem, but it didn't fit neatly into the Stratis storage stack model and so their problems still have not been fixed because we can't fix them at the filesystem level in isolation.

This is what I meant when I said "changes are needed on both sides of the fence". We can make the filesystems work better using thin storage to solve problems that manifest in different environments. However, we can't do that within the constraints of the existing architecture and management models. The architectural changes we need to make result in a storage stack that looks and behaves a bit differently. Management tools need to change to work with architectural changes such as sparse address space devices and device capacity management being promoted to the filesystem level. They'll also need to change to use full sized address space sparse image files for deployments rather than tiny linear block device images - we have all the tools to do this already (qemu-img and using qcow2 as a transport format), so it's largely just a matter of putting all the pieces in place and getting them to work together.

-Dave.

Changing filesystem resize patterns

LtWorf — Thu, 12 May 2022 13:55:18 +0000

I think most people would just copy the files over

Changing filesystem resize patterns

stefanha — Thu, 12 May 2022 13:19:43 +0000

Can anything be done about the 2 GB journal? If a block device starts off with 10 GB then that overhead is significant.

Changing filesystem resize patterns

dgc — Thu, 12 May 2022 13:14:00 +0000

Yes.

The prototype I wrote used 100TB as the base address space size so everything in the XFS filesystem is scaled to maximum sizes.
The only thing that takes significant space is 2GB for the journal - 100 sets of AG headers and btree root blocks is less than 4MB of space. However, by maximally sizing everything, the address space can be grown (yes, you can still "physically" grow the filesystem to use all the available device address space!) to a capacity of 10-20PB before AG count scalability limitations start to kick in.

That should be enough capacity for the next decade or so....

-Dave.

Changing filesystem resize patterns

walters — Thu, 12 May 2022 13:04:19 +0000

Hasn't it always been possible to prototype such an interface in the Linux kernel between e.g. LVM and the filesystems? I think Stratis is doing some of this "on the side" today right?

Changing filesystem resize patterns

dgc — Thu, 12 May 2022 12:58:10 +0000

Sure, why not?

One could say that we already have a "full filesystem protocol" - the kernel's VFS - and the cloud providers could just write kernel drivers for their storage back ends. Like, say, cephfs?

And I've just pointed out in another thread on fsdevel that the kernel has a generic atomic object storage interface: xattrs. They provide a name-based atomic key-value store alongside seekable data streams. It seems to me that there is little understanding of the implications of kernel provided, space, time and integrity efficient key value stores that scale to millions of objects *per inode*. Those cloudfs drivers could just map their back end object storage protocols directly to those kernel interfaces, too.

But before that can happen, we need people to stop thinking like storage can only store data in seekable streams and be accessed like a spinning disk from the 1980s. I think we need to take the small, easy steps first to solve the immediate problems, then we have some time and space to work on bigger changes....

-Dave.

Changing filesystem resize patterns

johannbg — Thu, 12 May 2022 12:54:25 +0000

> Predicting the future with tea leaves has never been particularly effective.

Dammit, oh well back to the drawing board using the good old reliable magic 8 ball for future predictions :)

Changing filesystem resize patterns

mokki — Thu, 12 May 2022 09:10:27 +0000

I would think there is quite a big difference.

In the big address space case every data structure can be laid out in sparse pattern. But initially there is clamp on the allocator on how much of journal/tree can consume for each data structure.
When growing the allocator limit is raised, but there is no need to move any data around. The data structures are all valid as is and 'continuous' to the allocated space.

With the current growing logic either the data structure space needs to be split to regions (increases complexity and slows down) or they need to be relocated (dangerous/slow operation).

Changing filesystem resize patterns

rwmj — Thu, 12 May 2022 07:25:32 +0000

It's unclear how this is really any different from telling the filesystem "you may get access to linear addresses 0 through 10TB in future, but right now you only have linear addresses 0 through 10GB".

Changing filesystem resize patterns

stefanha — Thu, 12 May 2022 07:18:58 +0000

Does this approach solve the metadata overhead problem discussed in the article? Is the filesystem created for a 10 TB address space instead of the 100 GB that are actually available?

Changing filesystem resize patterns

pabs — Thu, 12 May 2022 06:07:51 +0000

You can resize a USB stick filesystem; dd it over to a new USB stick, resize the partition and filesystem.

Changing filesystem resize patterns

Wol — Thu, 12 May 2022 05:50:05 +0000

CAFS? Pick? Git?

Like all of these things, I guess the extra complexity isn't seen as worth it most of the time, and actually I think I'm in agreement. Throwing hardware at a problem merely moves the problem ... the conventional model is simple and it works for most use-cases (most of the time) ...

Cheers,
Wol

Changing filesystem resize patterns

Wol — Thu, 12 May 2022 05:45:35 +0000

> So in that case I think XFS just works out of the box - I think it reads the /sys/ files to work out what to do already - with a RAID.

So I was told by the XFS guys ...

But they did say if you want to grow the volume, don't. Just create a new volume and move the data across (or grow it in increments of 100%).

Cheers,
Wol

Changing filesystem resize patterns

neilbrown — Thu, 12 May 2022 05:28:08 +0000

Going from linear to sparse mapping seems like too small a step.
Why not an object storage protocol or even a full filesystem protocol?

Changing filesystem resize patterns

neilbrown — Thu, 12 May 2022 05:25:29 +0000

> Neil: Please take your tongue out of your cheek.

I'm completely serious. If a filesystem is not fit for purpose it should be fixed or replaced.
Predicting the future with tea leaves has never been particularly effective. Doing it with hints from virtual SCSI devices is unlikely to fare better.

Changing filesystem resize patterns

zev — Thu, 12 May 2022 03:38:55 +0000

e2fsprogs already has the logic for providing bundles of preset parameters for different usage patterns -- see the '-T' option to mke2fs, and the '[fs_types]' section of mke2fs.conf. The tricky part is in trying to automatically figure out which one is the "right" one to use, which might depend on the nature of the backing device (maybe known at mkfs-time, maybe not if you're just prepping an image to dump to the real device) and on future usage/resizing (much harder to predict, though the backing device might provide some clues to base guesses on).

Changing filesystem resize patterns

dgc — Thu, 12 May 2022 02:37:05 +0000

It sounds like this session didn't get anywhere near to recognising the fundamental issue: cloud storage infrastructure presents storage as traditional linear physical block devices to the user.

Cloud storage isn't physical - it's thin provisioned and has a virtual mapping layer that makes it look like a normal, physical linear block address space. Hence filesystems aren't aware they are on virtual storage that can be grown and shrunk at will. They can only behave as if they have been deployed on a linear address space, and therefore cloud storage suffers from the same problems we've had with growing physical storage since the mid 1990s.

Put simply: the storage deployment architecture is based on 1990s storage architectures rather than the modern architectures that hide behind the fading 1990s facade.

I think we really can only solve this problem properly at the architectural level - we need coordinated structural changes on both sides of the fence to make linear address spaces go away and with that all the technical problems with physical filesystem grow and shrink.

IOWs, we need an architectural shift to storage that presents a sparse block device to the filesystem and the filesystems need to be slightly modified to provide logical capacity limits rather than physical capacity limits. That way the filesystem can be laid out at mkfs time as if it was a physical 10TB filesystem because it's got a 10TB address space, but only allow the user to make use of 10GB of capacity because that's all the storage space that has been provisioned to it.

That is, the 10GB that the storage provider supplies for the block device is the the *pool* of storage capacity that the filesystem can consume, not the *address space* it can consume. The filesystem should have a logical capacity set such that it ENOSPCs just before the pool runs out of capacity. Then the user can provision more pool space (e.g. 1GB) and grow the filesystem logical capacity by 1GB and ENOSPC goes away. Or the user can grow it to 10TB and we just don't care.

In doing this, we haven't changed the physical layout of the filesystem at all. It's "physically" sized for a 10TB address space, it can use space anywhere inside that 10TB address space, but the user is only allowed to consume as much space as the logical capacity limit set for it. We can use online discard functionality (or fstrim) to release consumed address space that is no longer in use at the filesystem level. Hence users don't have to do anything to manage the used pool capacity except to provision more it when it runs out.

IOWs, cloud storage architecture really needs to move away from providing linear address spaces to pool-based sparse address spaces. Filesystems need logical capacity management and logical grow/shrink to reflect the provisioned pool capacity they sit on top of. This can be integrated into all existing cloud deployments, too, because dm-thinp can be used to convert the provider's linear address spaces into a pool-based sparse address space..,,,

This is the architectural change we need to be driving - sparse storage address spaces solve all the problems with existing technologies, we already have most of the pieces we need implemented, it requires a minimum of expenditure and resources to implement on both the filesystem and cloud infrastructure sides, and it can be deployed into all existing clouds with just host kernel upgrades to pick up the new filesystem functionality.

We do not need to blame cloud providers for pushing a business model that keeps them in business, nor users who have optimised their storage usage to minimise costs. We have everything we need at hand to fix this problem and have had them for a while. Our problem is that developers on both sides of the fence refuse to own the problem. This can't be solved with fs mods alone and it can't be fixed with cloud deployment process changes alone. We need developers on both sides of the stacks to realise that small changes on both sides can be done and the whole problem goes away forever.

I've been struggling to get this message across for years. Here's the XFS prototype I wrote in 2017:

https://lore.kernel.org/linux-xfs/20171026083322.20428-1-...

-Dave.

Changing filesystem resize patterns

gerdesj — Thu, 12 May 2022 00:26:25 +0000

Neil: Please take your tongue out of your cheek.

I do agree with you that "Linux filesystems were generally designed to support being resized" is not my experience. It's generally quite easy to embigger them but very tricky to shrink them.

Perhaps a bug report will fix this silly oversight.

Changing filesystem resize patterns

neilbrown — Thu, 12 May 2022 00:09:36 +0000

This sounds to me like a simple bug report.

Bug: resize2fs doesn't resize all of the filesystem - only some parts.

I appreciate that resizing some parts are easier than others, so you wouldn't bother with the hard parts until the need presents itself. It seems that the need has now presented itself. Time to fix the bug.
If it is impossible, rather than just hard, to resize all of the filesystem, then that puts a big question mark over the introductory claim: "Linux filesystems were generally designed to support being resized"

Changing filesystem resize patterns

gerdesj — Wed, 11 May 2022 23:52:22 +0000

The wizard approach is a great idea, I think. However, filesystems are sodding complicated. Trying to work out how to get the best out of a given setup is hard.

I recently specified a box for around 100TB of backup (Veeam) storage on a budget. I went for a reconditioned Dell T430, rack mounted (gets you eight 3.5" bays), a PERC RAID controller with 8GB RAM and eight 16TB SATA. RAID 6 takes about a day to build with nothing else going on so probably two to three days in use. Hence RAID 6 and not 5. I went for a larger stripe size than default because this thing holds a few large files. Now for the file system.

It's XFS for the reflinks or ReFS and Windows (no thanks)! Much searching later I realized that this hardware RAID doesn't need to worry about the same things as when its on md and the like, ie worrying about the number of disc in the RAID and number of parity discs and stripe size. It seems that above a certain stripe size (1024KB I think), the PERC reports the maximum as that anyway via /sys/. It's some sort of secret sauce in the PERC according to some responses by Dell employees in some forum postings I found.

I ran mkfs.xfs with defaults!

Did I get the best out of the box? Well it twiddles its thumbs waiting for data to turn up and has about 88TB (real TB) of file system. I'm quite risk averse, hence RAID 6 and not 5, even though I have a spare disc on the shelf. I went for dual PSU, 64GB RAM and dual Gold Xeon CPUs, full iDRAC etc. It costed about £5000 + VAT (sales tax). The box has three years, next business day hardware support priced in too.

I priced up quite a few options and that works out at less than £60/TB useable space.

So in that case I think XFS just works out of the box - I think it reads the /sys/ files to work out what to do already - with a RAID.

However, the cases described in the article are simply mad in comparison with my rather normal, pedestrian use case. In general I want to create a 100GB to 10TB fs and bolt on more in increments of 10-50% every now and then. Mr Cloudy and his 10GB basic offering, flexing to 1PB in increments of 1GB can design their own fs to work with that use case.

I think that the state of the art fs development has already got it pretty much spot on and Mr Cloudy and co can damn well do the research and spend the money and work out what they need well away from the real world. Ideally they would collaborate on this in some sort of open way and we'd all benefit. There are billions of individuals who benefit from ext4, xfs, btrfs and all the rest being at the (b)leeding edge and easily available. There are billions of customers of cloud systems who would benefit from cloudy optimised file systems. There is a lot of crossover.

Perhaps we need to see xfs.orc, ext4.ggl, btrfs.faceb00c and the like being developed. I've no idea how that would work or if it is even possible.