Changing filesystem resize patterns

Posted May 18, 2022 23:11 UTC (Wed) by dgc (subscriber, #6611)
In reply to: Changing filesystem resize patterns by neilbrown
Parent article: Changing filesystem resize patterns

> Do they need to be "fully dynamic" ?? Don't they just need internal structures to scale with size?

They already tend to, though there are maximum size bounds for "internal structures" that pose limits. The problem is that once the structures are laid down on disk, they are fixed for the life of the filesystem. Growing can only replicate more of those same sized structures, and that's where all the problems lie.

> Have I over-simplified too much?

No, but you've assumed that the filesystems don't already do that. The problem is this scaling happens at mkfs time and that sets a lot of things in stone that have been assumed will never change.

> So: could we scale these tables based on their location instead of the total size?

Yes, we could, but that involves changing the on disk format to add a bunch of new metadata to support both variable size structures and structures of unpredictable, unknown locations.

I'll speak for XFS here, because the common question is "why can't we just implement variable size allocation groups?" and that's exactly what you are asking here.

The size of the AG is fixed at mkfs time. On a 2GB fs, that's 500MB. Grow that out to 10TB, and we have 20,000 AGs and it's the AG count that causes issues for us. AGs can be up to 1TB in size - a 10TB filesystem will be laid out with 10x1TB AGs, or if it on RAID5/6 it will have 32x~300GB AGs (because RAID has more disks and can handle more concurrency). So you can see the difference in layout between 20,000x500MB AGs vs 10x1TB AGs.

The problem is not the size of the AG - they are all indexed by btrees, so there is very little in way of performance difference between maintiain free space indexes in a 500MB AG and a 1TB AG. They scale with size just fine and we already support different sized AGs at runtime - they have an internal length variable in their headers and the last AG never fits neatly into the device and so is always shorter than the rest.

But the problem here is that all the algorithms that assume that all AGs are teh same size. e.g if we know the size and the AG number, right now we know exactly where it is located on the block device. We know it is equal to all other AGs, too, so algorithms do linear AG iteration because no one AG is more likely to be better than any other based on physical characteristics. As such, we have one variable in the superblock that tells us what the size of an AG is, and another that tells us the AG count. Our backup for that information for disaster recovery is in the AG headers - we have secondary superblock in the first sector of every AG.

So let's look at disaster recovery - the first requirement for any filesystem design is to be able to recover from loss of key structures. Right now, if we lose the primary superblock and AG 0 headers (say someone wipes out the first 64kB of the block device) then we have to do a sector-by-sector search to find the first secondary superblock in AG #1. We optimise the search based on device size and what mkfs defaults would do (so might only take a single IO to find a valid secondary), but if that doesn't work we have to read every single sector out past the first TB to find the start of the second AG. Then we can recover the size of the AGs, grab all the other secondary superblocks, validate them and recover the primary.

Now, if we have variable size AGs, how do we recover from the loss of the superblock and/or the AG location/size index tree? How do we find the start of all the AGs in the filesystem and validate they are correct? We can't play mkfs tricks or just find the secondary SB in AG#1 - we have to read every single sector of the block device to find every AG header.

How long does that take if we have a PB scale filesystem? At a scan rate of 1GB/s, just that scan will take roughly a day per PB. IOWs, even if we solve these problems, recovering from the loss or corruption of the AG indexing structure can be catastrophically bad from a disaster recovery POV - if it's going to take a week just to find the AG headers, we may as well just throw it away and restore from backups.

> So: could we scale these tables based on their location instead of the total size?

Sure, we could do that, but that still brings a host of other problems with it and introduces a whole set of new ones.

e.g. allocation algorithms that select AGs with larger free spaces first then always try exact or near locality allocations for all followup allocations to that inode. That changes the filesystem fill algorithms from low->high to high->low because all the AGs with large free spaces are high up in the address space. i.e. on physical disks we change from filling the outer edge (fastest) first, to filling from the inner edge (slowest) first.

Changes like this also have implications for AG lock ordering which only allows multiple AGs to be locking in ascending order. If we allocate from the highest AG first and it ENOSPCs mid operation, we need to allocate from another AG to complete it. If the only Ags we can chose from are lower down, then we can deadlock with other multi-ag allocations. The only answer there is *suprious ENOSPC*, because shutting down the filesystem when stuff like this happens is not an option. This spurious ENOSPC normally only happens when the filesystem is nearly full, so it generally isn't an unexpected result. But variable AG sizes will bring this behaviour out much earlier when the filesystem is nowhere near full.

There's there's just the general variable AG size issues, like the internal sparse adress space that XFS uses for inodes and filesystem block addressing (detect a trend here?). We encode the AG number into the high bits, and the block number in the AG into the low bits (i.e. AGNO|AGBNO). The number of bits each use is actually variable and based on the size of the AG set at mkfs time. Inode numbers are similar in being "AGNO|AGBNO|Inode in chunk" - it's a sparse encoding of it's location in the address space and so we can go straight from inode number to it's physical location just with a couple of shifts and some masking.

Hence if the AG is going to be variable size, then we always have to reserve 32 bits for the AGBNO in this address space so we can always support 1TB AG sizes in the address space. This means the every AG will actually consume 1TB address space, regardless of it's actual size.

At this point, the maximum size of the filesystem drops - it's no longer 8EB, it's 8EB - all the unused address space. IOWs, even just calculating the actual physical capacity of the filesystem becomes hard - we have to add up the size of each individual AG rather than just multiplying 2 numbers together. When we support 2^32 AGs, that's a non-trivial problem.

Then there is inode numbers instantly going over 32 bits. If the AG #0 is only 500MB in size, and the user selects the "inode32" allocator, they now only have a maximum of 500MB of storage for inodes in the entire filesystem. That's because inode numbers are sparse and putting them in AG #1 will use 33 bits of address space and hence have a 33bit inode number. With the current layout, there's 200 AGs that inodes could be placed in (a full 1TB of storage) before the inode address space goes over 32 bits...

Hence going to variable size AGs based on size of the device has user visible implications, even for users that haven't used grow and will never use grow. Put that on top of having to make serious changes to the on disk format, tools, disaster recovery algorithms and redesign runtime algorithms to support variable sizes sanely - we are effectively talking about redesigning the entire core of the filesystem.

> But it seems a lot less than "designing a new filesystem".

The effort is on the same level of on-disk format changes that the ext3->ext4 transition involved. I'll leave you to decide if that's less effort than "designing a new filesystem" given the years of work that involved.

-Dave.

Changing filesystem resize patterns

Posted May 19, 2022 4:10 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

Thanks for the deeper insights Dave!

> to add a bunch of new metadata to support both variable size structures and structures of unpredictable, unknown locations.

That isn't at all what I was suggesting - sorry that I wasn't clear.
I was suggesting something like:
First Gigabyte is 10 100MB AGs
Next 9 Gigabytes are 9 1GB AGs
Next 90GB are 9 10GB AGs
Next 900GB are 9 100GB AGs
etc
This is just illustrative - in practice you would use powers of 2 and other details might be different. E.g. the size of the smallest AG might be determined at mkfs time. So: nothing much that is unpredictable.

This layout would then be the same no matter how big the device was - so there would be no need to change anything when you make the device larger - just add new AGs, some of which might be bigger than any existing ones. 90% of the space would always be in the largest, or second largest, AGs.

I think this addresses your concerns about disaster recovery and mapping between inode number and the AG which stores the inode.

Your concern about allocation strategies when there are varying sizes AGs is not immediately solved. However....
The AGs used for allocating files to device locations do not necessarily have to match the block groups used for placing inodes and other metadata. One is a run-time concept and the other is an on-disk concept.
In my above example on a 1TB device, XFS could treat the first 28 on-disk AGs as a single 100GB group for allocation, then each of the remaining AGs as other 100GB allocation groups.

But as you suggest, there is a big difference between such abstract musings and the hard practical realities of changing code in a production filesystem that many bytes of data are depending on...

Changing filesystem resize patterns

Posted May 22, 2022 23:54 UTC (Sun) by dgc (subscriber, #6611) [Link]

> That isn't at all what I was suggesting - sorry that I wasn't clear.
> I was suggesting something like:
> First Gigabyte is 10 100MB AGs
> Next 9 Gigabytes are 9 1GB AGs
> Next 90GB are 9 10GB AGs
> Next 900GB are 9 100GB AGs
> etc

Yes, that's kind of what I assumed you were suggesting. It solves some of the "find the AG" corner cases, but it introduces a whole new set of problems that we'd have to solve. Compared to dumping the filesystem on dm-thin and changing a couple of hundred lines of accounting code, it's a massive undertaking.

> The AGs used for allocating files to device locations do not necessarily have to match the block groups used for placing inodes and other metadata.

XFS doesn't have "block groups". It has allocation groups - they are the units that hold inodes, metadata and data.

By default, XFS has no limitation on the AGs which can hold inodes and metadata within the filesystem. But there are user selectable allocation policies which place arbitrary constraints on where inodes and metadata can be placed. That's why I mentioned inode32 - the implicaitons of full 1TB sized AGs and inode32 are that only AG 0 can hold inodes. Hence the layout you suggest above will only provide 100MB of physical inode space w/ inode32 which equates to a hard cap of roughly 100k inodes no matter how large the fs is grown to. That is obviously not enough inodes for even a root disk on a modern system. Hence for inode32, we can't consider using a physical AG 0 size of less than 10GB for a filesystem with variable sized AGs that could be grown into the TBs range.

It also makes no sense if the filesystem starts large (e.g. we're starting with a capacity of TBs), and so we now have two completely different filesystem layouts - one for tiny filesystems that can grow, and one for anything that starts at over 500GB. That's not really sustainable from a maintenance and longevity POV - we need to reduce complexity and the size of the configuration matrix, not make it worse.

> In my above example on a 1TB device, XFS could treat the first 28 on-disk AGs as a single 100GB group for allocation, then each of the remaining AGs as other 100GB allocation groups.

Unfortunately, we can't do that easily. Locking, transaction reservations, etc all mean we have to treat AGs as independent entities.

For allocation policy purposes, we can group sequential AGs together (I was looking at this as "concatenated groups" to treat multiple TB of contiguous disk space as a single free space entity back in ~2007), but we still have to treat them as individual AGs when it comes to addressing, allocation, reflink and rmap tracking, etc. Hence it doesn't really solve any of the problems with having lots of small AGs in a large filesystem - it just hides them under another layer of abstraction.

There also performance implications of lots of AGs on spinning disks. At 4 AGs (default), the cost of seeking between data in different AGs is just on the edge of the cliff. We lose about 5% sustained performance on spinning disks at 4AGs vs 1AG (which is why ext4 is always a touch faster on single spinning disks than XFS). By 10 AGs the performance is halved because of the amount of seek thrashing between AGs that occurs as the data sets are spread across the internal address space. IOWs, from a performance perspective on spinning disks, we really want to avoid a large number of small AGs in a small filesystem if at all possible. To fix this we basically have to completely redesign the allocator algorithms to no treat all AGs equally...

> But as you suggest, there is a big difference between such abstract musings and the hard practical realities of changing code in a production filesystem that many bytes of data are depending on...

I wish that more people understood this.

-Dave.