LWN.net Logo

A better ext4

By Jonathan Corbet
January 23, 2008
Last week's Kernel Page may have been filesystem-heavy, but there was still a big omission, in the form of ext4. But ext4, being the successor to ext3, may well be the filesystem many of us are using a few years from now. Things have been relatively quiet on that front - at least, outside of the relevant mailing lists - but the ext4 developers have not been idle. Some of their work has now come to the surface with Ted Ts'o's posting of the ext4 merge plans for 2.6.25.

One of the changes going into ext4 is a lifting of the longstanding 4KB block size limit. That does not mean that just any block size works, though, and this feature will benefit fewer people than one might think, for one specific reason: the block size must still be no larger than the page size on the host system. So those of us running x86 systems with 4KB pages will be stuck with 4KB blocks still. And, on any system, the maximum block size is now 64KB.

One amusing effect of this change is that the size of a directory entry can now be as large as 64KB as well. But the field which holds the size of directory entries is only 16 bits wide. So a special hack has been employed to recognize 64KB directory entries and keep everything consistent.

Some internal variables have overflow problems as well. Block numbers are stored as a signed, 32-bit quantity, and so are block group numbers. That limits the maximum size of a filesystem to a mere 256PB. In 2.6.25, these values will become unsigned long variables, eliminating that intolerably low limit. Through some trickery, the inode field which stores the number of blocks associated with a file will be expanded to 48 bits, raising the maximum size of an individual file to just under 248 512-byte blocks.

The work does not stop there, though: another patch redefines that field to mean the number of filesystem blocks (instead of 512-byte sectors) used by the file. This is a change which has to be handled carefully, since it is an on-disk format change which could create trouble for people with existing ext4 filesystems. Everybody who is using ext4 should certainly be doing so with the knowledge that it's a development filesystem and is only suitable for storing files which are not valuable for more than about 30 minutes - Rawhide OpenOffice.org updates, say. But it still would be nice to not trash every existing ext4 filesystem out there. So the i_blocks field will continue, by default, to hold the number of 512-byte blocks. But, if that field exceeds 32 bits and forces the use of 48-bit numbers, it is thereafter interpreted as filesystem blocks. Since no existing filesystems are yet using 48-bit numbers, this approach successfully avoids breaking them.

Journal checksums are another feature arriving for 2.6.25. If the system crashes, the journal is used to recover any transactions which were committed, but which did not actually make it to disk. It sure would be nice to know that the journal, as stored in the filesystem, is intact before using it to make changes elsewhere. The checksum enables the filesystem to ensure that the journal is good and avoid (further) corrupting the filesystem if it is not. An interesting side benefit is that the checksum loosens the constraints on how the journal is written to disk, since an incompletely-written journal will now be detected; that should help to improve filesystem performance slightly.

Note that full data checksumming is still not on the agenda for ext4. But checksumming the journal is a good (if small) step in the right direction.

Another change is a VFS API change, in that it turns the i_version field of the inode structure into an unsigned, 64-bit value on all architectures. This version number is incremented when the file is changed, and it's stored (split into two fields) in the on-disk inode. 64-bit version numbers are required by NFSv4, which uses them to provide the dreaded "stale file handle" error when things change.

There is a new ioctl() (EXT4_IOC_MIGRATE) which can be used to explicitly request that the on-disk inode for a file be converted to the ext4 format.

The ext4 filesystem is extent-based, and has been for some time. "Extent-based" means that it tracks block allocations by extents (first block, number of blocks) rather than storing pointers to each individual block, as is done in ext3. There are a number of performance benefits to doing things this way, especially for larger files. Those benefits disappear, though, if a file's blocks cannot be grouped into the smallest number of extents possible.

One technique which greatly helps in optimizing block allocations for files is to allocate them in relatively large groups, rather than individually. In 2.6.25, ext4 will contain the multi-block allocator, which does exactly that. One might think that allocating a few blocks at a time would not be that big of a change, but the multi-block allocator is by far the most complex patch in the set. A lot of effort and heuristics go into deciding how many blocks to allocate, finding the optimal set of blocks, tracking the allocation, recovering blocks which end up never being used, ensuring that an application cannot read pre-allocated (but unwritten) blocks in search of leaked secrets, etc. It is quite a bit of code, but it is worth the trouble; multi-block allocation will be enabled by default in 2.6.25.

As noted above, a number of these patches force changes to the on-disk data structure. According to Ted, though, these should be the last on-disk changes for ext4. There are some features which still will not have been merged when 2.6.25 comes around - delayed allocation and online defragmentation among them - but they should not require format changes. So ext4 is getting closer to the point where it is considered ready for production use.

It is not at that point yet, though, and people who use it are still doing so at their own risk. To help drive that point home, Ted has proposed a new mount flag (called test_fs) which communicates to the kernel the user's understanding that they are about to mount a developmental filesystem and will not go filing lawsuits if things go wrong. In the absence of this mount option, an ext4 filesystem will refuse to mount. One might think that child-proofing the filesystem in this way would not be necessary, but some extra care in this area can only be a good thing. Filesystem-related surprises are rarely welcome.


(Log in to post comments)

A better ext4

Posted Jan 24, 2008 5:27 UTC (Thu) by bfields (subscriber, #19510) [Link]

"Another change is a VFS API change, in that it turns the i_version field of the inode
structure into an unsigned, 64-bit value on all architectures. This version number is
incremented when the file is changed, and it's stored (split into two fields) in the on-disk
inode. 64-bit version numbers are required by NFSv4, which uses them to provide the dreaded
"stale file handle" error when things change."

No, you're thinking of the generation number.

The intent is to use the i_version field as what the NFSv4 spec calls the "change attribute",
which is used by the client to revalidate cached data: the client can check that this value is
still the same as it was when it last looked at the file, saving it from having to re-read it.

Currently we're using the ctime for that purpose, but it's not actually guaranteed to change
every time the file changes (especially not on ext3, which has timestamps with only 1 second
resolution!).

Seems like a number that might be worth exporting to userspace some day too--other
applications could also benefit from a quick, reliable way to determine whether a file has
changed since the last time they looked at it.

(The generation number--the i_generation field of the inode--is what's used to ensure that NFS
filehandles stay distinct even when inode numbers are reused.)

What about fsck times???

Posted Jan 24, 2008 12:58 UTC (Thu) by alonz (subscriber, #815) [Link]

One of the important issues discussed last week was fsck times—how current filesystems (including ext3) require long fsck times, which with the new, large disks (terabyte disks, or disk arrays, are soon becoming household items).

Are there no plans to address these issues in ext4?

What about fsck times???

Posted Jan 24, 2008 13:51 UTC (Thu) by corbet (editor, #1) [Link]

The metadata clustering proposal discussed last week would certainly be applicable to ext4. See also Val Henson's e2fsck patch, which I've not had a chance to look at yet.

What about fsck times???

Posted Jan 24, 2008 16:17 UTC (Thu) by alonz (subscriber, #815) [Link]

I'm not sure metadata clustering (and/or parallel fsck) will be sufficient to solve the fsck problems. Metadata clustering is more of an issue for ext3 (which has per-block metadata) than for ext4 (which has extents), and parallel fsck only helps when metadata is split over several disks (e.g. RAID), but not with single huge disks.

I had hoped to see is something like (for example) Daniel Phillips' idea in http://lkml.org/lkml/2008/1/13/146: a small change in ext4 metadata format (adding a reverse mapping), which may allow fsck to operate on-the-fly and way faster than it is now. This is (IMO) far more scalable than the other approaches, whose effects are at most linear.

What about fsck times???

Posted Jan 24, 2008 22:24 UTC (Thu) by i3839 (guest, #31386) [Link]

fsck is seektime bound, so doing parallel fsck would at least throw more Io requests at the
harddisk, making it hopefully seek less on the whole. Or something like that.

What about fsck times???

Posted Jan 24, 2008 22:52 UTC (Thu) by nix (subscriber, #2304) [Link]

It seems to me that parallelizing, without substantial locking (-> 
serialization) or very clever design, is likely to *reduce* correlations 
between block requests and thus *increase* seeking substantially: and if 
you're adding enough locking to keep correlations between block requests, 
you can probably make the whole algorithm a serial one without much 
difficulty.

Certainly the naive implementations (e.g. multiple threads, one per block 
group) would hugely increase seeking. (There's lots of unavoidable 
serialization too: e.g. you can't carry out the passes all at once, or in 
a significantly different order.)

What about fsck times???

Posted Jan 25, 2008 19:48 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

It seems to me that parallelizing, without substantial locking (-> serialization) or very clever design, is likely to *reduce* correlations between block requests and thus *increase* seeking substantially

You just need enough threads. If there are 1000 threads, you'll have about 1000 requests in the device's queue at any time, and one of them is bound to be near where the head is now.

I don't know much about fsck; maybe the work can't be done with enough threads to do better than today's single thread.

What about fsck times???

Posted Jan 26, 2008 20:12 UTC (Sat) by nix (subscriber, #2304) [Link]

Yes, one of them will be in the right place: but the others will be all 
over the shop, and the device will have to satisfy those very soon as 
well, so as to free up space for more requests.

It seems to me that this is a quick route to death-by-seeking.

What about fsck times???

Posted Jan 26, 2008 22:09 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Yes, one of them will be in the right place: but the others will be all over the shop, and the device will have to satisfy those very soon as well, so as to free up space for more requests.

If space for requests is the issue, it doesn't matter which requests you satisfy -- they all return the same kind of space. You can ignore the distant requests as long as you want

I thought you'd point out that the general purpose head scheduling algorithm stresses response time, so would satisfy those distant requests soon. But fsck doesn't need response time, and should use a head scheduling algorithm that satisfies the near requests first regardless. You can let 990 of those threads wallow while the disk cycles through requests from the 10 that happen to be working in the same region of the disk.

I would much rather see the device driver than the application doing head scheduling. Only the device driver really knows the seek issues.

What about fsck times???

Posted Jan 27, 2008 15:16 UTC (Sun) by nix (subscriber, #2304) [Link]

It's not just that the general-purpose head scheduling algorithm stresses 
response time: it's also that it has finite storage, so will *have* to 
satisfy even distant requests fairly fast if it's not to starve userspace 
of memory (at least, this is true with the volume of distant requests a 
naively-threaded fsck would generate).

Also, it doesn't `really know' the seek issues. The only circumstance in 
which it knows *any* more about seek issues than fsck is if you're fscking 
a device atop a dm or md device: and in both of those cases the simple 
rule of thumb `issue contiguous requests' will provide far more speedup 
than the block scheduler ever could wring out of a pile of non-contiguous 
requests.

The only thing that really knows the seek issues is the drive controller 
itself (which is possibly remote from the machine doing the fscking), and 
it *really* doesn't have much ability to accumulate large numbers of 
requests and answer them out of order.

The best approach for speed here remains `issue contiguous requests', 
which is of course interesting to balance with fsck memory consumption... 
of course Val knows all of this far better than do I, and now I look at 
the patch she's catered for this, with I/O parallelized such that 
individual block devices *don't* see heaps of seeks.

A better ext4

Posted Jan 24, 2008 15:33 UTC (Thu) by sayler (guest, #3164) [Link]

"One amusing effect of this change is that the size of a directory entry can now be as large
as 64KB as well. But the field which holds the size of directory entries is only 16 bits wide.
So a special hack has been employed to recognize 64KB directory entries and keep everything
consistent."

Huh?

64K byte directory entries

Posted Jan 29, 2008 3:25 UTC (Tue) by xoddam (subscriber, #2322) [Link]

modulo 216, 64K is equivalent to zero.

A better ext4

Posted Mar 15, 2008 18:39 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

“But, if that field exceeds 32 bits and forces the use of 48-bit numbers, it is thereafter
interpreted as filesystem blocks. Since no existing filesystems are yet using 48-bit numbers,
this approach successfully avoids breaking them.”

Reading the patches I can find on LKML this description appears to be wrong.

Instead what's going on is...

* If the file is small enough for a 32-bit sector count, the old 32-bit field is used, the new
16-bit field is set to zero, and the EXT4_HUGE_FILE_FL inode flag is reset if present
* If the file is small enough for a 48-bit sector count, the old 32-bit and new 16-bit field
are used together to store it, and again the EXT4_HUGE_FILE_FL inode flag is reset if present
* If the file is bigger, the count is converted to a block count, stored using both the 32-bit
and 16-bit fields together and the EXT4_HUGE_FILE_FL inode flag is set, as well as a "huge
file" filesystem-wide compatibility flag.

This way of doing things ensures that no previous file could possibly be affected, because the
flag was created for this patch, and without that flag such a colossal file can't be stored on
ext4 at all. It also ensures that older versions of ext4 can't write to a filesystem
containing a huge file they don't understand, and that if they're used to mount the filesystem
read-only, they can't read the affected file (since otherwise they'd find it drastically short
of disk blocks).

Fast fsck by hand and by feature

Posted Apr 17, 2008 23:11 UTC (Thu) by Milan (guest, #26716) [Link]

One should pay attention how many i-nodes creates on its ext3 FS. By reducing i-nodes, fsck
time is reduced too. Anyway there is no possibility to reduce or increase number of i-nodes
without reformating FS so one must figure out how many i-nodes to create for own use and then
do backup/reformat/restore.

Ext4 will have an ability to know what i-nodes are not used so fsck should be fast even
zillions wasted space with unused i-nodes will be in place on your not-tuned-FS.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds