Ext3 for large filesystems
Ext3 is based on decades of experience with Unix filesystems. As a result, it is relatively straightforward to understand and highly reliable in its operation. It is, however, also showing its age in a number of ways. One of those is the maximum size of the underlying device it can handle. This limit is a mere 8 TB. That is enough to hold most of our mail spools - even before spam filtering - but it is a limit which is already affecting some users. With the size of contemporary disks, the creation of an 8 TB array is not an entirely outlandish thing to do now, and it will only become easier over time.
There are a couple of reasons for this limit. One of them is the use of 32-bit block numbers within the filesystem - and signed 32-bit numbers at that. The ext3 code can only track 2 gigablocks, which, using a 4K block size, sets the limit at 8 TB. Switching to an unsigned type can double that limit, but that only pushes back the problem by about one year. Clearly, larger block numbers are required.
The other problem has to do with how ext3 tracks the blocks associated with any given file. The ext3 inode structure contains an array of fifteen 32-bit pointers; the first twelve of those pointers contain the indexes of the first twelve blocks of the file. Thus, with a filesystem using 4K blocks, the first twelve pointers can describe a file of up to 48KB in length. If the file exceeds that length, an "indirect block" is created. This block is a big array of block pointers, holding the indexes for the next 1024 blocks; the 13th pointer in the inode structure tracks the location of this indirect block. Should that space not suffice, the 14th pointer is used for a double-indirect block - a block holding pointers to indirect blocks. Finally, the 15th pointer will be used for a triple-indirect block if need be.
This arrangement is not too different from how Unix systems structured filesystems two decades or more ago. It imposes a per-file maximum size of about 4 TB - big, but perhaps limiting for today's hot applications (such as comprehensive, nationwide telephone call archival). It works well for small files but, as files get larger, this organization becomes increasingly inefficient. Keeping a pointer to every single block is expensive, both in terms of space usage and the time it can take to locate a specific file block. Since larger filesystems will tend to hold larger files, this overhead becomes increasingly limiting over time.
A solution to these problems can be found in the extents and 48-bit support patch set. These patches have been posted by Mingming Cao; many other developers - especially Alex Tomas - have worked on them as well. They change the way files are stored to make things more efficient, and to allow the filesystem to index the blocks on larger devices.
The core of the patch is the support for extents. An extent is simply a set of blocks which are logically contiguous within the file and also on the underlying block device. Most contemporary filesystems put considerable effort into allocating contiguous blocks for files as a way of making I/O operations faster, so blocks which are logically contiguous within the file often are also contiguous on-disk. As a result, storing the file structure as extents should result in significant compression of the file's metadata, since a single extent can replace a large number of block pointers. The reduction in metadata should enable faster access as well.
An ext3 filesystem mounted with the extents option enabled will handle files stored in the old way, using block pointers, as always. New files will be created using extents, however. In these files, the fifteen-pointer array described above is overlaid with a new data structure. There is a short header, followed by a few occurrences of this structure:
struct ext3_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start; /* low 32 bits of physical block */ };
Here, ee_block is the index (within the file, not on disk) of the first block covered by this extent. The number of blocks in the extent is stored in ee_len, and the pointer to the first of those blocks (on disk, now) lives in the combination of ee_start and ee_start_hi. By storing physical block numbers this way, ext3 can handle 48-bit block numbers - enough to index a 1024 PB device. That should be enough to last for a couple years or so.
For files with few extents, all of the information can be stored within the on-disk inode itself. As the number of extents grows, however, the available space runs out. In that case, a form of indirect blocks is used; the in-inode extents array describes ranges of blocks holding extents arrays of their own. The tree of indirect extents blocks can grow to an essentially unlimited depth, allowing the filesystem to represent even very large, highly-fragmented files.
Beyond extents, relatively little had to be done to prepare ext3 for 48-bit block addressing. The signed, 32-bit block numbers are gone, having been converted to the larger sector_t type. Some reserved space in the ext3 superblock has been grabbed to store the high 16 bits of some global block counts. Much of the tracking of free blocks within the filesystem is done using block numbers relative to the beginning of the block group, so that code did not need to change much at all. A few tweaks to the journaling code were required for it to be able to handle the larger block numbers.
The end result is an enhancement to the ext3 filesystem which enables it to
work with much larger devices. Existing filesystems can use the new
features immediately with no dump-and-restore cycle. It would appear to be
(nearly)
universally agreed that these changes turn ext3 into a better filesystem.
Whether that better filesystem should still be called ext3 is
controversial, but that is a subject for another article.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext3 |
Posted Jun 15, 2006 15:29 UTC (Thu)
by smoogen (subscriber, #97)
[Link] (6 responses)
Posted Jun 15, 2006 16:58 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (2 responses)
Posted Mar 20, 2007 16:24 UTC (Tue)
by BrucePerens (guest, #2510)
[Link] (1 responses)
Posted Mar 20, 2007 16:35 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
64MB in the early 80s? Those must have cost a pretty penny. I remember 5MB and 10MB disks costing several hundred dollars.
Posted Jun 24, 2006 17:39 UTC (Sat)
by man_ls (guest, #15091)
[Link] (2 responses)
Posted Oct 15, 2006 15:55 UTC (Sun)
by knan (subscriber, #3940)
[Link]
Posted Oct 15, 2006 21:32 UTC (Sun)
by dlang (guest, #313)
[Link]
buy 40 Sun x4500 servers ( http://store.sun.com/CMTemplate/CEServlet?process=SunStor... ) each with 24TB storage, 2 dual-core opterons, and 16G ram
so 2PB of storage are $4m in 8 racks (80 servers) consuming ~60KW. that's not too scary (although it is quite a bit)
backups are an issue, my guess is that the backup is mirroring to another similar set.
Posted Jun 15, 2006 21:45 UTC (Thu)
by xorbe (guest, #3165)
[Link] (1 responses)
Posted Jun 15, 2006 22:04 UTC (Thu)
by Stephen_Beynon (guest, #4090)
[Link]
Posted Jun 16, 2006 0:33 UTC (Fri)
by sbergman27 (guest, #10767)
[Link] (2 responses)
My understanding is that even XFS has a maximum filesystem size of only 16TB on 32 bit, even though it goes to 9 EB on 64 bit systems.
Posted Mar 12, 2007 15:38 UTC (Mon)
by sandeen (guest, #42852)
[Link] (1 responses)
Though xfs is fully 64-bit capable on 32-bit machines, the 16T limit is OS-imposed. There is a 32-bit index into the (4k) page cache on x86, so 2^32 * 4096 is 16T is the maximum offset that can be addressed.
Posted Mar 20, 2007 16:32 UTC (Tue)
by BrucePerens (guest, #2510)
[Link]
Bruce
Posted Jun 19, 2006 21:47 UTC (Mon)
by dmuino (guest, #6930)
[Link] (2 responses)
Wouldn't just raising EXT3_MAX_BLOCK_SIZE at least temporarily alleviate this problem?
Obviously the performance issues dealing with large files are still there, and extents is a feature that becomes more and more important, but if you need something now that gives you filesystems larger than 8TB this might be something to explore.
Posted Jul 1, 2006 14:13 UTC (Sat)
by riel (subscriber, #3142)
[Link]
Furthermore, it would mean that even small files would take up an unreasonable amount of disk space...
Posted Jul 1, 2006 18:26 UTC (Sat)
by csnook (guest, #36935)
[Link]
Posted Jul 7, 2006 4:00 UTC (Fri)
by sitaram (guest, #5959)
[Link]
"4 TB - big, but perhaps limiting for today's hot applications (such as comprehensive, nationwide telephone call archival)"
Once again, our editor's sense of humour hits home :-)
Posted Oct 13, 2006 20:28 UTC (Fri)
by klog (guest, #19514)
[Link]
Actually large amounts of cluster systems seem to run into the 4TB file size and 8TB file limit these days. We had a project that needed approximately of 32 Petabytes for its raw data. I thought it was outlandish, but then got a couple of requests for 2 Petabyte systems for genome research.Ext3 for large filesystems
Well, if storage requirements double every 18 months, then 1024 PB will be enough to hold them for just shy of a decade, then. :-)Ext3 for large filesystems
I think I've thrown out the 64 Megabyte 3.5 inch hard disks I bought in the early 80's. It doesn't feel that long ago.
Ext3 for large filesystems
It still amazes me that a single desktop icon on a modern computer takes up about as much RAM as an entire video game console had back in the 80s. Ext3 for large filesystems
If you could expand on your post, I'm very curious about it. 2 Petabytes? 32 Petabytes? Where do you store that kind of information? For ther former you would need something like 2048 teraservers, just the power requirements are scary. How do you back it all up?
Petabytes
"For ther former you would need something like 2048 teraservers" ... or a few modern SAN storage arrays. There are 500TB off-the-shelf solutions available, and I'm sure the manufacturers will be happy to sell you bigger custom variants.Petabytes
actually, you can get a petabyte of storage for ~$2m off-the-shelf that fits in ~4 racks consuming ~30KW of powerPetabytes
Wait a sec... what's the "ext" in ext2 and ext3 if they are just adding Ext3 for large filesystems
extents now??
This is from memory so I may be wrong ... Ext3 for large filesystems
In the beginning Linux used the minix file system. This had many
limitations including as I recall 14 character filename limits.
The EXTended filestem was developed to address these problems. This was
relatively rapidly replased by the second extended filestem (ext2) (I do
not recall what the problems with ext1 were).
Ext3 extended Ext2 with new capabilities - most notably a journel. This
considerably improves the resilience of the filesystem to unexpected power
failures !
Does this all mean that ext3/4 with these patches would handle 1024 PB on a 32 bit system? Or only on a 64 bit system?Ext3 for large filesystems
> My understanding is that even XFS has a maximum filesystem size of only 16TB on 32 bit, even though it goes to 9 EB on 64 bit systems.Ext3 for large filesystems
Not too long ago, Unix systems replaced seek() with the lseek(), and were able to handle longer offsets even though they were 16-bit machines. The word width of your CPU in no way limits the width of words upon which you can perform arithmetic. The new 48-bit field is not using compiler types wider than 32 bits for its implementation. They could use "long long", but I suppose that's less portable.Ext3 for large filesystems
Why is the ext3 max block size == 4k?ext3 block size
Raising the block size has a number of consequences for the memory management subsystem in the kernel, as well as the I/O subsystem.ext3 block size
In the current block layer, the block size must be no larger than the page size. Since page size is 4k on nearly all architectures, 4k blocks are the sane way to go. I suppose if you gave Itanium boxes to a bunch of kernel developers to test on, they might be inclined to work on larger block sizes.ext3 block size
well I waited long enough for someone else to remark on this...Ext3 for large filesystems
Woohoo!Extents
IBM's "Old Iron" has worked using extents for 40 years...