Filesystem block size
Posted Sep 20, 2007 15:37 UTC (Thu) by joib
In reply to: Filesystem block size
Parent article: LinuxConf.eu wrapup
Remember, the rule of thumb for random access to blocked devices is that access time and transfer time should be about the same. For current disks, half a megabyte appears to be the sweet spot (maybe on some large terabyte disks even one megabyte);
Oracle recommends a big stripe size for raid10 arrays, largely based on a similar argument
Linux so far limited block size to 4k, and only has recently allowed filesystems to use other block sizes, too; mostly to mount medias formatted on platforms with larger blocks (like the 8k blocks on IA64).
Up to 64 KB block size for ext2/3/4, a year ago, I don't know if they have been merged yet.
A new file system therefore should be able to put data together in blocks that will be able to grow in future as well (transfer rate increases with the square root of density or capacity, while seek time can be assumed to be almost constant). It needs to take care that data can be packed into these larger blocks - if you have half a megabyte as minimum transfer unit, you don't put a directory into one block, but a whole directory subtree, maybe with associated inodes and everything.
I think you have identified the correct disease (seek time remaining roughly constant vs. everything else improving), but I'm not convinced your cure is the correct one.
Consider what is already being done today (some filesystems like xfs have had many of these features already): Rather than huge block sizes, allocate many blocks at the same time (delayed allocation and multiblock allocation (mballoc), apparently making their way into the VFS layer), use extents to store many adjacent blocks (most filesystems except for ext2/3). For small files, readahead takes care of reading many blocks when the heads are moved to read at another spot.
You also put a bunch of small files into one single block, as well.
Reiser did a related thing (tail packing), but apparently the implementation is considered pretty complicated. In any case, the issue is not large vs. small block size, but rather block allocation. With mballoc and extents, small block filesystems get most of the advantage that large blocks have, without the complexity associated with tail packing.
Essentially, as processors get faster and disks get larger, the behavior starts to shift. Main memory today reacts more like disks in the early days (cached block access instead of direct access), and disks tend to behave more like tapes (longer sequential access).
"Disk is the new tape" - Jim Gray
to post comments)