LWN.net Logo

chunkfs

chunkfs

Posted Apr 26, 2007 22:03 UTC (Thu) by pimlott (guest, #1535)
Parent article: Filesystems: chunkfs and reiser4

The core idea is to take a filesystem and split it into several independent filesystems, each of which maintains its own clean/dirty state. Should things go wrong, only those sub-filesystems which were active at the time of failure need to be checked.
I'm not a filesystems hacker, but I think this article misses the real point of chunkfs. After all, the central problem is that you don't know when "things go wrong". Corruption occurs unpredictably, for any number of reasons, so you need to assume that it can happen anywhere, anytime. The exciting use of the dirty bit, as I understand, is the ability to on-line fsck the chunks that are presently not dirty. Granted, you can also fsck the dirty chunks after a system crash, but for modern filesystems this just requires journal replay, which is fast anyway. Though I suppose if a chunk is so active that it never gets fscked on-line, you want to full-fsck it whenever the filesystem is off-line.

The next step is writing data with checksums or even error-correcting codes. But the real solution, for those who are serious about data integrity, is end-to-end checksums or ECC; ie, assure the integrity of the data from the moment is created to the moment it is consumed.


(Log in to post comments)

chunkfs

Posted Apr 27, 2007 4:39 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

I agree that the paragraph you mention misses the point of chunkfs. it's not that you avoid dong the fsck on some chunks, it's that you don't have to try and track the state of the entire filesystem at once so the check is faster. if some chunks haven't been modified since they were last chaned (and can therefor be clean) that just speeds up the searh

yes, failing drives can corrupt the filesystem independantly of this, but even with checksums you won't find this sort of corruption until you go looking for it (by trying to access the data).

chunkfs isn't trying to address this sort of low-level problem, it's working at a higher level. there's no reason that chunkfs couldn't be integrated into any filesystem and provide approximatly the same benifits for all of them. the initial proof of concept implementation is being done on ext2, not becouse it's the best low-level filesystem, but becouse it's the easiest to implement.

if things work out as hoped with the ext2 implementation I'm willing to bet that something very similar will start appearing as an option for other filesystems.

chunkfs

Posted Apr 27, 2007 5:31 UTC (Fri) by pimlott (guest, #1535) [Link]

I'm not sure if I followed exactly what you meant in some places, so let me try and ask you to correct me.
it's not that you avoid dong the fsck on some chunks, it's that you don't have to try and track the state of the entire filesystem at once so the check is faster.
Are you saying here that the total fsck time of all N chunks is less than the fsck time of a normal filesystem? That's probably true, but it doesn't solve the fundamental scaling problem of full partition fsck (does it?).
if some chunks haven't been modified since they were last chaned (and can therefor be clean) that just speeds up the searh
Is "chaned" meant to be "changed"? "haven't been modified since they were last changed" doesn't make sense, however. Do you mean that if a chunk has not even been touched since the last fsck, it doesn't need to be fscked again? Well ok, but I doubt that's going to be a common case. If you mean that chunks that are not dirty after a crash don't need to be fscked, that only applies to a non-journaling filesystem (does anyone still use those?).

The premise, I thought, was that the whole filesystem needs to be fscked from time to time, because stuff happens. No matter how clever you are, stopping stopping the world to check the whole partition will not scale. So on-line fsck of chunks looks like the real win to me.

chunkfs

Posted Apr 27, 2007 8:05 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

sorry for the sloppy typing, I was in too much of a hurry to finish the post

yes, the time to fsck 100x10Gb filesystems is expected to be significantly less then to fsck 1x1TB filesystem

in part this is also a memory limited operation, when doing the fsck you need to remember all the files that you have seen to see if any of the others overlap it. besides reducing n during the O(n^2) portion by 100x (even though you have to do it 100 times) you also drasticly reduce the amount of ram needed, avoiding swap or other low-memory conditions

yes, you do need to do a fsch from time to time, but under a crash condition you may be able to skip doing one for the chunks that have not been changed and so are still marked clean. and yes, people do still use non-journaling filesystems. when you journal you end up doing lots of writes twice (and a lot of seeking between the journal write and the final write). if you have lots of extra disk bandwidth you may be able to afford to do this, but if you don't have the extra disk bandwidth your entire system will slow down while the journal is flushed. for some applications (data capture for example) this isn't acceptable

in addition, you can do a check of a few chunks every boot rather then doing a check of everything every 30 boots (spreading the maintinance pain over time rather then having it hit in one massive chunk)

did I do a better job of explaining it this time?

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds