The 2006 Linux Filesystems Workshop (Part III) [LWN.net]

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 17:45 UTC (Thu) by piman (guest, #8957) [Link] (1 responses)

IANA filesystem developer, but it would seem to me the answer is no. The two causes of disk corruption are bugs (in the driver, kernel, filesystem, etc) or hardware failure (disk or memory). In the case of a disk failure, you lose whatever data was corrupted regardless of whether you use one large filesystem or many small ones. In the case of memory failure or bugs, with many small filesystems you will only corrupt the ones being written to. So the many small filesystems approach offers an advantage here as well.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 20:40 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Two more failure modes: system crash (which loses whatever writes were in flight but not completed) and point-media failures on the disk platter.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 18:51 UTC (Thu) by arjan (subscriber, #36785) [Link] (9 responses)

so far in the analysis we haven't found any reason.

One of the key things is that any of the "key" data for a filesystem (superblock and such) can be duplicated many times, since it's tiny as percentage of the fs, and constant size.

fwiw a document about chunkfs (work in progress) is at
http://www.fenrus.org/chunkfs.txt

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 0:14 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (8 responses)

Great document, interesting stuff!!! A few questions, as always...

(1) In the discussion of hard-linking, it looked to me that directories with links get replicated in the directory's chunk and in the hard-link destination's chunk. Is this the case, or am I confused (and on second reading, it looks like only the directory entries linking to this chunk get replicated)? If it is that case, there is some sort of mutex that covers all chunks to allow both replicas of the directory to be updated atomically?

(2) Is rename still atomic? In other words, is a single task guaranteed that if it sees the new name, a subsequent lookup won't see the old name and conversely?

(3) Is unlink still atomic?

(4) Does dcache know about the continuation inodes? (Can't see why it would need to, but had to ask...)

(5) Stupid question (just like the others, but this time I am admitting it up front!) -- for a multichunk file, why have the overhead information in the chunks that are fully consumed by their segment of the file? Why not just mark the chunk as being entirely data, and have some notation that indicates that the entire chunk is an extent? And is this enough heresy for one question, or should I try harder next time? ;-)

(6) Is intra-chunk compatibility with ext2/3 a goal?

(7) I am a bit concerned about the following sequence of events: (a) chunk zero is half full, with lots of smallish logfiles. (b) a large file is created, and starts in chunk 0 (perhaps one of the logfiles is wildly expanding or something). (c) the large file fills chunk 0 and expands to chunk 1 with a continuation inode. (d) each and every logfile expands, finds no space, and each thus needs its own continuation inode, violating the assumption that continuation inodes are rare. Or did I miss something here? If I am not too confused, one approach would be to detect large files and make them -only- have continuation inodes, with -no- data stored in a chunk shared with other files. How to detect this? Sorry, no clue!!!

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 6:35 UTC (Fri) by arjan (subscriber, #36785) [Link] (7 responses)

(1) Not so much replicated. If you think of a directory as a file-like linear stream (I know that's too simple, but readdir() makes it so sort of), what you'd do for a hardlink is make a continuation inode for that stream in the chunk that the file resides in, and continue that stream in this chunk, at least for the one dentry of the hardlink. So there is no duplication/replication, it's continuation.

(2) that's not different as currently is the case

(3) same

(4) No... it's a pure internal thing

(5) that's not a stupid question; the one thing I've not written up is that in principle, each chunk could have it's own on-disk format variant. The "entire chunk is one file" variant already was on my list, another one is "lots of small files".

(6) you mean ext2/3 layout within a chunk? Not a goal right now, although the plan is for the prototype to reuse ext2 for this. I don't want to be tied down to exact ext2 format beforehand though.

(7) there is something needed there yes. the entire thing needs a quite good allocation strategy, probably including delayed allocation etc

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 15:26 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

(1) OK, so in a sense, directories split across chunks in exactly the same way that files do, but for different reasons. Files split across chunks because they are large, while directories split across chunks because of the location of the primary inodes (or whatever a non-continuation inode is called) of the files within a given file. No replication. So one area that will require careful attention would be the performance of reading directories that had been split across chunks.

(2,3) Good to hear that rename and unlink are still atomic! I bet I am not the only one who feels this way. ;-)

(4) Also good to hear!

(5) Having specialized chunks could be a very good thing, though the administrative tooling will have to be -very- good at automatically handling the differences between chunks. Otherwise sysadmins will choke on it.

(6) OK to be incompatible, but my guess is that it will be very important to be able to easily migrate from existing filesystems to chunkfs. One good thing about current huge and cheap disks is that just migrating the data from one area of disk to another is much more palatable than it would have been a few decades ago.

(7) Good point -- I suppose that in the extreme case, a delayed allocation scheme might be able to figure out that the file is large enough to have whole chunks dedicated to its data.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 21:24 UTC (Fri) by arjan (subscriber, #36785) [Link] (5 responses)

for your (1) point... the good news is that hardlinks are relatively rare....

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 9, 2006 20:12 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

Good point on hard links being relatively rare (ditto mv and rename, I would guess). But the split directories they engender would persist. With N chunks, you have only 1/N chance of spontaneous repair, resulting in increasing directory splitting over time, even with low rates of ln, mv, and rename. So my guess is that there would need to be the equivalent of a defragmenter for long-lived filesystems. (A rechunker???)

I suppose that one way to do this would be to hold some chunks aside, and to periodically re-chunk into these chunks, sort of like generational garbage collection.

But perhaps see how it does in tests and benchmarks?

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 11, 2006 23:13 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

this is an area where changing the basic chunk size could have a huge effect.

the discussion was to split the disk into 1G chunks, but that can result in a LOT of chunks (3000+ on my home server :-). changing the chunk size can drasticly reduce the number of chunks needed, and therfor the number of potential places for the directories to get copied.

In addition, it would be possible to make a reasonable sized chunk that holds the beginning of every file (say the first few K, potentially with a blocksize less then 4K) and have all directories exist on that chunk, then only files that are larger would exist in the other chunks (this would also do wonders for things like updatedb that want to scan all files)

this master chunk would be absolutly critical, so it would need to be backed up or mirrored (but it's easy enough to make the first chunk be on a raid, even if it's just mirroing to another spot on the same drive)

this sounds like something that will spawn endless variations and tinkering once the basic capabilities are in place.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 13, 2006 18:15 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Is there an influence on concurrency? Can operations local to a chunk proceed entirely independently of other chunks? Per-chunk log for local operation, with global log for cross-chunk operations? Now -that- should be trivial to implement, right??? Just a small matter of software... ;-)

But, yes, 3,000 chunks does seem a bit of a pain to manage -- at least unless there is some way of automatically taking care of them. But do the chunks really need to be the same size? Could the filesystem allocate and free them, sort of like super-extents that contain metadata?

If we explore all the options, will it get done before 2020? Even if we don't explore all the options, will it get done before 2020? ;-)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:12 UTC (Sun) by rapsys (guest, #39313) [Link] (1 responses)

I was wondering if you may add in your schema the following features :
- each 5chunk beeing marked as md5 control sum (some of raid5 over chunk)
- blacklist chunk that did 1-2 errors (as they have a weak magnetic
surface for example)

I think that two idea are interesting as I readed some article about
independant read/write head in hard disk for future to improve the
concurrent reading/write.
I read that there is 3-4year about Video Hard Disc Recorder for TV (but
hdtv and drm may have killed such improvement ?)

The point is that if we are able to have 3-5 head on a hard disk, why not
mark one of that head surface to be reserved to backup control sum ?
(this would not kill performance if head are independant as disk bandwith
is far bigger than the mechanic part)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:40 UTC (Sun) by rapsys (guest, #39313) [Link]

Hum, I forget something.

The interest of the previous idea is if your data has been physicaly
corrupted (write overpowered, etc) in the other chunk, of the (group
of )block checksum will be wrong.
If the chunk of control sum is avaible and valid (see below), the kernel
issue a EAGAIN/ERESTORING to the application and regenerate the data from
the control sums' chunk.
It will allow to not loose a movie/music file because a stupid block in
the middle have been corrupted :'(

The more interesting things is that the hard disk will remap the
problematic magnic part to somewhere else because you re-write on the same
place and it's marked as trashed by hd.
(it's an assumption that need to be true every time and hd manufacturer
NEED to respect, after few write/read cycle test scheduled by firmware on
that place maybe)

The only problem I see if an overpowered writes cross the chunk
separation.
The control sum will be corrupted and will lead kernel in trouble in case
of birthday paradox.

So we will need to have (group of )block checksum for the control sums'
chunk too.
(Will it fit in space with equation : x(data)+1(control sums) ?)

I made two assumption :
- read/write to that chunk is not too costly (independant head, etc)
- control sum are not cpu killer (maybe have a special feature in hd to do
that job)
- control sums of chunk are not cpu killer too (but we only increase the
consumed cpu time for each write on disk)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 23:11 UTC (Thu) by vaurora (guest, #38407) [Link] (1 responses)

To expand on Arjan's reply, it's not obvious that it would increase the probability of any failure. Sure, there are a larger number of individual bitmaps, but in terms of bits on disk, they are still the same size and shouldn't have an increased likelihood of suffering an I/O error. More superblocks is interesting because they are fixed size per file system; on the other hand, most modern file systems already heavily replicate the superblock. What does seem to be true is that this scheme will limit the effect of any individual failure, as long as we are smart about handling loss of path components.

We definitely appreciate criticism as we would like to figure out (possible fatal) errors BEFORE implementing anything. So if you have any more ideas about how this will fail, let us know and hopefully we can figure something out.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 0:20 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

My original question came from considering a large file spread across multiple chunks, so that loss of any of these chunks loses part of the file. So any fixed probability of chunk loss adds up (approximately, anyway). On the freelist, I agree with you, and it does seem that you can be more aggressive about replicating superblocks to reduce the probability of superblock loss (but thereby slowing superblock updates).

Idle question, but I couldn't resist asking. ;-)