The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 6:35 UTC (Fri) by arjan (subscriber, #36785)
In reply to: The 2006 Linux Filesystems Workshop (Part III) by PaulMcKenney
Parent article: The 2006 Linux Filesystems Workshop (Part III)

(1) Not so much replicated. If you think of a directory as a file-like linear stream (I know that's too simple, but readdir() makes it so sort of), what you'd do for a hardlink is make a continuation inode for that stream in the chunk that the file resides in, and continue that stream in this chunk, at least for the one dentry of the hardlink. So there is no duplication/replication, it's continuation.

(2) that's not different as currently is the case

(3) same

(4) No... it's a pure internal thing

(5) that's not a stupid question; the one thing I've not written up is that in principle, each chunk could have it's own on-disk format variant. The "entire chunk is one file" variant already was on my list, another one is "lots of small files".

(6) you mean ext2/3 layout within a chunk? Not a goal right now, although the plan is for the prototype to reuse ext2 for this. I don't want to be tied down to exact ext2 format beforehand though.

(7) there is something needed there yes. the entire thing needs a quite good allocation strategy, probably including delayed allocation etc

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 15:26 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

(1) OK, so in a sense, directories split across chunks in exactly the same way that files do, but for different reasons. Files split across chunks because they are large, while directories split across chunks because of the location of the primary inodes (or whatever a non-continuation inode is called) of the files within a given file. No replication. So one area that will require careful attention would be the performance of reading directories that had been split across chunks.

(2,3) Good to hear that rename and unlink are still atomic! I bet I am not the only one who feels this way. ;-)

(4) Also good to hear!

(5) Having specialized chunks could be a very good thing, though the administrative tooling will have to be -very- good at automatically handling the differences between chunks. Otherwise sysadmins will choke on it.

(6) OK to be incompatible, but my guess is that it will be very important to be able to easily migrate from existing filesystems to chunkfs. One good thing about current huge and cheap disks is that just migrating the data from one area of disk to another is much more palatable than it would have been a few decades ago.

(7) Good point -- I suppose that in the extreme case, a delayed allocation scheme might be able to figure out that the file is large enough to have whole chunks dedicated to its data.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 21:24 UTC (Fri) by arjan (subscriber, #36785) [Link] (5 responses)

for your (1) point... the good news is that hardlinks are relatively rare....

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 9, 2006 20:12 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

Good point on hard links being relatively rare (ditto mv and rename, I would guess). But the split directories they engender would persist. With N chunks, you have only 1/N chance of spontaneous repair, resulting in increasing directory splitting over time, even with low rates of ln, mv, and rename. So my guess is that there would need to be the equivalent of a defragmenter for long-lived filesystems. (A rechunker???)

I suppose that one way to do this would be to hold some chunks aside, and to periodically re-chunk into these chunks, sort of like generational garbage collection.

But perhaps see how it does in tests and benchmarks?

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 11, 2006 23:13 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

this is an area where changing the basic chunk size could have a huge effect.

the discussion was to split the disk into 1G chunks, but that can result in a LOT of chunks (3000+ on my home server :-). changing the chunk size can drasticly reduce the number of chunks needed, and therfor the number of potential places for the directories to get copied.

In addition, it would be possible to make a reasonable sized chunk that holds the beginning of every file (say the first few K, potentially with a blocksize less then 4K) and have all directories exist on that chunk, then only files that are larger would exist in the other chunks (this would also do wonders for things like updatedb that want to scan all files)

this master chunk would be absolutly critical, so it would need to be backed up or mirrored (but it's easy enough to make the first chunk be on a raid, even if it's just mirroing to another spot on the same drive)

this sounds like something that will spawn endless variations and tinkering once the basic capabilities are in place.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 13, 2006 18:15 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Is there an influence on concurrency? Can operations local to a chunk proceed entirely independently of other chunks? Per-chunk log for local operation, with global log for cross-chunk operations? Now -that- should be trivial to implement, right??? Just a small matter of software... ;-)

But, yes, 3,000 chunks does seem a bit of a pain to manage -- at least unless there is some way of automatically taking care of them. But do the chunks really need to be the same size? Could the filesystem allocate and free them, sort of like super-extents that contain metadata?

If we explore all the options, will it get done before 2020? Even if we don't explore all the options, will it get done before 2020? ;-)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:12 UTC (Sun) by rapsys (guest, #39313) [Link] (1 responses)

I was wondering if you may add in your schema the following features :
- each 5chunk beeing marked as md5 control sum (some of raid5 over chunk)
- blacklist chunk that did 1-2 errors (as they have a weak magnetic
surface for example)

I think that two idea are interesting as I readed some article about
independant read/write head in hard disk for future to improve the
concurrent reading/write.
I read that there is 3-4year about Video Hard Disc Recorder for TV (but
hdtv and drm may have killed such improvement ?)

The point is that if we are able to have 3-5 head on a hard disk, why not
mark one of that head surface to be reserved to backup control sum ?
(this would not kill performance if head are independant as disk bandwith
is far bigger than the mechanic part)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:40 UTC (Sun) by rapsys (guest, #39313) [Link]

Hum, I forget something.

The interest of the previous idea is if your data has been physicaly
corrupted (write overpowered, etc) in the other chunk, of the (group
of )block checksum will be wrong.
If the chunk of control sum is avaible and valid (see below), the kernel
issue a EAGAIN/ERESTORING to the application and regenerate the data from
the control sums' chunk.
It will allow to not loose a movie/music file because a stupid block in
the middle have been corrupted :'(

The more interesting things is that the hard disk will remap the
problematic magnic part to somewhere else because you re-write on the same
place and it's marked as trashed by hd.
(it's an assumption that need to be true every time and hd manufacturer
NEED to respect, after few write/read cycle test scheduled by firmware on
that place maybe)

The only problem I see if an overpowered writes cross the chunk
separation.
The control sum will be corrupted and will lead kernel in trouble in case
of birthday paradox.

So we will need to have (group of )block checksum for the control sums'
chunk too.
(Will it fit in space with equation : x(data)+1(control sums) ?)

I made two assumption :
- read/write to that chunk is not too costly (independant head, etc)
- control sum are not cpu killer (maybe have a special feature in hd to do
that job)
- control sums of chunk are not cpu killer too (but we only increase the
consumed cpu time for each write on disk)