User: Password:
Subscribe / Log in / New account

Widening ext4's readdir() cookie

Widening ext4's readdir() cookie

Posted Mar 28, 2013 11:59 UTC (Thu) by neilbrown (subscriber, #359)
In reply to: Widening ext4's readdir() cookie by dlang
Parent article: Widening ext4's readdir() cookie

Do any filesystems other than ext3/4 have this problem?

Some address it with by having two indexes - one by name and one by cookie. It is more expensive, bit is actually reliable (rather than only being stochastically reliable).

My preference is internal-chaining. This adds a little more book-keeping for the case when hashes collide, but I believe that is preferable to strange errors when hashes collide. And they really don't collide often even with 32bit hashes.

(Log in to post comments)

Widening ext4's readdir() cookie

Posted Mar 28, 2013 12:17 UTC (Thu) by paulj (subscriber, #341) [Link]

I was going to post again to speculate whether or not the root of this problem was a desire to avoid keeping any index other than the name-based one! :)

A simple index over a finite, ordered ID space, with 2 operations "get unused ID" and "get next used ID in the order" would solve this problem trivially, it really seems to me. ("get unused ID" could be done efficiently in a number of ways, e.g. picking a random ID in a ringed space, then using the "next unused" operation'd probably work well in a sparsely used space).

Trying to bodge the properties of an order in a finite space out of the properties of an index of (effectively) infinite IDs seems doomed to failure and/or strange corner-cases.

Hacking the internal structure of the index also seems potentially fragile and needlessly complex - though I don't know exactly what you mean by internal chaining. It sounds possibly like adding indices inside the index. In which case, why not just keep a separate, simpler index?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:14 UTC (Thu) by etienne (guest, #25256) [Link]

> In which case, why not just keep a separate, simpler index?

There is an obvious index, which is the index of the inode in the directory as written to the hard disk sectors - but all index are worse than cookies because if you are currently using file 4 (out of 10), and someone removes the file number 2, when you will ask for file 5, you will skip a file...

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:27 UTC (Thu) by paulj (subscriber, #341) [Link]

Because you've changed the ID of file 5, because you made its ID dependent on an ephemeral order.

If you make the IDs stable, so they never change so long as the entry exists, then the problem is gone. The IDs are stable, and so the ordering between them is stable. You can do this through an additional index (in the data-structure sense).

The "pain" seems to be that the ext3+ devs are trying to achieve this stable order by hacking around in the details an existing index of an ID that does NOT have an absolute order (the hashes of names). That wouldn't be necessary at all if they'd just add an additional stable, absolutely ordered, finite ID with its own index. The namespace-hash and the finite-ID could each have their own, simple index - no nasty hacks needed.

(It seems to me, but I could be dreadfully wrong, or perhaps there are some critical details missing from the reporting ;) ).

Widening ext4's readdir() cookie

Posted Mar 29, 2013 2:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]

"jfs" has a fairly simple second index. Each directory entry contains a stable index number, and there is a table indexed by this number that leads to the name in the btree. Look for "modify_index" in jfs_dtree.c and explore uses of "struct dir_table_slot".

Some problems with this are:
- readdir will not return names in natural b-tree order, so you will get lots of seeking. Not a problem when the whole directory fits easily in cache, may be a problem for v.large directories.
- every B-tree modification (page-split, page-delete) requires updating random entries in the index table.

i.e. there is a genuine IO cost here.

Having readdir return entries in natural B-Tree order is clearly appealing, and a separate table doesn't really allow this.

By "Internal chaining" I mean that if a particular hash value is in use, you add 1 and try again. When you find a hash value to use, you store it (or the difference between the hash of the name and the chosen value, which can probably be stored with fewer bits) with the name. When you delete entries you need to be careful to leave place-holders if any subsequent entries are part of an internal chain leading from here.

In the common case of no collisions there is zero IO overhead (where as with the jfs approach there is always IO overhead). In the rarer case were there are collisions, there is an overhead that probably gets substantially worse as the chain length increases. As long as any internal chain is within one block there is a processing overhead but no IO overhead. Once it crosses a block boundary you get a real overhead, but probably comparable with the JFS overhead. Once you have a chain touching three blocks it probably gets a lot worse. But that should be extremely unlikely.

So it seems like a solution with excellent common-case performance, which degrades only in rare circumstances. Worst case performance (where all names hash to the same value) is probably not much worse than a linear search.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 13:42 UTC (Thu) by bfields (subscriber, #19510) [Link]

Do any filesystems other than ext3/4 have this problem?

I haven't heard any.

And, yes, I suspect there are any number of solutions. Just pick one of them, turn it into a patch that's bulletproof and backwards compatible, and we're done.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds