Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for December 5, 2013
Deadline scheduling: coming soon?
LWN.net Weekly Edition for November 27, 2013
ACPI for ARM?
LWN.net Weekly Edition for November 21, 2013
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:46 UTC (Thu) by dlang (✭ supporter ✭, #313)
Posted Mar 28, 2013 10:25 UTC (Thu) by paulj (subscriber, #341)
Posted Mar 28, 2013 11:59 UTC (Thu) by neilbrown (subscriber, #359)
Some address it with by having two indexes - one by name and one by cookie. It is more expensive, bit is actually reliable (rather than only being stochastically reliable).
My preference is internal-chaining. This adds a little more book-keeping for the case when hashes collide, but I believe that is preferable to strange errors when hashes collide. And they really don't collide often even with 32bit hashes.
Posted Mar 28, 2013 12:17 UTC (Thu) by paulj (subscriber, #341)
A simple index over a finite, ordered ID space, with 2 operations "get unused ID" and "get next used ID in the order" would solve this problem trivially, it really seems to me. ("get unused ID" could be done efficiently in a number of ways, e.g. picking a random ID in a ringed space, then using the "next unused" operation'd probably work well in a sparsely used space).
Trying to bodge the properties of an order in a finite space out of the properties of an index of (effectively) infinite IDs seems doomed to failure and/or strange corner-cases.
Hacking the internal structure of the index also seems potentially fragile and needlessly complex - though I don't know exactly what you mean by internal chaining. It sounds possibly like adding indices inside the index. In which case, why not just keep a separate, simpler index?
Posted Mar 28, 2013 14:14 UTC (Thu) by etienne (subscriber, #25256)
There is an obvious index, which is the index of the inode in the directory as written to the hard disk sectors - but all index are worse than cookies because if you are currently using file 4 (out of 10), and someone removes the file number 2, when you will ask for file 5, you will skip a file...
Posted Mar 28, 2013 14:27 UTC (Thu) by paulj (subscriber, #341)
If you make the IDs stable, so they never change so long as the entry exists, then the problem is gone. The IDs are stable, and so the ordering between them is stable. You can do this through an additional index (in the data-structure sense).
The "pain" seems to be that the ext3+ devs are trying to achieve this stable order by hacking around in the details an existing index of an ID that does NOT have an absolute order (the hashes of names). That wouldn't be necessary at all if they'd just add an additional stable, absolutely ordered, finite ID with its own index. The namespace-hash and the finite-ID could each have their own, simple index - no nasty hacks needed.
(It seems to me, but I could be dreadfully wrong, or perhaps there are some critical details missing from the reporting ;) ).
Posted Mar 29, 2013 2:56 UTC (Fri) by neilbrown (subscriber, #359)
Some problems with this are:
- readdir will not return names in natural b-tree order, so you will get lots of seeking. Not a problem when the whole directory fits easily in cache, may be a problem for v.large directories.
- every B-tree modification (page-split, page-delete) requires updating random entries in the index table.
i.e. there is a genuine IO cost here.
Having readdir return entries in natural B-Tree order is clearly appealing, and a separate table doesn't really allow this.
By "Internal chaining" I mean that if a particular hash value is in use, you add 1 and try again. When you find a hash value to use, you store it (or the difference between the hash of the name and the chosen value, which can probably be stored with fewer bits) with the name. When you delete entries you need to be careful to leave place-holders if any subsequent entries are part of an internal chain leading from here.
In the common case of no collisions there is zero IO overhead (where as with the jfs approach there is always IO overhead). In the rarer case were there are collisions, there is an overhead that probably gets substantially worse as the chain length increases. As long as any internal chain is within one block there is a processing overhead but no IO overhead. Once it crosses a block boundary you get a real overhead, but probably comparable with the JFS overhead. Once you have a chain touching three blocks it probably gets a lot worse. But that should be extremely unlikely.
So it seems like a solution with excellent common-case performance, which degrades only in rare circumstances. Worst case performance (where all names hash to the same value) is probably not much worse than a linear search.
Posted Mar 28, 2013 13:42 UTC (Thu) by bfields (subscriber, #19510)
Do any filesystems other than ext3/4 have this problem?
I haven't heard any.
And, yes, I suspect there are any number of solutions. Just pick one of them, turn it into a patch that's bulletproof and backwards compatible, and we're done.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds