User: Password:
Subscribe / Log in / New account

Widening ext4's readdir() cookie

Widening ext4's readdir() cookie

Posted Mar 28, 2013 8:58 UTC (Thu) by paulj (subscriber, #341)
Parent article: Widening ext4's readdir() cookie

It's interesting to read of the readdir offset debate, and read the hate for it. Most interesting is the "reluctantly" link in the article, - where Ted seems to make 2 claims:

  1. The readdir/telldir offset is a legacy problem (NFS), and if not for that it would not be in the kernel.
  2. It would not generally be practical for userspace to keep directories in memory while walking them for telldir/readdir.

Aren't these claims potentially contradictory? If it is not practical to keep directories in memory for telldir/readdir, why would it be practical for any other API? How would a dir-walking API work that didn't require userspace to keep a shadow copy (or didn't require the kernel to keep a shadow copy)? Is this perhaps a fundamental problem of a trade-off of where to keep state, rather than anything particular to readdir?

Also strange is the language in the SUSv3 spec which Ted quotes in another email (

" One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered."

Difficult to describe formally? That's odd. Surely it is quite trivial? You define that directory entries have an absolute ordering (lexical, numberical, whatever you want - it doesn't matter). Why are B-trees a problem? B-trees are defined to provide an index over absolutely ordered objects. Given, say, a B-tree and a rank in that order (e.g. the cookie of a previous dirent), there should be no problem to determine what the next dirent in the order is - even if the B-tree has been changed or re-balanced, even if the dirent for the cookie no longer exists! Surely? All ordered indices should have this property (I.e. hash tables do not, but pretty much all commonly used trees would, no?), surely?

Basically, I'm puzzled, because the issues as stated don't fully make sense (either not issues, or potentially not consistent) to me. So I'm wondering what the deeper issues are, that the developers and spec writers havn't explained; or else whether perhaps the developers have gotten so "in" to the details of the problem that they're overlooking some things that'd be clearer if they stepped back? (Or it could just be me having one of my amazingly stupid moments ;) - in which case, my apologies tentatively in advance).

(Log in to post comments)

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:07 UTC (Thu) by paulj (subscriber, #341) [Link]

And further, this bit of the article made me scratch my head and wonder which of my understanding, the description of the issues involved in the article, and/or the code is suboptimal:

"Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space.

Encoding the position in the tree in the cookie through a hash? But the tree is indexing something with an absolute order - the very node in the true MUST be associated with a unique value already (e.g. the name), surely? Why would you then hash this?

(I'm assuming I'm having a pre-morning-caffeine stupid moment :) ).

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:13 UTC (Thu) by paulj (subscriber, #341) [Link]

I guess the issue must be the order the B-tree is indexing is ephemeral (e.g. it's of the index in the number of entries). But then, is there no way to make that order more stable?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:23 UTC (Thu) by dark (guest, #8483) [Link]

I think the problem here is that the underlying ordered data entries are simply too big (they're whole filenames) and telldir() returns a long. No matter how you plan to map the names to long offsets, you'll find some combination of mutations to the directory that makes the offset invalid.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:32 UTC (Thu) by paulj (subscriber, #341) [Link]

Yes, names would be too big. Cpurse, you don't need an index into the space of all possible names. You need an index into the entries (not necessarily names) that existed and have existed in this directory. I wonder is there not a way to get a stable order, with a reliable "next" operation, out of that?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:57 UTC (Thu) by dlang (subscriber, #313) [Link]

remember, the OS has no idea when (or even if) the application is going to make it's next request. The spec allows the application to read one filename a week and still be guaranteed to see all files that existed when it started the read with no duplicates.

If there were no resource constraints, the system could use RCU type strategies to make a copy of the directory and just use the token as a numeric offset into this copy, but since the system has no way of ever knowing when it could release this copy, it would have to keep it forever.

remember that with NFS, a server is required to be consistent even across reboots.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 10:24 UTC (Thu) by paulj (subscriber, #341) [Link]

E.g. why not make the order a ring? Store last created ID with the directory. Use an index into that order that is easily capable of answering "What is the next value in the order?" - tree indices typically are capable of this.

To create an entry:

- locate the ID in your ordered index (B-tree, whatever)
- Determine the next free ID by walking the index to the next existing ID
- Create your entry with that free ID, update your index
- Update the directory's last created ID

Insertions have to be done atomically wrt directory and index and any readers. Removals do not need to modify the last-created ID in the directory though. But it does not matter a jot how the internal structure of the index changes.

No collisions, no hashes, no encoding the structure of indices into cookies. No nasty hacks. Just work with the grain by making the underlying order you're indexing have the properties you want.

You are limited to # of directory entries being the size of the ring, but it seems NFS compatibility would allow that ring to be (2^64)-1.

Why wouldn't that work?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 11:31 UTC (Thu) by paulj (subscriber, #341) [Link]

Actually the ring isn't necessary. Any way to pick a free ID will do.

Widening ext4's readdir() cookie

Posted Apr 4, 2013 1:56 UTC (Thu) by vasi (guest, #83946) [Link]

HFS+ could do this. Each directory entry has a unique ID (the "catalog node ID"), which fits in a long. You can lookup the CNID to get the file name, and then use that name as your position in the directory.

This is still imperfect. Because each CNID is associated with a file name, hard links require an extra level of indirection. And obviously if the file at the current position is deleted, or renamed, things won't work as expected.

This is all moot, since Linux's HFS+ driver just uses the offset as f_pos, I'm not sure why. Maybe the f_pos values are required to be increasing? CNIDs are not.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:18 UTC (Thu) by dlang (subscriber, #313) [Link]

the big problem is that the directory contents may be changed by something else while you are going through the list of files.

So your b-tree or other complex structure may change, potentially significantly.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:41 UTC (Thu) by paulj (subscriber, #341) [Link]

It doesn't matter if the ordered index gets changed, even significantly, if the order it is indexing remains stable. I just wonder if there's some better way to manufacture a stable order from directory entries, rather than hash the name?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:46 UTC (Thu) by dlang (subscriber, #313) [Link]

The filesystem authors have been looking for a good way to deal with this ordering problem for years. If there was a simple answer I would have expected that _someone_ would have found it and everyone else would copy them.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 10:25 UTC (Thu) by paulj (subscriber, #341) [Link]

Why wouldn't a ring work? (see other comment too).

Widening ext4's readdir() cookie

Posted Mar 28, 2013 11:59 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Do any filesystems other than ext3/4 have this problem?

Some address it with by having two indexes - one by name and one by cookie. It is more expensive, bit is actually reliable (rather than only being stochastically reliable).

My preference is internal-chaining. This adds a little more book-keeping for the case when hashes collide, but I believe that is preferable to strange errors when hashes collide. And they really don't collide often even with 32bit hashes.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 12:17 UTC (Thu) by paulj (subscriber, #341) [Link]

I was going to post again to speculate whether or not the root of this problem was a desire to avoid keeping any index other than the name-based one! :)

A simple index over a finite, ordered ID space, with 2 operations "get unused ID" and "get next used ID in the order" would solve this problem trivially, it really seems to me. ("get unused ID" could be done efficiently in a number of ways, e.g. picking a random ID in a ringed space, then using the "next unused" operation'd probably work well in a sparsely used space).

Trying to bodge the properties of an order in a finite space out of the properties of an index of (effectively) infinite IDs seems doomed to failure and/or strange corner-cases.

Hacking the internal structure of the index also seems potentially fragile and needlessly complex - though I don't know exactly what you mean by internal chaining. It sounds possibly like adding indices inside the index. In which case, why not just keep a separate, simpler index?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:14 UTC (Thu) by etienne (guest, #25256) [Link]

> In which case, why not just keep a separate, simpler index?

There is an obvious index, which is the index of the inode in the directory as written to the hard disk sectors - but all index are worse than cookies because if you are currently using file 4 (out of 10), and someone removes the file number 2, when you will ask for file 5, you will skip a file...

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:27 UTC (Thu) by paulj (subscriber, #341) [Link]

Because you've changed the ID of file 5, because you made its ID dependent on an ephemeral order.

If you make the IDs stable, so they never change so long as the entry exists, then the problem is gone. The IDs are stable, and so the ordering between them is stable. You can do this through an additional index (in the data-structure sense).

The "pain" seems to be that the ext3+ devs are trying to achieve this stable order by hacking around in the details an existing index of an ID that does NOT have an absolute order (the hashes of names). That wouldn't be necessary at all if they'd just add an additional stable, absolutely ordered, finite ID with its own index. The namespace-hash and the finite-ID could each have their own, simple index - no nasty hacks needed.

(It seems to me, but I could be dreadfully wrong, or perhaps there are some critical details missing from the reporting ;) ).

Widening ext4's readdir() cookie

Posted Mar 29, 2013 2:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]

"jfs" has a fairly simple second index. Each directory entry contains a stable index number, and there is a table indexed by this number that leads to the name in the btree. Look for "modify_index" in jfs_dtree.c and explore uses of "struct dir_table_slot".

Some problems with this are:
- readdir will not return names in natural b-tree order, so you will get lots of seeking. Not a problem when the whole directory fits easily in cache, may be a problem for v.large directories.
- every B-tree modification (page-split, page-delete) requires updating random entries in the index table.

i.e. there is a genuine IO cost here.

Having readdir return entries in natural B-Tree order is clearly appealing, and a separate table doesn't really allow this.

By "Internal chaining" I mean that if a particular hash value is in use, you add 1 and try again. When you find a hash value to use, you store it (or the difference between the hash of the name and the chosen value, which can probably be stored with fewer bits) with the name. When you delete entries you need to be careful to leave place-holders if any subsequent entries are part of an internal chain leading from here.

In the common case of no collisions there is zero IO overhead (where as with the jfs approach there is always IO overhead). In the rarer case were there are collisions, there is an overhead that probably gets substantially worse as the chain length increases. As long as any internal chain is within one block there is a processing overhead but no IO overhead. Once it crosses a block boundary you get a real overhead, but probably comparable with the JFS overhead. Once you have a chain touching three blocks it probably gets a lot worse. But that should be extremely unlikely.

So it seems like a solution with excellent common-case performance, which degrades only in rare circumstances. Worst case performance (where all names hash to the same value) is probably not much worse than a linear search.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 13:42 UTC (Thu) by bfields (subscriber, #19510) [Link]

Do any filesystems other than ext3/4 have this problem?

I haven't heard any.

And, yes, I suspect there are any number of solutions. Just pick one of them, turn it into a patch that's bulletproof and backwards compatible, and we're done.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds