Widening ext4's readdir() cookie
In a separate article, we explained how an ext4 change to the kernel-user-space ABI in Linux 3.4 broke the GlusterFS filesystem; here, we look in detail at the change and why it was needed. The change in question was a patch by Fan Yong that widened the readdir() "cookies" produced by ext4 from 32 to 64 bits. Understanding why Fan's patch was necessary first requires a bit of background on the readdir() API.
The readdir API consists of a number of functions that allow an application to walk through the entries in a directory list. The opendir() function opens a directory stream for a specified directory. The readdir() function returns the contents of a directory stream, one entry at a time. The telldir() and seekdir() functions provide lseek-style functionality: an application can remember its current position in a directory stream using telldir(), scan further entries with readdir(), and then return to the remembered position using seekdir().
It turns out that supporting the readdir API is a source of considerable pain for filesystem developers. The API was designed in a simpler age, when directories were essentially linear tables of filenames plus inode numbers. The first of the widely used Linux filesystems, ext2, followed that design. In such filesystems, one can meaningfully talk about an offset within a directory table.
However, in the interests of improving performance and supporting new features, modern filesystems (such as ext4) have long since adopted more complex data structures—typically B-trees (PDF)—for representing directories. The problem with B-tree structures, from the point of view of implementing the readdir() API, is that the nodes in a tree can undergo (sometimes drastic) rearrangements as entries are added to and removed from the tree. This reordering of the tree renders the concept of a directory "offset" meaningless. The lack of a stable offset value is obviously a difficulty when implementing telldir() and seekdir(). However, it is also a problem for the implementation of readdir(), which must be done in such a way that a loop using readdir() to scan an entire directory will return a list of all files in the directory, without duplicates. Consequently, readdir() must internally also maintain some kind of stable representation of a position within the directory stream.
Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space. A subsequent readdir() or seekdir() will pass the cookie back to the kernel, at which point the kernel decodes the cookie to derive a position within the directory.
Encoding the directory position as a cookie works, more or less, but has some limitations. The cookie has historically been a 31-bit hash value, because older NFS implementations could handle only 32-bit cookies. (The hash is 31-bit because the off_t type used to represent the information is defined as a signed type, and negative offsets are not allowed.) In earlier times, a 31-bit hash was not too much of a problem: filesystem limitations meant that directories were usually small, so the chance that two directory entries would hash to the same value was small.
However, modern filesystems allow for large directories—so large that the chance of two files producing the same 31-bit hash is significant. For example, in a directory with 2000 entries, the chance of a collision is around 0.1%. In a directory with 32,768 entries (the historical limit in ext2), the chance is somewhat more than 20%. (For the math behind these numbers, see the Wikipedia article on the Birthday Paradox.) Modern filesystems have much higher limits on the number of files in a directory, with a corresponding increase in the chance of hash collisions; in a directory with 100,000 entries, the probability is over 90%.
Two files that hash to the same cookie value can lead to problems when using readdir(), especially on NFS. Suppose that we want to scan all of the files in a directory. And suppose that two files, say abc and xyz, hash to the same value, and that the directory is ordered such that abc is scanned first. When an NFS client readdir() later reaches the file xyz, it will receive a cookie that is exactly the same as for abc. Upon passing that cookie back to the NFS server, the next readdir() will commence at the file following abc. The NFS client code has some logic to detect this situation; that logic causes readdir() to give the (somewhat counter-intuitive) error ELOOP, "Too many levels of symbolic links".
This error can be fairly easily reproduced on NFS with older kernels. One simply has to create an ext4 directory containing enough files, mount that directory over NFS, and run any program that performs a readdir() loop over the directory on the NFS client. When working with a local filesystem (no NFS involved), the same problem exists, but in a different form. One does not encounter it when using readdir(), because of the way in which that function is implemented on top of the getdents() system call. Essentially, opendir() opens a file descriptor that is used by getdents(); the kernel is able to internally associate a directory position with that file descriptor, so cookies play no part in the implementation of readdir(). By contrast, because NFS is stateless, each readdir() over NFS requires that the NFS server explicitly locate the directory position corresponding to the cookie sent by the client.
On the other hand, the problem can be observed with a local ext4 filesystem when using telldir(), because that function explicitly returns the directory "offset" cookie to the caller. If two directory entries produce the same "offset" cookie when calling telldir(), then a call to seekdir() after either of the telldir() calls will go back to the same location. A user-space loop such as the following easily reveals the problem, encountering a difficulty analogous to a readdir() loop over NFS:
dirp = opendir("/path/to/ext4/dir");
while ((dirent = readdir(dirp)) != NULL) {
...
seekdir(dirp, telldir(dirp));
...
}
The seekdir(dirp, telldir(dirp)) call is a seeming no-op, simply resetting the directory position to its current location. However, where a directory entry hashes to the same value as an earlier directory entry, the effect of the call will be to reset the directory position to the earlier entry with the same hash. An infinite loop thus results. Real programs would of course not use telldir() and seekdir() in this manner. However, every now and then programs that use those calls would obtain a surprising result: a seekdir() would reposition the directory stream to a completely unexpected location.
Thus, the cookie collision problem needed to be fixed for the benefit of both ext4 and (especially) NFS. The simplest way of reducing the likelihood of hash collisions is to increase the size of the hash space. That was the purpose of Fan's patch, which increased the size of the hash space for the offset cookies produced by ext4 from 31 bits to 63. (A similar change has also been merged for ext3.) With a 63-bit hash space, even a directory containing one million entries would have less than one chance in four million of producing a hash collision. Of course, a corresponding change is required in NFS, so that the NFS server is able to deal with the larger cookie sizes. That change was provided in a patch by Bernd Schubert.
Reading this article and the GlusterFS article together, one might
wonder why GlusterFS doesn't have the same problems with XFS that it has
with ext4. The answer, as noted by Dave
Chinner, is that XFS uses a rather different scheme to produce
readdir() cookies. That scheme produces cookies that require only
32 bits, and the cookies are produced in such a way as to guarantee that no
two files can generate the same cookie. XFS is able to produce unique
32-bit cookies due to the virtual mapping it overlays onto the directory
index; adding such a mapping to ext4 (which does not otherwise need it)
would be a large job.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/ext4 |
(Log in to post comments)
Widening ext4's readdir() cookie
Posted Mar 28, 2013 8:58 UTC (Thu) by paulj (subscriber, #341) [Link]
It's interesting to read of the readdir offset debate, and read the hate for it. Most interesting is the "reluctantly" link in the article, http://lwn.net/Articles/544237/ - where Ted seems to make 2 claims:- The readdir/telldir offset is a legacy problem (NFS), and if not for that it would not be in the kernel.
- It would not generally be practical for userspace to keep directories in memory while walking them for telldir/readdir.
Aren't these claims potentially contradictory? If it is not practical to keep directories in memory for telldir/readdir, why would it be practical for any other API? How would a dir-walking API work that didn't require userspace to keep a shadow copy (or didn't require the kernel to keep a shadow copy)? Is this perhaps a fundamental problem of a trade-off of where to keep state, rather than anything particular to readdir?
Also strange is the language in the SUSv3 spec which Ted quotes in another email (http://lwn.net/Articles/544257/):
" One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered."
Difficult to describe formally? That's odd. Surely it is quite trivial? You define that directory entries have an absolute ordering (lexical, numberical, whatever you want - it doesn't matter). Why are B-trees a problem? B-trees are defined to provide an index over absolutely ordered objects. Given, say, a B-tree and a rank in that order (e.g. the cookie of a previous dirent), there should be no problem to determine what the next dirent in the order is - even if the B-tree has been changed or re-balanced, even if the dirent for the cookie no longer exists! Surely? All ordered indices should have this property (I.e. hash tables do not, but pretty much all commonly used trees would, no?), surely?
Basically, I'm puzzled, because the issues as stated don't fully make sense (either not issues, or potentially not consistent) to me. So I'm wondering what the deeper issues are, that the developers and spec writers havn't explained; or else whether perhaps the developers have gotten so "in" to the details of the problem that they're overlooking some things that'd be clearer if they stepped back? (Or it could just be me having one of my amazingly stupid moments ;) - in which case, my apologies tentatively in advance).
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:07 UTC (Thu) by paulj (subscriber, #341) [Link]
And further, this bit of the article made me scratch my head and wonder which of my understanding, the description of the issues involved in the article, and/or the code is suboptimal:
"Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space.
Encoding the position in the tree in the cookie through a hash? But the tree is indexing something with an absolute order - the very node in the true MUST be associated with a unique value already (e.g. the name), surely? Why would you then hash this?
(I'm assuming I'm having a pre-morning-caffeine stupid moment :) ).
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:13 UTC (Thu) by paulj (subscriber, #341) [Link]
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:23 UTC (Thu) by dark (guest, #8483) [Link]
I think the problem here is that the underlying ordered data entries are simply too big (they're whole filenames) and telldir() returns a long. No matter how you plan to map the names to long offsets, you'll find some combination of mutations to the directory that makes the offset invalid.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:32 UTC (Thu) by paulj (subscriber, #341) [Link]
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:57 UTC (Thu) by dlang (guest, #313) [Link]
If there were no resource constraints, the system could use RCU type strategies to make a copy of the directory and just use the token as a numeric offset into this copy, but since the system has no way of ever knowing when it could release this copy, it would have to keep it forever.
remember that with NFS, a server is required to be consistent even across reboots.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 10:24 UTC (Thu) by paulj (subscriber, #341) [Link]
To create an entry:
- locate the ID in your ordered index (B-tree, whatever)
- Determine the next free ID by walking the index to the next existing ID
- Create your entry with that free ID, update your index
- Update the directory's last created ID
Insertions have to be done atomically wrt directory and index and any readers. Removals do not need to modify the last-created ID in the directory though. But it does not matter a jot how the internal structure of the index changes.
No collisions, no hashes, no encoding the structure of indices into cookies. No nasty hacks. Just work with the grain by making the underlying order you're indexing have the properties you want.
You are limited to # of directory entries being the size of the ring, but it seems NFS compatibility would allow that ring to be (2^64)-1.
Why wouldn't that work?
Widening ext4's readdir() cookie
Posted Mar 28, 2013 11:31 UTC (Thu) by paulj (subscriber, #341) [Link]
Widening ext4's readdir() cookie
Posted Apr 4, 2013 1:56 UTC (Thu) by vasi (subscriber, #83946) [Link]
This is still imperfect. Because each CNID is associated with a file name, hard links require an extra level of indirection. And obviously if the file at the current position is deleted, or renamed, things won't work as expected.
This is all moot, since Linux's HFS+ driver just uses the offset as f_pos, I'm not sure why. Maybe the f_pos values are required to be increasing? CNIDs are not.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:18 UTC (Thu) by dlang (guest, #313) [Link]
So your b-tree or other complex structure may change, potentially significantly.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:41 UTC (Thu) by paulj (subscriber, #341) [Link]
Widening ext4's readdir() cookie
Posted Mar 28, 2013 9:46 UTC (Thu) by dlang (guest, #313) [Link]
Widening ext4's readdir() cookie
Posted Mar 28, 2013 10:25 UTC (Thu) by paulj (subscriber, #341) [Link]
Widening ext4's readdir() cookie
Posted Mar 28, 2013 11:59 UTC (Thu) by neilbrown (subscriber, #359) [Link]
Some address it with by having two indexes - one by name and one by cookie. It is more expensive, bit is actually reliable (rather than only being stochastically reliable).
My preference is internal-chaining. This adds a little more book-keeping for the case when hashes collide, but I believe that is preferable to strange errors when hashes collide. And they really don't collide often even with 32bit hashes.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 12:17 UTC (Thu) by paulj (subscriber, #341) [Link]
A simple index over a finite, ordered ID space, with 2 operations "get unused ID" and "get next used ID in the order" would solve this problem trivially, it really seems to me. ("get unused ID" could be done efficiently in a number of ways, e.g. picking a random ID in a ringed space, then using the "next unused" operation'd probably work well in a sparsely used space).
Trying to bodge the properties of an order in a finite space out of the properties of an index of (effectively) infinite IDs seems doomed to failure and/or strange corner-cases.
Hacking the internal structure of the index also seems potentially fragile and needlessly complex - though I don't know exactly what you mean by internal chaining. It sounds possibly like adding indices inside the index. In which case, why not just keep a separate, simpler index?
Widening ext4's readdir() cookie
Posted Mar 28, 2013 14:14 UTC (Thu) by etienne (guest, #25256) [Link]
There is an obvious index, which is the index of the inode in the directory as written to the hard disk sectors - but all index are worse than cookies because if you are currently using file 4 (out of 10), and someone removes the file number 2, when you will ask for file 5, you will skip a file...
Widening ext4's readdir() cookie
Posted Mar 28, 2013 14:27 UTC (Thu) by paulj (subscriber, #341) [Link]
If you make the IDs stable, so they never change so long as the entry exists, then the problem is gone. The IDs are stable, and so the ordering between them is stable. You can do this through an additional index (in the data-structure sense).
The "pain" seems to be that the ext3+ devs are trying to achieve this stable order by hacking around in the details an existing index of an ID that does NOT have an absolute order (the hashes of names). That wouldn't be necessary at all if they'd just add an additional stable, absolutely ordered, finite ID with its own index. The namespace-hash and the finite-ID could each have their own, simple index - no nasty hacks needed.
(It seems to me, but I could be dreadfully wrong, or perhaps there are some critical details missing from the reporting ;) ).
Widening ext4's readdir() cookie
Posted Mar 29, 2013 2:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]
Some problems with this are:
- readdir will not return names in natural b-tree order, so you will get lots of seeking. Not a problem when the whole directory fits easily in cache, may be a problem for v.large directories.
- every B-tree modification (page-split, page-delete) requires updating random entries in the index table.
i.e. there is a genuine IO cost here.
Having readdir return entries in natural B-Tree order is clearly appealing, and a separate table doesn't really allow this.
By "Internal chaining" I mean that if a particular hash value is in use, you add 1 and try again. When you find a hash value to use, you store it (or the difference between the hash of the name and the chosen value, which can probably be stored with fewer bits) with the name. When you delete entries you need to be careful to leave place-holders if any subsequent entries are part of an internal chain leading from here.
In the common case of no collisions there is zero IO overhead (where as with the jfs approach there is always IO overhead). In the rarer case were there are collisions, there is an overhead that probably gets substantially worse as the chain length increases. As long as any internal chain is within one block there is a processing overhead but no IO overhead. Once it crosses a block boundary you get a real overhead, but probably comparable with the JFS overhead. Once you have a chain touching three blocks it probably gets a lot worse. But that should be extremely unlikely.
So it seems like a solution with excellent common-case performance, which degrades only in rare circumstances. Worst case performance (where all names hash to the same value) is probably not much worse than a linear search.
Widening ext4's readdir() cookie
Posted Mar 28, 2013 13:42 UTC (Thu) by bfields (subscriber, #19510) [Link]
Do any filesystems other than ext3/4 have this problem?
I haven't heard any.
And, yes, I suspect there are any number of solutions. Just pick one of them, turn it into a patch that's bulletproof and backwards compatible, and we're done.
telldir() and seekdir()
Posted Mar 28, 2013 10:43 UTC (Thu) by meuh (guest, #22042) [Link]
Surprisingly, telldir() and seekdir() use a long int (instead of a off_t) as the location (offset) value. So, from a userspace point of view, going from 32bits to 64bits to reduce the collision problem seen in NFS, seems rather strange.
Why am I seeing a security exploit?
Posted Mar 29, 2013 4:10 UTC (Fri) by jmorris42 (guest, #2203) [Link]
So why can't someone precompute carefully constructed directory entries that will cause a collision, even with 63 bit hashes? Once you have those, drop them somewhere and wait for the infinite loop. Do believe that is called 'denial of service' Is there a good reason these puppies can't be injected in all sorts of places? How many Linux mail servers unpack zipped attachments to perform malware detection? Is everyone CERTAIN none will ever traverse the unpacked directory in a way that will trigger this?
No, this is bogus. Designing a filesystem that will work most of the time isn't any way to write code for production use.
Why am I seeing a security exploit?
Posted Mar 29, 2013 13:49 UTC (Fri) by bfields (subscriber, #19510) [Link]
So why can't someone precompute carefully constructed directory entries that will cause a collision, even with 63 bit hashes?
Looks like the hash algorithm takes a seed, which is per-superblock, generated at mkfs time, and should be unknown to an unprivileged user; grep for s_hash_seed in kernel and e2fsprogs source.
Why am I seeing a security exploit?
Posted Mar 29, 2013 16:37 UTC (Fri) by jmorris42 (guest, #2203) [Link]
Why am I seeing a security exploit?
Posted Apr 3, 2013 20:26 UTC (Wed) by bronson (subscriber, #4806) [Link]
More seriously, there are other sources of error in your computer that are far more worthy of your attention: http://en.wikipedia.org/wiki/Soft_error
It's true that all bets are off if an attacker can break the hash. But, if/when that happens, the fix will probably be a straightforward kernel patch.
Why am I seeing a security exploit?
Posted Apr 4, 2013 16:20 UTC (Thu) by jimparis (guest, #38647) [Link]
(numbers from http://preshing.com/20110504/hash-collision-probabilities)
Widening ext4's readdir() cookie
Posted Mar 29, 2013 11:32 UTC (Fri) by magnus (subscriber, #34778) [Link]
Widening ext4's readdir() cookie
Posted Mar 29, 2013 13:54 UTC (Fri) by bfields (subscriber, #19510) [Link]
Off the top of my head one objection is that there's no guarantee of forward progress for readers.
Widening ext4's readdir() cookie
Posted Mar 30, 2013 5:40 UTC (Sat) by Mog (subscriber, #29529) [Link]
Is it correct ? I remember a limitation on the number of subdirs due to the nlink field in the stat structure being a short, but not on the total number of entries (sudirs+files+...).
Widening ext4's readdir() cookie
Posted Apr 4, 2013 19:40 UTC (Thu) by aakef (subscriber, #38030) [Link]
I think "which does not otherwise need it" is not absolutely correct. Ext4s HTREE/dirindex readdir has another issue. Entries are not returned in inode, but in hash order and for example for 'ls -l' (readdir + stat) or 'rm -rf dir' that causes lots of random disk access. Recent core utils try to eliminate this by sorting results returned by readdir(), but that logic is not used by everything. Depending on the implementation "such a mapping" from above would also eliminate the random access problem at all.
Bernd
