Negative dentries, 20 years later [LWN.net]

Negative dentries, 20 years later

Posted Apr 11, 2022 21:48 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (1 responses)

> a solution could be to, instead of storing the name of the nonexistent file, storing the name of the existent files preceding and following the existent file

I like this. Automatically bounded, and you can potentially answer many queries with a single negative range. You could even reuse the positive dentries for the preceding and following files, so long as you can ensure they won't be freed prematurely, further reducing RAM requirements.

The main drawback seems to me to be that the current filesystem APIs don't provide the preceding and following files in response to failed lookups—which could be changed—and, more importantly, that some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/> which only responds to requests for specific paths and can't be enumerated to find the preceding and following positive entries. (And even if it could, that information could be obsolete before the query finishes; this can even affect simple non-range negative dentries but storing ranges, including names not directly queried, would make it worse.)

One can imagine using a mix of ranges where filesystem support exists and simple negative entries where it doesn't, but then we still need to solve the original problem. Or we could just not cache negative lookups from filesystems that can't provide ranges, but that has its own drawbacks.

Negative dentries, 20 years later

Posted Apr 12, 2022 11:37 UTC (Tue) by grawity (subscriber, #80596) [Link]

some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/>

I think you would have the same problem with ordinary network filesystems like NFS/SMB when accessing a directory where you have +x but not +r. The server knows the nearest two files in that directory, but the client doesn't have the permissions to know that.

Negative dentries, 20 years later

Posted Apr 11, 2022 21:49 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (11 responses)

It is my understanding that the dentry cache is currently non-exhaustive, that is, it does not contain every dentry in a given directory, but only those that have been recently accessed. I'm not clear on how you would go about creating negative dentry ranges in a non-exhaustive cache. I can think of a few ways to do it, but they all seem obviously terrible to me:

1. Make the dentry cache exhaustive; scan and pre-cache the entire directory when you want to cache any individual dentry. On ext4 and other modern FS's, this will result in hilariously large caches in some cases, because directories are allowed to have a very large number of entries. Also, the scan will take a long time to complete.
2. Scan the directory once every time a failed lookup occurs, and make a negative dentry for the range surrounding that failed lookup. This slows down failed lookups by potentially quite a lot, especially on filesystems where dentries are not sorted in any convenient way (e.g. most if not all versions of FAT, ext2).
3. Scan the directory once every time a successful lookup occurs, and make negative dentries for the two ranges adjacent to the successful lookup. This is subject to most of the same problems as (2), except that it slows down successful lookups instead of failed lookups.
4. Periodically identify directories that have a lot of cached negative dentries, scan them, and coalesce their negative dentries into ranges. But it would be far simpler to just expire the negative dentries which have not been recently used; a negative dentry does not correspond to anything on disk, and so negative dentries are always clean and can be unceremoniously discarded without disk access (i.e. discarding is very fast compared to scanning).

Negative dentries, 20 years later

Posted Apr 11, 2022 23:10 UTC (Mon) by cesarb (subscriber, #6266) [Link] (8 responses)

> I'm not clear on how you would go about creating negative dentry ranges in a non-exhaustive cache.

It would need help from each filesystem. I can think of three main ways of storing directory entries on a filesystem: they can be an unsorted list of filenames (as you mentioned, FAT and ext2 are good examples), or they can be a more complicated sorted structure (either a linear ordered list or something like a b-tree), or they can be an unsorted hash table.

For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.

For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).

I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).

The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.

Negative dentries, 20 years later

Posted Apr 12, 2022 0:17 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (7 responses)

> For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.

That is probably acceptable.

> For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).

This might require more complicated tree traversal logic, in the case where one of the neighbor leaves is in a different subtree to the other neighbor leaf. At worst, that probably adds an extra O(log n) root-to-leaf traversal, which is likely acceptable in practice.

> I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).

Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets. This periodic rebuilding is necessary if you want a hash table to have O(1) performance on its standard operations - else it will eventually degrade to linear scans (regardless of whether you use open hashing or closed hashing). But this also means that it's possible for the hash table to be much, much sparser than k, if it has just been rebuilt (or if you've only added a small number of elements in the first place). As a result, in the worst case you could potentially scan N-1 consecutive buckets to find the boundaries (when there's only one occupied bucket). In cases where the hash table was just rebuilt, you could observe up to N - kn consecutive empty buckets, where n is the previous size (n depends on your growth algorithm, but it's probably some constant fraction of N, e.g. n = N/2). In both cases, this means our worst case performance has degraded from O(1) to O(N) (where N is the number of buckets, which is proportional to the actual number of dentries). That is likely to be unacceptable, for systems where we decided to use a hash table in the first place, because it defeats the main selling point of a hash table.

> The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.

That also applies to individual negative dentries. The only way negative dentry caching can possibly work is if it takes place on the server side, and not the client side. Which is rather unfortunate, seeing as the network roundtrip is by far the more expensive operation than the disk lookup (at least in high-latency deployments, anyway).

Negative dentries, 20 years later

Posted Apr 12, 2022 0:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> (where N is the number of buckets, which is proportional to the actual number of dentries)

Correction: Where N is proportional to *the largest number of dentries that the directory has ever had, or the largest number since the last time the filesystem shrank the hash table, if the filesystem does that at all.* This makes things significantly worse, because it means you really can get N - 1 for large N in a non-pathological case (i.e. removing most or all of the contents of a directory, but not rmdir'ing the directory itself).

Negative dentries, 20 years later

Posted Apr 12, 2022 12:42 UTC (Tue) by Wol (subscriber, #4433) [Link] (5 responses)

> Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets.

Or you use a modern hash algorithm like, dare I say it, Pick has used for the last 40 years ... I believe it's the same one as used by Python and Sleepycat/Berkeley amongst others.

https://en.wikipedia.org/wiki/Linear_hashing

tldr;
Given 2^(n-1) < N <= 2^n, if mod(2^n) gives a valid bucket use that otherwise use mod(2^(n-1)). When creating a new bucket N+1, all the records that belong in the new bucket will currently be in bucket (N+1)mod(2^(n-1)), so the cost of growing the hash table by one bucket at a time is constant. So the hash table can be kept at optimal fullness with minimal cost.

And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

Cheers,
Wol

Negative dentries, 20 years later

Posted Apr 12, 2022 19:26 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (4 responses)

> And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Furthermore, if you allow deletions, then a linearly-growing hash table does not help, because you can grow the hash table to be huge, and then depopulate it (e.g. by making lots of entries in a directory, then removing them).

Negative dentries, 20 years later

Posted Apr 12, 2022 20:54 UTC (Tue) by cesarb (subscriber, #6266) [Link] (1 responses)

> We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets.

Actually, now that I thought about it better, not even that might be necessary. A hash bucket itself is already a "range" of directory entries, since it being empty means that none of the entries hashing to that bucket are present. Since the number of hash buckets is limited, that could be enough to limit the number of negative "ranges".

Negative dentries, 20 years later

Posted Apr 12, 2022 21:09 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

Yes, you can do that, but it buys you less than you think:

* Under open addressing (linear/quadratic probing, cuckoo hashing, etc.), the number of buckets must be greater than the number of dentries, and for performance reasons, it should be significantly greater.
* Under closed addressing (separate chaining), the number of dentries cannot be much more than a small (constant) multiple of the number of buckets, or performance suffers.

In both cases: Assuming the hash function is a quasi-random oracle, the probability of a hash collision should be quite low, and should become lower as the hash table grows (so that the average number of collisions that each dentry suffers remains bounded above). A cached negative dentry range consisting of a single hash bucket will only help you for other dentries that would have collided with that bucket, which should not be very many in practice, or else the hash function is bad.

Negative dentries, 20 years later

Posted Apr 12, 2022 21:42 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Except you completely missed the point I was responding to. cesarb was talking about a hashed directory, and not a cache.

And in that case, there are NO empty buckets (if you want to use that particular variant of hashing). The variant I'm used to sticks multiple entries per bucket, but there's a variant that has exactly as many buckets as entries, and one entry per bucket.

And just as it's very quick and easy to grow the table, it's just as easy to shrink it - the entire bucket you're getting rid of just tags on the end of the bucket the (now invalid) old hash hashes to.

Cheers,
Wol

Negative dentries, 20 years later

Posted Apr 13, 2022 3:00 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

If every bucket is occupied, then there are no empty buckets to cache.

Negative dentries, 20 years later

Posted Apr 12, 2022 9:16 UTC (Tue) by pomac (subscriber, #94901) [Link] (1 responses)

You don't really want to make things exhaustive... Imagine having users creating single directories with millions of files.

It's already a DoS in terms of memory due to how Linux works but enumerating them all with caches... Nah.. ;)

Negative dentries, 20 years later

Posted Apr 14, 2022 11:44 UTC (Thu) by nix (subscriber, #2304) [Link]

> You don't really want to make things exhaustive... Imagine having users creating single directories with millions of files.

Maildirs do this routinely. It's amazing how much trouble it causes (to backup programs in particular, since they have to enumerate every inode in the directory to figure out which ones changed). (INN news tradspools do it too, but they're not per-user and they're relatively rarely backed up at all.)

Negative dentries, 20 years later

Posted Apr 11, 2022 22:13 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

By all means try to implement this, but I think it'll be vastly inferior to the current dcache.

Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.

Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).

Negative dentries, 20 years later

Posted Apr 11, 2022 23:30 UTC (Mon) by cesarb (subscriber, #6266) [Link]

> By all means try to implement this, but I think it'll be vastly inferior to the current dcache.

I don't doubt that it would be inferior; still, it's an interesting thought experiment.

> Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.

Actually, it could be a per-filesystem data structure, if the concept of range is extended to include the parent directory; but it would have to be something like hash(directory pointer) || key(component), with the definition of key() depending on the filesystem's native on-disk order (if it has one; naïve filesystems with simple unsorted directories could use hash(component) for the key). I agree that it would be a massive change, with potential gains only on an unlikely case (a large amount of lookups to non-existing entries, which could be hits on a negative range cache but would be misses on a negative entry cache).

> Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).

Yes, the sort would have to be filesystem-specific; I went into more detail on another comment above.