Negative dentries, 20 years later

Posted Apr 12, 2022 0:17 UTC (Tue) by NYKevin (subscriber, #129325)
In reply to: Negative dentries, 20 years later by cesarb
Parent article: Negative dentries, 20 years later

> For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.

That is probably acceptable.

> For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).

This might require more complicated tree traversal logic, in the case where one of the neighbor leaves is in a different subtree to the other neighbor leaf. At worst, that probably adds an extra O(log n) root-to-leaf traversal, which is likely acceptable in practice.

> I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).

Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets. This periodic rebuilding is necessary if you want a hash table to have O(1) performance on its standard operations - else it will eventually degrade to linear scans (regardless of whether you use open hashing or closed hashing). But this also means that it's possible for the hash table to be much, much sparser than k, if it has just been rebuilt (or if you've only added a small number of elements in the first place). As a result, in the worst case you could potentially scan N-1 consecutive buckets to find the boundaries (when there's only one occupied bucket). In cases where the hash table was just rebuilt, you could observe up to N - kn consecutive empty buckets, where n is the previous size (n depends on your growth algorithm, but it's probably some constant fraction of N, e.g. n = N/2). In both cases, this means our worst case performance has degraded from O(1) to O(N) (where N is the number of buckets, which is proportional to the actual number of dentries). That is likely to be unacceptable, for systems where we decided to use a hash table in the first place, because it defeats the main selling point of a hash table.

> The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.

That also applies to individual negative dentries. The only way negative dentry caching can possibly work is if it takes place on the server side, and not the client side. Which is rather unfortunate, seeing as the network roundtrip is by far the more expensive operation than the disk lookup (at least in high-latency deployments, anyway).

Negative dentries, 20 years later

Posted Apr 12, 2022 0:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> (where N is the number of buckets, which is proportional to the actual number of dentries)

Correction: Where N is proportional to *the largest number of dentries that the directory has ever had, or the largest number since the last time the filesystem shrank the hash table, if the filesystem does that at all.* This makes things significantly worse, because it means you really can get N - 1 for large N in a non-pathological case (i.e. removing most or all of the contents of a directory, but not rmdir'ing the directory itself).

Negative dentries, 20 years later

Posted Apr 12, 2022 12:42 UTC (Tue) by Wol (subscriber, #4433) [Link] (5 responses)

> Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets.

Or you use a modern hash algorithm like, dare I say it, Pick has used for the last 40 years ... I believe it's the same one as used by Python and Sleepycat/Berkeley amongst others.

https://en.wikipedia.org/wiki/Linear_hashing

tldr;
Given 2^(n-1) < N <= 2^n, if mod(2^n) gives a valid bucket use that otherwise use mod(2^(n-1)). When creating a new bucket N+1, all the records that belong in the new bucket will currently be in bucket (N+1)mod(2^(n-1)), so the cost of growing the hash table by one bucket at a time is constant. So the hash table can be kept at optimal fullness with minimal cost.

And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

Cheers,
Wol

Negative dentries, 20 years later

Posted Apr 12, 2022 19:26 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (4 responses)

> And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Furthermore, if you allow deletions, then a linearly-growing hash table does not help, because you can grow the hash table to be huge, and then depopulate it (e.g. by making lots of entries in a directory, then removing them).

Negative dentries, 20 years later

Posted Apr 12, 2022 20:54 UTC (Tue) by cesarb (subscriber, #6266) [Link] (1 responses)

> We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets.

Actually, now that I thought about it better, not even that might be necessary. A hash bucket itself is already a "range" of directory entries, since it being empty means that none of the entries hashing to that bucket are present. Since the number of hash buckets is limited, that could be enough to limit the number of negative "ranges".

Negative dentries, 20 years later

Posted Apr 12, 2022 21:09 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

Yes, you can do that, but it buys you less than you think:

* Under open addressing (linear/quadratic probing, cuckoo hashing, etc.), the number of buckets must be greater than the number of dentries, and for performance reasons, it should be significantly greater.
* Under closed addressing (separate chaining), the number of dentries cannot be much more than a small (constant) multiple of the number of buckets, or performance suffers.

In both cases: Assuming the hash function is a quasi-random oracle, the probability of a hash collision should be quite low, and should become lower as the hash table grows (so that the average number of collisions that each dentry suffers remains bounded above). A cached negative dentry range consisting of a single hash bucket will only help you for other dentries that would have collided with that bucket, which should not be very many in practice, or else the hash function is bad.

Negative dentries, 20 years later

Posted Apr 12, 2022 21:42 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Except you completely missed the point I was responding to. cesarb was talking about a hashed directory, and not a cache.

And in that case, there are NO empty buckets (if you want to use that particular variant of hashing). The variant I'm used to sticks multiple entries per bucket, but there's a variant that has exactly as many buckets as entries, and one entry per bucket.

And just as it's very quick and easy to grow the table, it's just as easy to shrink it - the entire bucket you're getting rid of just tags on the end of the bucket the (now invalid) old hash hashes to.

Cheers,
Wol

Negative dentries, 20 years later

Posted Apr 13, 2022 3:00 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

If every bucket is occupied, then there are no empty buckets to cache.