LWN: Comments on "Negative dentries, 20 years later"

Intuitive ways to exit text editors

Jandar — Mon, 18 Apr 2022 20:41:24 +0000

> CTRL+Z
> $ killall emacs vi

Neither vi in insert-mode nor emacs in x11 mode stops with CTRL-Z ;-)

vi inserts the CTRL-Z and emacs executes suspend-frame.

Intuitive ways to exit text editors

Spack — Sun, 17 Apr 2022 22:59:52 +0000

CTRL+Z
$ killall emacs vi

Negative dentries, 20 years later

nix — Thu, 14 Apr 2022 11:44:42 +0000

> You don't really want to make things exhaustive... Imagine having users creating single directories with millions of files.

Maildirs do this routinely. It's amazing how much trouble it causes (to backup programs in particular, since they have to enumerate every inode in the directory to figure out which ones changed). (INN news tradspools do it too, but they're not per-user and they're relatively rarely backed up at all.)

Intuitive ways to exit text editors

SiB — Thu, 14 Apr 2022 07:40:03 +0000

My way of starting emacs is `emacsclient -c -d "$DISPLAY" --alternate-editor='' "$1"`

That has the same effect on how to kill the emacs process. Ctrl-X Ctrl-C just closes the frame.

Intuitive ways to exit text editors

neilbrown — Wed, 13 Apr 2022 23:50:38 +0000

> emacs: Ctrl-G Ctrl-G Ctrl-G Ctrl-X Ctrl-C

This actually doesn't work for me ..... because I deliberately disabled it as I was somehow typing it accidentally.
Now I use : alt-x k i l l - e m a c s enter
I haven't yet typed that by mistake.

Negative dentries, 20 years later

NYKevin — Wed, 13 Apr 2022 03:00:06 +0000

If every bucket is occupied, then there are no empty buckets to cache.

Negative dentries, 20 years later

Wol — Tue, 12 Apr 2022 21:42:39 +0000

> You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Except you completely missed the point I was responding to. cesarb was talking about a hashed directory, and not a cache.

And in that case, there are NO empty buckets (if you want to use that particular variant of hashing). The variant I'm used to sticks multiple entries per bucket, but there's a variant that has exactly as many buckets as entries, and one entry per bucket.

And just as it's very quick and easy to grow the table, it's just as easy to shrink it - the entire bucket you're getting rid of just tags on the end of the bucket the (now invalid) old hash hashes to.

Cheers,
Wol

Negative dentries, 20 years later

NYKevin — Tue, 12 Apr 2022 21:09:23 +0000

Yes, you can do that, but it buys you less than you think:

* Under open addressing (linear/quadratic probing, cuckoo hashing, etc.), the number of buckets must be greater than the number of dentries, and for performance reasons, it should be significantly greater.
* Under closed addressing (separate chaining), the number of dentries cannot be much more than a small (constant) multiple of the number of buckets, or performance suffers.

In both cases: Assuming the hash function is a quasi-random oracle, the probability of a hash collision should be quite low, and should become lower as the hash table grows (so that the average number of collisions that each dentry suffers remains bounded above). A cached negative dentry range consisting of a single hash bucket will only help you for other dentries that would have collided with that bucket, which should not be very many in practice, or else the hash function is bad.

Negative dentries, 20 years later

cesarb — Tue, 12 Apr 2022 20:54:26 +0000

> We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets.

Actually, now that I thought about it better, not even that might be necessary. A hash bucket itself is already a "range" of directory entries, since it being empty means that none of the entries hashing to that bucket are present. Since the number of hash buckets is limited, that could be enough to limit the number of negative "ranges".

Negative dentries, 20 years later

NYKevin — Tue, 12 Apr 2022 19:26:06 +0000

> And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.

Furthermore, if you allow deletions, then a linearly-growing hash table does not help, because you can grow the hash table to be huge, and then depopulate it (e.g. by making lots of entries in a directory, then removing them).

Negative dentries, 20 years later

vgoyal — Tue, 12 Apr 2022 16:06:30 +0000

We indeed raise open file limit to 1M by default. But that requires CAP_SYS_RESOURCE. We also have been looking at running virtiofsd unprivileged and don't have CAP_SYS_RESOURCE and hence can't raise the limit.

Intuitive ways to exit text editors

dskoll — Tue, 12 Apr 2022 15:50:02 +0000

It can be hard to get out of emacs too, sometimes. The emergency cheat-sheet:

vi: Esc Esc Esc :q!

emacs: Ctrl-G Ctrl-G Ctrl-G Ctrl-X Ctrl-C

Negative dentries, 20 years later

Wol — Tue, 12 Apr 2022 12:42:55 +0000

> Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets.

Or you use a modern hash algorithm like, dare I say it, Pick has used for the last 40 years ... I believe it's the same one as used by Python and Sleepycat/Berkeley amongst others.

https://en.wikipedia.org/wiki/Linear_hashing

tldr;
Given 2^(n-1) < N <= 2^n, if mod(2^n) gives a valid bucket use that otherwise use mod(2^(n-1)). When creating a new bucket N+1, all the records that belong in the new bucket will currently be in bucket (N+1)mod(2^(n-1)), so the cost of growing the hash table by one bucket at a time is constant. So the hash table can be kept at optimal fullness with minimal cost.

And being a hash table, you can return a definitive yes/no just by scanning the one bucket.

Cheers,
Wol

Negative dentries, 20 years later

aaronmdjones — Tue, 12 Apr 2022 12:04:32 +0000

> As it happens, repeated attempts to look up a nonexistent file are also common; an example would be the shell's process of working through the search path every time a user types "vi" (Emacs users start the editor once and never leave its cozy confines thereafter, so they don't benefit in the same way).

It could also be argued that vi users don't benefit either, because they can't figure out how to close it.

Negative dentries, 20 years later

grawity — Tue, 12 Apr 2022 11:37:58 +0000

some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/>

I think you would have the same problem with ordinary network filesystems like NFS/SMB when accessing a directory where you have +x but not +r. The server knows the nearest two files in that directory, but the client doesn't have the permissions to know that.

Negative dentries, 20 years later

pomac — Tue, 12 Apr 2022 09:16:27 +0000

You don't really want to make things exhaustive... Imagine having users creating single directories with millions of files.

It's already a DoS in terms of memory due to how Linux works but enumerating them all with caches... Nah.. ;)

Negative dentries, 20 years later

josh — Tue, 12 Apr 2022 07:45:54 +0000

> Is there a way to track the amount of time it took to generate the dentry, and then trash the cheaper ones and keep the more expensive ones?

That seems like a useful heuristic to weight entries, if you also take into account the number of lookups.

Negative dentries, 20 years later

geert — Tue, 12 Apr 2022 07:44:51 +0000

Why not a power of two? ;-)

Negative dentries, 20 years later

dgc — Tue, 12 Apr 2022 07:26:19 +0000

Hi Jon,

More discussion of the periodic reclaim concept for controlling cache explosions can be found under this thread:

https://lwn.net/ml/linux-mm/20220402072103.5140-1-hdanton...

-Dave.

Negative dentries, 20 years later

benjamir — Tue, 12 Apr 2022 04:39:02 +0000

Isn't the number of open files (sometimes) user defined?
Why not depend on that (in fractions or multiples)?

Negative dentries, 20 years later

NYKevin — Tue, 12 Apr 2022 00:24:35 +0000

> (where N is the number of buckets, which is proportional to the actual number of dentries)

Correction: Where N is proportional to *the largest number of dentries that the directory has ever had, or the largest number since the last time the filesystem shrank the hash table, if the filesystem does that at all.* This makes things significantly worse, because it means you really can get N - 1 for large N in a non-pathological case (i.e. removing most or all of the contents of a directory, but not rmdir'ing the directory itself).

Negative dentries, 20 years later

NYKevin — Tue, 12 Apr 2022 00:17:07 +0000

> For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.

That is probably acceptable.

> For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).

This might require more complicated tree traversal logic, in the case where one of the neighbor leaves is in a different subtree to the other neighbor leaf. At worst, that probably adds an extra O(log n) root-to-leaf traversal, which is likely acceptable in practice.

> I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).

Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets. This periodic rebuilding is necessary if you want a hash table to have O(1) performance on its standard operations - else it will eventually degrade to linear scans (regardless of whether you use open hashing or closed hashing). But this also means that it's possible for the hash table to be much, much sparser than k, if it has just been rebuilt (or if you've only added a small number of elements in the first place). As a result, in the worst case you could potentially scan N-1 consecutive buckets to find the boundaries (when there's only one occupied bucket). In cases where the hash table was just rebuilt, you could observe up to N - kn consecutive empty buckets, where n is the previous size (n depends on your growth algorithm, but it's probably some constant fraction of N, e.g. n = N/2). In both cases, this means our worst case performance has degraded from O(1) to O(N) (where N is the number of buckets, which is proportional to the actual number of dentries). That is likely to be unacceptable, for systems where we decided to use a hash table in the first place, because it defeats the main selling point of a hash table.

> The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.

That also applies to individual negative dentries. The only way negative dentry caching can possibly work is if it takes place on the server side, and not the client side. Which is rather unfortunate, seeing as the network roundtrip is by far the more expensive operation than the disk lookup (at least in high-latency deployments, anyway).

Negative dentries, 20 years later

vgoyal — Mon, 11 Apr 2022 23:54:43 +0000

Controlling the size of dentry cache without memory pressure can help us in one of the problems we face with virtiofs. As dentry cache is growing, it keeps inodes around, that means it keeps corresponding data structures around in virtiofs daemon. And it keeps O_PATH fd per inode on server. And soon we run out of number of open fd a process can have. If guest has good amount of RAM, it can cache lot of dentries/inodes without memory pressure and create problems for virtiofs daemon. Some notion/knob of controlling the size of dentry cache without memory pressure can possibly.

Negative dentries, 20 years later

cesarb — Mon, 11 Apr 2022 23:30:51 +0000

> By all means try to implement this, but I think it'll be vastly inferior to the current dcache.

I don't doubt that it would be inferior; still, it's an interesting thought experiment.

> Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.

Actually, it could be a per-filesystem data structure, if the concept of range is extended to include the parent directory; but it would have to be something like hash(directory pointer) || key(component), with the definition of key() depending on the filesystem's native on-disk order (if it has one; naïve filesystems with simple unsorted directories could use hash(component) for the key). I agree that it would be a massive change, with potential gains only on an unlikely case (a large amount of lookups to non-existing entries, which could be hits on a negative range cache but would be misses on a negative entry cache).

> Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).

Yes, the sort would have to be filesystem-specific; I went into more detail on another comment above.

Negative dentries, 20 years later

cesarb — Mon, 11 Apr 2022 23:10:59 +0000

> I'm not clear on how you would go about creating negative dentry ranges in a non-exhaustive cache.

It would need help from each filesystem. I can think of three main ways of storing directory entries on a filesystem: they can be an unsorted list of filenames (as you mentioned, FAT and ext2 are good examples), or they can be a more complicated sorted structure (either a linear ordered list or something like a b-tree), or they can be an unsorted hash table.

For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.

For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).

I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).

The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.

Negative dentries, 20 years later

MattBBaker — Mon, 11 Apr 2022 22:26:36 +0000

It seems like there is also a question of how valuable each dentry, negative or not, really is. A lookup on a local SSD is going to be a lot easier on the system if you get the policy wrong than will a bad decision on a NFS filesystem. Is there a way to track the amount of time it took to generate the dentry, and then trash the cheaper ones and keep the more expensive ones?

Negative dentries, 20 years later

willy — Mon, 11 Apr 2022 22:13:58 +0000

By all means try to implement this, but I think it'll be vastly inferior to the current dcache.

Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.

Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).

Negative dentries, 20 years later

NYKevin — Mon, 11 Apr 2022 21:49:10 +0000

It is my understanding that the dentry cache is currently non-exhaustive, that is, it does not contain every dentry in a given directory, but only those that have been recently accessed. I'm not clear on how you would go about creating negative dentry ranges in a non-exhaustive cache. I can think of a few ways to do it, but they all seem obviously terrible to me:

1. Make the dentry cache exhaustive; scan and pre-cache the entire directory when you want to cache any individual dentry. On ext4 and other modern FS's, this will result in hilariously large caches in some cases, because directories are allowed to have a very large number of entries. Also, the scan will take a long time to complete.
2. Scan the directory once every time a failed lookup occurs, and make a negative dentry for the range surrounding that failed lookup. This slows down failed lookups by potentially quite a lot, especially on filesystems where dentries are not sorted in any convenient way (e.g. most if not all versions of FAT, ext2).
3. Scan the directory once every time a successful lookup occurs, and make negative dentries for the two ranges adjacent to the successful lookup. This is subject to most of the same problems as (2), except that it slows down successful lookups instead of failed lookups.
4. Periodically identify directories that have a lot of cached negative dentries, scan them, and coalesce their negative dentries into ranges. But it would be far simpler to just expire the negative dentries which have not been recently used; a negative dentry does not correspond to anything on disk, and so negative dentries are always clean and can be unceremoniously discarded without disk access (i.e. discarding is very fast compared to scanning).

Negative dentries, 20 years later

nybble41 — Mon, 11 Apr 2022 21:48:32 +0000

> a solution could be to, instead of storing the name of the nonexistent file, storing the name of the existent files preceding and following the existent file

I like this. Automatically bounded, and you can potentially answer many queries with a single negative range. You could even reuse the positive dentries for the preceding and following files, so long as you can ensure they won't be freed prematurely, further reducing RAM requirements.

The main drawback seems to me to be that the current filesystem APIs don't provide the preceding and following files in response to failed lookups—which could be changed—and, more importantly, that some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/> which only responds to requests for specific paths and can't be enumerated to find the preceding and following positive entries. (And even if it could, that information could be obsolete before the query finishes; this can even affect simple non-range negative dentries but storing ranges, including names not directly queried, would make it worse.)

One can imagine using a mix of ranges where filesystem support exists and simple negative entries where it doesn't, but then we still need to solve the original problem. Or we could just not cache negative lookups from filesystems that can't provide ranges, but that has its own drawbacks.

Negative dentries, 20 years later

cesarb — Mon, 11 Apr 2022 20:35:38 +0000

> While normal dentries are limited by the number of files that actually exist, there are few limits to the number of nonexistent files.

But there is a limit to the number of _ranges_ of nonexistent files: they are limited by the number of files that actually exist plus one. So a solution could be to, instead of storing the name of the nonexistent file, storing the name of the existent files preceding and following the existent file. There is precedent in the NSEC family of DNS entry types (NSEC and NSEC3), which prove the absence of a record on a DNS zone by signing a record containing both the preceding and following names.

(You could also make the ranges circular to remove the "plus one" and avoid a couple of special cases.)

Negative dentries, 20 years later

willy — Mon, 11 Apr 2022 19:31:34 +0000

The idea I'm currently advocating (a little busy with other projects to work on myself) is that after allocating 100 entries, we scan 110 entries at the tail of the list and either return them to the head of the list with their 'referenced' flag cleared, or free them. That allows the cache to both shrink (if <100 have been referenced) and grow (if >100 have been referenced)

Obviously those numbers are a bit magic. Why not 150? Or 200? Needs some experimentation.

Negative dentries, 20 years later

dskoll — Mon, 11 Apr 2022 19:03:48 +0000

Right. I guess limiting the size of the dentry cache so we start evicting before we have general memory pressure is the main idea, but allowing it to expand if by some miracle all entries have been hit so that it scales up if memory is available.

Negative dentries, 20 years later

willy — Mon, 11 Apr 2022 18:56:21 +0000

You're absolutely correct that never-accessed negative dentries have no value. And we even have a single bit that says whether the dentry has been accessed since the last time it got to the end of the LRU list. The problem is, as Dave says, that we don't try to turn the list over to find which ones are and aren't used until were under general memory pressure.

Negative dentries, 20 years later

dskoll — Mon, 11 Apr 2022 18:19:30 +0000

It seems to me that a negative dentry is useful only to the extent that it's hit. So what if the total number of negative dentries was limited to some small number and if that number is exceeded, then either (1) we evict an entry that has never been hit yet (or hit fewer than N times), or (2) we increase the negative dentry cache size by some amount if all negative dentries have been hit at least once (or at least N times).

This means keeping track of how many times a negative dentry is hit.

Caveat: I'm not a kernel programmer and have no idea how this would work in practice with the existing data structures.