Negative dentries, 20 years later
The kernel's dentry cache saves the results of looking up a file in a filesystem. Should the need arise to look up the same file again, the cached result can be used, avoiding a trip through the underlying filesystem and accesses to the storage device. Repeated file-name lookups are common — consider /usr/bin/bash or ~/.nethackrc — so this is an important optimization to make.
The importance of remembering failed lookups in negative dentries may be less obvious at the outset. As it happens, repeated attempts to look up a nonexistent file are also common; an example would be the shell's process of working through the search path every time a user types "vi" (Emacs users start the editor once and never leave its cozy confines thereafter, so they don't benefit in the same way). Even more common are failed lookups created by the program loader searching for shared libraries or a compiler looking for include files. One is often advised to "fail fast" in this society; when it comes to lookups of files that don't exist, that can indeed be good advice.
So negative dentries are a good thing but, as we all know, it is possible to have too much of a good thing. While normal dentries are limited by the number of files that actually exist, there are few limits to the number of nonexistent files. As a result, it is easy for a malicious (or simply unaware) application to create negative dentries in huge numbers. If memory is tight, the memory-management subsystem will eventually work to push some of these negative dentries out. In the absence of memory pressure, though, negative dentries can accumulate indefinitely, leaving a large mess to clean up when memory does inevitably run out.
Some kernel problems are resolved quickly; others take a little longer. LWN briefly reported on a complaint about the memory consumption of negative dentries back in 2002, nearly exactly 20 years ago. A more recent attempt to solve the problem was covered here in early 2020. While numerous developers have taken a stab at the negative-dentry problem over time, the core problem remains. Those dentries still take up valuable memory, and they can create other problems (such as soft lockups) as well.
A new discussion
In mid-March, Matthew Wilcox suggested
that the negative-dentry problem might make a good LSFMM topic:
"maybe some focused brainstorming on the problem would lead to
something that actually works
". Often, simply proposing a topic
like this can elicit the sort of brainstorming needed to work toward a
solution. That didn't happen this time, but it did lead to the posting of
a
patch set by Stephen Brennan showing a new approach to the problem.
One of the difficulties posed by the negative-dentry problem is that it can be hard to know when the time has come to start throwing them away. The sizes of systems and workloads vary hugely, so any sort of simple limit is likely to cause performance regressions somewhere. Providing a knob for the system administrator to tune the limit can be tempting, but that just pushes the problem onto the users, and it is generally felt that the kernel should be able to figure things out by itself. But, as Brennan noted, that is not easy:
It's hard to look at a hash bucket or LRU list and design a heuristic for an acceptable amount of negative dentries: it won't scale from small to large systems well. But setting up heuristics on a per-directory basis will scale better, and it's easier to reason about.
The specific heuristic proposed by the patch is that the negative dentries for any given directory should not outnumber the positive dentries by more than a factor of five. If there are 20 positive dentries in the cache for a directory, there can be no more than 100 negative dentries. It is a nice idea, with only one small problem: the kernel doesn't keep counts of the number of dentries (or their types) associated with each directory.
To get around that, Brennan added code that maintains a "cursor" in the list of dentries associated with each directory. Whenever a dentry operation (creation or deletion) happens, that code will advance the cursor through the next six dentries in the list; if it does not encounter at least one positive dentry, it assumes that the limit has been exceeded and cleans up some negative dentries. Attaching this work to the dentry operations themselves means that the penalty will be paid by processes that are responsible for the creation of a lot of dentries, which seems correct.
The problem with this approach is, of course, that there is nothing that
forces dentries to be added to a directory's list in any particular order.
Depending on the order in which dentries are created, this algorithm could
come to an incorrect conclusion regarding the real ratio of positive to
negative dentries and do the wrong thing. Brennan acknowledged this
problem ("This workload-dependence is bad, full stop
"), but
has not yet come up with a better idea. As things stand, this algorithm
seems certain to lead to pathological cases; that may prevent the
acceptance of this patch set even in the absence of other concerns.
The bigger problem
Back in the general discussion, though, Dave Chinner argued that the focus on negative dentries was addressing a symptom of the problem and missing the bigger issue. The real problem, he said, is that memory pressure is the only mechanism the kernel has for controlling the size of the many caches it maintains:
Yup, the underlying issue here is that memory reclaim does nothing to manage long term build-up of single use cached objects when *there is no memory pressure*. There's [plenty] of idle time and spare resources to manage caches sanely, but we don't. e.g. there is no periodic rotation of caches that could lead to detection and reclaim of single use objects (say over a period of minutes) and hence prevent them from filling up all of memory unnecessarily and creating transient memory reclaim and allocation latency spikes when memory finally fills up.
Rather than worry about the dentry cache, he said, developers should come up with a mechanism that can manage the size of all in-kernel caches. Wilcox agreed in principle, but cautioned against making the problem so broad that it becomes intractable. Chinner doubled down, though, saying that multiple kernel caches have the same problem, and that a solution for one, based on some sort of periodic scanning to age items out of the cache, would be instantly applicable to all of them.
This discussion has mostly wound down without any suggestion that anybody
is setting out to create the more general cache-aging mechanism that
Chinner would like to see. The problem remains, though, and seems unlikely
to go away by itself. So the chances of this discussion showing up in an
LSFMM slot seem fairly high. Perhaps an in-person discussion — the first
in the memory-management and filesystem communites in three years — will lead to some sort
of consensus on a solution, preferably one that will be implemented before another
20 years pass.
Index entries for this article | |
---|---|
Kernel | Dentry cache |
Posted Apr 11, 2022 18:19 UTC (Mon)
by dskoll (subscriber, #1630)
[Link] (4 responses)
It seems to me that a negative dentry is useful only to the extent that it's hit. So what if the total number of negative dentries was limited to some small number and if that number is exceeded, then either (1) we evict an entry that has never been hit yet (or hit fewer than N times), or (2) we increase the negative dentry cache size by some amount if all negative dentries have been hit at least once (or at least N times).
This means keeping track of how many times a negative dentry is hit.
Caveat: I'm not a kernel programmer and have no idea how this would work in practice with the existing data structures.
Posted Apr 11, 2022 18:56 UTC (Mon)
by willy (subscriber, #9762)
[Link] (3 responses)
Posted Apr 11, 2022 19:03 UTC (Mon)
by dskoll (subscriber, #1630)
[Link] (2 responses)
Right. I guess limiting the size of the dentry cache so we start evicting before we have general memory pressure is the main idea, but allowing it to expand if by some miracle all entries have been hit so that it scales up if memory is available.
Posted Apr 11, 2022 19:31 UTC (Mon)
by willy (subscriber, #9762)
[Link] (1 responses)
Obviously those numbers are a bit magic. Why not 150? Or 200? Needs some experimentation.
Posted Apr 12, 2022 7:44 UTC (Tue)
by geert (subscriber, #98403)
[Link]
Posted Apr 11, 2022 20:35 UTC (Mon)
by cesarb (subscriber, #6266)
[Link] (16 responses)
But there is a limit to the number of _ranges_ of nonexistent files: they are limited by the number of files that actually exist plus one. So a solution could be to, instead of storing the name of the nonexistent file, storing the name of the existent files preceding and following the existent file. There is precedent in the NSEC family of DNS entry types (NSEC and NSEC3), which prove the absence of a record on a DNS zone by signing a record containing both the preceding and following names.
(You could also make the ranges circular to remove the "plus one" and avoid a couple of special cases.)
Posted Apr 11, 2022 21:48 UTC (Mon)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
I like this. Automatically bounded, and you can potentially answer many queries with a single negative range. You could even reuse the positive dentries for the preceding and following files, so long as you can ensure they won't be freed prematurely, further reducing RAM requirements.
The main drawback seems to me to be that the current filesystem APIs don't provide the preceding and following files in response to failed lookups—which could be changed—and, more importantly, that some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/> which only responds to requests for specific paths and can't be enumerated to find the preceding and following positive entries. (And even if it could, that information could be obsolete before the query finishes; this can even affect simple non-range negative dentries but storing ranges, including names not directly queried, would make it worse.)
One can imagine using a mix of ranges where filesystem support exists and simple negative entries where it doesn't, but then we still need to solve the original problem. Or we could just not cache negative lookups from filesystems that can't provide ranges, but that has its own drawbacks.
Posted Apr 12, 2022 11:37 UTC (Tue)
by grawity (subscriber, #80596)
[Link]
some filesystems may not be able to provide this information within acceptable performance parameters. For example, consider a FUSE filesystem like <https://pypi.org/project/simple-httpfs/> I think you would have the same problem with ordinary network filesystems like NFS/SMB when accessing a directory where you have +x but not +r. The server knows the nearest two files in that directory, but the client doesn't have the permissions to know that.
Posted Apr 11, 2022 21:49 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (11 responses)
1. Make the dentry cache exhaustive; scan and pre-cache the entire directory when you want to cache any individual dentry. On ext4 and other modern FS's, this will result in hilariously large caches in some cases, because directories are allowed to have a very large number of entries. Also, the scan will take a long time to complete.
Posted Apr 11, 2022 23:10 UTC (Mon)
by cesarb (subscriber, #6266)
[Link] (8 responses)
It would need help from each filesystem. I can think of three main ways of storing directory entries on a filesystem: they can be an unsorted list of filenames (as you mentioned, FAT and ext2 are good examples), or they can be a more complicated sorted structure (either a linear ordered list or something like a b-tree), or they can be an unsorted hash table.
For unsorted directories, you already have to scan the whole directory linearly on a cache miss; the only change would be to keep track of the nearest preceding and following names.
For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).
I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).
The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.
Posted Apr 12, 2022 0:17 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
That is probably acceptable.
> For a sorted structure, the lookup code would have to be modified to return the neighbor entries when the entry is not found. There's an extra complication in that the range lookup would have to use the filesystem's native concept of order, which could differ from filesystem to filesystem (for instance, the order might be based on a hash of the filename).
This might require more complicated tree traversal logic, in the case where one of the neighbor leaves is in a different subtree to the other neighbor leaf. At worst, that probably adds an extra O(log n) root-to-leaf traversal, which is likely acceptable in practice.
> I don't have a good solution for filesystems which use a hash table directly, though an approach using directly the low-level details of the hash table might be possible (that is, a negative entry would map to a range of hash buckets).
Typically, when we characterize the performance of a hash table, we assume that it has some maximum fullness. That is, if there are N buckets, then we assume that there are at most kN elements in the hash table (for some k with, in most cases, 0 < k < 1); if the hash table grows larger than that, it gets rebuilt with more buckets. This periodic rebuilding is necessary if you want a hash table to have O(1) performance on its standard operations - else it will eventually degrade to linear scans (regardless of whether you use open hashing or closed hashing). But this also means that it's possible for the hash table to be much, much sparser than k, if it has just been rebuilt (or if you've only added a small number of elements in the first place). As a result, in the worst case you could potentially scan N-1 consecutive buckets to find the boundaries (when there's only one occupied bucket). In cases where the hash table was just rebuilt, you could observe up to N - kn consecutive empty buckets, where n is the previous size (n depends on your growth algorithm, but it's probably some constant fraction of N, e.g. n = N/2). In both cases, this means our worst case performance has degraded from O(1) to O(N) (where N is the number of buckets, which is proportional to the actual number of dentries). That is likely to be unacceptable, for systems where we decided to use a hash table in the first place, because it defeats the main selling point of a hash table.
> The worst case would be remote filesystems (which also includes things like FUSE), since not only their API would have no concept of a negative dentry range, but also a range could become invalid without warning.
That also applies to individual negative dentries. The only way negative dentry caching can possibly work is if it takes place on the server side, and not the client side. Which is rather unfortunate, seeing as the network roundtrip is by far the more expensive operation than the disk lookup (at least in high-latency deployments, anyway).
Posted Apr 12, 2022 0:24 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Correction: Where N is proportional to *the largest number of dentries that the directory has ever had, or the largest number since the last time the filesystem shrank the hash table, if the filesystem does that at all.* This makes things significantly worse, because it means you really can get N - 1 for large N in a non-pathological case (i.e. removing most or all of the contents of a directory, but not rmdir'ing the directory itself).
Posted Apr 12, 2022 12:42 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (5 responses)
Or you use a modern hash algorithm like, dare I say it, Pick has used for the last 40 years ... I believe it's the same one as used by Python and Sleepycat/Berkeley amongst others.
https://en.wikipedia.org/wiki/Linear_hashing
tldr;
And being a hash table, you can return a definitive yes/no just by scanning the one bucket.
Cheers,
Posted Apr 12, 2022 19:26 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
You have entirely missed the point. We're not trying to figure out whether the bucket is occupied, we're trying to find its nearest occupied neighbors so that we can find an empty range of buckets. That requires a linear scan over neighboring buckets.
Furthermore, if you allow deletions, then a linearly-growing hash table does not help, because you can grow the hash table to be huge, and then depopulate it (e.g. by making lots of entries in a directory, then removing them).
Posted Apr 12, 2022 20:54 UTC (Tue)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Actually, now that I thought about it better, not even that might be necessary. A hash bucket itself is already a "range" of directory entries, since it being empty means that none of the entries hashing to that bucket are present. Since the number of hash buckets is limited, that could be enough to limit the number of negative "ranges".
Posted Apr 12, 2022 21:09 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
* Under open addressing (linear/quadratic probing, cuckoo hashing, etc.), the number of buckets must be greater than the number of dentries, and for performance reasons, it should be significantly greater.
In both cases: Assuming the hash function is a quasi-random oracle, the probability of a hash collision should be quite low, and should become lower as the hash table grows (so that the average number of collisions that each dentry suffers remains bounded above). A cached negative dentry range consisting of a single hash bucket will only help you for other dentries that would have collided with that bucket, which should not be very many in practice, or else the hash function is bad.
Posted Apr 12, 2022 21:42 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Except you completely missed the point I was responding to. cesarb was talking about a hashed directory, and not a cache.
And in that case, there are NO empty buckets (if you want to use that particular variant of hashing). The variant I'm used to sticks multiple entries per bucket, but there's a variant that has exactly as many buckets as entries, and one entry per bucket.
And just as it's very quick and easy to grow the table, it's just as easy to shrink it - the entire bucket you're getting rid of just tags on the end of the bucket the (now invalid) old hash hashes to.
Cheers,
Posted Apr 13, 2022 3:00 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
Posted Apr 12, 2022 9:16 UTC (Tue)
by pomac (subscriber, #94901)
[Link] (1 responses)
It's already a DoS in terms of memory due to how Linux works but enumerating them all with caches... Nah.. ;)
Posted Apr 14, 2022 11:44 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Maildirs do this routinely. It's amazing how much trouble it causes (to backup programs in particular, since they have to enumerate every inode in the directory to figure out which ones changed). (INN news tradspools do it too, but they're not per-user and they're relatively rarely backed up at all.)
Posted Apr 11, 2022 22:13 UTC (Mon)
by willy (subscriber, #9762)
[Link] (1 responses)
Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.
Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).
Posted Apr 11, 2022 23:30 UTC (Mon)
by cesarb (subscriber, #6266)
[Link]
I don't doubt that it would be inferior; still, it's an interesting thought experiment.
> Lookup is currently hash(directory pointer, component) and search the global hash table. What you're proposing would be a per-directory data structure to search, so a massive change.
Actually, it could be a per-filesystem data structure, if the concept of range is extended to include the parent directory; but it would have to be something like hash(directory pointer) || key(component), with the definition of key() depending on the filesystem's native on-disk order (if it has one; naïve filesystems with simple unsorted directories could use hash(component) for the key). I agree that it would be a massive change, with potential gains only on an unlikely case (a large amount of lookups to non-existing entries, which could be hits on a negative range cache but would be misses on a negative entry cache).
> Also, you'd need to be able to query the filesystem for the "previous" and "next" entries ... for whatever sort you think needs to be used (filesystem directories are not necessarily sorted in any way you think they are).
Yes, the sort would have to be filesystem-specific; I went into more detail on another comment above.
Posted Apr 11, 2022 22:26 UTC (Mon)
by MattBBaker (subscriber, #28651)
[Link] (1 responses)
Posted Apr 12, 2022 7:45 UTC (Tue)
by josh (subscriber, #17465)
[Link]
That seems like a useful heuristic to weight entries, if you also take into account the number of lookups.
Posted Apr 11, 2022 23:54 UTC (Mon)
by vgoyal (guest, #49279)
[Link]
Posted Apr 12, 2022 4:39 UTC (Tue)
by benjamir (subscriber, #133607)
[Link] (1 responses)
Posted Apr 12, 2022 16:06 UTC (Tue)
by vgoyal (guest, #49279)
[Link]
Posted Apr 12, 2022 7:26 UTC (Tue)
by dgc (subscriber, #6611)
[Link]
More discussion of the periodic reclaim concept for controlling cache explosions can be found under this thread:
https://lwn.net/ml/linux-mm/20220402072103.5140-1-hdanton...
-Dave.
Posted Apr 12, 2022 12:04 UTC (Tue)
by aaronmdjones (subscriber, #119973)
[Link] (5 responses)
It could also be argued that vi users don't benefit either, because they can't figure out how to close it.
Posted Apr 12, 2022 15:50 UTC (Tue)
by dskoll (subscriber, #1630)
[Link] (4 responses)
It can be hard to get out of emacs too, sometimes. The emergency cheat-sheet:
vi: Esc Esc Esc :q!
emacs: Ctrl-G Ctrl-G Ctrl-G Ctrl-X Ctrl-C
Posted Apr 13, 2022 23:50 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
This actually doesn't work for me ..... because I deliberately disabled it as I was somehow typing it accidentally.
Posted Apr 14, 2022 7:40 UTC (Thu)
by SiB (subscriber, #4048)
[Link]
That has the same effect on how to kill the emacs process. Ctrl-X Ctrl-C just closes the frame.
Posted Apr 17, 2022 22:59 UTC (Sun)
by Spack (subscriber, #77556)
[Link] (1 responses)
Posted Apr 18, 2022 20:41 UTC (Mon)
by Jandar (subscriber, #85683)
[Link]
Neither vi in insert-mode nor emacs in x11 mode stops with CTRL-Z ;-)
vi inserts the CTRL-Z and emacs executes suspend-frame.
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
2. Scan the directory once every time a failed lookup occurs, and make a negative dentry for the range surrounding that failed lookup. This slows down failed lookups by potentially quite a lot, especially on filesystems where dentries are not sorted in any convenient way (e.g. most if not all versions of FAT, ext2).
3. Scan the directory once every time a successful lookup occurs, and make negative dentries for the two ranges adjacent to the successful lookup. This is subject to most of the same problems as (2), except that it slows down successful lookups instead of failed lookups.
4. Periodically identify directories that have a lot of cached negative dentries, scan them, and coalesce their negative dentries into ranges. But it would be far simpler to just expire the negative dentries which have not been recently used; a negative dentry does not correspond to anything on disk, and so negative dentries are always clean and can be unceremoniously discarded without disk access (i.e. discarding is very fast compared to scanning).
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Given 2^(n-1) < N <= 2^n, if mod(2^n) gives a valid bucket use that otherwise use mod(2^(n-1)). When creating a new bucket N+1, all the records that belong in the new bucket will currently be in bucket (N+1)mod(2^(n-1)), so the cost of growing the hash table by one bucket at a time is constant. So the hash table can be kept at optimal fullness with minimal cost.
Wol
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
* Under closed addressing (separate chaining), the number of dentries cannot be much more than a small (constant) multiple of the number of buckets, or performance suffers.
Negative dentries, 20 years later
Wol
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Why not depend on that (in fractions or multiples)?
Negative dentries, 20 years later
Negative dentries, 20 years later
Negative dentries, 20 years later
Intuitive ways to exit text editors
Intuitive ways to exit text editors
Now I use : alt-x k i l l - e m a c s enter
I haven't yet typed that by mistake.
Intuitive ways to exit text editors
Intuitive ways to exit text editors
$ killall emacs vi
Intuitive ways to exit text editors
> $ killall emacs vi