|
|
Log in / Subscribe / Register

Dealing with negative dentries

By Jake Edge
May 9, 2022

LSFMM

The problem of negative dentries accumulating in the dentry cache in an unbounded manner, as we looked at back in April, came up at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). Negative dentries reflect failed file-name lookups, which are then cached, saving an expensive operation if the file name in question is looked up again. There is no mechanism to proactively prune back those cache entries, however, so the cache keeps growing until memory pressure finally causes the system to forcibly evict some of them, which can make the system unresponsive for a long time or even cause a soft lockup.

The problem

[Stephen Brennan]

Stephen Brennan led the session; he had posted a patch set, with a new approach to the problem, during the discussion in March. The problem that he is seeing is on big servers with lots of memory, where part of the workload is looking up unique IDs in some directory a few times per second—which goes on for months or years. Each lookup creates a negative dentry in the cache, resulting in a cache full of these entries that have never been used after they were created.

Since there is no memory pressure, because the system has lots more memory than is needed by the workload, there is no clean up. That can lead to soft lockups when iterating through the children of a dentry because of the amount of time it takes to do so. It also leads to slab fragmentation; if the system has 500 million negative dentries and then a directory containing some of them is deleted, there will be an enormous number of partially filled slab pages.

His goal is to have some kind of generic system for managing all of the various least-recently-used (LRU) lists in the page-cache and filesystem code. Currently, negative dentries are not moved to the head of the LRU when they are referenced; instead, they are simply marked and left in place until the shrinker runs. Shrinkers are the mechanism that the memory-management subsystem uses to request cache entries be freed. They only run when there is memory pressure, but at that point a dentry might have been marked as referenced a year ago, so that dentry is not useful anymore—if it ever was. When that happens, the shrinkers have to do a lot of work just to move entries to the end of the list before they can even be reclaimed.

In the mailing list discussion, Dave Chinner wanted to see a generic solution for various caches that all use the list_lru mechanism. Brennan (and others) have been thinking about that. For example, every time something gets added to these caches, the list could be rotated to move those with fewer references to the back of the list. At that point, perhaps some aging rules could be applied to negative dentries. Another thing that could be done is to track the number of entries in these caches, and their types, so that decisions could be made based on the numbers of dentries, negative dentries, or some calculation using those numbers.

James Bottomley said that the problem is that the number of negative dentries is completely unbounded, so he wondered if the focus of the work should be on dealing with that. The number of positive dentries is bounded by the number of files in a directory, but there is a nearly infinite number of file names that are not present. Brennan agreed, noting that on the systems he has looked at, 99% of the cached dentries are of the negative variety, so that is the source of the underlying problem. But he thinks the fix can be made using the existing lru_list and shrinker frameworks.

While Oracle's main concern is with negative dentries, Matthew Wilcox said, there are other caches that have similar problems. For example, under certain workloads, the inode cache can similarly end up with many entries that will never be used again. That cache is bounded by the number of files in the filesystem, but it is still a "used once" problem as with negative dentries in the problem cases.

Not really LRU

This is a "classic LRU problem", Kent Overstreet said. Brennan agreed with that, noting that these LRUs are not really being treated as "least recently used". There are, instead, used-once entries scattered throughout the list. Under memory pressure, those entries are the ones that should be cleaned up; if they were organized better, so that all of those that were not recently used were together, the cost of scanning the whole list for them could be reduced. But there is still the problem of workloads that create "stupid amounts" of negative dentries that will never be used again. There is no good reason to cache those at all.

Josef Bacik said that he had solved a similar problem around five years ago by not marking entries as referenced until they are used for the second time. Prior to that, the list was being scanned to clear the referenced bits for entries that had been referenced once, but that scan took a long time. He chose to wait for the second use, instead of changing the LRU to be a real LRU, because he found that constantly shuffling things to the back of the list was "not excellent" for the workload he was looking at.

Brennan said that it is not excellent for a lot of workloads because it leads to a lot of contention on the spinlocks. He loves the idea of waiting until the entry is used again, but it leads to another problem: "how much is too much?" Memory pressure provides a good signal to indicate that entries need to be pruned, but the soft lockups he mentioned can occur starting around 100 million negative dentries in a single directory. He would rather not use some kind of "magic threshold", but there needs to be some mechanism to start shrinking the one-use items, at least.

Other possibilities

Bottomley suggested that the bounded positive dentry cache size for a directory be the limit on how large the negative-dentry cache could grow. Brennan thought that was an interesting idea. Ted Ts'o pointed out that negative dentries have no references to other data structures in the system, so they are easy to get rid of, or simply to move to another page. He wondered if simply getting rid of the negative dentries blocking the freeing of a page might be a reasonable tradeoff, even if those entries might actually be used again. There may be good reasons to give negative dentries special treatment, he said.

Bacik said that the negative-dentry problem can be solved in a fairly straightforward way, but that if Brennan wanted to solve the more general problem, there was a lot of work that would be needed. There was some discussion of using a mechanism like the page cache "refault" tracking, that was added by Johannes Weiner quite a ways back; it would allow the system to know that something it had evicted returned to the cache, so it should probably stick around longer.

Overstreet wondered if there was a way to detect workloads that are scanning and creating lots of negative dentries, then stopping the creation of those entries. That could be tracked per process ID and limits could be placed on how many negative dentries could be created. Michal Hocko said it sounded like a similar problem to that of throttling the creation of dirty pages; when the rate of their creation gets too high, a process is throttled to slow down their creation.

Brennan said that the problem is not necessarily that the entries are created at a high rate of hundreds or thousands per second; it could be a slow trickle of them over a long period of time. It can add up to hundreds of millions over, say, a year's time. There are some workloads that simply do lookups, which create a negative dentry when the file is not found, but others may be creating lots of temporary files and then deleting them, which also leaves negative dentries behind.

John Hubbard said that part of the problem is that Linux wants to use all of the available memory if possible, because it generally leads to better performance, but there is a cost to getting that memory back when it is needed elsewhere. Making any kind of improvement on the reclaim side would really help, Brennan said.

As time for the session wound down, Bacik asked Brennan what it is he would like the filesystems and memory-management developers to do to help. Brennan said that there were a lot of good ideas raised in the session and he hopes that those who care about the specific problem for negative dentries and the more general problem of finding better ways to reclaim memory from these caches would provide more eyeballs on the patches he would be posting.

Weiner asked if a way to sidestep the problem would be to put the negative dentries on a separate shrinker list. There are some shrinkers that are called more aggressively and there is no need to protect specific entries; a more wholesale freeing would probably work just fine. Brennan agreed that might be a way forward. The possibility of making changes that were specific to negative dentries was his favorite part of the discussion, he said.


Index entries for this article
KernelDentry cache
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2022


to post comments

Dealing with negative dentries

Posted May 10, 2022 6:20 UTC (Tue) by nickodell (subscriber, #125165) [Link] (3 responses)

In one of the threads discussing this patch, there's an example of how dentry bloat can happen on non-crazy systems: https://bugzilla.redhat.com/show_bug.cgi?id=1720479 I thought it was interesting, so I dug through the bug tracker and associated source code to understand the rationale behind *why* it was making so many dentries.

Here's a short summary of why it happened:

1. In the linked bug, curl is being used as a Docker healthcheck for Elasticsearch. It's getting run every second and creating ~20,000 negative dentries each time.
2. NSS is a library for loading and validating SSL certificates. It's used in curl and Firefox. It can load a set of certificate authorities from the filesystem.
3. But loading these is slow, so it builds an indexed SQLite database of SSL certs so that cert lookup is fast.
4. The location that the database is being built on could be a network filesystem, so if that happens, the database should be built up in memory, *then* written to disk in one go, for speed reasons.
5. But how can it detect if it's a slow network FS or a fast disk FS? It measures it using stat(). But if it stats a file which exists, the file could be in the dentry cache already, which would throw off the measurement. So it measures the stat() time of a nonexistant file with a random number in the filename.
6. It repeats that measurement 10,000 files, or for 33ms, whichever comes first. Also, it measures both /tmp and the directory it loads the certificates from.

There are a couple of contributing factors to the problem:

1. The developers of NSS, Mozilla, mostly care about performance for Firefox. In Firefox, NSS is only loaded once per process. curl also loads NSS once per process, but each process much shorter lived.
2. Firefox is used on desktop systems, which get rebooted more often.
3. AFAIK, there's no good cross-platform way to determine if a path is on a network filesystem.
4. Elasticsearch gets run on systems with gobs and gobs of memory, and it's not rebooted for a really long time.

The story does have a happy ending, though. These days, NSS will make a temporary dir, look up its 10000 non-existant files in that directory, and deletes the temporary dir. As I understand it, that cleans up all but one negative dentry.
https://github.com/nss-dev/nss/blob/18668d2e34500a6f14b68...

But it's easy to see why the NSS developers started looking up nonexistent files - it solved a pressing performance problem in a cross-platform way that looked to be nearly free. It seems like negative dentries are something of an attractive nuisance - easy to misuse without knowing the performance costs.

Dealing with negative dentries

Posted May 10, 2022 17:26 UTC (Tue) by brenns10 (subscriber, #112114) [Link] (1 responses)

I remember reading this bug report and thinking that this was one of the crazier systems. I don't love the idea of timing operations to determine whether something is local or remote: ideally we should provide an API to provide the required info and then try to make every operation fast enough to appear as if it's local :P . But I suppose craziness is in the eye of the beholder :)

> But it's easy to see why the NSS developers started looking up nonexistent files - it solved a pressing performance problem in a cross-platform way that looked to be nearly free. It seems like negative dentries are something of an attractive nuisance - easy to misuse without knowing the performance costs.

You're 100% right though. Userspace shouldn't need to know or care what a dentry is, or worry about the cache pollution they cause. The kernel needs to be smart enough to weed out these useless negative dentries.

Dealing with negative dentries

Posted May 10, 2022 18:57 UTC (Tue) by nickodell (subscriber, #125165) [Link]

>I don't love the idea of timing operations to determine whether something is local or remote: ideally we should provide an API to provide the required info and then try to make every operation fast enough to appear as if it's local :P

They don't care about whether the directory is local or remote, exactly - they care about whether it's fast or slow. In principle, you could have a really fast network filesystem, and then NSS wouldn't bother with the workaround.

>But I suppose craziness is in the eye of the beholder :)

Haha, exactly.

-----

By the way, thanks for hosting this session, and posting the initial patch. I thought it was a really smart and fast heuristic.

Dealing with negative dentries

Posted May 11, 2022 7:37 UTC (Wed) by jengelh (subscriber, #33263) [Link]

>3. AFAIK, there's no good cross-platform way to determine if a path is on a network filesystem.

At what point does certain storage become "network"? Media conversion to fiber? Cabling distance to the array >1000m? What would you do if the "network"ed mount were still to be the faster choice over a local disk?
Such checking attempts can only end in vain, even before having to consider the difference in OS APIs.

Dealing with negative dentries

Posted May 10, 2022 8:33 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (4 responses)

This is probably a stupid question but... why can't you just put a (per-directory) hard cap on the number of dentries at some arbitrary number (say, 1000)? Are there really applications out there that need (for performance reasons) more than 1000 negative dentries in a single directory? Why aren't those applications using something like SQLite instead of raw filesystem lookups?

Yes, I'm sure there's some weird application that builds huge directory hierarchies and then queries for lots of files in lots of different places, so maybe you also need a more generous system-wide limit. But I'm not sure why so much complexity is being expended on serving the "sad path," as it were...

Dealing with negative dentries

Posted May 10, 2022 14:22 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

Yes, there really are. Consider PATH=$HOME/bin:/bin:/usr/bin

Most commands you execute don't exist in your private bin, so you need a negative dentry for each of them.

Similar issues for header files; each invocation of gcc needs to search every directory in the -I directories for the header files.

Dealing with negative dentries

Posted May 10, 2022 17:06 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> Most commands you execute don't exist in your private bin, so you need a negative dentry for each of them.

bash hashes those automatically. It does one lookup and then remembers where the file lives. Try running the command hash to see your lookup table. Ironically, this is a perfect example of negative dentry cache leakage, because the kernel is remembering information which userspace already tracks anyway.

(zsh also does this, and I imagine most other modern shells do as well.)

> Similar issues for header files; each invocation of gcc needs to search every directory in the -I directories for the header files.

I'm not saying that we should get rid of negative dentries altogether, just that you don't need a huge number of them. Just how many header files are you searching for?

Dealing with negative dentries

Posted May 21, 2022 8:57 UTC (Sat) by sammythesnake (guest, #17693) [Link]

Perhaps the fix for this (and similar cases) could be to have an API that allows a file lookup with a set of locations, rather than a single file path.

E.g. a call to look for "ls" in any of ["/home/bob/bin","/usr/bin","/usr/sbin"] - there could then be a single dentry indicating that it's at "/usr/bin/ls", or none of the above for some other binary that isn't installed.

The same logic would work for include files or whatever, too...

Obvious downsides that occur to me:

1. You need to define a neat way to specify the $PATH equivalent in a way that could be canonically hashed as the key in the cache - that API would need careful design.

For positive entries, you could potentially only include path elements up to the one where the file was found, meaning the dentry created by the "ls" example above would also be hit if the $PATH was extended or even if the "sbin" element was omitted.

All of this would be much more complex and locating the correct negative dentries to invalidate when a file is created could be a faff. Given that this code is supposed to be faster than heading to the filesystem, there's a limit to how much cleverness is worth paying for.

2. You also might have lots of different $PATH / $INCLUDE specifications, leading to potentially lots of similar entries (especially the negative ones which wouldn't benefit from the path element pruning idea above) which would reduce the potential wins.

Dealing with negative dentries

Posted May 10, 2022 14:29 UTC (Tue) by k3ninho (subscriber, #50375) [Link]

>some arbitrary number (say, 1000)
You've get up to 255 bytes for the name in a path that's got max length 4096 bytes, and 65535 files per directory. Maybe the arbitrary number should be in the 10^9 range.

The example's in a comment adjacent to this: https://lwn.net/Articles/894432/

K3n.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds