Monitoring mount operations

By Jake Edge
May 24, 2023

Amir Goldstein kicked off a session on monitoring mounts at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. In particular, there are problems when trying to efficiently monitor "a very large number of mounts in a mount namespace"; some user-space programs need an accurate view of the mount tree without having to constantly parse /proc/mounts or the like. There are a number of questions to be answered, including what the API should look like and what entity should be watched in order to get notifications of new mount operations.

It is trivial, he said, to add a notification for unmount events, but the corresponding event for a new mount is trickier, since it is not clear where, exactly, the watch for that event should be placed. It could be placed on the user or mount namespace of interest; another idea would be to choose a directory and monitor all of the mounts that happen on it and any of its subdirectories recursively. David Howells said that he has implemented something for getting mount notifications; the watch is placed on the mount namespace. Miklos Szeredi said that each namespace has its own mount tree and each mount has a 32-bit ID that gets assigned to it, but those cannot reliably be used to uniquely identify a particular mount because they can be reused during a given boot of the system. Howells said that he added a 64-bit counter that could be used for that purpose, though it will "eventually get reused" as well.

Howells was asked about patches, which he said he had posted a while back. Szeredi pointed out that those patches were not for fanotify support, but were instead for the watch queue; it is the same general concept, however, he said. Christian Brauner thought that the notification piece should be separated from the fsinfo() effort.

The problem, Howells said, is that the notification queue can overflow, which means that events, such as mount and unmount operations, would get lost. Howells mentioned that currently tools have to parse (and poll) /proc/mounts in order to find out the status of mounts and unmounts, which is not particularly efficient. Brauner noted that he had invited Lennart Poettering to the talk, since systemd would be one of the eventual users of any new feature of this sort, so he asked Poettering about systemd's needs in this area.

Right now, systemd parses /proc/self/mountinfo, "which, of course, is terrible", Poettering said. He is not particularly concerned if events get dropped, as long as there is a way to figure out what has happened; some kind of unambiguous indication that events have been dropped coupled with an API for systemd to get the current status when it needs to do so would be ideal. He would like a facility that provides the immediate child mounts for a given mount along with mount-related events for those children. Howells said that Ian Kent had created a patch set implementing mount watching for systemd using fsinfo() and the watch queue notifications.

Brauner asked if the feature needed to be added to fanotify for systemd's use, but Poettering said that he did not care. His main concern is in getting notified when events are lost, so that systemd can take some action to update its state; it would be great if the lost-event notification narrowed down where in the mount tree the lost event(s) came from. For systemd's use case, it would be better to get events for a particular subtree, rather than the whole system, because it normally is only concerned with a subset of the full mount tree.

Jeff Layton asked about the systemd use case for this information. Poettering responded that there are many systemd services that need to wait for mount activity of some form (e.g. at boot time, MySQL needs to wait for the filesystem where its files reside). Much of systemd's dependency processing for services depends on an accurate picture of the state of the system, including mounts.

Goldstein said that he was unsure how to report the occurrence of a tucked mount, which is a mechanism aimed at cleanly replacing an overlay mount. Brauner said that he was no longer "allowed to call it that"; there is another interpretation of that term, which he was unaware of until "friendly people on social media" pointed it out to him. They suggested using "beneath" to describe the type of mount. There is also, of course, the danger of mistyping the previous term, he said.

There was some discussion of a way to retrieve the immediate child mounts, as Poettering requested, but that will require a unique mount ID, Brauner said. After some roundabout discussion about mount-related APIs and the concerns that would need to be kept in mind, worries about a mount-ID overflow were raised. Layton pointed out that a 64-bit counter that gets incremented every nanosecond will take more than 500 years to overflow, so "we're never going to overflow at 64 bits".

There may be problems with exposing those 64-bit values to user-space programs that expect only a 32-bit mount-ID, however. In fact, Poettering checked the systemd code and it "knows" that the mount-IDs are 32-bits in size. Howells said that the existing mount-ID is "recycled, too small, people assume it is too small", so something new that is defined to be 64-bits is needed. Poettering suggested using UUIDs "and the problem goes away", he said, to chuckles around the room. As time expired, things kind of trailed off; it is clear that there is more work needed before anything is likely to go upstream.

Index entries for this article
Kernel	fanotify
Kernel	Filesystems/Mounting
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Monitoring mount operations

Posted May 24, 2023 15:29 UTC (Wed) by intelfx (subscriber, #130118) [Link] (4 responses)

> Brauner said that he was no longer "allowed to call it that"; there is another interpretation of that term, which he was unaware of until "friendly people on social media" pointed it out to him.

What's the "other" interpretation?

// here we go again...

Monitoring mount operations

Posted May 24, 2023 16:20 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

Depending on what kind of communities you hang out in, you might associate "tucking" primarily with drag queens or trans women hiding their genitals (https://en.wikipedia.org/wiki/Tucking). As far as I'm aware, there's no problem using that term in that context or in a Linux context - there's nothing negative or even confusing about it, since it's still just a kind of hiding something under something else. I presume the interaction on social media was https://mastodon.social/@brauner/110122901027222225 in which it sounds like the commenter just thought it was a funny connotation.

Monitoring mount operations

Posted May 25, 2023 6:04 UTC (Thu) by smurf (subscriber, #17840) [Link]

"Oh damn this word can mean something vaguely sexual in a completely unrelated context, change it to SAVE THE CHILDREN" (and the terminally puritan).

Given the absurd level of synonymal richness in the English language (check out the game "Synonymy") renaming this is stupid – esp. since we all know what "mount" can mean in the same context. Nobody's going to rename THAT.

Monitoring mount operations

Posted May 25, 2023 12:23 UTC (Thu) by atnot (subscriber, #124910) [Link]

Yeah I personally think feeling like you are no longer allowed to call it that is a pretty strong overreaction too. I'm all for being careful in the language you chose and how it affects people, but there's not really any connotations or context that make the alternate meaning distasteful. I do understand that this sort of thing can be hard to judge for outsiders and they'd prefer to play it safe, but really the worst case is here really some trans people having a snicker at unintential comedy. Which, let's be real, will be happening anyway with all of the binders and eggs and make_trans() and thousands of other things overloaded in queer slang.

That all said, given the difficulty of unlearning things I would understand if the author voluntarily preferred to name it something else now to avoid having that association in his head while working on the feature ;)

Monitoring mount operations

Posted May 24, 2023 18:47 UTC (Wed) by jkingweb (subscriber, #113039) [Link]

Something to do with automobiles, according to what DuckDuckGo is telling me. 🤷

Monitoring mount operations

Posted May 24, 2023 21:09 UTC (Wed) by bluca (subscriber, #118303) [Link] (1 responses)

I really hope we can get fsinfo() finally, there are some really nasty race conditions when you combine lots of mounts and low resources devices, due to how racy probing /proc/self/mountinfo is.

Monitoring mount operations

Posted May 25, 2023 15:27 UTC (Thu) by brauner (subscriber, #109349) [Link]

fsinfo() != mount table notifications. But yes, we need both.

Monitoring mount operations

Posted May 24, 2023 21:25 UTC (Wed) by shemminger (subscriber, #5739) [Link] (2 responses)

Netlink has a similar problem with notifications overflowing the queue. The problem is that notifications are asychronous (and multicast); therefore blocking until the application has consumed them is not a workable solution. Blocking kernel waiting for a broken slow userspace application will bring even more pain.

If queue overflows, then userspace application will see an error on the netlink listening socket.
Handling overflow in a safe manner is really hard. The application can recover by rescanning the space it is listening to (ie all routes) but then synchronizing incoming changes with current state gets messy.

Monitoring mount operations

Posted May 24, 2023 23:58 UTC (Wed) by Fowl (subscriber, #65667) [Link] (1 responses)

Copy/account the memory of each message to each listener and let the OOM killer deal with any stuck ones?

Monitoring mount operations

Posted May 27, 2023 10:38 UTC (Sat) by kaesaecracker (subscriber, #126447) [Link]

I think a potential problem with this might be that you would still have to allocate memory for the copied message. Even if the memory is shared between all processes that did not read the message yet, you would have to allocate new memory for the next message. So there would still be a way for the queue to be "full", but the messages would have to be dropped for all listeners.
I think a user space application could have a thread dedicated to listening to new messages and immediately copying them to another in-process queue for actual processing.