|
|
Subscribe / Log in / New account

watch_mount(), watch_sb(), and fsinfo() (again)

By Jonathan Corbet
February 24, 2020
Filesystems, by design, hide a lot of complexity from users. At times, though, those users need to be able to look inside the black box and extract information about what is going on within a filesystem. Answering this need is David Howells, the creator of a number of filesystem-oriented system calls; in this patch set he tries to add three more, one of which we have seen before and two of which are new.

The new system calls, watch_mount() and watch_sb(), provide ways for a process to request notifications whenever something changes at either a mount point (watch_mount()) or within a specific mounted filesystem (watch_sb(), the "sb" standing for "superblock"). For a mount point, events of interest can include the mounting or unmounting of filesystems anywhere below the mount point, the change of an attribute like read-only, movement of mount points, and more. Filesystem-specific events can also include attribute changes, along with filesystem errors, quota problems, or network issues for remote filesystems.

These system calls are built on a newer version of the event-notification mechanism that Howells has been working on for some time. In the past, getting notifications has involved opening a new device (/dev/watch_queue), but that interface has changed in the meantime. In the current version, a process calls pipe2() with the new O_NOTIFICATION_PIPE flag to create a special type of pipe meant for notification use. The writable side of this pipe is not used by the application; the file descriptor for the readable end can be passed to either of the new system calls:

    int watch_mount(int dirfd, const char *path, unsigned int flags,
    		    int watch_fd, int watch_id);
    int watch_sb(int dirfd, const char *path, unsigned int flags,
    		 int watch_fd, int watch_id);

In both cases, dirfd, path, and flags identify the directory of interest in the usual openat() style. The notification pipe is passed in as watch_fd, and watch_id is an integer value that will be returned in any generated events. There is a special case, though; if watch_id is -1, any existing watch using the given watch_fd will be removed.

The application receives events by reading from the pipe. By default all events affecting the given watch point will be returned. The application can, though, create a filter that is attached to the notification pipe with an ioctl() call. There's another ioctl() call to set the size of the buffer used to hold notifications sent to user space. Curious readers can see these system calls used in this sample program.

Unlike the system calls described above, fsinfo() has been seen before. Its prototype remains the same:

    int fsinfo(int dirfd, const char *path, const struct fsinfo_params *params,
	       void *buffer, size_t buf_size);

As before, dirfd and path describe the filesystem for which information is requested; there is no flags argument here, but it is hidden within the params structure, which looks like this:

    struct fsinfo_params {
	__u32	at_flags;
	__u32	flags;
	__u32	request;
	__u32	Nth;
	__u32	Mth;
	__u64	__reserved[3];
    };

The at_flags field contains the same flags that one would ordinarily expect to see in an openat()-style system call. The request field describes the information that is being asked for; a number of possible values can be found in this patch from the series. Potentially available information ("potentially" because filesystems are not required to implement every possibility) include filesystem limits, timestamp resolution information, the volume UUID, the servers behind a remote filesystem, and more. For attributes that can have multiple values, the Nth and Mth fields can be used to select one in particular.

The format of the returned value is ... complex. Values are stored into the provided buffer in any of a number of formats, depending on what was requested. For some, a structure is returned; others return a string or a type called simply "opaque". There is some documentation in this patch, but it seems clear that potential users of this system call will have to do some digging to figure out the information that will be returned to them.

This patch set is now in its 17th revision, having evolved quite a bit over the years. The one comment on this version, so far, comes from James Bottomley, who suggested that there may not be a need for fsinfo() at all. Instead, with some changes to how fsconfig() (which is used to configure filesystem attributes) is implemented, it could be turned into an interface that could both set and read attributes. So far, Howells has not responded to that suggestion.

Overall, the fact that these patches have been through 17 revisions (so far) says a lot. Nobody doubts that getting this information out of the kernel would be useful, but the API remains complex and hard for potential users to understand. Whether that can be fixed while retaining the features provided by these system calls is not clear, though.

Index entries for this article
KernelSystem calls


to post comments

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 24, 2020 23:09 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> struct fsinfo_params {
> __u32 at_flags;
Sigh.

You know what's missing here? Yes, it's the explicit structure length.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 8:00 UTC (Tue) by geert (subscriber, #98403) [Link] (1 responses)

And the explicit padding between an odd number of 32-bit fields, and the 64-bit fields.
Fortunately the latter is present in the real declaration:

__u32 __reserved32[1]; /* Reserved params; all must be 0 */

Perhaps that can be traded for an explicit structure length?

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 26, 2020 2:19 UTC (Wed) by cyphar (subscriber, #110703) [Link]

Indeed, that was the entire point of copy_struct_from_user() -- so that new syscalls can be extensible and avoid the need to predict how many future fields you might want (I gave a talk about this at LCA this year). I've sent a suggestion to the ML, though there is some discussion on the latest thread to remove the need for fsinfo(2) entirely.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 4:32 UTC (Tue) by anguslees (subscriber, #7131) [Link] (9 responses)

Is netlink an acceptable alternative to the proposed notification pipe, or are the required semantics different?

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 7:00 UTC (Tue) by smurf (subscriber, #17840) [Link] (8 responses)

Regardless of whether netlink would be a good way to do it, this is yet another way to get a string of notifications from the kernel. I do wonder why nobody is complaining loudly that this should use, or maybe improve upon, one of the existing mechanisms.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 10:11 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Yes, I actually wanted to ask the same. Why not extend fanotify?

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 14:38 UTC (Tue) by alison (subscriber, #63752) [Link]

>Why not extend fanotify?

The author of "man fanotify" appears to have the same question:

The fanotify API provides notification and interception of filesystem events.
Use cases include virus scanning and hierarchical storage management. **Cur‐
rently, only a limited set of events is supported.**

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 25, 2020 22:36 UTC (Tue) by neilbrown (subscriber, #359) [Link] (3 responses)

This *is* an existing mechanism. autofs has been using a pipe for notification for decades.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 26, 2020 7:24 UTC (Wed) by maxfragg (guest, #122266) [Link] (2 responses)

not sure if the fact that this existed so long without a lot of people noticing makes it better or worse.
But I have to agree, having one encouraged interface (probably netlink, or maybe even epoll here?) to be used for any new interface of this type would be a good idea

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 26, 2020 14:25 UTC (Wed) by Paf (subscriber, #91811) [Link]

This doesn’t really seem like a big deal - novel and different mechanisms are a problem when they’re, well, novel and different. This is just a pipe, and unless I’ve missed something, it’s not requiring implementing anything.

I mean maybe netlink would be superior, but this seems nicely trivial. It’s not like they’re knocking up a netlink competitor, just using a trivial interface where they can.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 28, 2020 14:43 UTC (Fri) by dhowells (guest, #55933) [Link]

I'm trying to make a generic notification system, with such properties as:
- messages can be longer than 8 bytes (say up to 128 bytes);
- message queueing requires no on-the-spot allocation and can be done from softirq/irq context or inside spinlocks;
- messages can come from a variety of sources (e.g. mount(2), add_key(2), keyctl(2), sb errors, usb notifications, ...);
- messages from different sources can be in the same queue;
- messages may need copying into multiple queues;
- filters can be easily employed;
- message loss reporting.

As has been mentioned, there are existing solutions:
- Netlink:
- would make the core VFS dependent on the networking code.
- would require GFP_ATOMIC message allocation at the point of generation (at least sometimes).
- doesn't seem to make it easy to do message loss reporting.
- epoll is for dealing with file descriptors - but we're not dealing with fd events.
- Not all sources I need notifications for can be addressed with fanotify.
- eventfd() has 8-byte messages.

Pipes have disadvantages too: no message loss reporting (I'm having to add it); pipe ring buffer metadata elements generally larger than common messages (might be able to optimise in future to store small data in the ring). Note that I didn't want to use pipes either: I was originally using a mappable chardev ring buffer but Linus said I had to use pipes instead.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 27, 2020 14:47 UTC (Thu) by kugel (subscriber, #70540) [Link] (1 responses)

There is also eventfd(2)

A pipe seems really awkward for one-way event notifications.

watch_mount(), watch_sb(), and fsinfo() (again)

Posted Feb 28, 2020 14:15 UTC (Fri) by dhowells (guest, #55933) [Link]

eventfd is limited to 8-byte messages. That is not sufficient.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds