watch_mount(), watch_sb(), and fsinfo() (again)
The new system calls, watch_mount() and watch_sb(), provide ways for a process to request notifications whenever something changes at either a mount point (watch_mount()) or within a specific mounted filesystem (watch_sb(), the "sb" standing for "superblock"). For a mount point, events of interest can include the mounting or unmounting of filesystems anywhere below the mount point, the change of an attribute like read-only, movement of mount points, and more. Filesystem-specific events can also include attribute changes, along with filesystem errors, quota problems, or network issues for remote filesystems.
These system calls are built on a newer version of the event-notification mechanism that Howells has been working on for some time. In the past, getting notifications has involved opening a new device (/dev/watch_queue), but that interface has changed in the meantime. In the current version, a process calls pipe2() with the new O_NOTIFICATION_PIPE flag to create a special type of pipe meant for notification use. The writable side of this pipe is not used by the application; the file descriptor for the readable end can be passed to either of the new system calls:
int watch_mount(int dirfd, const char *path, unsigned int flags, int watch_fd, int watch_id); int watch_sb(int dirfd, const char *path, unsigned int flags, int watch_fd, int watch_id);
In both cases, dirfd, path, and flags identify the directory of interest in the usual openat() style. The notification pipe is passed in as watch_fd, and watch_id is an integer value that will be returned in any generated events. There is a special case, though; if watch_id is -1, any existing watch using the given watch_fd will be removed.
The application receives events by reading from the pipe. By default all events affecting the given watch point will be returned. The application can, though, create a filter that is attached to the notification pipe with an ioctl() call. There's another ioctl() call to set the size of the buffer used to hold notifications sent to user space. Curious readers can see these system calls used in this sample program.
Unlike the system calls described above, fsinfo() has been seen before. Its prototype remains the same:
int fsinfo(int dirfd, const char *path, const struct fsinfo_params *params, void *buffer, size_t buf_size);
As before, dirfd and path describe the filesystem for which information is requested; there is no flags argument here, but it is hidden within the params structure, which looks like this:
struct fsinfo_params { __u32 at_flags; __u32 flags; __u32 request; __u32 Nth; __u32 Mth; __u64 __reserved[3]; };
The at_flags field contains the same flags that one would ordinarily expect to see in an openat()-style system call. The request field describes the information that is being asked for; a number of possible values can be found in this patch from the series. Potentially available information ("potentially" because filesystems are not required to implement every possibility) include filesystem limits, timestamp resolution information, the volume UUID, the servers behind a remote filesystem, and more. For attributes that can have multiple values, the Nth and Mth fields can be used to select one in particular.
The format of the returned value is ... complex. Values are stored into the provided buffer in any of a number of formats, depending on what was requested. For some, a structure is returned; others return a string or a type called simply "opaque". There is some documentation in this patch, but it seems clear that potential users of this system call will have to do some digging to figure out the information that will be returned to them.
This patch set is now in its 17th revision, having evolved quite a bit over the years. The one comment on this version, so far, comes from James Bottomley, who suggested that there may not be a need for fsinfo() at all. Instead, with some changes to how fsconfig() (which is used to configure filesystem attributes) is implemented, it could be turned into an interface that could both set and read attributes. So far, Howells has not responded to that suggestion.
Overall, the fact that these patches have been through
17 revisions (so far) says a lot. Nobody doubts that getting this
information out of the kernel would be useful, but the API remains complex
and hard for potential users to understand. Whether that can be fixed
while retaining the features provided by these system calls is not clear,
though.
Index entries for this article | |
---|---|
Kernel | System calls |
Posted Feb 24, 2020 23:09 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
You know what's missing here? Yes, it's the explicit structure length.
Posted Feb 25, 2020 8:00 UTC (Tue)
by geert (subscriber, #98403)
[Link] (1 responses)
__u32 __reserved32[1]; /* Reserved params; all must be 0 */
Perhaps that can be traded for an explicit structure length?
Posted Feb 26, 2020 2:19 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
Posted Feb 25, 2020 4:32 UTC (Tue)
by anguslees (subscriber, #7131)
[Link] (9 responses)
Posted Feb 25, 2020 7:00 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (8 responses)
Posted Feb 25, 2020 10:11 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Feb 25, 2020 14:38 UTC (Tue)
by alison (subscriber, #63752)
[Link]
The author of "man fanotify" appears to have the same question:
The fanotify API provides notification and interception of filesystem events.
Posted Feb 25, 2020 22:36 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (3 responses)
Posted Feb 26, 2020 7:24 UTC (Wed)
by maxfragg (guest, #122266)
[Link] (2 responses)
Posted Feb 26, 2020 14:25 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
I mean maybe netlink would be superior, but this seems nicely trivial. It’s not like they’re knocking up a netlink competitor, just using a trivial interface where they can.
Posted Feb 28, 2020 14:43 UTC (Fri)
by dhowells (guest, #55933)
[Link]
As has been mentioned, there are existing solutions:
Pipes have disadvantages too: no message loss reporting (I'm having to add it); pipe ring buffer metadata elements generally larger than common messages (might be able to optimise in future to store small data in the ring). Note that I didn't want to use pipes either: I was originally using a mappable chardev ring buffer but Linus said I had to use pipes instead.
Posted Feb 27, 2020 14:47 UTC (Thu)
by kugel (subscriber, #70540)
[Link] (1 responses)
A pipe seems really awkward for one-way event notifications.
Posted Feb 28, 2020 14:15 UTC (Fri)
by dhowells (guest, #55933)
[Link]
watch_mount(), watch_sb(), and fsinfo() (again)
> __u32 at_flags;
Sigh.
watch_mount(), watch_sb(), and fsinfo() (again)
Fortunately the latter is present in the real declaration:
Indeed, that was the entire point of copy_struct_from_user() -- so that new syscalls can be extensible and avoid the need to predict how many future fields you might want (I gave a talk about this at LCA this year). I've sent a suggestion to the ML, though there is some discussion on the latest thread to remove the need for fsinfo(2) entirely.
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
Use cases include virus scanning and hierarchical storage management. **Cur‐
rently, only a limited set of events is supported.**
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
But I have to agree, having one encouraged interface (probably netlink, or maybe even epoll here?) to be used for any new interface of this type would be a good idea
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)
- messages can be longer than 8 bytes (say up to 128 bytes);
- message queueing requires no on-the-spot allocation and can be done from softirq/irq context or inside spinlocks;
- messages can come from a variety of sources (e.g. mount(2), add_key(2), keyctl(2), sb errors, usb notifications, ...);
- messages from different sources can be in the same queue;
- messages may need copying into multiple queues;
- filters can be easily employed;
- message loss reporting.
- Netlink:
- would make the core VFS dependent on the networking code.
- would require GFP_ATOMIC message allocation at the point of generation (at least sometimes).
- doesn't seem to make it easy to do message loss reporting.
- epoll is for dealing with file descriptors - but we're not dealing with fd events.
- Not all sources I need notifications for can be addressed with fanotify.
- eventfd() has 8-byte messages.
watch_mount(), watch_sb(), and fsinfo() (again)
watch_mount(), watch_sb(), and fsinfo() (again)