Filesystem notification, part 1: An overview of dnotify and inotify
Filesystem notification APIs provide a mechanism by which applications can be informed when events happen within a filesystem—for example, when a file is opened, modified, deleted, or renamed. Over time, Linux has acquired three different filesystem notification APIs, and it is instructive to look at them to understand what the differences between the APIs are. It's also worthwhile to consider what lessons have been learned during the design of the APIs—and what lessons remain to be learned.
This article is thus the first in a series that looks at the Linux filesystem notification APIs: dnotify, inotify, and fanotify. To begin with, we briefly describe the original API, dnotify, and look at its limitations. We'll then look at the inotify API, and consider the ways in which it improves on dnotify. In a subsequent article, we'll take a look at the fanotify API.
Filesystem notification use cases
In order to compare filesystem notification APIs, it's useful to consider some of the use cases for those APIs. Some of the common use cases are the following:
- Caching a model of filesystem objects: The application wants to maintain an internal representation that accurately reflects the current set of objects in a filesystem, or some subtree of that filesystem. An example of such an application is a file manager, which presents the user with a graphical representation of the objects in a filesystem.
- Logging filesystem activity: The application wants to record all of the events (or some subset of event types) that occur for the monitored filesystem objects.
- Gatekeeping filesystem operations: The application wants to intervene when a filesystem event occurs. The classic example of such an application is an antivirus system: when another program tries to (for example) execute a file, the antivirus system first checks the contents of the file for malware, and then either allows the execution to proceed if the file contents are benign, or prevents execution if a virus is detected.
In the beginning: dnotify
Without a kernel-supported filesystem notification API, an application must resort to techniques such as polling the state of directories and files using repeated invocations of system calls such as stat() and the readdir() library function. Such polling is, of course, slow and inefficient. Furthermore, this approach allows only a limited range of events to be detected, for example, creation of a file, deletion of a file, and changes of file metadata such as permissions and file size. By contrast, operations such as file renames are difficult to identify.
Those problems led to the creation of the first in-kernel implementation of a filesystem notification API, dnotify, which was implemented by Stephen Rothwell (these days, the maintainer of the linux-next tree) and which first appeared in Linux 2.4.0 (in 2001).
Because it was the first attempt at implementing a filesystem notification API, done at a time when the problem was less well understood and when some of the pitfalls of API design were less easily recognized, the dnotify API has a number of peculiarities. To begin with, the interface is multiplexed on the existing fcntl() system call. (By contrast, the later inotify and fanotify APIs were each implemented using new system calls.) To enable monitoring, one makes a call of the form:
fcntl(fd, F_NOTIFY, mask);
Here, fd is a file descriptor that specifies a directory to be monitored, and this brings us to the second oddity of the API: dnotify can be used to monitor only whole directories; monitoring individual files is not possible. The mask specifies the set of events to be monitored in the directory. These include events for file access, modification, creation, deletion, and attribute changes (e.g., permission and ownership changes) that are fully listed in the fcntl(2) man page.
A further dnotify oddity is its method of notification. When an event occurs, the monitoring application is sent a signal (SIGIO by default, but this can be changed). The signal on its own does not identify which directory had the event, but if we use sigaction() to establish the handler using the SA_SIGINFO flag, then the handler receives a siginfo_t argument whose si_fd field contains the file descriptor associated with the directory. At that point, the application then needs to rescan the directory to determine which file has changed. (In typical usage, the application would maintain a data structure that caches a mapping of file descriptors to directory names, so that it can map si_fd back to a directory name.)
A simple example of the use of dnotify can be found here.
Problems with dnotify
As is probably clear, the dnotify API is cumbersome, and has a number of limitations. As already noted, we can monitor only entire directories, not individual files. Furthermore, dnotify provides notification for a rather modest range of events. Most notably, by comparison to inotify, dnotify can't tell us when a file was opened or closed. However, there are also some other serious limitations of the API.
The use of signals as a notification method causes a number of difficulties. The first of these is that signals are delivered asynchronously: catching signals with a handler can be racy and error-prone. One way around that particular difficulty is to instead accept signals synchronously using sigwaitinfo(). The use of SIGIO as the default notification signal is also undesirable, because it is one of the traditional signals that does not queue. This means that if events are generated more quickly than the application can process the signals, then some notifications will be lost. (This difficulty can be circumvented by changing the notification signal to one of the so-called realtime signals, which can be queued.)
Signals are also problematic because they convey little information: at most, we get a signal number (it is possible to arrange for different directories to notify using different signals) and a file descriptor number. We get no information about which particular file in a directory triggered an event, or indeed what kind of event occurred. (One can play tricks such as opening multiple file descriptors for the same directory, each of which notifies a different set of events, but this adds complexity to the application.) One further reason that using signals as a notification method can be a problem is that an application that uses dnotify might also make use of a library that employs signals: the use of a particular signal by dnotify in the main program may conflict with the library's use of the same signal (or vice versa).
A final significant limitation of the dnotify API is the need to open a file descriptor for each directory that is monitored. This is problematic for two reasons. First, an application that monitors a large number of directories may quickly run out of file descriptors. However, a more serious problem is that holding file descriptors open on a filesystem prevents that filesystem from being unmounted.
Notwithstanding these API problems, dnotify did provide an efficiency improvement over simply polling a filesystem, and dnotify came to be employed in some widely used tools such as the Beagle desktop search tool. However, it soon became clear that a better API would make life easier for user-space applications.
Enter inotify
The inotify API was developed by John McCutchan with support from Robert Love. First released in Linux 2.6.13 (in 2005), inotify aimed to address all of the obvious problems with dnotify.
The API employs three dedicated system calls—inotify_init(), inotify_add_watch(), and inotify_rm_watch()—and makes use of the traditional read() system call as well.
inotify_init() creates an inotify instance—a kernel data structure that records which filesystem objects should be monitored and maintains a list of events that have been generated for those objects. The call returns a file descriptor that is employed by the rest of the API to refer to this inotify instance. The diagram at right summarizes the operation of an inotify instance.
inotify_add_watch() allows us to modify the set of filesystem objects monitored by an inotify instance. We can add new objects (files and directories) to the monitoring list, specifying which events are to be notified, and change the set of events that are notified for an object that is already in the monitoring list. Unsurprisingly, inotify_rm_watch() is the converse of inotify_add_watch(): it removes an object from the monitoring list.
The three arguments to inotify_add_watch() are an inotify file descriptor, a filesystem pathname, and a bit mask:
int inotify_add_watch(int fd, const char *pathname, uint32_t mask);
The mask argument specifies the set of events to be notified for the filesystem object referred to by pathname and can include some additional bits that affect the behavior of the call. As an example, the following code allows us to monitor file creation and deletion events inside the directory mydir, as well as monitor for deletion of the directory itself:
int fd, wd;
fd = inotify_init();
wd = inotify_add_watch(fd, "mydir",
IN_CREATE | IN_DELETE | IN_DELETE_SELF);
A full list of the bits that can be included in the mask argument is given in the inotify(7) man page. The set of events notified by inotify is a superset of that provided by dnotify. Most notably, inotify provides notifications when filesystem objects are opened and closed, and provides much more information for file rename events, as we outline below.
The return value of inotify_add_watch() is a "watch descriptor", which is an integer value that uniquely identifies the specified filesystem object within the inotify monitoring list. An inotify_add_watch() call that specifies a filesystem object that is already being monitored (possibly via a different pathname) will return the same watch descriptor number as was returned by the inotify_add_watch() that first added the object to the monitoring list.
When events occur for objects in the monitoring list, they can be read from the inotify file descriptor using read(). (The inotify file descriptor can also be monitored for readability using select(), poll(), and epoll().) Each read() returns one or more structures of the following form to describe an event:
struct inotify_event {
int wd; /* Watch descriptor */
uint32_t mask; /* Bit mask describing event */
uint32_t cookie; /* Unique cookie associating related events */
uint32_t len; /* Size of name field */
char name[]; /* Optional null-terminated name */
};
The wd field is a watch descriptor that was previously returned by inotify_add_watch(). By maintaining a data structure that maps watch descriptors to pathnames, the application can determine the filesystem object for which this event occurred. mask is a bit mask that describes the event that occurred. In most cases, this field will include one of the bits specified in the mask specified when the watch was established. For example, given the inotify_add_watch() call that we showed earlier, if the directory mydir was deleted, read() would return an event whose mask field has the IN_DELETE_SELF bit set. (By contrast, dnotify does not generate an event when a monitored directory is deleted.)
In addition to the various events for which an application may request notification, there are certain events for which inotify always generates automatic notifications. The most notable of these is IN_IGNORED, which is generated whenever inotify ceases to monitor an object. This can occur, for example, because the object was deleted or the filesystem on which it resides was unmounted. The IN_IGNORED event can be used by the application to adjust its internal model of what is currently being monitored. (Again, dnotify has no analog of this event.)
The name field is used (only) when an event occurs for a file inside a monitored directory: it contains the null-terminated name of the file that triggered this event. The len field indicates the total size of the name field, which may be terminated by multiple null bytes in order to pad out the inotify_event structure to a size that allows successive structures in the read buffer to be aligned at architecture-appropriate byte boundaries (typically, multiples of 16 bytes).
The cookie field exists to help applications interpret rename events. When a file is renamed inside (or between) monitored directories, two events are generated: an IN_MOVED_FROM event for the directory from which the file is moved, and an IN_MOVED_TO event for the directory to which the file is moved. The first event contains the old name of the file, and the second event contains the new name. Both events have the same unique cookie value, allowing the application to connect the two events, and thus work out the old and new name of the file (a task that is rather difficult with dnotify). We'll say rather more about rename events in the next article in this series.
Inotify does not provide recursive monitoring. In other words, if we are monitoring the directory mydir, then we will receive notifications for that directory as well as all of its immediate descendants, including subdirectories. However, we will not receive notifications for events inside the subdirectories. But, with some effort, it is possible to perform recursive monitoring by creating watches for each of the subdirectories in a directory tree. To assist with this task, when a subdirectory is created inside a monitored directory (or indeed, when any event is generated for a subdirectory), inotify generates an event that has the IN_ISDIR bit set. This provides the application with the opportunity to add watches for new subdirectories.
Example program
The code below demonstrates the basic steps in using the inotify API. The program first creates an inotify instance and adds watches for all possible events for each of the pathnames specified in its command line. It then sits in a loop reading events from the inotify file descriptor and displaying information from those events (using our displayInotifyEvent(), shown in the full version of the code here).
int
main(int argc, char *argv[])
{
struct inotify_event *event
...
inotifyFd = inotify_init(); /* Create inotify instance */
for (j = 1; j < argc; j++) {
wd = inotify_add_watch(inotifyFd, argv[j], IN_ALL_EVENTS);
printf("Watching %s using wd %d\n", argv[j], wd);
}
for (;;) { /* Read events forever */
numRead = read(inotifyFd, buf, BUF_LEN);
...
/* Process all of the events in buffer returned by read() */
for (p = buf; p < buf + numRead; ) {
event = (struct inotify_event *) p;
displayInotifyEvent(event);
p += sizeof(struct inotify_event) + event->len;
}
}
}
Suppose that we use this program to monitor two subdirectories, xxx and yyy:
$ ./inotify_demo xxx yyy
Watching xxx using wd 1
Watching yyy using wd 2
If we now execute the following command:
$ mv xxx/aaa yyy/bbb
we see the following output from our program:
Read 64 bytes from inotify fd
wd = 1; cookie =140040; mask = IN_MOVED_FROM
name = aaa
wd = 2; cookie =140040; mask = IN_MOVED_TO
name = bbb
The mv command generated an IN_MOVED_FROM event for the xxx directory (watch descriptor 1) and an IN_MOVED_TO event for the yyy directory (watch descriptor 2). The two events contained, respectively, the old and new name of the file. The events also had the same cookie value, thus allowing an application to connect them.
How inotify improves on dnotify
Inotify improves on dnotify in a number of respects. Among the more notable improvements are the following:
- Both directories and individual files can be monitored.
- Instead of signals, applications are notified of filesystem events by reading structured data from a file descriptor created using the API. This approach allows an application to deal with notifications synchronously, and also allows for richer information to be provided with notifications.
- Inotify does not require an application to open file descriptors for each monitored object. Instead, it uses an API-specific handle (the watch descriptor). This avoids the problems of file-descriptor exhaustion and open file descriptors preventing filesystems from being unmounted.
- Inotify provides more information when notifying events. First, it can be used to detect a wider range of events. Second, when the subject of an event is a file inside a monitored directory, inotify provides the name of that file as part of the event notification.
- Inotify provides richer information in its notification of rename events, allowing an application to easily determine the old and new name of the renamed object.
- IN_IGNORED events make it (relatively) easy for an inotify application to maintain an internal model of the currently monitored set of filesystem objects.
Concluding remarks
We've briefly seen how inotify improves on dnotify. In the next
article in this series, we look in more detail at inotify,
considering how it can be used in a robust application that
monitors a filesystem tree. This will allow us to see the full
capabilities of inotify, while at the same time discovering some
of its limitations.
| Index entries for this article | |
|---|---|
| Kernel | Development model/User-space ABI |
| Kernel | Dnotify |
| Kernel | Inotify |
| GuestArticles | Kerrisk, Michael |
(Log in to post comments)
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 9, 2014 21:06 UTC (Wed) by wahern (subscriber, #37304) [Link]
Rare is the application that actually handles IN_Q_OVERFLOW. The ability to get the file name with a change notification without having an open descriptor is an _optimization_, but for reliable code you still need to be prepared to scan the directory regardless. So what's the big deal about having a descriptor to a directory open, anyhow?
fanotify has FAN_UNLIMITED_QUEUE, but that just seems like a DoS attack waiting to happen.
inotify is more convenient than EVFILT_VNODE in many respects, but in terms of a sane API that equitably divides concerns between user-space and kernel-space, EVFILT_VNODE wins hands-down. How many file managers are there, versus applications which only need basic event notification for a small set of directories or files? That's a lot of convoluted kernel code to handle one or two applications.
OS X Finder does just fine with EVFILT_VNODE. And OS X has the O_EVTONLY descriptor flag, which permits the kernel to unmount volumes with open descriptors.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 9, 2014 22:15 UTC (Wed) by neilbrown (subscriber, #359) [Link]
So using lots of file descriptors doesn't just push against a limit, it wastes memory.
http://lxr.free-electrons.com/source/Documentation/filesy... (Line 219)
I think it would be great to put 'struct file' on 'a diet' much like what was done to 'struct inode', so it only holds the bare minimum (including an _operations pointer) and various users embed it in a larger structure as required (the kernel approach to OO).
That, combined with O_EVTONLY (great idea!), would allow fds to be used instead of wds. Then you just pass all your O_EVTONLY fds to epoll and it "just works" :-)
Unfortunately there is no room in Linux for another file notification API, so I'll never get to try this out.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 23:40 UTC (Thu) by zblaxell (subscriber, #26385) [Link]
Indeed. I gave up on trying to convince kernel devs that not having recursive watches was a huge API hole in inotify (try monitoring a hierarchy of 30 million files, I dare you ;).
I implemented my own notification schemes in FUSE filesystems instead, and never looked back. Among other things, I can prevent race conditions by blocking whatever's accessing the files until whatever's watching the accesses catches up.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 11, 2014 8:49 UTC (Fri) by dgm (subscriber, #49227) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 11, 2014 14:02 UTC (Fri) by zblaxell (subscriber, #26385) [Link]
Depends on how you define the "S" in DoS.
If you want to trigger a backup process whenever files change, it might be reasonable to stop files changing until that process catches up, or at least wait until the notification has been stored in a persistent queue somewhere so the backup process can get to it later.
The other common example is blocking access to files by content, e.g. in a virus scanner.
Whether this is DoS or just a heavy implementation cost depends on whether it's more important to have up-to-date backups or to be able to rapidly modify files without blocking.
Obviously this wouldn't be something you just let random users do with their file browsers.
> How can you prove that the watcher is making progress?
By implementing a synchronous notification mechanism, e.g. make the watcher be a function call inside the FUSE daemon, or implement a lock on each notification object that blocks access to the watched object until the watcher clears the lock.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 24, 2014 3:05 UTC (Thu) by eMBee (guest, #70889) [Link]
we tried to use inotify to find new files for backup, so that rsync or csync2 would not have to scan the whole tree of a few TB worth of files, but we found that setting up the watches alone took more than an hour. longer even than it takes for csync2 to scan the whole tree.
would you be willing to publish your FUSE notification scheme?
greetings, eMBee.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 12, 2014 0:59 UTC (Sat) by foom (subscriber, #14868) [Link]
Are you sure? I also thought that nobody would want another half baked file notifications api, but then fanotify was added...
So who knows...Apparently I was wrong.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 12, 2014 7:51 UTC (Sat) by alison (subscriber, #63752) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 16, 2014 7:05 UTC (Wed) by kleptog (subscriber, #1183) [Link]
That said, one advantage of not using real file descriptors is that you won't block unmount, which can be nice for something that's trying to run in the background.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 16, 2014 23:08 UTC (Wed) by cortana (subscriber, #24596) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 4:23 UTC (Thu) by neilbrown (subscriber, #359) [Link]
I wonder what other, possibly less common, use cases there are.
One of my favourites is as a publish/subscribe (aka 'multicast') IPC mechanism. It is a bit clunky, but it can be made to work.
Who needs dbus when you can muck about with files and dnotify :-)
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 4:41 UTC (Thu) by k8to (subscriber, #15413) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 5:22 UTC (Thu) by ncm (subscriber, #165) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 8:37 UTC (Thu) by jbdenis (subscriber, #56289) [Link]
I concur. The fact that you can miss events is quite annoying. From my point of view, something like /dev/fsevents from OS X (http://en.wikipedia.org/wiki/FSEvents) looks like a more robust way to handle the filesystem notification problem.
There was an nice discussion here about this problem on the bup google groups :
https://groups.google.com/forum/#!topic/bup-list/CXRI7MS3LwM
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 14:41 UTC (Thu) by cortana (subscriber, #24596) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 11, 2014 10:00 UTC (Fri) by HelloWorld (guest, #56129) [Link]
Does that need to be done in userspace? That might be the next application for the new BPF :-)
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 16, 2014 18:22 UTC (Wed) by nix (subscriber, #2304) [Link]
It's OK for its original use case of virus scanning, but who the hell cares about that?
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 21, 2014 7:20 UTC (Mon) by timmyp (guest, #97966) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 23, 2014 16:37 UTC (Wed) by nix (subscriber, #2304) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 11, 2014 0:06 UTC (Fri) by zlynx (guest, #2285) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 11, 2014 14:45 UTC (Fri) by smurf (subscriber, #17840) [Link]
It's a lot of fun (which I for one will happily leave to somebody else) to (try to) arrive at a coherent view of a directory which changes while you scan it.
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 10, 2014 15:48 UTC (Thu) by fandingo (guest, #67019) [Link]
Filesystem notification, part 1: An overview of dnotify and inotify
Posted Jul 12, 2014 13:08 UTC (Sat) by tomgj (guest, #50537) [Link]
By contrast, the later inotify and fanotify APIs were each implemented using new system calls.
I'm afraid the use of the word "implemented" in the above is problematic. If we live in a world in which the implementation of an API is when someone writes the working code that shows the behaviour required by the design / specification of that API, it does seem problematic to also use "implemented" to mean something else. In the above sentence, saying the APIs were *designed* or *specified* using new system calls would carry much less risk of causing confusion. This avoids overloading "implemented" with closely related but critically distinct meanings.
Although it's clear to someone with a solid grasp of these concepts what the author actually meant in the sentence quoted above, adhering to a stricter nomenclature around interfaces and implementations, and specifically use of the word "implementation", could help these concepts develop better in the section of the readership where the concepts are muddled together in unhelpful ways. And I would speculate that the majority of the readership falls in to the latter category.
API implementation or design
Posted Jul 12, 2014 19:44 UTC (Sat) by giraffedata (guest, #1954) [Link]
saying the APIs were *designed* or *specified* using new system calls would carry much less risk of causing confusion
I agree with that. I know that the implementation the author had in mind was where inotify is an implementation of the filesystem notification API concept, but it's sloppy.
There is surprisingly much sloppy terminology, requiring readers to disambiguate using context and common sense, in computer engineering. In this article, for example, inotify_init() is called a system call. It's really a whole class of system calls - over all computers and all time, there have probably been millions of inotify_init() system calls.
I've often said that confusing the object with the class is the biggest impediment to understanding in technical documents I read, but confusing design and implementation, forcing the reader to figure out which level of the process the author is writing about, might be the second biggest.
API implementation or design
Posted Jul 16, 2014 7:22 UTC (Wed) by wahern (subscriber, #37304) [Link]
Probably not. 'cause it's referring to the characteristic sound you make to summon a duck, and not the particular times you made that sound.
API implementation or design
Posted Jul 16, 2014 16:20 UTC (Wed) by giraffedata (guest, #1954) [Link]
Right, that's another good example. "Duck call" can refer to an object - a single event - or to a class of those events - the ones that have a certain characteristic sound. The reader has to determine from context which one the writer meant.Often, the reader can disambiguate instantly and subconsciously. Other times, the reader has to look ahead a few sentences to resolve the ambiguity. In the worst case, the reader subconsciously chooses the wrong interpretation and later gets stuck and has to back up and look for places the writer might have meant something else.
"Bob's duck call worked" is an example of where the disambiguation is not trivial. Are we telling of a particular duck shooting or are we talking about Bob's now-over duck hunting career?
Regardless of whether it's easy or hard to disambiguate, the writer's goal should be to save the reader from having do any figuring out. If the writer can be explicit about object versus class, that helps.
