Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.16-rc4, which was released on July 6.

Stable kernel status: On July 3, Greg Kroah-Hartman announced that 3.14 would be the next "longterm stable" kernel; he will be maintaining it until August 2016 or thereabouts.

The 3.15.4, 3.14.11, 3.10.47, and 3.4.97 stable kernels were released on July 6, followed by 3.15.5, 3.14.12, 3.10.48, and 3.4.98 on July 9.

Comments (none posted)

Quotes of the week

IOW, what would an end-user's bug report look like?

It's important to think this way because a year from now some person we've never heard of may be looking at a user's bug report and wondering whether backporting this patch will fix it. Amongst other reasons.

— Andrew Morton

Hey, I figure that if you weren't desperately in need of entertainment, you would not have asked me to hack a perl script!

— Paul McKenney

"Magic barrier sprinkles" is a bad path to start down, IMHO.

— Rusty Russell

We do not do defensive programming, we try to do logical things, and only logical things.

— Eric Dumazet (Thanks to Dan Carpenter.)

Comments (none posted)

The future of realtime Linux in doubt

In a message about the release of the 3.14.10-rt7 realtime Linux kernel, Thomas Gleixner reiterated that the funding problems that have plagued realtime Linux (which he raised, again, at last year's Real Time Linux Workshop) have only gotten worse. Efforts were made to find funding for the project, but "nothing has materialized". Assuming that doesn't change, Gleixner plans to cut back on development and on plans to get the code upstream. "After my last talk about the state of preempt-RT at LinuxCon Japan, Linus told me: 'That was far more depressing than I feared'. The mainline kernel has seen a lot of benefit from the preempt-RT efforts in the past 10 years and there is a lot more stuff which needs to be done upstream in order to get preempt-RT fully integrated, which certainly would improve the general state of the Linux kernel again."

Comments (103 posted)

Anatomy of a system call, part 1

July 9, 2014

This article was contributed by David Drysdale

System calls are the primary mechanism by which user-space programs interact with the Linux kernel. Given their importance, it's not surprising to discover that the kernel includes a wide variety of mechanisms to ensure that system calls can be implemented generically across architectures, and can be made available to user space in an efficient and consistent way.

I've been working on getting FreeBSD's Capsicum security framework onto Linux and, as this involves the addition of several new system calls (including the slightly unusual execveat() system call), I found myself investigating the details of their implementation. As a result, this is the first of a pair of articles that explore the details of the kernel's implementation of system calls (or syscalls). In this article we'll focus on the mainstream case: the mechanics of a normal syscall (read()), together with the machinery that allows x86_64 user programs to invoke it. The second article will move off the mainstream case to cover more unusual syscalls, and other syscall invocation mechanisms.

System calls differ from regular function calls because the code being called is in the kernel. Special instructions are needed to make the processor perform a transition to ring 0 (privileged mode). In addition, the kernel code being invoked is identified by a syscall number, rather than by a function address.

Defining a syscall with `SYSCALL_DEFINEn()`

The read() system call provides a good initial example to explore the kernel's syscall machinery. It's implemented in fs/read_write.c, as a short function that passes most of the work to vfs_read(). From an invocation standpoint the most interesting aspect of this code is way the function is defined using the SYSCALL_DEFINE3() macro. Indeed, from the code, it's not even immediately clear what the function is called.

    SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

These SYSCALL_DEFINEn() macros are the standard way for kernel code to define a system call, where the n suffix indicates the argument count. The definition of these macros (in include/linux/syscalls.h) gives two distinct outputs for each system call.

    SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count)
    __SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

The first of these, SYSCALL_METADATA(), builds a collection of metadata about the system call for tracing purposes. It's only expanded when CONFIG_FTRACE_SYSCALLS is defined for the kernel build, and its expansion gives boilerplate definitions of data that describes the syscall and its parameters. (A separate page describes these definitions in more detail.)

The __SYSCALL_DEFINEx() part is more interesting, as it holds the system call implementation. Once the various layers of macros and GCC type extensions are expanded, the resulting code includes some interesting features:

    asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count)
    	__attribute__((alias(__stringify(SyS_read))));

    static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count);
    asmlinkage long SyS_read(long int fd, long int buf, long int count);

    asmlinkage long SyS_read(long int fd, long int buf, long int count)
    {
    	long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count);
    	asmlinkage_protect(3, ret, fd, buf, count);
    	return ret;
    }

    static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

First, we notice that the system call implementation actually has the name SYSC_read(), but is static and so is inaccessible outside this module. Instead, a wrapper function, called SyS_read() and aliased as sys_read(), is visible externally. Looking closely at those aliases, we notice a difference in their parameter types — sys_read() expects the explicitly declared types (e.g. char __user * for the second argument), whereas SyS_read() just expects a bunch of (long) integers. Digging into the history of this, it turns out that the long version ensures that 32-bit values are correctly sign-extended for some 64-bit kernel platforms, preventing a historical vulnerability.

The last things we notice with the SyS_read() wrapper are the asmlinkage directive and asmlinkage_protect() call. The Kernel Newbies FAQ helpfully explains that asmlinkage means the function should expect its arguments on the stack rather than in registers, and the generic definition of asmlinkage_protect() explains that it's used to prevent the compiler from assuming that it can safely reuse those areas of the stack.

To accompany the definition of sys_read() (the variant with accurate types), there's also a declaration in include/linux/syscalls.h, and this allows other kernel code to call into the system call implementation directly (which happens in half a dozen places). Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.

Syscall table entries

Hunting for callers of sys_read() also points the way toward how user space reaches this function. For "generic" architectures that don't provide an override of their own, the include/uapi/asm-generic/unistd.h file includes an entry referencing sys_read:

    #define __NR_read 63
    __SYSCALL(__NR_read, sys_read)

This defines the generic syscall number __NR_read (63) for read(), and uses the __SYSCALL() macro to associate that number with sys_read(), in an architecture-specific way. For example, arm64 uses the asm-generic/unistd.h header file to fill out a table that maps syscall numbers to implementation function pointers.

However, we're going to concentrate on the x86_64 architecture, which does not use this generic table. Instead, x86_64 defines its own mappings in arch/x86/syscalls/syscall_64.tbl, which has an entry for sys_read():

    0	common	read			sys_read

This indicates that read() on x86_64 has syscall number 0 (not 63), and has a common implementation for both of the ABIs for x86_64, namely sys_read(). (The different ABIs will be discussed in the second part of this series.) The syscalltbl.sh script generates arch/x86/include/generated/asm/syscalls_64.h from the syscall_64.tbl table, specifically generating an invocation of the __SYSCALL_COMMON() macro for sys_read(). This header file is used, in turn, to populate the syscall table, sys_call_table, which is the key data structure that maps syscall numbers to sys_name() functions.

x86_64 syscall invocation

Now we will look at how user-space programs invoke the system call. This is inherently architecture-specific, so for the rest of this article we'll concentrate on the x86_64 architecture (other x86 architectures will be examined in the second article of the series). The invocation process also involves a few steps, so a clickable diagram, seen at left, may help with the navigation.

In the previous section, we discovered a table of system call function pointers; the table for x86_64 looks something like the following (using a GCC extension for array initialization that ensures any missing entries point to sys_ni_syscall()):

    asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    	[0 ... __NR_syscall_max] = &sys_ni_syscall,
    	[0] = sys_read,
    	[1] = sys_write,
    	/*... */
    };

For 64-bit code, this table is accessed from arch/x86/kernel/entry_64.S, from the system_call assembly entry point; it uses the RAX register to pick the relevant entry in the array and then calls it. Earlier in the function, the SAVE_ARGS macro pushes various registers onto the stack, to match the asmlinkage directive we saw earlier.

Moving outwards, the system_call entry point is itself referenced in syscall_init(), a function that is called early in the kernel's startup sequence:

    void syscall_init(void)
    {
    	/*
    	 * LSTAR and STAR live in a bit strange symbiosis.
    	 * They both write to the same internal register. STAR allows to
    	 * set CS/DS but only a 32bit target. LSTAR sets the 64bit rip.
    	 */
    	wrmsrl(MSR_STAR,  ((u64)__USER32_CS)<<48  | ((u64)__KERNEL_CS)<<32);
    	wrmsrl(MSR_LSTAR, system_call);
    	wrmsrl(MSR_CSTAR, ignore_sysret);
    	/* ... */

The wrmsrl instruction writes a value to a model-specific register; in this case, the address of the general system_call syscall handling function is written to register MSR_LSTAR (0xc0000082), which is the x86_64 model-specific register for handling the SYSCALL instruction.

And this gives us all we need to join the dots from user space to the kernel code. The standard ABI for how x86_64 user programs invoke a system call is to put the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the SYSCALL instruction. This instruction causes the processor to transition to ring 0 and invoke the code referenced by the MSR_LSTAR model-specific register — namely system_call. The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table — namely sys_read(), which is a thin, asmlinkage wrapper for the real implementation in SYSC_read().

Now that we've seen the standard implementation of system calls on the most common platform, we're in a better position to understand what's going on with other architectures, and with less-common cases. That will be the subject of the second article in the series.

Comments (18 posted)

Control groups, part 2: On the different sorts of hierarchies

July 9, 2014

This article was contributed by Neil Brown

Control groups

Hierarchies are everywhere. Whether this is a deep property of the universe or simply the result of the human thought process, we see hierarchies wherever we look, from the URL bar that your browser displays (or maybe doesn't) to the pecking order in the farm yard. There is a fun fact that if you click on the first link in the main text of a Wikipedia article, and then repeat that on each following article, you eventually get to Philosophy, though this is apparently only true 94.52% of the time. Nonetheless it suggests that all knowledge can be arranged hierarchically underneath the general heading of "Philosophy".

Control groups (cgroups) allow processes to be grouped hierarchically and the specific details of this hierarchy are one area where cgroups have both undergone change and received criticism. In our ongoing effort to understand cgroups enough to enjoy the debates that regularly spring up, it is essential to have an appreciation of the different ways a hierarchy can be used, so we can have some background against which to measure the hierarchy in cgroups.

I find that an example from my past raises some relevant issues that we can then see play out in some more generally familiar filesystem hierarchies and that we can be prepared to look for in cgroup hierarchies.

Hierarchies in computer account privileges

In a previous role as a system administrator for a modest-sized computing department at a major Australian university, we had a need for a scheme to impose various access controls on, and provide resource allocations to, a wide variety of users: students, both undergraduate and post-graduate, and staff, both academic and professional. Already it is clear that a hierarchy is presenting itself, with room for further subdivisions between research and course-work students, and between technical and clerical professional staff.

Largely orthogonal to this hierarchy were divisions of the school into research groups and support groups (I worked in the Computing Support Group) together with a multitude of courses that were delivered, each loosely associated with a particular program (Computer Engineering, Software Engineering, etc.) at a particular year level. Within each of the different divisions and courses there could be staff in different roles as well as students. Some privileges best aligned with the role performed by the owner of the account, so staff received a higher printing allowance than students. Others aligned with the affiliation of the account owner — a particular printer might be reserved for employees in the School Office who had physical access and used it for printing confidential transcripts. Similarly, students in some particular course had a genuine need for a much higher budget of color printouts.

To manage all of this we ended up with two separate hierarchies that were named "Reason" (which included the various roles, since they were the reason a computer account was given) and "Organisation" (identifying that part of the school in which the role was active). From these two we formed a cross product such that for each role and for each affiliation there was, at least potentially, a group of user accounts. Each account could exist in several of these groups, as both staff and students could be involved in multiple courses, and some senior students might be tutors for junior courses. Various privileges and resources could be allocated to individual roles and affiliations or intersections thereof, and they would be inherited by any account in the combined hierarchy.

Manageable complexity

Having a pair of interconnected hierarchies was certainly more complex than the single hierarchy that I was hoping for, but it had one particular advantage: it worked. It was an arrangement that proved to be very flexible and we never had any trouble deciding where to attach any particular computer account. The complexity was a small price to play for the utility.

Further, the price was really quite small. While creating the cross product of two hierarchies by hand would have been error prone, we didn't have to do that. A fairly straightforward tool managed all the complexity behind the scenes, creating and linking all the intermediate tree nodes as required. While working with the tree, whether assigning permissions or resources or attaching people to various roles or affiliations, we rarely needed to think about the internal details and never risked getting them wrong.

This exercise left me with a deep suspicion of simple hierarchies. They are often tempting, but just as often they are an over-simplification. So the first lesson from this tale is that a little complexity can be well worth the cost, particularly if it is well-chosen and can be supported by simple tools.

Two types of hierarchy

The second lesson from this exercise is that the two hierarchies weren't just different in detail; they had quite different characters.

The "Reason" hierarchy is what might be called a "classification" hierarchy. Every individual had their own unique role but it is useful to group similar roles into classes and related classes into super-classes. A widely known hierarchy that has this same property is the Linnaean taxonomy of Biological classification, which is a hierarchy of life forms with seven main ranks of Kingdom, Phylum, Class, Order, Family, Genus, and Species.

With this sort of hierarchy all the members belong in the leaves. In the biological example, all life forms are members of some species. We may not know (or be able to agree) which species a particular individual belongs to, but to suggest that some individual is a member of some Family, but not of any Genus or Species doesn't make sense. It would be at best an interim step leading to a final classification.

The "Organisation" hierarchy has quite a different character. The different research groups did not really represent a classification of research interests, but were a way to organize people into conveniently sized groups to distribute management. Certainly the groups aligned with people's interests where possible, but it was not unheard of for someone to be assigned to a group not because they naturally belonged, but because it was most convenient. To some extent the grouping exists for some separate purpose and members are placed in groups to meet that purpose. This contrasts with a "classification" where each "class" exists only to contain its members.

An organizational hierarchy has another important property: it is perfectly valid for internal nodes to contain individuals. The Head of School was the head of the whole school, and so belonged at the top of the hierarchy. Similarly, a program director could reasonably be associated with the program as a whole without being specifically associated with each of the courses in the program. In many organizations, the leader or head of each group is a member of the group one step up in the organizational hierarchy, which affirms this pattern.

These two different types of hierarchy are quite common and often get mingled together. Two places that we can find them that will be familiar to many readers are the "sysfs" filesystem in Linux, and the source code tree for the Linux kernel.

Devices in `/sys`

The "sysfs" filesystem (normally mounted at /sys) is certainly a hierarchy — as that is how filesystems work. While sysfs currently contains a range of different objects including modules, firmware information, and filesystem details, it was originally created for devices and it is only the devices that will concern us here.

There are, in fact, three separate hierarchical arrangements of devices that all fit inside sysfs, suggesting that each device should have three parents. As devices are represented as directories, this is clearly not possible, since Unix directories may have only one parent. This conundrum is resolved thorough the use of symbolic links (or "symlinks") with implicit, rather than explicit, links to parents. We will start exploring with the hierarchies that are held together with symlinks.

The hierarchy rooted at /sys/dev could be referred to as the "legacy hierarchy". From the early days of Unix there have been two sorts of devices: block devices and character devices. These are represented by the various device-special-files that can normally be found in /dev. Each such file identifies as either a block device or a character device and has a major device number indicating the general class of device (e.g. serial port, parallel port, disk or tape drive) and a minor number that indicates which particular device of that class is the target.

This three-level hierarchy is exactly what we find under /sys/dev, though a colon is used, rather than a slash, to separate the last two levels. So /sys/dev/block/8:0 (block device with major number 8 and minor number 0) is a symbolic link to a directory representing the device also known as "sda". If we start in that directory and want to find the path from /sys/dev, we can find the last two components ("8:0") by reading the "dev" file. Determining that it is a block device is less straightforward, though the presence of a "bdi" (block device info) directory is a strong hint.

This hierarchy is particularly useful if all you have is the name of a device file in /dev, or an open file descriptor on such a device. The stat() or fstat() system calls will report the device type and the major and minor numbers, and these can trivially be converted to a path name in /sys/dev, which can lead to other useful information about the device.

The second symlink-based hierarchy is probably the most generally useful. It is rooted at /sys/class and /sys/bus, suggesting that there really should be another level in there to hold both of these. There are plans to combine both of these into a new /sys/subsystem tree, though as those plans are at least seven years old, I'm not holding my breath. One valuable aspect of these plans that is already in place is that each device directory has a subsystem symlink that points back to either the class or bus tree, so you can easily find the parent of any device within this hierarchy.

The /sys/class hierarchy is quite simple, containing a number of device classes each of which contains a number of specific devices with links to the real device directory. As such, it is conceptually quite similar to the legacy hierarchy, just with names instead of numbers. The /sys/bus hierarchy is similar, though the devices are collected into a separate devices subdirectory allowing each bus directory to also contain drivers and other details.

The third hierarchy for organizing devices is a true directory-based hierarchy that doesn't depend on symlinks. It is found in /sys/devices and has a structure that, in all honesty, is rather hard to describe.

The overriding theme to the organization is that it follows the physical connectedness of devices, so if a hard drive is accessed via a USB port with the USB controller attached to a PCI bus, then the path though the hierarchy to that hard drive will first find the PCI bus, and then the USB port. After the hard drive will be the "block" device that provides access to the data on the drive, and then possibly subordinate devices for partitions.

This is an arrangement that seems like a good idea until you realize that some devices get control signals from one place (or maybe two if there is a separate reset line) and power supply from another place, so a simple hierarchy cannot really describe all the interconnectedness. This is an issue that was widely discussed in preparation for this year's Kernel Summit.

When examining these hierarchies from the perspective of "classification" versus "organization", some fairly clear patterns emerge. The /sys/dev hierarchy is a simple classification hierarchy, though possibly overly simple as many devices (e.g. network interfaces) don't appear there. The /sys/class part of the subsystem hierarchy is similarly a simple classification, though it is more complete.

The /sys/bus part of the subsystem hierarchy is also a simple two-level classification, though the presence of extra information for each bus type, such as the drivers directory, confuse this a little. Devices in the class hierarchy are classified by what functionality they provide (net, sound, watchdog, etc). Devices in the bus hierarchy are classified by how they are accessed and represent different addressable units rather than different functional units. The extra entries in the /sys/bus subtree allow some control over what functionality (represented by a driver and realized as a class device) is requested of each addressable unit.

With this understood, it is hierarchically a simple two-level classification.

The /sys/devices hierarchy is indisputably an organizational hierarchy. It contains all the class devices and all the bus devices in a rough analog of the physical organization of devices. When there is no physical device, or it is not currently represented on any sort of bus, devices are organized into /sys/devices/virtual.

Here again we see that both a classification hierarchy and an organization hierarchy for the same objects can be quite useful, each in its own way. There can be some complexity to working with both, but if you follow the rules, it isn't too bad.

The Linux kernel source tree

For a significantly different perspective on hierarchies, we can look at the Linux kernel source code tree, though many evolving source code trees could provide similar examples. This hierarchy is more about organization than classification, though, as with the research groups discussed earlier, there is generally an attempt to keep related things together when convenient.

There are two aspects of the hierarchy that are worth highlighting, as they illustrate choices that must be made — consciously or unconsciously.

At the top level, there are directories for various major subsystems, such as fs for filesystems (and also file servers like nfsd), mm for memory management, sound, block, crypto, etc. These all seem like reasonable classifications. And then there is kernel. Given that all of Linux is an operating system kernel, maybe this bit is the kernel of the kernel?

In reality, it is various distinct bits and pieces that don't really belong to any particular subsystem, or they are subsystems that are small enough to only need one or two files. In some cases, like the time and sched directories, they are subsystems which were once small enough to belong in kernel and have grown large enough to need their own directory, but not bold enough to escape from the kernel umbrella.

The fs subtree has a similar set of files. Most of fs is the different filesystems and there are a few support modules that get their own subdirectory, such as exportfs, which helps various file servers, and dlm, which supports locking for cluster filesystems. However, in fs is also an ad hoc collection of C files providing services to filesystems, or implementing the higher-level system call interfaces. These are exactly like the code that appears in kernel (and possibly lib) at the top level. However, in fs there is no subdirectory for miscellaneous things, it all just stays in the top level of fs.

There is not necessarily a right answer as to whether everything should be classified into its own leaf directory (following the kernel model), or whether it is acceptable to store source code in internal directories (as is done in fs). However, it is a choice that must be made, and is certainly something to hold an opinion on when debating hierarchies in cgroups.

The kernel source tree also contains a different sort of classification: scripts live in the scripts directory, firmware lives in the firmware directory, and header files live in the include directory — except when they don't. There has been a tendency in recent years to move some header files out of the include directory tree and closer to the C source code files that they are related to. To make this more concrete, let's consider the example of the NFS and the ext3 filesystems.

Each of these filesystems consist of some C language files, some C header files, and assorted other files. The question is: should the header files for NFS live with the header files for ext3 (header files together), or should the header files for NFS live with the C language files for NFS (NFS files together)? To put this another way, do we need to use the hierarchy to classify the header files as different from the other files, or are the different names sufficient?

There was a time when most, if not all, header files were in the include tree. Today, it is very common to find include files mixed with the C files. For ext3, a big change happened in Linux 3.4, when all four header files were moved from include/linux/ into a single file with the rest of the ext3 code: fs/ext3/ext3.h.

The point here is that classification is quite possible without using a hierarchy. Sometimes hierarchical classification is perfect for the task. Sometimes it is just a cumbersome inconvenience. Being willing to use hierarchy when, but only when, it is needed, makes a lot of sense.

Hierarchies for processes

Understanding cgroups, which is the real goal of this series of articles, will require some understanding of how to manage groups of processes and what role hierarchy can play in that management. None of the above is specifically about processes, but it does raise some useful questions or issues that we can consider when we start looking at the details of cgroups:

Does the simplicity of a single hierarchy outweigh the expressiveness of multiple hierarchies, whether they are separate (as in sysfs) or interconnected (as in the account management example)?
Is the overriding goal to classify processes, or simply to organize them? Or are both needs relevant, and, if so, how can we combine them?
Could we allow non-hierarchical mechanisms, such as symbolic links or file name suffixes, to provide some elements of classification or organization?
Does it ever make sense for processes to be attached to internal nodes in the hierarchy, or should they be forced into leaves, even if that leaf is simply a miscellaneous leaf.

In the hierarchy of process groups we looked at last time, we saw a single simple hierarchy that classified processes, first by login session, and then by job group. All processes that were in the hierarchy at all were in the leaves, but many processes, typically system daemons that never opened a tty at all, were completely absent from the hierarchy.

To begin to find answers to these questions in a more modern setting, we need to understand what cgroups actually does with processes and what the groups are used for. In the next article we will start answering that question by taking a close look at some of the cgroups "subsystems", which include resource controllers and various other operations that need to treat a set of processes as group.

Comments (1 posted)

Filesystem notification, part 1: An overview of dnotify and inotify

July 9, 2014

This article was contributed by Michael Kerrisk.

Filesystem notification

Filesystem notification APIs provide a mechanism by which applications can be informed when events happen within a filesystem—for example, when a file is opened, modified, deleted, or renamed. Over time, Linux has acquired three different filesystem notification APIs, and it is instructive to look at them to understand what the differences between the APIs are. It's also worthwhile to consider what lessons have been learned during the design of the APIs—and what lessons remain to be learned.

This article is thus the first in a series that looks at the Linux filesystem notification APIs: dnotify, inotify, and fanotify. To begin with, we briefly describe the original API, dnotify, and look at its limitations. We'll then look at the inotify API, and consider the ways in which it improves on dnotify. In a subsequent article, we'll take a look at the fanotify API.

Filesystem notification use cases

In order to compare filesystem notification APIs, it's useful to consider some of the use cases for those APIs. Some of the common use cases are the following:

Caching a model of filesystem objects: The application wants to maintain an internal representation that accurately reflects the current set of objects in a filesystem, or some subtree of that filesystem. An example of such an application is a file manager, which presents the user with a graphical representation of the objects in a filesystem.
Logging filesystem activity: The application wants to record all of the events (or some subset of event types) that occur for the monitored filesystem objects.
Gatekeeping filesystem operations: The application wants to intervene when a filesystem event occurs. The classic example of such an application is an antivirus system: when another program tries to (for example) execute a file, the antivirus system first checks the contents of the file for malware, and then either allows the execution to proceed if the file contents are benign, or prevents execution if a virus is detected.

In the beginning: dnotify

Without a kernel-supported filesystem notification API, an application must resort to techniques such as polling the state of directories and files using repeated invocations of system calls such as stat() and the readdir() library function. Such polling is, of course, slow and inefficient. Furthermore, this approach allows only a limited range of events to be detected, for example, creation of a file, deletion of a file, and changes of file metadata such as permissions and file size. By contrast, operations such as file renames are difficult to identify.

Those problems led to the creation of the first in-kernel implementation of a filesystem notification API, dnotify, which was implemented by Stephen Rothwell (these days, the maintainer of the linux-next tree) and which first appeared in Linux 2.4.0 (in 2001).

Because it was the first attempt at implementing a filesystem notification API, done at a time when the problem was less well understood and when some of the pitfalls of API design were less easily recognized, the dnotify API has a number of peculiarities. To begin with, the interface is multiplexed on the existing fcntl() system call. (By contrast, the later inotify and fanotify APIs were each implemented using new system calls.) To enable monitoring, one makes a call of the form:

    fcntl(fd, F_NOTIFY, mask);

Here, fd is a file descriptor that specifies a directory to be monitored, and this brings us to the second oddity of the API: dnotify can be used to monitor only whole directories; monitoring individual files is not possible. The mask specifies the set of events to be monitored in the directory. These include events for file access, modification, creation, deletion, and attribute changes (e.g., permission and ownership changes) that are fully listed in the fcntl(2) man page.

A further dnotify oddity is its method of notification. When an event occurs, the monitoring application is sent a signal (SIGIO by default, but this can be changed). The signal on its own does not identify which directory had the event, but if we use sigaction() to establish the handler using the SA_SIGINFO flag, then the handler receives a siginfo_t argument whose si_fd field contains the file descriptor associated with the directory. At that point, the application then needs to rescan the directory to determine which file has changed. (In typical usage, the application would maintain a data structure that caches a mapping of file descriptors to directory names, so that it can map si_fd back to a directory name.)

A simple example of the use of dnotify can be found here.

Problems with dnotify

As is probably clear, the dnotify API is cumbersome, and has a number of limitations. As already noted, we can monitor only entire directories, not individual files. Furthermore, dnotify provides notification for a rather modest range of events. Most notably, by comparison to inotify, dnotify can't tell us when a file was opened or closed. However, there are also some other serious limitations of the API.

The use of signals as a notification method causes a number of difficulties. The first of these is that signals are delivered asynchronously: catching signals with a handler can be racy and error-prone. One way around that particular difficulty is to instead accept signals synchronously using sigwaitinfo(). The use of SIGIO as the default notification signal is also undesirable, because it is one of the traditional signals that does not queue. This means that if events are generated more quickly than the application can process the signals, then some notifications will be lost. (This difficulty can be circumvented by changing the notification signal to one of the so-called realtime signals, which can be queued.)

Signals are also problematic because they convey little information: at most, we get a signal number (it is possible to arrange for different directories to notify using different signals) and a file descriptor number. We get no information about which particular file in a directory triggered an event, or indeed what kind of event occurred. (One can play tricks such as opening multiple file descriptors for the same directory, each of which notifies a different set of events, but this adds complexity to the application.) One further reason that using signals as a notification method can be a problem is that an application that uses dnotify might also make use of a library that employs signals: the use of a particular signal by dnotify in the main program may conflict with the library's use of the same signal (or vice versa).

A final significant limitation of the dnotify API is the need to open a file descriptor for each directory that is monitored. This is problematic for two reasons. First, an application that monitors a large number of directories may quickly run out of file descriptors. However, a more serious problem is that holding file descriptors open on a filesystem prevents that filesystem from being unmounted.

Notwithstanding these API problems, dnotify did provide an efficiency improvement over simply polling a filesystem, and dnotify came to be employed in some widely used tools such as the Beagle desktop search tool. However, it soon became clear that a better API would make life easier for user-space applications.

Enter inotify

The inotify API was developed by John McCutchan with support from Robert Love. First released in Linux 2.6.13 (in 2005), inotify aimed to address all of the obvious problems with dnotify.

The API employs three dedicated system calls—inotify_init(), inotify_add_watch(), and inotify_rm_watch()—and makes use of the traditional read() system call as well.

[Inotify diagram]

inotify_init() creates an inotify instance—a kernel data structure that records which filesystem objects should be monitored and maintains a list of events that have been generated for those objects. The call returns a file descriptor that is employed by the rest of the API to refer to this inotify instance. The diagram at right summarizes the operation of an inotify instance.

inotify_add_watch() allows us to modify the set of filesystem objects monitored by an inotify instance. We can add new objects (files and directories) to the monitoring list, specifying which events are to be notified, and change the set of events that are notified for an object that is already in the monitoring list. Unsurprisingly, inotify_rm_watch() is the converse of inotify_add_watch(): it removes an object from the monitoring list.

The three arguments to inotify_add_watch() are an inotify file descriptor, a filesystem pathname, and a bit mask:

    int inotify_add_watch(int fd, const char *pathname, uint32_t mask);

The mask argument specifies the set of events to be notified for the filesystem object referred to by pathname and can include some additional bits that affect the behavior of the call. As an example, the following code allows us to monitor file creation and deletion events inside the directory mydir, as well as monitor for deletion of the directory itself:

    int fd, wd;

    fd = inotify_init();

    wd = inotify_add_watch(fd, "mydir",
                           IN_CREATE | IN_DELETE | IN_DELETE_SELF);

A full list of the bits that can be included in the mask argument is given in the inotify(7) man page. The set of events notified by inotify is a superset of that provided by dnotify. Most notably, inotify provides notifications when filesystem objects are opened and closed, and provides much more information for file rename events, as we outline below.

The return value of inotify_add_watch() is a "watch descriptor", which is an integer value that uniquely identifies the specified filesystem object within the inotify monitoring list. An inotify_add_watch() call that specifies a filesystem object that is already being monitored (possibly via a different pathname) will return the same watch descriptor number as was returned by the inotify_add_watch() that first added the object to the monitoring list.

When events occur for objects in the monitoring list, they can be read from the inotify file descriptor using read(). (The inotify file descriptor can also be monitored for readability using select(), poll(), and epoll().) Each read() returns one or more structures of the following form to describe an event:

    struct inotify_event {
        int      wd;      /* Watch descriptor */
        uint32_t mask;    /* Bit mask describing event */
        uint32_t cookie;  /* Unique cookie associating related events */
        uint32_t len;     /* Size of name field */
        char     name[];  /* Optional null-terminated name */
    };

The wd field is a watch descriptor that was previously returned by inotify_add_watch(). By maintaining a data structure that maps watch descriptors to pathnames, the application can determine the filesystem object for which this event occurred. mask is a bit mask that describes the event that occurred. In most cases, this field will include one of the bits specified in the mask specified when the watch was established. For example, given the inotify_add_watch() call that we showed earlier, if the directory mydir was deleted, read() would return an event whose mask field has the IN_DELETE_SELF bit set. (By contrast, dnotify does not generate an event when a monitored directory is deleted.)

In addition to the various events for which an application may request notification, there are certain events for which inotify always generates automatic notifications. The most notable of these is IN_IGNORED, which is generated whenever inotify ceases to monitor an object. This can occur, for example, because the object was deleted or the filesystem on which it resides was unmounted. The IN_IGNORED event can be used by the application to adjust its internal model of what is currently being monitored. (Again, dnotify has no analog of this event.)

The name field is used (only) when an event occurs for a file inside a monitored directory: it contains the null-terminated name of the file that triggered this event. The len field indicates the total size of the name field, which may be terminated by multiple null bytes in order to pad out the inotify_event structure to a size that allows successive structures in the read buffer to be aligned at architecture-appropriate byte boundaries (typically, multiples of 16 bytes).

The cookie field exists to help applications interpret rename events. When a file is renamed inside (or between) monitored directories, two events are generated: an IN_MOVED_FROM event for the directory from which the file is moved, and an IN_MOVED_TO event for the directory to which the file is moved. The first event contains the old name of the file, and the second event contains the new name. Both events have the same unique cookie value, allowing the application to connect the two events, and thus work out the old and new name of the file (a task that is rather difficult with dnotify). We'll say rather more about rename events in the next article in this series.

Inotify does not provide recursive monitoring. In other words, if we are monitoring the directory mydir, then we will receive notifications for that directory as well as all of its immediate descendants, including subdirectories. However, we will not receive notifications for events inside the subdirectories. But, with some effort, it is possible to perform recursive monitoring by creating watches for each of the subdirectories in a directory tree. To assist with this task, when a subdirectory is created inside a monitored directory (or indeed, when any event is generated for a subdirectory), inotify generates an event that has the IN_ISDIR bit set. This provides the application with the opportunity to add watches for new subdirectories.

Example program

The code below demonstrates the basic steps in using the inotify API. The program first creates an inotify instance and adds watches for all possible events for each of the pathnames specified in its command line. It then sits in a loop reading events from the inotify file descriptor and displaying information from those events (using our displayInotifyEvent(), shown in the full version of the code here).

    int
    main(int argc, char *argv[])
    {
        struct inotify_event *event
        ...

        inotifyFd = inotify_init();         /* Create inotify instance */

        for (j = 1; j < argc; j++) {
            wd = inotify_add_watch(inotifyFd, argv[j], IN_ALL_EVENTS);

            printf("Watching %s using wd %d\n", argv[j], wd);
        }

        for (;;) {                          /* Read events forever */
            numRead = read(inotifyFd, buf, BUF_LEN);
            ...

            /* Process all of the events in buffer returned by read() */

            for (p = buf; p < buf + numRead; ) {
                event = (struct inotify_event *) p;
                displayInotifyEvent(event);

                p += sizeof(struct inotify_event) + event->len;
            }
        }
    }

Suppose that we use this program to monitor two subdirectories, xxx and yyy:

    $ ./inotify_demo xxx yyy
    Watching xxx using wd 1
    Watching yyy using wd 2

If we now execute the following command:

    $ mv xxx/aaa yyy/bbb

we see the following output from our program:

    Read 64 bytes from inotify fd
        wd = 1; cookie =140040; mask = IN_MOVED_FROM
            name = aaa
        wd = 2; cookie =140040; mask = IN_MOVED_TO
            name = bbb

The mv command generated an IN_MOVED_FROM event for the xxx directory (watch descriptor 1) and an IN_MOVED_TO event for the yyy directory (watch descriptor 2). The two events contained, respectively, the old and new name of the file. The events also had the same cookie value, thus allowing an application to connect them.

How inotify improves on dnotify

Inotify improves on dnotify in a number of respects. Among the more notable improvements are the following:

Both directories and individual files can be monitored.
Instead of signals, applications are notified of filesystem events by reading structured data from a file descriptor created using the API. This approach allows an application to deal with notifications synchronously, and also allows for richer information to be provided with notifications.
Inotify does not require an application to open file descriptors for each monitored object. Instead, it uses an API-specific handle (the watch descriptor). This avoids the problems of file-descriptor exhaustion and open file descriptors preventing filesystems from being unmounted.
Inotify provides more information when notifying events. First, it can be used to detect a wider range of events. Second, when the subject of an event is a file inside a monitored directory, inotify provides the name of that file as part of the event notification.
Inotify provides richer information in its notification of rename events, allowing an application to easily determine the old and new name of the renamed object.
IN_IGNORED events make it (relatively) easy for an inotify application to maintain an internal model of the currently monitored set of filesystem objects.

Concluding remarks

We've briefly seen how inotify improves on dnotify. In the next article in this series, we look in more detail at inotify, considering how it can be used in a robust application that monitors a filesystem tree. This will allow us to see the full capabilities of inotify, while at the same time discovering some of its limitations.

Comments (26 posted)

Linus Torvalds Linux 3.16-rc4 ?

Greg KH Linux 3.15.4 ?

Greg KH Linux 3.14.11 ?

Thomas Gleixner 3.14.10-rt7 ?

Jiri Slaby Linux 3.12.24 ?

Steven Rostedt 3.12.22-rt35 ?

Luis Henriques Linux 3.11.10.13 ?

Greg KH Linux 3.10.47 ?

Steven Rostedt 3.10.44-rt46 ?

Greg KH Linux 3.4.97 ?

Steven Rostedt 3.4.94-rt117 ?

Steven Rostedt 3.2.60-rt88 ?

Matthias Brugger arm: Add basic support for Mediatek Cortex-A7 SoCs ?

AKASHI Takahiro arm64: Add seccomp support ?

AKASHI Takahiro arm64: Add audit support ?

Zi Shen Lim arm64: eBPF JIT compiler ?

Morten Rasmussen sched: Energy cost model for energy-aware scheduling ?

Steven Rostedt ftrace: Add dynamically allocated trampolines ?

Andy Shevchenko lib: introduce string_escape_mem an %*pE specifier ?

Jason Low MCS spinlocks: Cancellable MCS spinlock rework ?

NeilBrown Improve wait_on_bit interface. ?

Guenter Roeck kernel: Add support for restart notifier call chain ?

Li Zefan [PATCH v3 00/12] cpuset: separate configured masks and effective masks ?

Yan, Zheng perf, x86: Haswell LBR call stack support ?

Andrey Ryabinin [RFC/PATCH -next 00/21] Address sanitizer for kernel (kasan) - dynamic memory error detector. ?

Daniel Jeong backlight: add new tps611xx backlight driver ?

Stanimir Varbanov Support for Qualcomm QPNP PMIC's ?

Krzysztof Kozlowski charger/mfd: max14577: Add support for MAX77836 ?

Javier Martinez Canillas Add Maxim 77802 PMIC support ?

Peter Griffin Add ST dwc3 glue layer driver. ?

Ivan T. Ivanov New Qualcomm PMIC pin controller drivers ?

Alexandre Belloni Add a driver for the atmel ram controller ?

Jenny TC power_supply: Introduce power supply charging driver ?

Harini Katakam gpio: Add driver for Zynq GPIO controller ?

Tomeu Vizoso Per-user clock constraints ?

Maarten Lankhorst Convert TTM to the new fence interface. ?

Thierry Reding asm-generic/io.h: Implement generic {read,write}s*() ?

Jeff Layton fcntl-linux.h: add new definitions and manual updates for open file description locks ?

Michael Kerrisk (man-pages) man-pages-3.70 is released ?

Sergey Senozhatsky btrfs compression: merge inflate and deflate z_streams ?

Namjae Jeon fs: introduce IOC_MOV_DATA ioctl ?

Dan Streetman mm/zpool: add common api for zswap to use zbud/zsmalloc ?

Vladimir Davydov Virtual Memory Resource Controller for cgroups ?

Naoya Horiguchi mm: introduce fincore() v3 ?

Minchan Kim MADV_FREE support ?

Vivek Goyal kexec: Verify signature of PE signed bzImage ?

David Howells KEYS: PKCS#7 and PE file signature checking for kexec ?

behanw@converseincode.com LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM ?

Kernel development

Brief items

Kernel release status

Quotes of the week

The future of realtime Linux in doubt

Kernel development news

Anatomy of a system call, part 1

Defining a syscall with `SYSCALL_DEFINEn()`

Syscall table entries

x86_64 syscall invocation

Control groups, part 2: On the different sorts of hierarchies

Hierarchies in computer account privileges

Manageable complexity

Two types of hierarchy

Devices in `/sys`

The Linux kernel source tree

Hierarchies for processes

Filesystem notification, part 1: An overview of dnotify and inotify

Filesystem notification use cases

In the beginning: dnotify

Problems with dnotify

Enter inotify

Example program

How inotify improves on dnotify

Concluding remarks

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Security-related

Miscellaneous

Kernel development

Brief items

Kernel release status

Quotes of the week

The future of realtime Linux in doubt

Kernel development news

Anatomy of a system call, part 1

Defining a syscall with SYSCALL_DEFINEn()

Syscall table entries

x86_64 syscall invocation

Control groups, part 2: On the different sorts of hierarchies

Hierarchies in computer account privileges

Manageable complexity

Two types of hierarchy

Devices in /sys

The Linux kernel source tree

Hierarchies for processes

Filesystem notification, part 1: An overview of dnotify and inotify

Filesystem notification use cases

In the beginning: dnotify

Problems with dnotify

Enter inotify

Example program

How inotify improves on dnotify

Concluding remarks

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Security-related

Miscellaneous

Defining a syscall with `SYSCALL_DEFINEn()`

Devices in `/sys`