Kernel development
Brief items
Kernel release status
The current development kernel is 4.7-rc3, released on June 12. Linus said: "The diffstat looks fairly normal and innocuous. There's more of a filesystem component to it than usual, but that's mostly some added new btrfs tests, and if you ignore that part it's all the normal stuff: drivers dominate (gpu and networking drivers are the bulk, but there's i2c, rdma, ...) with some arch updates, and general networking code. And the usual random stuff all over."
Thorsten Leemhuis has, for now, taken up the long-neglected task of tracking regressions in the -rc kernels; he currently has 21 regressions listed for 4.7-rc3.
Stable updates: None have been released in the last week.
Kernel development news
Time to move to C11 atomics?
A typical program written in C looks like a deterministic set of steps laid out in a specific order. Out of the programmer's sight, though, both the compiler and the CPU are free to change the ordering of operations with the goal of speeding the program's execution. When one is dealing with a single thread of execution, reordering operations without breaking the program is a relatively straightforward task; that is no longer true when multiple threads are working with the same memory. In the multi-threaded case, developers must often make the ordering requirements explicit.
To that end, the kernel has defined a whole set of memory barriers and atomic operations designed to preserve memory-access ordering in places where it matters while preserving performance. The C11 version of the C language tries to solve the same problems with a different set of barrier operations. Once again, the question has been asked: should the kernel drop its own operations in favor of those defined by the C standard?
This question last came up in 2014; see LWN's coverage of that discussion for a great deal of background on how C11 atomic operations work and how concurrent memory access can go wrong when reordering of operations is not sufficiently controlled. This time around, compiler support for C11 atomic operations has improved, and David Howells has come forward with a full implementation of the (x86) kernel's atomic operations built on C11 atomics. The implementation itself is fairly straightforward; for example, the atomic_read() functions look like this:
static __always_inline int __atomic_read(const atomic_t *v, int memorder)
{
return __atomic_load_n(&v->counter, memorder);
}
#define atomic_read(v) (__atomic_read((v), __ATOMIC_RELAXED))
#define atomic_read_acquire(v) (__atomic_read((v), __ATOMIC_ACQUIRE))
David's patches show that this conversion can be done; the real question is whether it should be done. As one might expect, there are a number of arguments each way.
Switching to C11 atomic operations would, in theory, allow the kernel to dump a bunch of tricky architecture-specific barrier code and take advantage of the same code, built into the compiler, that concurrent user-space programs will be using. C11 atomics give the compiler better visibility into what the code is actually doing, opening up more optimization possibilities and enabling the use of instructions that are tricky to invoke from inline assembly code. The compiler can also pick the instruction that is appropriate for the size of the operand; that can eliminate the big compile-time switch statements in the kernel's header files currently.
The optimization possibilities are not fully realized with current compilers, but the potential exists for the compiler to, eventually, do better than even the most highly tweaked inline assembly code. As Paul McKenney put it:
There is also a benefit from the compiler being able to move specific barriers away from the actual atomic operation if that gives better performance; such moves are not possible with operations implemented in inline assembly.
Of course, there are some disadvantages to making this switch as well. One
of those is that C11 atomics are not implemented well in anything but the
newest compilers. Indeed, David says that "there will be some
seriously suboptimal code production before gcc-7.1
" — a release
that is not due for the better part of a year. As might be expected,
numerous bugs involving C11 atomics have been turned up as part of this
project; they are being duly reported and fixed, but there are probably
more to come. In the long term, use of C11 atomics in the kernel would
certainly lead to a more robust compiler implementation, but getting there
might be painful.
If a kernel built for multiprocessor operation (as almost all are) finds itself running on a uniprocessor system, it will patch the unneeded synchronization instructions out of its own code. If C11 atomics are used, this patching is not possible; it is no longer possible to know where those instructions are, and even small compiler changes could lead to massive confusion. Uniprocessor systems are increasingly rare and, arguably, custom kernels are already built for many of them, but it would still be better not to slow down such systems unnecessarily.
Perhaps the biggest potential problem, though, is that the memory model implemented by C11 atomics does not exactly match the model used by the kernel. The C11 model is based on acquire/release semantics — one-way barriers that are described in the 2014 article and this article. Much of the kernel, instead, makes use of load/store barriers, which are stricter, two-way barriers. A memory write with release semantics will only complete after any previous reads or writes are visible throughout the system, but it allows other operations made logically after the write to be reordered to happen before that write. A write with store semantics, instead, strictly orders other write operations on both sides of the barrier.
One option would be to weaken the kernel's memory model so that architectures that have acquire/release semantics can gain the associated performance advantages. But, as one might imagine, such a change would be fraught with the potential for subtle, difficult-to-find bugs; it would have to be approached carefully. That said, David notes that the PowerPC seems to already be working with a weaker model, so there may not be many problems lurking in the core kernel.
As Will Deacon pointed out, C11 atomics lack a good implementation of consume load operations, which are an important part of read-copy-update (RCU), among other things. A consume load can always be replaced with an acquire operation, but the performance will be worse. In general, Will worries that the C11 model is a poor fit for the ARM architecture, and that the result of a switch might be an unwieldy combination of C11 and kernel-specific operations. He did agree, though, that a generic implementation based on C11 atomics would be a useful tool for developers bringing up the kernel on a new architecture.
There has, thus far, been far less discussion of this idea than happened last time around; perhaps developers are resigning themselves to the idea that this change will happen eventually, even if it seems premature now. There would certainly be advantages in such a switch, for both the kernel and the compiler communities. Whether those advantages justify the costs has not yet been worked out, though.
Mount namespaces, mount propagation, and unbindable mounts
In the previous installment of our article series on namespaces, we looked at the key concepts underlying mount namespaces and shared subtrees, including the notions of mount propagation types and peer groups. In this article, we provide some practical demonstrations of the operation of the various propagation types: MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE.
MS_SHARED and MS_PRIVATE example
As we saw in the previous article, the MS_SHARED and MS_PRIVATE propagation types are roughly opposites. A shared mount point is a member of peer group. Each of the member mount points in a peer group propagates mount and unmount events to the other members of the group. By contrast, a private mount point is not a member of a peer group; it neither propagates events to peers, nor receives events propagated from peers. In the following shell session, we demonstrate the different semantics of these two propagation types.
Suppose that, in the initial mount namespace, we have two existing mount points, /mntS and /mntP. From a shell in the namespace, we then mark /mntS as shared and /mntP as private, and view the mounts in /proc/self/mountinfo:
sh1# mount --make-shared /mntS
sh1# mount --make-private /mntP
sh1# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
77 61 8:17 / /mntS rw,relatime shared:1
83 61 8:15 / /mntP rw,relatime
From the output, we see that /mntS is a shared mount in peer group 1, and that /mntP has no optional tags, indicating that it is a private mount. (As noted in the previous article, most mount and unmount operations require that the user is privileged, as indicated by the '#' prompt.)
On a second terminal, we create a new mount namespace where we run a second shell and inspect the mounts:
sh2# unshare -m --propagation unchanged sh
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
222 145 8:17 / /mntS rw,relatime shared:1
225 145 8:15 / /mntP rw,relatime
The new mount namespace received a copy of the initial mount namespace's mount points. These new mount points maintain the same propagation types, but have unique mount IDs (first field in the records).
In the second terminal, we then create mounts under each of /mntS and /mntP and inspect the outcome:
sh2# mkdir /mntS/a
sh2# mount /dev/sdb6 /mntS/a
sh2# mkdir /mntP/b
sh2# mount /dev/sdb7 /mntP/b
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
222 145 8:17 / /mntS rw,relatime shared:1
225 145 8:15 / /mntP rw,relatime
178 222 8:22 / /mntS/a rw,relatime shared:2
230 225 8:23 / /mntP/b rw,relatime
From the above, it can be seen that /mntS/a was created as shared (inheriting this setting from its parent mount) and /mntP/b was created as a private mount.
Returning to the first terminal and inspecting the set-up, we see that the new mount created under the shared mount point /mntS propagated to its peer mount (in the initial mount namespace), but the new mount created under the private mount point /mntP did not propagate:
sh1# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
77 61 8:17 / /mntS rw,relatime shared:1
83 61 8:15 / /mntP rw,relatime
179 77 8:22 / /mntS/a rw,relatime shared:2
MS_SLAVE example
Making a mount point a slave allows it to receive propagated mount and unmount events from a master peer group, while preventing it from propagating events to that master. This is useful if we want to (say) receive a mount event when an optical disk is mounted in the master peer group (in another mount namespace), but we want to prevent mount and unmount events under the slave mount from having side effects in other namespaces.
We can demonstrate the effect of slaving by first marking two (existing) mount points in the initial mount namespace as shared:
sh1# mount --make-shared /mntX
sh1# mount --make-shared /mntY
sh1# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
132 83 8:23 / /mntX rw,relatime shared:1
133 83 8:22 / /mntY rw,relatime shared:2
On a second terminal, we create a new mount namespace and inspect the replicated mount points:
sh2# unshare -m --propagation unchanged sh
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
168 167 8:23 / /mntX rw,relatime shared:1
169 167 8:22 / /mntY rw,relatime shared:2
In the new mount namespace, we then mark one of the mount points as a slave. The effect of changing a shared mount to a slave mount is to make it a slave of the peer group of which it was formerly a member.
sh2# mount --make-slave /mntY
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
168 167 8:23 / /mntX rw,relatime shared:1
169 167 8:22 / /mntY rw,relatime master:2
In the above output, the /mntY mount point is marked with the tag master:2. The tag name is perhaps counterintuitive: it indicates that the mount point is a slave mount that is receiving propagation events from the master peer group with the ID 2. In the case where a mount is both a slave of another peer group, and shares events with a peer group of its own, then the optional fields in the /proc/PID/mountinfo record will show both a master:M tag and a shared:N tag.
Continuing in the new namespace, we create mounts under each of /mntX and /mntY:
sh2# mkdir /mntX/a
sh2# mount /dev/sda3 /mntX/a
sh2# mkdir /mntY/b
sh2# mount /dev/sda5 /mntY/b
When we inspect the state of the mount points in the new mount namespace, we see that /mntX/a was created as a new shared mount (inheriting the "shared" setting from its parent mount) and /mntY/b was created as a private mount (i.e., no tags shown in the optional fields):
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
168 167 8:23 / /mntX rw,relatime shared:1
169 167 8:22 / /mntY rw,relatime master:2
173 168 8:3 / /mntX/a rw,relatime shared:3
175 169 8:5 / /mntY/b rw,relatime
Returning to the first terminal, we see that the mount /mntX/a propagated to the /mntX peer in the initial namespace, but the mount /mntY/b did not propagate:
sh1# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
132 83 8:23 / /mntX rw,relatime shared:1
133 83 8:22 / /mntY rw,relatime shared:2
174 132 8:3 / /mntX/a rw,relatime shared:3
Next, we create a new mount point under /mntY in the initial mount namespace:
sh1# mkdir /mntY/c
sh1# mount /dev/sda1 /mntY/c
sh1# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
132 83 8:23 / /mntX rw,relatime shared:1
133 83 8:22 / /mntY rw,relatime shared:2
174 132 8:3 / /mntX/a rw,relatime shared:3
178 133 8:1 / /mntY/c rw,relatime shared:4
When we examine the mount points in the second mount namespace, we see that in this case the new mount has been propagated to the slave mount point, and that the new mount is itself a slave mount (to peer group 4):
sh2# cat /proc/self/mountinfo | grep '/mnt' | sed 's/ - .*//'
168 167 8:23 / /mntX rw,relatime shared:1
169 167 8:22 / /mntY rw,relatime master:2
173 168 8:3 / /mntX/a rw,relatime shared:3
175 169 8:5 / /mntY/b rw,relatime
179 169 8:1 / /mntY/c rw,relatime master:4
An aside: bind mounts
In a moment, we'll look at the use of the MS_UNBINDABLE propagation type. However, beforehand, it's useful to briefly describe the concept of a bind mount, a feature that first appeared in Linux 2.4.
A bind mount can be used to make a file or directory subtree visible at another location in the single directory hierarchy. In some ways, a bind mount is like a hard link, but it differs in some important respects:
-
It is not possible to create a hard link to a directory,
but it is possible to bind mount to a directory.
-
Hard links can be made only to files on the same filesystem,
while a bind mount can cross filesystem boundaries
(and even reach out of a chroot() jail).
- Hard links entail a modification to the filesystem. By contrast, a bind mount is a record in the mount list of a mount namespace—in other words, a property of the live system.
A bind mount can be created programmatically using the mount() MS_BIND flag or on the command line using mount --bind. In the following example, we first create a directory containing a file and then bind mount that directory at a new location:
# mkdir dir1 # Create source directory
# touch dir1/x # Populate the directory
# mkdir dir2 # Create target for bind mount
# mount --bind dir1 dir2 # Create bind mount
# ls dir2 # Bind mount has same content
x
Then we create a file under the new mount point and observe that the new file is visible under the original directory as well, indicating that the bind mount refers to the same directory object:
# touch dir2/y
# ls dir1
x y
By default, when creating a bind mount of a directory, only that directory is mounted at the new location; if there are any mounts under that directory tree, they are not replicated under the mount target. It is also possible to perform a recursive bind mount, by calling mount() with the flags MS_BIND and MS_REC, or from the command line using the mount --rbind option. In this case, each mount under the source tree is replicated at the corresponding location in the target tree.
MS_UNBINDABLE example
The shared, private, and slave propagation types are about managing propagation of mount events between peer mounts (which are typically in different namespaces). Unbindable mounts exist to solve a different problem, one that preceded the existence of mount namespaces. That problem is the so-called "mount point explosion" that occurs when repeatedly performing recursive bind mounts of a higher-level subtree at a lower-level mount point. We'll now walk through a shell session that demonstrates the problem, and then see how unbindable mounts provide a solution.
To begin with, suppose we have a system with the two mount points, as follows:
# mount | awk '{print $1, $2, $3}'
/dev/sda1 on /
/dev/sdb6 on /mntX
Now suppose that we want to recursively bind mount the root directory under several users' home directories. We'll do this for the first user and inspect the mount points. However, we first create a new namespace in which we recursively mark all mount points as slaves, to prevent the steps that we perform from having any side effects in other mount namespaces:
# unshare -m sh
# mount --make-rslave /
# mount --rbind / /home/cecilia
# mount | awk '{print $1, $2, $3}'
/dev/sda1 on /
/dev/sdb6 on /mntX
/dev/sda1 on /home/cecilia
/dev/sdb6 on /home/cecilia/mntX
When we repeat the recursive bind operation for the second user, we start to see the explosion problem:
# mount --rbind / /home/henry
# mount | awk '{print $1, $2, $3}'
/dev/sda1 on /
/dev/sdb6 on /mntX
/dev/sda1 on /home/cecilia
/dev/sdb6 on /home/cecilia/mntX
/dev/sda1 on /home/henry
/dev/sdb6 on /home/henry/mntX
/dev/sda1 on /home/henry/home/cecilia
/dev/sdb6 on /home/henry/home/cecilia/mntX
Under /home/henry, we have not only recursively added the /mntX mount, but also the recursive mount of that directory under /home/cecilia that was created in the previous step. Upon repeating the step for a third user and simply counting the resulting mounts, it becomes obvious that the explosion is exponential in nature:
# mount --rbind / /home/otto
# mount | awk '{print $1, $2, $3}' | wc -l
16
We can avoid this mount explosion problem by making each of the new mounts unbindable. The effect of doing this is that recursive bind mounts of the root directory will not replicate the unbindable mounts. Returning to the original scenario, we make an unbindable mount for the first user and examine the mount via /proc/self/mountinfo:
# mount --rbind --make-unbindable / /home/cecilia
# cat /proc/self/mountinfo | grep /home/cecilia | sed 's/ - .*//'
108 83 8:2 / /home/cecilia rw,relatime unbindable
...
An unbindable mount is shown with the tag unbindable in the optional fields of the /proc/self/mountinfo record.
Now we create unbindable recursive bind mounts for the other two users:
# mount --rbind --make-unbindable / /home/henry
# mount --rbind --make-unbindable / /home/otto
Upon examining the list of mount points, we see that there has been no explosion of mount points, because the unbindable mounts were not replicated under each user's directory:
# mount | awk '{print $1, $2, $3}'
/dev/sda1 on /
/dev/sdb6 on /mntX
/dev/sda1 on /home/cecilia
/dev/sdb6 on /home/cecilia/mntX
/dev/sda1 on /home/henry
/dev/sdb6 on /home/henry/mntX
/dev/sda1 on /home/otto
/dev/sdb6 on /home/otto/mntX
Concluding remarks
Mount namespaces, in conjunction with the shared subtrees feature, are a powerful and flexible tool for creating per-user and per-container filesystem trees. They are also a surprisingly complex feature, and we have tried to unravel some of that complexity in this article. However, there are actually several more topics that we haven't considered. For example, there are detailed rules that describe the propagation type that results when performing bind mounts and move (mount --move) operations, as well as rules that describe the result when changing the propagation type of a mount. Many of those details can be found in the kernel source file Documentation/filesystems/sharedsubtree.txt.
Kernel building with GCC plugins
It has long been understood that static-analysis tools can be useful in finding (and defending against) bugs and security problems in code. One of the best places to implement such tools is in the compiler itself, since much of the work required to analyze a program is already done in the compilation process. Despite the fact that GCC has had the ability to support security-oriented plugins for some years, the mainline kernel has never adopted any such plugins. That situation looks likely to change with the 4.8 kernel release, though.For many years, GCC famously did not support plugins out of a fear that proprietary plugins would undermine the free compiler. That roadblock ended in 2009, when the GCC runtime library exemption was rewritten. This library, which is needed by almost every program built with GCC, can be linked with proprietary code — but only if no non-GPLv3 plugins were used in the compilation process. The addition of that rule gave the powers that be at the Free Software Foundation the confidence that they could safely add a plugin mechanism to GCC.
Relatively few plugins have materialized in any setting, perhaps because writing one requires a fairly deep understanding of how GCC works and the documentation available is not entirely helpful. (LWN ran an introduction to creating GCC plugins back in 2011). One group that did jump onto the plugin bandwagon, though, is grsecurity, where the ability to analyze — and transform — kernel code was quickly recognized as having a lot of potential. There were four plugins in the grsecurity patch set when LWN took a look in 2011. The current testing patch set from grsecurity shows twelve of them, performing a variety of functions:
- Checker incorporates some address-space checks normally
performed separately with the sparse
tool.
- Colorize simply adds color to some diagnostic output.
- Constify makes structures containing only function pointers
const.
- Initify moves string constants that are only referenced in
__init or __exit functions to the appropriate ELF
sections.
- Kallocstat generates information on sizes passed to
kmalloc().
- Kernexec is there to "
to make KERNEXEC/amd64 almost as good as it is on i386
"; it ensures that, for example, user-space pages are not executable by the kernel. - Latent_entropy tries to generate entropy (randomness) from the
kernel's execution; more on this one below.
- Randomize_layout reorganizes structure layout randomly.
- Rap implements grsecurity's "return address protection"
mechanism, described in this
presentation [PDF].
- Size_overflow (described on this
page) detects some integer overflows.
- Stackleak tracks kernel-stack usage so that the stack can be
cleared on return to user space.
- Structleak forcibly clears structure fields if they might be copied to user space.
These plugins have clear value to developers wishing to harden the kernel, and they are all free software (though many of them are GPLv2-only, meaning that they cannot be used to compile code needing the GCC runtime library; fortunately, the kernel does not use that library). So far, though, they remain unavailable to kernel developers and distributors, living only in the grsecurity patch set. There are no serious technical or legal obstacles keeping them out of the mainline, but nobody has made the effort to move them over — until now.
Plugins go mainline
Recently, interest in hardening the mainline kernel has increased — or, perhaps more accurately, resistance to doing so has decreased. One obvious way of doing so is to try to bring some of the ideas found in grsecurity into the mainline kernel; that includes the plugin mechanism. To that end, the Linux Foundation's Core Infrastructure Initiative has funded Emese Révfy, the developer of some of the above-listed plugins, to bring this functionality into the mainline kernel. The resulting patch set has been through several rounds of review and is currently staged in linux-next for a probable 4.8 merge.
Emese's patch set does not include all of the plugins listed above; indeed, it includes none of them. Instead, there are two relatively simple plugins provided as a sort of demonstration of how things can be done. One of them, called "sancov," inserts a tracing call at the beginning of each basic block of code. This feature is useful for anything requiring coverage tracking; it is aimed at the syzkaller fuzz tester in particular.
The other included plugin calculates the "cyclomatic complexity" of each function in the kernel. This metric is a simple count of the number of possible paths through the function; a higher complexity count indicates more twisted code that, perhaps, is a more likely hiding place for bugs. Emese has suggested that it could be incorporated into the build-testing systems, where it could emit warnings when somebody adds a new function above a given complexity threshold.
Your editor built an allyesconfig kernel with this plugin enabled; the result was nearly 620,000 complexity values printed to the output. According to this metric, the most complex function in the kernel, with a score of 817, is cache_alloc() in drivers/md/bcache/super.c — a demonstration of just how much complexity can be hidden in macros. Perhaps a more convincing demonstration is rt2800_init_registers(), a 450-line function weighing in at 586. The most complex core-kernel function is alloc_large_system_hash(), with a score of 278.
The latent_entropy plugin from grsecurity has been posted as a separate patch set. This plugin tries to address the problem that systems often have very little entropy available immediately after boot. It adds code to initialization-time functions; each of those functions will generate a pseudo-random value when called and mix it into the entropy pool. That did not seem particularly random to a number of observers; the key, according to "PaX Team", is that the timing and sequencing of this mixing varies according to the interrupts raised during system boot. Ted Ts'o commented that this "entropy" might merely duplicate that obtained from the interrupt-timing measurements that are already done. He noted that mixing it in twice won't hurt, but it may not help much either.
See this 2012 message from PaX Team for more information on how the latent_entropy plugin works.
As noted above, the plugin infrastructure and two simple plugins are currently poised to be merged for 4.8. The latent_entropy plugin is not in linux-next as of this writing, so it is likely to arrive later, if at all. But there is a whole set of existing plugins waiting for somebody to make the effort to bring them over and, even better, the potential for many other plugins to be written in the future. A pluggable compiler can be a potent tool for the checking and hardening of kernel code; the kernel community may have a lot to gain from making use of it.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
