Kernel development
Brief items
Kernel release status
The current development kernel is 4.7-rc1, released on May 29. Linus said: "this time around we have a fairly big change to the vfs layer that allows filesystems (if they buy into it) to do readdir() and path component lookup in parallel within the same directory. That's probably the biggest conceptual vfs change we've had since we started doing cached pathname lookups using RCU." The code name has been changed to "Psychotic Stoned Sheep."
Stable updates: 4.6.1, 4.5.6, 4.4.12, and 3.14.71 were released on June 1.
Quote of the week
Kernel development news
The end of the 4.7 merge window
By the time that Linus released the 4.7-rc1 prepatch and closed the merge window for this development cycle, 10,707 non-merge changesets had been pulled into the mainline repository. As expected, that falls rather short of the 12,172 pulled for 4.6-rc1, but it still adds up to a busy development cycle with a number of interesting changes and new features.Some of the changes pulled since last week's summary are:
- The NFS client now implements the copy_file_range() system call,
making use of the NFS 4.2 COPY command to optimize the
operation on the remote server.
- The direct-access code for persistent memory (DAX) can now work with
memory arrays containing media errors.
- If the new TRIM_UNUSED_KSYMS configuration option is
selected, any exported symbols that are not actually used by the built
kernel will be removed from the exports list. That might open up more
optimization opportunities, and making unused symbols inaccessible
seems like a worthwhile change from a security point of view.
- A number of longstanding issues with the kernel's string-hashing code,
described in this article, have been
addressed through the introduction of a new hashing library. See <linux/stringhash.h> for
the new interface.
- New hardware support includes: Sigma Designs "Tango" temperature sensors, thermal sensors attached to analog-to-digital converters, Intel Core SoC power management controllers, Chelsio iSCSI target offload controllers, Texas Instruments TAS5720 mono audio amplifiers, and Maxim MAX98371 codecs.
If the normal schedule is followed, the 4.7 release can be expected to happen on July 17. There are no guarantees, of course; that date can be shifted by regressions, unexpected API issues, or irresistible diving opportunities. But the release cycle is predictable enough these days that we can expect that date to not slip by much, if at all. Between now and then, it's just a matter of testing the new kernel and getting the inevitable bugs fixed.
System calls for memory protection keys
"Memory protection keys" are an Intel processor feature that is making its first appearance in Skylake server CPUs. They are a user-controllable, coarse-grained protection mechanism, allowing a program to deny certain types of access to ranges of memory. LWN last looked at kernel support for memory protection keys (or "pkeys") at the end of 2015. The system-call interface is now deemed to be in its final form, and there is a push to stage it for merging during the 4.8 development cycle. So the time seems right for a look at how this feature will be used on Linux systems.A pkey is a four-bit value (in the current Intel implementation) that can be stored in the page-table entry for each page in a process's address space. Pages can thus be arbitrarily assigned to one of sixteen key values; each address space has its own set of keys. For each of those keys, the process can configure the CPU to deny either write operations or all access entirely. Pkeys will override the regular protections assigned to each page but, since they can only deny operations, their effect will always be to restrict access more strictly than the page protections do. There are a number of intended use cases, including the implementation of execute-only memory or the protection of sensitive data (cryptographic keys, for example) when it is not in active use.
Most pkey operations are unprivileged and thus could be left to user space to handle without kernel involvement; the one exception is storing the key values in the page-table entries. There is value in having the kernel take an overall role in coordinating the use of pkeys, though, so that library code can use them without interfering with the rest of the application. The kernel can also make good use of pkeys if it knows it has exclusive access to them. To make all this possible, five system calls have been defined for working with pkeys in applications.
The proposed pkey API
To avoid conflicts over the use of any specific key, pkeys should be allocated prior to use. The allocation system calls are:
int pkey_alloc(unsigned long flags, unsigned long initial_rights);
int pkey_free(int key);
A new protection key may be obtained with pkey_alloc(). In the current implementation, the flags argument must be zero, while initial_rights is a bitmask setting the key's initial access restrictions. The available access bits are PKEY_DISABLE_WRITE (disabling write access) or PKEY_DISABLE_ACCESS (which disables all access). It is worth noting that these flags refer to data accesses; memory with a PKEY_DISABLE_ACCESS pkey can still be read by the processor for execute access.
The return value from pkey_alloc() is an integer index indicating which key was allocated, or ENOSPC if no keys are available. Keys which are no longer in use may be freed with pkey_free(). Freeing a key does not, however, remove that key value from page-table entries or remove any restrictions that had been applied to that key. So surprising things could happen if an application frees a key that is still applied to pages within its address space and the key is later reallocated to another use.
The assigning of keys to pages is done with a new variant of the mprotect() system call:
int pkey_mprotect(void *start, size_t len, int prot, int pkey);
This call behaves like mprotect() in that it will set the (regular) protection bits described by prot on the pages containing len bytes beginning at start. It will also assign the given pkey (which must have been allocated with pkey_alloc()) to those pages. A call to pkey_mprotect() will succeed on systems that do not support pkeys, but only if pkey is passed as zero.
If an application wants to ensure that a given memory range will never be accessible without the desired pkey restrictions, it can create that range by passing PROT_NONE to mmap(), making the memory initially inaccessible. A subsequent pkey_mprotect() call will then atomically change the protections and assign the pkey, ensuring that there is never a window where the restrictions are not as desired.
An application can query the current restrictions associated with a pkey using the RDPKRU instruction, and change them with WRPKRU, so there is not strictly a need for the kernel to support these operations. The kernel provides a couple of system calls for manipulating pkey restrictions anyway:
unsigned long pkey_get(int pkey);
int pkey_set(int pkey, unsigned long access_rights);
These functions eliminate the need to use special assembly instructions in application code; they can also verify that the given pkey has been allocated.
Execute-only interactions
There can be some security benefits from designating memory that contains code as execute-only, so that its contents cannot be read for other purposes. As it happens, though, setting the page protections to PROT_EXEC does not have that effect — the affected pages are still readable. So, on current processors, true execute-only protections are not easily achievable. But, as mentioned above, the PKEY_DISABLE_ACCESS restriction does not block execute access. It can thus be used, in conjunction with PROT_EXEC, to create execute-only memory ranges.
While the system-call API is still out-of-tree, the core support for pkeys has been in the mainline kernel since the 4.6 release. If the kernel sees an mprotect() call setting PROT_EXEC permissions on a range of memory, it will automatically use a pkey to create true execute-only permissions. This is one of the reasons why it is useful to have the kernel in control of key allocation.
There is an interesting question that comes up, though: what if a process sets a pkey of its own with pkey_mprotect(), then uses a regular mprotect() call to set the page permissions to PROT_EXEC? In this case, the kernel could either change the restrictions for the assigned pkey, or it could change the affected pages to use its own reserved pkey. Either approach could lead to results that the application developer finds surprising.
To avoid such surprises, one pkey (number zero) has been set aside as the default key for all pages. This key will never be allocated with pkey_alloc(), and its restrictions cannot be changed with pkey_set(). As of 4.8 (assuming these patches are merged), the kernel will only assign the execute-only pkey to pages that are currently controlled by the default key.
The memory protection keys patches have been circulating for some time, and have evolved considerably in response to reviewer comments. At this point, they would appear to have reached a stable point where the developers who are paying attention are happy with them. So the chances are good that the 4.8 kernel will include these system calls making the full functionality available to applications. How soon the requisite hardware will be widely available is yet to be seen, though.
Containers, pseudo TTYs, and backward compatibility
There is no doubt that the addition of container technologies to Linux has created a lot of value, allowing workloads to be effectively and efficiently isolated from each other. Implementing these technologies presents a number of challenges, particularly as much of Linux and Unix was designed to use singletons: objects of which there could never ever be more than one, such as host names, network routing tables, or process-ID namespaces. Containers require this design approach to be revised as they need multiple instances of these objects. A singleton that has been causing problems recently is the set of pseudo terminals (TTYs).
A pseudo TTY (or "PTY") is a pair of devices — a slave and a master — that provide a special sort of communication channel. The slave device behaves much like the device representing the VT100 or ADM-3A "dumb terminal" that we all have on our desks ... or that we might have had a few decades ago. It can read and write text as though it were a physical terminal, it can enable or disable echo of typed characters, etc. The master acts more like the person sitting in front of that dumb terminal. Writing to the master is exactly like typing on a terminal. If echo is enabled, then everything written can immediately be read back, and writing a backspace effectively causes the previous character typed to be forgotten. Modern computers typically have very few, if any, physical terminals, but potentially lots of PTYs to support text-based interfaces as provided by terminal emulators (such as xterm or gnome-terminal) and remote access interfaces like SSH.
Opening a pseudo TTY
The history of pseudo TTYs contains the sort of mix of clever ideas and unfortunate choices that we've come to expect in fast-moving technology. The original implementation provided a fixed number of master/slave pairs which, like all other devices, had permanent device nodes in /dev. /dev/ptyp9 would be a master device, for example, and /dev/ttyp9 would be the matching slave device. An application or service that needed a PTY would try to open each master device in turn until it succeeded with one; it would then have exclusive access to that PTY. The application would change the ownership of the slave to match the user it was providing access to and hand the slave to whatever command shell or similar program was appropriate. If a non-privileged application needed to allocate a PTY it would need a setuid helper program to update the ownership of the slave device node in /dev.
While this worked, it was far from elegant, particularly as the number of PTYs configured on systems headed into the hundreds and the search for a free master device became a greater waste of time. So, with the "Unix98" Single Unix Specification (SUS), a new approach was adopted. An abstract interface was defined to allow the writing of portable code without imposing a single mechanism on all implementations, as the committee was not able to agree on any one universal mechanism. In February 1998, Linux 2.1.87 brought support for the /dev/ptmx multiplexing master device. Opening this device provides access to an otherwise unused pseudo TTY master and allows the matching slave to be identified using an ioctl(). This makes implementations of posix_openpt() and ptsname() quite straightforward.
In April of that year, Linux 2.1.93 added a new virtual filesystem called devpts that is normally mounted at /dev/pts. Whenever a new master/slave pair is created, a device node for the slave is created in that virtual filesystem. This device node is given an owner and group matching the owner and group of the process that opened /dev/ptmx, though either can be overridden by mount options. With this there is no need for a setuid helper program. At least there shouldn't be.
There is just one little problem: SUS requires that the group ID of the slave device should not be that of the creating process, but rather some definite, though unspecified, value. The GNU C Library (glibc) takes responsibility for implementing this requirement; it quite reasonably chooses the ID of the group named "tty" (often 5) to fill this role. If the devpts filesystem is mounted with options gid=5,mode=620, this group ID and the required access mode will be used and glibc will be happy. If not, glibc will (if so configured) run a setuid helper found at /usr/libexec/pt_chown.
As Eric Biederman discovered, xen-create-image mounts devpts in a chroot while creating a new root filesystem, and does so without these options. Just why this is interesting will become clear a little later.
Seeing the singletons
This design for PTYs created two related singletons: the master multiplexer /dev/ptmx and the slave virtual filesystem /dev/pts. Abstracting a singleton object to be different in different containers has been done multiple times and the process is well understood. When there are two distinct but related singletons as we have here, there is more complexity that must be carefully managed. These details were thought to have been addressed back in 2009 when container support was added to the pseudo TTY subsystem.
With this change, it became possible to mount distinct instances of the devpts filesystem, each with its own set of pseudo TTYs. A new "ptmx" file was created inside the mounted devpts filesystem instance; opening this pts/ptmx file would always create slave nodes in the same filesystem instance. It was expected that /dev/ptmx would be changed to be a symbolic link to pts/ptmx; containers could then just mount their own devpts filesystem and, as it was now a self-contained entity, everything would be happy. Unfortunately not everyone got the memo. While some container libraries configure /dev/ptmx like this, the practice isn't universal.
The last piece of the puzzle is that a device node for ptmx that is created explicitly with mknod, rather than created implicitly in a devpts instance, is still a singleton, so there must be a unique, global devpts filesystem where slave nodes are created when the singleton ptmx node is opened. To ensure backward compatibility, an attempt to mount a devpts filesystem will normally mount this single-instance unless the newinstance mount option is provided. This way, old installations get what they expect, new code has control and can get what is needed. It seems like a reasonably clean, if slightly inelegant, solution.
Unfortunately there is a problem, and here at last we find out why that setuid helper program is relevant. Setuid programs are always a little bit risky — it is important that they cannot be tricked into doing the wrong thing, so they must be provided with complete information in ways that cannot easily be forged. The setuid pt_chown tool is given the master side of the new PTY as an open file descriptor and the user ID to change its ownership to as the process's real UID. It then needs to find the slave node, which can be done using ptsname(). In a system with multiple devpts instances mounted, the information pt_chown gives to ptsname() is no longer complete as it does not identify which devpts instance to use; that can lead to unfortunate consequences.
What is your real name?
In Linux 3.9, it became possible for an unprivileged user to mount a new devpts instance in a private user namespace. If that user ensured this mount was still visible in the global namespace, a program running there would be able to open the new ptmx device and get a file descriptor of a PTY master with any arbitrary index number. This would be quite distinct from any PTY in the global /dev/pts/ but, crucially, ptsname() doesn't know that. ptsname() simply calls the ioctl() to find the index number for the PTY, and constructs a path name in /dev/pts/. So if the master file descriptor could be passed to a setuid pt_chown, it would change the ownership of a PTY that was, in all probability, owned by someone else. The ability to take ownership of a PTY connected to a root shell, for example, has obvious value to somebody wishing to compromise the system.
The obvious response to this would be to get rid of pt_chown, simply because it is a setuid program that isn't needed — providing devpts was mounted with the proper options. Unfortunately this isn't obvious to everyone. Pulling other threads of this story together, if you run xen-create-image, it will mount the devpts filesystem within its chroot environment with no options. This mounts the singleton instance, and imposes the default options on it (which are not the preferred options). If, as is normally the case, the singleton instance is already mounted outside the chroot at /dev/pts with options like gid=5,mode=620, the default options will be implicitly imposed there too — overwriting the previous options. Without pt_chown this will result in slave PTYs getting the wrong group owner, which breaks a number of applications.
The right solution would be to fix xen-create-image.
Biederman reported that, rather than taking that approach,
"some distributions
have been working around this problem by continuing
to install a setuid root pt_chown binary that will be called by glibc
to fix the permissions.
" In the absence of multiple-instance
devpts, this may be clumsy but it
works and has no obvious security problems. With the introduction of
multiple-instance devpts and allowing unprivileged mounts in
user namespaces, a potential security issue has arisen. This problem came up
as a result of
changes in the kernel, so it is up to the kernel developers to address
the issue, even though it is only there because of questionable
practices in user-space code.
Just to be clear, this problem only affects installations with a setuid pt_chown, with a v3.9 or later Linux kernel, and with the non-default CONFIG_USER_NS configuration option enabled.
Search for a solution
The best solution would address the problem without any change to user space. This would require a kernel change so that the current pt_chown program either failed to recognize the given file descriptor as representing a PTY master, or failed when it tried to change ownership. Neither of these are possible with any sort of elegance. pt_chown identifies the master by using the TIOCGPTN ioctl which just returns a number used to identify the slave. Any change to this would be likely to break some other program too.
As no kernel change could be sufficient, some glibc change is required. It would probably be possible to change ptsname() to perform more checks before performing the ownership change, such as ensuring that the inode and device numbers of the passed file descriptor match that which is reported by lstat("/dev/ptmx"). It may even be possible to make this work reliably on all kernels and all Linux distributions. But this far from certain and the approach only serves to perpetuate an undesirable setuid program: nobody really wants that.
So the preferred approach is a glibc configuration change rather than a code change: deprecate pt_chown and convince all distributions to remove it. This suggests that a way needs to be found to support the somewhat strange usage in xen-create-image without the need for pt_chown.
Biederman's plan for this is to discard the "singleton" instance of devpts completely. Once implemented, every mount of devpts will be a new instance, so when xen-create-image performs the mount with default options, that won't affect the mount options on the system /dev/pts. If all distributions had already changed /dev/ptmx to be a symlink to pts/ptmx in all cases, this would be a trivial change and nobody would notice. But that change has not been universally made.
Since the separate ptmx device node (created via mknod() in /dev) is still widely in use, it must be changed so that, instead of using the singleton devpts (which will no longer exist), it uses the right one for the particular context in which it is accessed. When Beiderman's new version of ptmx is opened, it will look for the name "pts" in the same directory that ptmx was found and see if that is a mount point of the devpts filesystem. If it is, a new PTY will be allocated in that filesystem. If it isn't, an error will be returned.
This is undoubtedly an odd behavior for a character-special device to have. We are used to symbolic links behaving differently when found in different directories, but not character devices. There was a suggestion that the ptmx device could make itself look like a symbolic link, but this turned out to be much more easily said than done, and it is not clear that this would be a more elegant solution, just a different one.
To summarize the intent behind these changes: enabling a regular ptmx device to find and use a nearby mount of devpts makes it possible to get rid of the singleton version of the devpts filesystem and have every mount create a new instance. Once this is done, the unusual usage in xen-create-image (or any other unusual usage that might be out there) will only have local effect and cannot impose non-standard options on the "system" /dev/pts. Then there will no longer be any excuse to install pt_chown, so we can strongly encourage distributions and users to remove it. At that point, the security problems that arise when enabling both CONFIG_USER_NS and pt_chown will no longer be an issue.
Might this break something else?
This change, first proposed by Biederman in December, has not had an easy path to the kernel. Once it became clear that some sort of semantic change was needed, the question arose as to which changes might be safe and which changes might break things. Linus Torvalds's dictum that we mustn't break user space does not mean that we cannot change the behavior of the kernel, only that we cannot change a behavior that is reasonably being depended on. There was some disagreement as to exactly what could or could not be changed in this case.
To his great credit, Biederman has assembled a considerable array of different distributions to test his changes on. The most recent patch makes the claim that:
which provides quite a high level of confidence that existing behaviors aren't broken.
This last patch is considerably smaller than some earlier attempts, in part because Torvalds committed some clean-up patches himself to make the code more approachable. It was a little too late for Greg Kroah-Hartman — as TTY maintainer — to accept it for the 4.7 cycle, but it seems likely that this new approach to devpts where every mount is a new instance will land for Linux 4.8. The next step will, presumably, be to actively encourage those distributions that currently ship a setuid pt_chown to stop doing so.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
