Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.9-rc3, released on March 17. Linus says: "Not as small as -rc2, but that one really was unusually calm. So there was clearly some pending stuff that came in for -rc3, with network drivers and USB leading the charge. But there's other misc drivers, arch updates, btrfs fixes, etc etc too."

Stable updates: 3.8.3, 3.4.36, and 3.0.69 were released on March 14, and 3.8.4, 3.4.37, 3.2.41, and 3.0.70 came out on March 20.

Comments (none posted)

A trinity fuzz-tester mailing list

Dave Jones has announced the creation of a mailing list for development of the "Trinity" fuzz testing tool. It is hosted on vger, so the usual majordomo subscription routine applies.

Full Story (comments: none)

The trouble with DMA masks

By Jonathan Corbet
March 20, 2013

Almost any I/O device worth its electrons will support direct memory access (DMA) transactions; to do otherwise is to be relegated to the world of low-bandwidth, high-overhead I/O. But "DMA-capable" devices are not all equally so; many of them have limitations restricting the range of memory that can be directly accessed. The 24-bit limitation that afflicted ISA devices in the early days of the personal computer is a classic example, but contemporary hardware also has its limits. The kernel has long had a mechanism for working around these limitations, but it turns out that this subsystem has some interesting problems of its own.

DMA limitations are usually a result of a device having fewer address lines than would be truly useful. The 24 lines described by the ISA specification are an obvious example; there is simply no way for an ISA-compliant device to address more than 16MB of physical memory. PCI devices are normally limited to a 32-bit address space, but a number of devices are limited to a smaller space as a result of dubious hardware design; as is so often the case, hardware designers have shown a great deal of creativity in this area. But users are not concerned with these issues; they just want their peripherals to work. So the kernel has to find a way to respect any given device's special limits while still using DMA to the greatest extent possible.

The kernel's DMA API (described in Documentation/DMA-API.txt) abstracts and hides most of the details of making DMA actually work with any specific device. This API will, for example, endeavor to allocate memory that falls within the physical range supported by the target device. It will also transparently implement "bounce buffering" — copying data between a device-inaccessible buffer and an accessible buffer — if necessary. To do so, however, the DMA API must be informed of a device's addressing limits. That is done through the provision of a "DMA mask," a bitmask describing the memory range reachable by the device. The documentation describes the mask this way:

The dma_mask represents a bit mask of the addressable region for the device. I.e., if the physical address of the memory anded with the dma_mask is still equal to the physical address, then the device can perform DMA to the memory.

The problem, as recently pointed out by Russell King, is that the DMA mask is not always interpreted that way. He points to code like the following, found in block/blk-settings.c:

    void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
    {
	unsigned long b_pfn = dma_mask >> PAGE_SHIFT;

What is happening here is that the code is right-shifting the DMA mask to turn it into a "page frame number" (PFN). If one envisions a system's memory as a linear array of pages, the PFN of a given page is simply its index into that array (though memory is not always organized so simply). By treating a DMA mask as, for all practical purposes, another way of expressing the PFN of the highest addressable page, the block code is changing the semantics of how the mask is interpreted.

Russell explained how that can be problematic. On some ARM systems, memory does not start at a physical address of zero; the physical address of the first byte can be as high as 3GB (0xc0000000). If a system configured in this way has a device with a 26-bit address limitation (with the upper bits being filled in by the bus hardware), then its DMA mask should be set to 0xc3ffffff. Any physical address within the device's range will be unchanged by a logical AND operation with this mask, while any address outside of that range will not.

But what then happens when the block code right-shifts that mask to get a PFN from the mask? The result (assuming 4096-byte pages) is 0xc3fff, which is a perfectly valid PFN on a system where the PFN of the first page will be 0xc0000. And that is fine until one looks at the interactions with a global memory management variable called max_low_pfn. Given that name, one might imagine that it is the maximum PFN contained within low memory — the PFN of the highest page that is directly addressable by the kernel without special mappings. Instead, max_low_pfn is a count of page frames in low memory. But not all code appears to treat it that way.

On an x86 system, where memory starts at a physical address of zero (and, thus, a PFN of zero), that difference does not matter; the count and the maximum are the same. But on more complicated systems the results can be interesting. Returning to the same function in blk-settings.c:

    blk_max_low_pfn = max_low_pfn - 1;  /* Done elsewhere at init time */

    if (b_pfn < blk_max_low_pfn)
	dma = 1;
    q->limits.bounce_pfn = b_pfn;

Here we have a real page frame number (calculated from the DMA mask) compared to a count of page frames, with decisions on how DMA must be done depending on the result. It would not be surprising to see erroneous results from such an operation; with regard to the discussion in question, it seems to have caused bounce buffering to be done when there was no need. One can easily see other kinds of trouble that could result from this type of confusion; inconsistent views of what a variable means will rarely lead to good things.

Fixing this situation is not going to be straightforward; Russell had "no idea" of how to do it. Renaming max_low_pfn to something like low_pfn_count might be a first step as a way to avoid further confusion. Better defining the meaning of a DMA mask (or, at least, ensuring that the kernel's interpretation of a mask adheres to the existing definition) sounds like a good idea, but it could be hard to implement in a way that does not break obscure hardware — some of that code can be fragile indeed. One way or another, it seems that the DMA interface, which was designed by developers working with relatively straightforward hardware, is going to need some attention from the ARM community if it's going to meet that community's needs.

Comments (none posted)

Anatomy of a user namespaces vulnerability

By Michael Kerrisk
March 20, 2013

An exploit posted on March 13 revealed a rather easily exploitable security vulnerability (CVE 2013-1858) in the implementation of user namespaces. That exploit enables an unprivileged user to escalate to full root privileges. Although a fix was quickly provided, it is nevertheless instructive to look in some detail at the vulnerability, both to better understand the nature of this kind of exploit and also to briefly consider how this vulnerability came to appear inside the user namespaces implementation. General background on user namespaces can be found in parts 5 and part 6 of our recent series of articles on namespaces.

Overview

The vulnerability was discovered by Sebastian Krahmer, who posted proof-of-concept code demonstrating the exploit on the oss-security mailing list. The exploit is based on the fact that Linux 3.8 allows the following combination of flags when calling clone() (and also unshare() and setns()):

    clone(... CLONE_NEWUSER | CLONE_FS, ...);

CLONE_NEWUSER says that the new child should be in a new user namespace, and with the completion of the user namespaces implementation in Linux 3.8, that flag can now be employed by unprivileged processes. Within the new namespace, the child has a full set of capabilities, although it has no capabilities in the parent namespace.

The CLONE_FS flag says that the caller of clone() and the resulting child should share certain filesystem-related attributes—root directory, current working directory, and file mode creation mask (umask). The attribute of particular interest here is the root directory, which a privileged process can change using the chroot() system call.

It is the mismatch between the scope of these two flags that creates the window for the exploit. On the one hand, CLONE_FS causes the parent and child process to share the root directory attribute. On the other hand, CLONE_NEWUSER puts the two processes into separate user namespaces, and gives the child full capabilities in the new user namespace. Those capabilities include CAP_SYSCHROOT, which gives a process the ability to call chroot(); the sharing provided by CLONE_FS means that the child can change the root directory of a process in another user namespace.

In broad strokes, the exploit achieves escalation to root privileges by executing any set-user-ID-root program that is present on the system in a chroot environment which is engineered to execute attacker-controlled code. That code runs with user ID 0 and allows the exploit to fire up a shell with root privileges. The exploit as demonstrated is accomplished by subverting the dynamic linking mechanism, although other lines of attack based on the same foundation are also possible.

The vulnerability scenario

The first part of understanding the exploit requires some understanding of the operation of the dynamic linker. Most executables (including most set-user-ID root programs) on a Linux system employ shared libraries and dynamic linking. At run time, the dynamic linker loads the required shared libraries in preparation for running the program. The pathname of the dynamic linker is embedded in the executable file's ELF headers, and is listed among the other dependencies of a dynamically linked executable when we use the ldd command (here executed on an x86-64 system):

    $ ldd /bin/ls | grep ld-linux
            /lib64/ld-linux-x86-64.so.2 (0x00000035b1800000)

There are a few important points to note about the dynamic linker. First, it is run before the application program. Second, it is run under whatever credentials would be accorded to the application program; thus, for example, if a set-user-ID-root program is being executed, the dynamic linker will run with an effective user ID of root.

Executable files are normally protected so that they can't be modified by users other than the file owner; this prevents, for example, unprivileged users from modifying the dynamic linker path embedded inside a set-user-ID-root binary. For similar reasons, an unprivileged user can't change the contents of the dynamic linker binary.

However, suppose for a moment that an unprivileged user could construct a chroot tree containing (via a hard link) the set-user-ID-root binary and an executable of the user's own choosing at /lib64/ld-linux-x86-64.so.2. Running the set-user-ID-root binary would then cause control first to be passed to the user's own code, which would be running as root. The aim of the exploit is to bring about the situation shown in the following diagram, where pathnames are shown linked to various binary files:

The key point in the above diagram is that two pathnames link to the fusermount binary (a set-user-ID-root program used for mounting and unmounting FUSE filesystems). If a process outside the chroot environment executes the /bin/fusermount binary, then the real dynamic linker will be invoked to load the binary's shared libraries. On the other hand, if a process inside the chroot environment executes the other link to the binary (/suid-root), then the kernel will load the ELF interpreter pointed to by the link /lib64/ld-linux-x86-64.so.2 inside the chroot environment. That link points to code supplied by an attacker, and will be run with root privileges.

How does the Linux 3.8 user namespaces implementation help with this attack? First, an unprivileged user can create a new user namespace in which they gain full privileges, including the ability to create a chroot environment using chroot(). Second, the differing scope of CLONE_NEWUSER and CLONE_FS described above means that the privileged process inside a new user namespace can construct a chroot environment that applies to a process outside the user namespace. If that process can in turn then be made to execute a set-user-ID binary inside the chroot environment, then the attacker code will be run as root.

A three-phase attack

Although Sebastian's program is quite short, there are many details involved that make the exploit somewhat challenging to understand; furthermore, the program is written with the goal of accomplishing the exploit, rather than educating the reader on how the exploit is carried out. Therefore, we'll provide an equivalent program, userns_exploit.c, that performs the same attack—this program is structured in a more understandable way and is instrumented with output statements that enable the user to see what is going on. We won't walk though the code of the program, but it is well commented and should be easy to follow using the explanations in this article.

The attack code involves the creation of three processes, which we'll label "parent", "child", and "grandchild". The attack is conducted in three phases; in each phase, a separate instance of the attacker code is executed. This concept can at first be difficult to grasp when reading the code. It's easiest to think of the userns_exploit program as simply offering itself in three flavors, with the choice being determined by command-line arguments and the effective user ID of the process.

The following diagram shows the exploit in overview:

In the above diagram, the vertical dashed lines indicate points where a process is blocked waiting for another process to complete some action.

In the first phase of the exploit, the program starts by discovering its own pathname. This is done by reading the contents of the /proc/self/exe symbolic link. The program needs to know its own pathname for two reasons: so it can create a link to itself inside the chroot tree and so it can re-execute itself later.

The program then creates two processes, labeled "parent" and "child" in the above diagram. The parent's task is simple. It will loop, using the stat() system call to check whether the program pathname discovered in the previous step is owned by root and has the set-user-ID permission bit enabled. This causes the parent to wait until the other processes have finished their tasks.

In the meantime, the "child" populates the directory tree that will be used as the chroot environment. The goal is to create the set-up shown in the following diagram:

The difference from the first diagram is that we now see that it is the userns_exploit program that will be used as the fake dynamic loader inside the chroot environment. Furthermore, that binary is also linked outside the chroot environment, and the exploit design takes advantage of that fact.

Having created the chroot tree shown above, the child then employs clone(CLONE_NEWUSER|CLONE_FS) to create a new process—the grandchild. The grandchild has a full set of capabilities, which allows it to call chroot() to place itself into the chroot tree. Because the grandchild and the child share the root directory attribute, the child is now also placed in the chroot environment.

Its small task complete, the grandchild now terminates. At that point, the child, which has been waiting on the grandchild, now resumes. As its next step, the child executes the program at the path /suid-root. This is in fact a link to the fusermount binary. Because the child is in the initial user namespace and the fusermount binary is set-user-ID-root, the child gains root privileges.

However, before the fusermount binary is loaded, the kernel first loads its ELF interpreter, the file at the path /lib64/ld-linux-x86-64.so.2. That, as it happens, is actually the userns_exploit program. Thus, the userns_exploit program is now executed for a second time (and the fusermount program is never executed).

The second phase of the exploit has now begun. This instance of the userns_exploit program recognizes that it has an effective user ID of 0. However, the only files it can access are those inside the chroot environment. But that is sufficient. The child can now change the ownership of the file /lib64/ld-linux-x86-64.so.2 and turn on the file's set-user-ID permission bit. That pathname is, of course, a link to the userns_exploit binary. At this point, the child's work is now complete, and it terminates.

All of this time, the parent process has been sitting in the background waiting for the userns_exploit binary to become a set-user-ID-root program. That, of course, is what the child has just accomplished. So, at this point, the parent now executes the userns_exploit program outside the chroot environment. On this execution, the program is supplied with a command-line argument.

The third and final phase of the exploit has now started. The userns_exploit program determines that it has an effective user ID of 0 and notes that it has a command-line argument. That latter fact distinguishes this case from the second execution of the userns_exploit and is a signal that this time the program is being executed outside the chroot environment. All that the program now needs to do is execute a shell; that shell will provide the user with full root privileges on the system.

Further requirements for a successful exploit

There are a few other steps that are necessary to successfully accomplish the exploit. The userns_exploit program must be statically linked. This is necessary so that, when executed as the dynamic linker inside the chroot environment, the userns_exploit program does not itself require a dynamic linker.

In addition, the value in the /proc/sys/fs/protected_hardlinks file must zero. The protected_hardlinks file was a feature that was added in Linux 3.6 specifically to prevent the types of exploit discussed in this article. If this file has the value one, then only the owner of a file can create hard links to it. On a vanilla kernel, protected_hardlinks unfortunately has the default value zero, although some distributions provide kernels that change this default.

In the process of exploring this vulnerability, your editor discovered that set-user-ID binaries built as hardened, position-independent executables (PIE) cannot be used for this particular attack. (Many of the set-user-ID-root binaries on his Fedora system were hardened in this manner.) While PIE hardening thwarts this particular line of attack, the chroot() technique described here can still be used to exploit a set-user-ID-root binary in other ways. For example, the binary can be placed in a suitably constructed chroot environment that contains the genuine dynamic linker but a compromised libc.

Finally, user namespaces must of course be enabled on the system where this exploit is to be tested, and the kernel version needs to be precisely 3.8. Earlier kernel versions did not allow unprivileged users to create user namespaces, and later kernels will fix this bug, as described below. The exploit is unlikely to be possible with distributor kernels: because the Linux 3.8 kernel does not support the use of user namespaces with various filesystems, including NFS and XFS, distributors are unlikely to enable user namespaces in the kernels that they ship.

The fix

Once the problem was reported, Eric Biederman considered two possible solutions. The more complex solution is to create an association from a process's fs_struct, the kernel data structure that records the process's root directory, to a user namespace, and use that association to set limitations around the use of chroot() in scenarios such as the one described in this article. The alternative is the simple and obviously safe solution: disallow the combination of CLONE_NEWUSER and CLONE_FS in the clone() system call, make CLONE_NEWUSER automatically imply CLONE_FS in the unshare() system call, and disallow the use of setns() to change a process's user namespace if the process is sharing CLONE_FS-related attributes with another process.

Subsequently, Eric concluded that the complex solution seemed to be unnecessary and would add a small overhead to every call to fork(). He thus opted for the simple solution: the Linux 3.9 kernel (and the 3.8.3 stable kernel) will disallow the combination of CLONE_NEWUSER and CLONE_FS.

User namespaces and security

As we noted in an earlier article, Eric Biederman has put a lot of work into trying to ensure that unprivileged can create user namespaces without causing security vulnerabilities. Nevertheless, a significant exploit was found soon after the release of the first kernel version that allowed unprivileged processes to create user namespaces. Another user namespace vulnerability that potentially allowed unprivileged users to load arbitrary kernel modules was also reported and fixed earlier this month. In addition, during the discussion of the CLONE_NEWUSER|CLONE_FS issue, Andy Lutomirski has hinted that there may be another user namespaces vulnerability to be fixed.

Why is it that several security vulnerabilities have sprung from the user namespaces implementation? The fundamental problem seems to be that user namespaces and their interactions with other parts of the kernel are rather complex—probably too complex for the few kernel developers with a close interest to consider all of the possible security implications. In addition, by making new functionality available to unprivileged users, user namespaces expand the attack surface of the kernel. Thus, it seems that as user namespaces come to be more widely deployed, other security bugs such as these are likely to be found. One hopes that they'll be found and fixed by the kernel developers and white hat security experts, rather than found and exploited by black hat attackers.

Updated on 22 February 2013 to clarify and correct some minor details of the "simple and safe" solution under the heading, "The fix".

Comments (31 posted)

Linus Torvalds Linux 3.9-rc3 ?

Greg KH Linux 3.8.4 ?

Greg KH Linux 3.8.3 ?

Greg KH Linux 3.4.37 ?

Greg KH Linux 3.4.36 ?

Ben Hutchings Linux 3.2.41 ?

Greg KH Linux 3.0.70 ?

Greg KH Linux 3.0.69 ?

Borislav Petkov x86, cpu: Expand ->x86_capability flags with bugs bitvector, v2 ?

Daniel Lezcano cpuidle : ARM driver to rule them all ?

Arnd Bergmann SIRF multiplatform support ?

Tejun Heo workqueue: break up workqueue_lock into multiple locks ?

Tejun Heo workqueue: NUMA affinity for unbound workqueues ?

Michel Lespinasse rwsem fast-path write lock stealing ?

Eric Wong epoll: avoid spinlock contention with wfcqueue ?

Paul E. McKenney rcu: Remove restrictions on no-CBs CPUs ?

Rik van Riel ipc,sem: sysv semaphore scalability ?

Steven Rostedt tracing: function triggers, stack tracer fixes, clocks and documenation ?

Borislav Petkov Perf persistent events ?

HATAYAMA Daisuke kdump, vmcore: support mmap() on /proc/vmcore ?

Jon Arne Jørgensen Add a driver for somagic smi2021 ?

Andreas Larsson gpio: Add device driver for GRGPIO cores and support custom accessors with gpio-generic ?

Maxime Ripard net: Add davicom wemac ethernet driver found on Allwinner A10 SoC's ?

Guenter Roeck hwmon: Add devres support ?

Luis R. Rodriguez compat-drivers based on v3.8.3 ?

Vipul Pandya Add support for Chelsio T5 adapter ?

Adrian Chadd Announcement: open source AR9380 and later HAL ?

Laurent Pinchart R-Car Display Unit DRM driver ?

Sebastian Hesselbarth clk: add si5351 i2c common clock driver ?

Kishon Vijay Abraham I Generic PHY Framework ?

Philipp Zabel Add generic driver for on-chip SRAM ?

Peter Hurley lockless n_tty receive path ?

Michael Kerrisk (man-pages) man-pages-3.50 is released ?

Michael Kerrisk (man-pages) open(2): document O_PATH ?

Michael Kerrisk (man-pages) For review: user_namespaces(7) man page ?

Paul E. McKenney nohz1: Documentation ?

Aaron Lu block layer runtime pm ?

Eric W. Biederman [PATCH 00/14] xfs: Support for interacting with multiple user namespaces ?

Philipp Reisner RFC: Non blocking submit for activity log misses ?

Kirill A. Shutemov Transparent huge page cache ?

Tang Chen Introduce movablemem_map boot option. ?

Mel Gorman Reduce system disruption due to kswapd ?

Matthew Garrett Security: Add CAP_COMPROMISE_KERNEL ?

Wanlong Gao virtio-scsi multiqueue ?

Mauro Carvalho Chehab rasdaemon userspace tool v.0.1 ?

Kernel development

Brief items

Kernel release status

A trinity fuzz-tester mailing list

Kernel development news

The trouble with DMA masks

Anatomy of a user namespaces vulnerability

Overview

The vulnerability scenario

A three-phase attack

Further requirements for a successful exploit

The fix

User namespaces and security

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous