Brief items
The current development kernel is 3.9-rc3,
released on March 17. Linus says:
"
Not as small as -rc2, but that one really was unusually calm. So
there was clearly some pending stuff that came in for -rc3, with network
drivers and USB leading the charge. But there's other misc drivers, arch
updates, btrfs fixes, etc etc too."
Stable updates:
3.8.3, 3.4.36, and 3.0.69 were released on March 14,
and
3.8.4, 3.4.37,
3.2.41, and 3.0.70 came out on March 20.
Comments (none posted)
Dave Jones has announced the creation of a mailing list for development of
the "Trinity" fuzz testing tool. It is hosted on vger, so the usual
majordomo subscription routine applies.
Full Story (comments: none)
Kernel development news
By Jonathan Corbet
March 20, 2013
Almost any I/O device worth its electrons will support direct memory access
(DMA) transactions; to do otherwise is to be relegated to the world of
low-bandwidth, high-overhead I/O. But "DMA-capable" devices are not all
equally so; many of them have limitations restricting the range of memory
that can be directly accessed. The 24-bit limitation that afflicted ISA
devices in the early days of the personal computer is a classic example,
but contemporary hardware also has its limits. The kernel has long had a
mechanism for working around these limitations, but it turns out that this
subsystem has some interesting problems of its own.
DMA limitations are usually a result of a device having fewer address lines
than would be truly useful. The 24 lines described by the ISA
specification are an obvious example; there is simply no way for an
ISA-compliant device to address more than 16MB of physical memory. PCI
devices are normally limited to a 32-bit address space, but a number of
devices are limited to a smaller space as a result of dubious hardware
design; as is so often the case,
hardware designers have shown a great deal of creativity in this area. But
users are not concerned with these issues; they just want their peripherals
to work. So the kernel has to find a way to respect any given device's
special limits while still using DMA to the greatest extent possible.
The kernel's DMA API (described in Documentation/DMA-API.txt) abstracts and hides
most of the details of making DMA actually work with any specific device.
This API will, for example, endeavor to allocate memory that falls within
the physical range supported by the target device. It will also
transparently implement "bounce buffering" — copying data between a
device-inaccessible buffer and an accessible buffer — if necessary. To do
so, however, the DMA API must be informed of a device's addressing limits.
That is done through the provision of a "DMA mask," a bitmask describing
the memory range reachable by the device. The documentation describes the
mask this way:
The dma_mask represents a bit mask of the addressable region for
the device. I.e., if the physical address of the memory anded with
the dma_mask is still equal to the physical address, then the
device can perform DMA to the memory.
The problem, as recently pointed out by
Russell King, is that the DMA mask is not always interpreted that way. He
points to code like the following, found in block/blk-settings.c:
void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
unsigned long b_pfn = dma_mask >> PAGE_SHIFT;
What is happening here is that the code is right-shifting the DMA mask to
turn it into a "page frame number" (PFN). If one envisions a system's
memory as a linear array of pages, the PFN of a given page is simply its
index into that array (though memory is not always organized so simply).
By treating a DMA mask as, for all practical purposes, another way of
expressing the PFN of the highest addressable page, the block code is
changing the semantics of how the mask is interpreted.
Russell explained how that can be problematic. On some ARM systems,
memory does not start at a physical address of zero; the physical
address of the first byte can be as high as 3GB (0xc0000000). If a
system configured in this way has a device with a 26-bit address limitation
(with the upper bits
being filled in by the bus hardware), then its DMA mask should be set to
0xc3ffffff. Any physical address within the device's range will be
unchanged by a logical AND operation with this mask, while any address
outside of that range will not.
But what then happens when the block code right-shifts that mask to get a
PFN from the mask? The result (assuming 4096-byte pages) is 0xc3fff, which
is a perfectly valid PFN on a system where the PFN of the first page will
be 0xc0000. And that is fine until one looks at the interactions with a
global memory management variable called max_low_pfn. Given that
name, one might imagine that it is the maximum PFN contained within low
memory — the PFN of the highest page that is directly addressable by the
kernel without special mappings. Instead, max_low_pfn is a
count of page frames in low memory. But not all code appears to
treat it that way.
On an x86 system, where memory starts at a physical address of zero (and,
thus, a PFN of zero), that difference does not matter; the count and the
maximum are the same. But on more
complicated systems the results can be interesting. Returning to the same
function in blk-settings.c:
blk_max_low_pfn = max_low_pfn - 1; /* Done elsewhere at init time */
if (b_pfn < blk_max_low_pfn)
dma = 1;
q->limits.bounce_pfn = b_pfn;
Here we have a real page frame number (calculated from the DMA mask)
compared to a count of page frames, with decisions on how DMA must be done
depending on the result. It would not be surprising to see erroneous
results from such an operation; with regard to the discussion in question,
it seems to have caused bounce buffering to be done when there was no need.
One can easily see other kinds of trouble that could result from this type
of confusion; inconsistent views of what a variable means will rarely lead
to good things.
Fixing this situation is not going to be straightforward; Russell had "no
idea" of how to do it. Renaming max_low_pfn to something like
low_pfn_count might be a first step as a way to avoid further
confusion. Better defining the meaning of a DMA mask (or, at least,
ensuring that the kernel's interpretation of a mask adheres to the existing
definition) sounds like a good idea, but it could be hard to implement in a
way that does not break obscure hardware — some of that code can be fragile
indeed. One way or another, it seems that the DMA interface, which was
designed by developers working with relatively straightforward hardware, is
going to need some attention from the ARM community if it's going to meet
that community's needs.
Comments (none posted)
By Michael Kerrisk
March 20, 2013
An exploit posted on March 13
revealed a rather easily exploitable security vulnerability (CVE 2013-1858)
in the implementation of user namespaces. That exploit enables an
unprivileged user to escalate to full root privileges. Although a fix was
quickly provided, it is nevertheless instructive to look in some detail at
the vulnerability, both to better understand the nature of this kind of
exploit and also to briefly consider how this vulnerability came to appear
inside the user namespaces implementation. General background on user
namespaces can be found in parts 5 and part
6 of our recent series of
articles on namespaces.
Overview
The vulnerability was discovered by Sebastian Krahmer, who posted
proof-of-concept code
demonstrating the exploit on the oss-security mailing list.
The exploit is based on the fact that
Linux 3.8 allows the following combination of flags when calling
clone() (and also unshare() and setns()):
clone(... CLONE_NEWUSER | CLONE_FS, ...);
CLONE_NEWUSER says that the new child should be in
a new user namespace, and with the completion of the user namespaces
implementation in Linux 3.8, that flag can now be employed by unprivileged
processes. Within the new namespace, the child has a full set of capabilities,
although it has no capabilities in the parent namespace.
The CLONE_FS flag says that the caller of clone() and
the resulting child should share certain filesystem-related attributes—root
directory, current working directory, and file mode creation mask
(umask). The attribute of particular interest here is the root directory,
which a privileged process can change using the chroot() system
call.
It is the mismatch between the scope of these two flags that creates
the window for the exploit. On the one hand, CLONE_FS causes the
parent and child process to share the root directory attribute. On the
other hand, CLONE_NEWUSER puts the two processes into separate
user namespaces, and gives the child full capabilities in the new user
namespace. Those capabilities include CAP_SYSCHROOT, which gives a
process the ability to call chroot(); the sharing provided by
CLONE_FS means that the child can change the root directory of a
process in another user namespace.
In broad strokes, the exploit achieves escalation to root privileges by
executing any set-user-ID-root program that is present on the system in a
chroot environment which
is engineered to execute attacker-controlled code. That code runs with user
ID 0 and allows the exploit to fire up a shell with root privileges. The
exploit as demonstrated is accomplished by subverting the dynamic linking
mechanism, although other lines of attack based on the same foundation are
also possible.
The vulnerability scenario
The first part of understanding the exploit requires some understanding
of the operation of the dynamic linker. Most executables (including most
set-user-ID root programs) on a Linux system employ shared libraries and
dynamic linking.
At run time, the dynamic linker loads the required shared libraries in
preparation for running the program. The pathname of the dynamic linker is
embedded in the executable file's ELF headers, and is listed among the
other dependencies of a dynamically linked executable when we use the
ldd command (here executed on an x86-64 system):
$ ldd /bin/ls | grep ld-linux
/lib64/ld-linux-x86-64.so.2 (0x00000035b1800000)
There are a few important points to note about the dynamic linker. First, it
is run before the application program. Second, it is run under whatever
credentials would be accorded to the application program; thus, for
example, if a set-user-ID-root program is being executed, the dynamic
linker will run with an effective user ID of root.
Executable files are normally protected so that they can't be modified
by users other than the file owner; this prevents, for example,
unprivileged users from modifying the dynamic linker path embedded inside a
set-user-ID-root binary. For similar reasons, an unprivileged user can't
change the contents of the dynamic linker binary.
However, suppose for a moment that an unprivileged user could construct a
chroot tree containing (via a hard link) the set-user-ID-root binary and
an executable of the user's own choosing at
/lib64/ld-linux-x86-64.so.2. Running the set-user-ID-root binary
would then cause control first to be passed to the user's own code, which
would be running as root. The aim of the exploit is to bring about the
situation shown in the following diagram, where pathnames are shown linked
to various binary files:
The key point in the above diagram is that two pathnames link to the
fusermount binary (a set-user-ID-root program used for mounting
and unmounting FUSE
filesystems). If a process outside the chroot environment executes the
/bin/fusermount binary, then the real dynamic linker will be
invoked to load the binary's shared libraries. On the other hand, if a
process inside the chroot environment executes the other link to the binary
(/suid-root), then the kernel will load the ELF interpreter
pointed to by the link /lib64/ld-linux-x86-64.so.2 inside the
chroot environment. That link points to code supplied by an attacker, and
will be run with root privileges.
How does the Linux 3.8 user namespaces implementation help with this
attack? First, an unprivileged user can create a new user namespace in which
they gain full privileges, including the ability to create a chroot
environment using chroot(). Second, the differing scope of
CLONE_NEWUSER and CLONE_FS described above means that
the privileged process inside a new user namespace can construct a chroot
environment that applies to a process outside the user namespace. If that
process can in turn then be made to execute a set-user-ID binary inside
the chroot environment, then the attacker code will be run as root.
A three-phase attack
Although Sebastian's program is quite short, there are many details
involved that make the exploit somewhat challenging to understand;
furthermore, the program is written with the goal of accomplishing the
exploit, rather than educating the reader on how the exploit is carried
out. Therefore, we'll provide an equivalent program, userns_exploit.c, that performs the
same attack—this program is structured in a more understandable way
and is instrumented with output statements that enable the user to see what
is going on. We won't walk though the code of the program, but it is well
commented and should be easy to follow using the explanations in this article.
The attack code involves the creation of three processes, which we'll
label "parent", "child", and "grandchild". The attack is conducted in
three phases; in each phase, a separate instance of the attacker code is
executed. This concept can at first be difficult to grasp when reading the
code. It's easiest to think of the userns_exploit program as
simply offering itself in three flavors, with the choice being determined
by command-line arguments and the effective user ID of the process.
The following diagram shows the exploit in overview:
In the above diagram, the vertical dashed lines indicate points where a
process is blocked waiting for another process to complete some action.
In the first phase of the exploit, the program starts by discovering its
own pathname. This is done by reading the contents of the
/proc/self/exe symbolic link.
The program needs to know its own pathname for two
reasons: so it can create a link to itself inside the chroot tree and so it
can re-execute itself later.
The program then creates two processes, labeled "parent" and "child"
in the above diagram. The parent's task is simple. It will loop, using the
stat() system call to check whether the program pathname
discovered in the previous step is owned by root and has the
set-user-ID permission bit enabled. This causes the parent to wait until
the other processes have finished their tasks.
In the meantime, the "child" populates the directory tree that will be used
as the chroot environment. The goal is to create the set-up shown in the
following diagram:
The difference from the first diagram is that we now see that it is the
userns_exploit program that will be used as the fake dynamic
loader inside the chroot environment. Furthermore, that binary is also
linked outside the chroot environment, and the exploit design takes advantage of
that fact.
Having created the chroot tree shown above, the child then employs
clone(CLONE_NEWUSER|CLONE_FS) to create a new process—the
grandchild. The grandchild has a full set of capabilities, which allows it
to call chroot() to place itself into the chroot tree. Because the
grandchild and the child share the root directory attribute, the child is
now also placed in the chroot environment.
Its small task complete, the grandchild now terminates. At that point,
the child, which has been waiting on the grandchild, now
resumes. As its next step, the child executes the program at the path
/suid-root. This is in fact a link to the fusermount
binary. Because the child is in the initial user namespace and the
fusermount binary is set-user-ID-root, the child gains root
privileges.
However, before the fusermount binary is loaded, the kernel
first loads its ELF interpreter, the file at the path
/lib64/ld-linux-x86-64.so.2. That, as it happens, is actually the
userns_exploit program. Thus, the userns_exploit program
is now executed for a second time (and the fusermount program is
never executed).
The second phase of the exploit has now begun. This instance of the
userns_exploit program recognizes that it has an effective user ID
of 0. However, the only files it can access are those inside the chroot
environment. But that is sufficient. The child can now change the ownership
of the file /lib64/ld-linux-x86-64.so.2 and turn on the file's
set-user-ID permission bit. That pathname is, of course, a link to the
userns_exploit binary. At this point, the child's work is now
complete, and it terminates.
All of this time, the parent process has been sitting in the background
waiting for the userns_exploit binary to become a set-user-ID-root
program. That, of course, is what the child has just accomplished. So, at
this point, the parent now executes the userns_exploit program
outside the chroot environment. On this execution, the program is
supplied with a command-line argument.
The third and final phase of the exploit has now started. The
userns_exploit program determines that it has an effective user ID
of 0 and notes that it has a command-line argument. That latter fact
distinguishes this case from the second execution of the
userns_exploit and is a signal that this time the program is being
executed outside the chroot environment. All that the program now
needs to do is execute a shell; that shell will provide the user with full
root privileges on the system.
Further requirements for a successful exploit
There are a few other steps that are necessary to successfully
accomplish the exploit. The userns_exploit program must be
statically linked. This is necessary so that, when executed as the dynamic linker
inside the chroot environment, the userns_exploit program does not
itself require a dynamic linker.
In addition, the value in the /proc/sys/fs/protected_hardlinks
file must zero. The protected_hardlinks file was a feature that
was added in Linux 3.6 specifically to prevent
the types of exploit discussed in this article. If this file has the
value one, then only the owner of a file can create hard links to it. On a
vanilla kernel, protected_hardlinks unfortunately has the default
value zero, although some distributions provide kernels that change this
default.
In the process of exploring this vulnerability, your editor
discovered that set-user-ID binaries built as hardened,
position-independent executables (PIE) cannot be used for this particular
attack. (Many of the set-user-ID-root binaries on his Fedora system were
hardened in this manner.) While PIE hardening thwarts this particular line of
attack, the chroot() technique described here can still be used to
exploit a set-user-ID-root binary in other ways. For example, the
binary can be placed in a suitably constructed chroot environment
that contains the genuine dynamic linker but a compromised libc.
Finally, user namespaces must of course be enabled on the system where
this exploit is to be tested, and the kernel version needs to be precisely
3.8. Earlier kernel versions did not allow unprivileged users to create
user namespaces, and later kernels will fix this bug, as described
below. The exploit is unlikely to be possible with distributor kernels:
because the Linux 3.8 kernel does not support the use of user namespaces
with various filesystems, including NFS and XFS, distributors are
unlikely to enable user namespaces in the kernels that they ship.
The fix
Once the problem was reported, Eric
Biederman considered two possible
solutions. The more complex solution is to create an association from a
process's fs_struct, the kernel data structure that records the
process's root directory, to a user namespace, and use that association to
set limitations around the use of chroot() in scenarios such as
the one described in this article. The alternative is the simple and
obviously safe solution: disallow the combination of CLONE_NEWUSER
and CLONE_FS in the clone() system call, make
CLONE_NEWUSER automatically imply CLONE_FS in the
unshare() system call, and disallow the use of setns() to
change a process's user namespace if the process is sharing
CLONE_FS-related attributes with another process.
Subsequently, Eric concluded
that the complex solution seemed to be unnecessary and would add a small
overhead to every call to fork(). He thus opted for the simple
solution: the Linux 3.9 kernel (and the 3.8.3 stable kernel) will disallow
the combination of CLONE_NEWUSER and CLONE_FS.
User namespaces and security
As we noted in an earlier
article, Eric Biederman has put a lot of work into trying to ensure
that unprivileged can create user namespaces without causing security
vulnerabilities. Nevertheless, a significant exploit was found soon after
the release of the first kernel version that allowed unprivileged processes
to create user namespaces. Another user namespace vulnerability that
potentially allowed unprivileged users to load arbitrary kernel modules was
also reported and fixed earlier this month. In addition, during
the discussion of the CLONE_NEWUSER|CLONE_FS issue,
Andy Lutomirski has hinted that there may
be another user namespaces vulnerability to be fixed.
Why is it that several security vulnerabilities have sprung from the
user namespaces implementation? The fundamental problem seems to be that
user namespaces and their interactions with other parts of the kernel are
rather complex—probably too complex for the few kernel developers
with a close interest to consider all of the possible security
implications. In addition, by making new functionality available to
unprivileged users, user namespaces expand the attack surface of the
kernel. Thus, it seems that as user namespaces come to be more widely
deployed, other security bugs such as these are likely to be
found. One hopes that they'll be found and fixed by the kernel developers
and white hat security experts, rather than found and exploited by black
hat attackers.
Updated on 22 February 2013 to clarify and correct some minor details of the
"simple and safe" solution under the heading, "The fix".
Comments (30 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>