Brief items
The current development kernel is 3.10-rc2,
released on May 20. Linus says:
"
For being an -rc2, it's not unreasonably sized, but I did take a few
pulls that I wouldn't have taken later in the rc series. So it's not
exactly small either. We've got arch updates (PPC, MIPS, PA-RISC), we've
got driver fixes (net, gpu, target, xen), and we've got filesystem updates
(btrfs, ext4 and cepth - rbd)."
Stable updates:
3.9.3, 3.4.46, and
and 3.0.79 were released on May 19; 3.6.11.4 came out on May 20.
Comments (none posted)
A new kernel tracing tool called "ktap" has
made its first release. "
KTAP have
different design principles from Linux mainstream dynamic tracing language
in that it's based on bytecode, so it doesn't depend upon GCC, doesn't
require compiling a kernel module, safe to use in production environment,
fulfilling the embedded ecosystem's tracing needs." It's in an
early state; the project is looking for testers and contributors.
Comments (10 posted)
By Jonathan Corbet
May 22, 2013
As reported in our
Linux Storage,
Filesystem, and Memory Management Summit coverage, the decision was
made to merge the
zswap compressed swap
cache subsystem while holding off on the rather more complex "zcache"
subsystem. But conference
decisions can often run into difficulties during the implementation process; that has proved to be the case here.
Zswap developer Seth Jennings duly submitted
the code for consideration for the 3.11 development cycle. He quickly
ran into opposition from zcache developer Dan Magenheimer; Dan had agreed
with the merging of zswap in principle, but he expressed concerns that zswap may perform
poorly in some situations. According to Dan, it would be better to fix
these problems before merging the code:
I think the real challenge of zswap (or zcache) and the value to
distros and end users requires us to get this right BEFORE users
start filing bugs about performance weirdness. After which most
users and distros will simply default to 0% (i.e. turn zswap off)
because zswap unpredictably sometimes sucks.
The discussion went around in circles the way that in-kernel compression
discussions often do. In the end, though, the consensus among memory
management developers (but not Dan) was probably best summarized by Mel Gorman:
I think there is a lot of ugly in there and potential for weird
performance bugs. I ran out of beans complaining about different
parts during the review but fixing it out of tree or in staging
like it's been happening to date has clearly not worked out at all.
So the end result is likely to be that zswap will be merged for 3.11, but
with a number of warnings attached to it. Then, with luck, the increased
visibility of the code will motivate developers to prepare patches and
improve the code to a point where it is production-ready.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
May 22, 2013
Once upon a time, usable tracing tools for Linux were few and far between.
Now, instead, there is a wealth of choices, including the in-kernel ftrace
facility, SystemTap, and the LTTng suite; Oracle also has
a
port of DTrace for its distribution, available to its paying customers.
On May 21, another alternative showed up in the form of the
ktap 0.1 release. Ktap does not offer any
major features that are not available from the other tracing tools, but
there may still be a place for it in the tracing ecosystem.
Ktap appears to be strongly oriented toward the needs of embedded users;
that has affected a number of the design decisions that have been made. At
the top of the list was the decision to embed a byte-code interpreter into
the kernel and compile tracing scripts for that interpreter. That is a big
difference from SystemTap, which, in its current implementation, compiles a
tracing script into a separate
module that must be loaded into the kernel. This difference matters
because an embedded target often will not have a full compiler toolchain
installed on it; even if the tools are available, compiling and linking a
module can be a slow process. Compiling a ktap script, instead, requires a
simple utility to produce byte code for the ktap kernel module.
That compiler implements a language that is based on Lua. It is C-like, but it is dynamically
typed, has a dictionary-like "table" type, and lacks arrays and pointers.
There is a simple function definition mechanism which can be used like
this:
function eventfun (e) {
printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
}
The resulting function will, when called, output the current CPU number,
process ID, executing program name, and
the string representation of the passed-in event e. There is a
probe-placement function, so ktap could arrange to call the above function
on system call entry with:
kdebug.probe("tp:syscalls", eventfun)
A quick run on your editor's system produced a bunch of output like:
3 2745 Xorg sys_setitimer(which: 0, value: 7fff05967ec0, ovalue: 0)
3 2745 Xorg sys_setitimer -> 0x0
2 27467 as sys_mmap(addr: 0, len: 81000, prot: 3, flags: 22, fd: ffffffff, off: 0)
2 27467 as sys_mmap -> 0x2aaaab67c000
2 3402 gnome-shell sys_mmap(addr: 0, len: 97b, prot: 1, flags: 2, fd: 21, off: 0)
2 3402 gnome-shell sys_mmap -> 0x7f4ec4bfb000
There are various utility functions for generating timer requests, creating
histograms, and so on. So, for example, this script:
hist = {}
function eventfun (e) {
if (e.sc_is_enter) {
inplace_inc(hist, e.name)
}
}
kdebug.probe("tp:syscalls", eventfun)
kdebug.probe_end(function () {
histogram(hist)
})
is sufficient to generate a histogram of system calls over the period of
time from when it starts until when the user interrupts it. Your editor
ran it with a kernel build running and got output looking like this:
value ------------- Distribution ------------- count
sys_enter_open |@@@@@@@@ 587779
sys_enter_close |@@@@ 343728
sys_enter_newfstat |@@@@ 331459
sys_enter_read |@@@ 283217
sys_enter_mmap |@@@ 243458
sys_enter_ioctl |@@ 219364
sys_enter_munmap |@@ 165006
sys_enter_write |@ 128003
sys_enter_poll |@ 77311
sys_enter_recvfrom | 52898
The syntax for setting probe points closely matches that used by perf; probes
can be set on specific functions or tracepoints, for example. It is
possible to hook into the perf events mechanism to get other types of
hardware or software events, and memory breakpoints are supported. The
(sparse) documentation packaged with the code also suggests that ktap is
able to set user-space
probes, but none of the example scripts packaged with the tool demonstrate
that capability.
Ktap scripts can manipulate the return value of probed functions within the
kernel. There does not currently appear to be a way to manipulate
kernel-space data directly, but that could presumably be added (along with
lots of other features) in the future. What's there now is a proof of
concept as much as anything; it is a quick way to get some data out of the
kernel but does not offer a whole lot that is not available using the
existing ftrace interface.
For those who want to play with it, the first step is a simple:
git clone https://github.com/ktap/ktap.git
From there, building the code and running the sample scripts is a matter of
a few minutes of relatively painless work. There is the ktapvm
module, which must, naturally, be loaded into the kernel. That module
creates a special virtual file (ktap/ktapvm under the debugfs
root) that is used by the ktap binary to load and run compiled
scripts.
Ktap in its current form is limited, without a lot of exciting new
functionality. Even so, it seems to have generated a certain amount of
interest in the development community. Getting started with most tracing
tools usually seems to involve a fair amount of up-front learning; ktap,
perhaps, is a more approachable solution for a number of users. The whole
thing is about 10,000 lines of code; it shouldn't be hard for others to run
with and extend. If developers start to take the bait, interesting things
could happen with this project.
Comments (4 posted)
By Jonathan Corbet
May 21, 2013
Linux is generally considered to have one of the most fully featured and
fast networking stacks available. But there are always users who are not
happy with what's available and who want to replace it with something more
closely tuned for their specific needs. One such group consists of people
with extreme low latency requirements, where each incoming packet must be
responded to as quickly as possible. High-frequency trading systems fall
into this category, but there are others as well. This class of user is
sometimes tempted to short out the kernel's networking stack altogether in
favor of a purely user-space (or purely hardware-based) implementation, but
that has problems of its own. A relatively small patch to the networking
subsystem might just be able to remove that temptation for at least some of
these users.
Network interfaces, like most reasonable peripheral devices, are capable of
interrupting the CPU whenever a packet arrives. But even a moderately busy
interface can handle hundreds or thousands of packets per second;
per-packet interrupts would quickly overwhelm the processor with
interrupt-handling work, leaving little time for getting useful tasks
done. So most interface drivers will disable the per-packet interrupt when
the traffic level is high enough and,
with cooperation from the core networking stack, occasionally poll the
device for new packets. There are a number of advantages to doing things
this way: vast numbers of interrupts can be avoided, incoming packets can
be more efficiently processed in batches, and, if packets must be dropped
in response to load, they can be discarded in the interface before they
ever hit the network stack. Polling is thus a win for almost all
situations where there is any significant amount of traffic at all.
Extreme low-latency users see things differently, though. The time between
a packet's arrival and the next poll is just the sort of latency that they
are trying to avoid. Re-enabling interrupts is not a workable solution,
though; interrupts, too, are a source of latency. Thus the drive for
user-space solutions where an application can simply poll the interface for
new packets whenever it is prepared to handle new messages.
Eliezer Tamir has posted an alternative solution in the form of the low-latency Ethernet device polling patch
set. With this patch, an application can enable polling for new
packets directly in the device driver, with the result that those packets
will quickly find their way into the network stack.
The patch adds a new member to the net_device_ops structure:
int (*ndo_ll_poll)(struct napi_struct *dev);
This function should cause the driver to check the interface for new
packets and flush them into the network stack if they exist; it should not
block. The
return value is the number of packets it pushed into the stack, or zero if no
packets were available. Other return values include
LL_FLUSH_BUSY, indicating that ongoing activity prevented the
processing of packets (the inability to take a lock would be an example) or
LL_FLUSH_FAILED, indicating some sort of error. The latter value
will cause polling to stop; LL_FLUSH_BUSY, instead, appears to be
entirely ignored.
Within the networking stack, the ndo_ll_poll() function will be
called whenever polling the interface seems like the right thing to do.
One obvious case is in response to the poll() system call.
Sockets marked as non-blocking will only poll once; otherwise polling will
continue until some packets destined for the relevant socket find their way
into the networking stack, up
until the maximum time controlled by the ip_low_latency_poll
sysctl knob. The default value for that knob is zero (meaning that
the interface will only be polled once), but the "recommended
value" is 50µs. The end result is that, if unprocessed packets exist when
poll() is called (or arrive shortly thereafter), they will be
flushed into the stack and made
available immediately, with no need to wait for the stack itself to get
around to polling the interface.
Another patch in the series adds another call site in the TCP code. If a
read() is issued on an established TCP connection and no data is
ready for return to user space, the driver will be polled to see if some
data can be pushed into the system. So there is no need for a separate
poll() call to get polling on a TCP socket.
This patch set makes polling easy to use by applications; once it is
configured into the kernel, no application changes are needed at all. On
the other hand, the lack of application control means that every
poll() or TCP read() will go into the polling code and,
potentially, busy-wait for as long as the ip_low_latency_poll knob
allows. It is not hard to imagine that, on many latency-sensitive systems,
the hard response-time requirements really only apply to some connections,
while others have no such requirements. Polling on those less-stringent
sockets could, conceivably, create new latency problems on the sockets that
the user really cares about. So, while no reviewer has called for it yet,
it would not be surprising to see the addition of a setsockopt()
operation to enable or disable polling for specific sockets before this
code is merged.
It almost certainly will be merged at some point; networking maintainer
Dave Miller responded to an earlier posting
with "I just wanted to say that I like this work a lot."
There are still details to be worked out and, presumably, a few more rounds
of review to be done, so low-latency sockets may not be ready for the 3.11
merge window. But it would be surprising if this work took much longer
than that to get into the mainline kernel.
Comments (7 posted)
By Jake Edge
May 21, 2013
Local privilege escalations seem to be regularly found in the Linux kernel
these days,
but they usually aren't quite so old—more than two years since the release
of 2.6.37—or backported into
even earlier kernels. But CVE-2013-2094
is just that kind of bug, with a now-public exploit that apparently dates
back to 2010.
It (ab)uses the perf_event_open() system call, and the bug was
backported
to the 2.6.32 kernel used by Red Hat Enterprise Linux (and its clones:
CentOS, Oracle, and Scientific Linux). While local privilege escalations
are generally considered less worrisome on systems without untrusted users,
it is easy to forget that UIDs used by network-exposed services should also
qualify as untrusted—compromising a service, then using a local
privilege escalation, leads directly to root.
The bug was found by Tommi Rantala when running the Trinity fuzz tester and was fixed
in mid-April. At that time, it was not
recognized as a security problem; the release of an exploit in mid-May
certainly changed that. The exploit is dated 2010 and contains some
possibly "not
safe for
work" strings. Its author expressed
surprise
that it wasn't seen as a security problem when it was fixed. That alone is
an indication (if
one was needed) that people in various colored hats are scrutinizing kernel
commits—often in ways that the kernel developers are not.
The bug itself was introduced
in 2010, and made its first appearance in the 2.6.37 kernel in January
2011. It treated the 64-bit perf event ID differently in an
initialization routine (perf_swevent_init() where the ID was
sanity checked) and in the cleanup routine
(sw_perf_event_destroy()). In the former, it was treated as a
signed 32-bit integer, while in the latter as an unsigned 64-bit integer.
The difference may not seem hugely significant, but, as it turns out, it
can be used to effect a full compromise of the system by privilege
escalation to root.
The key piece of the puzzle is that the event ID is used as an array
index in the kernel. It is a value that is controlled by user space, as it is
passed in via the struct perf_event_attr argument to perf_event_open().
Because it is sanity checked as an int, the upper 32 bits of
event_id can be anything the attacker wants, so long as the lower
32 bits are considered valid. Because
event_id is used as a signed value, the test:
if (event_id >= PERF_COUNT_SW_MAX)
return -ENOENT;
doesn't exclude negative IDs, so anything with bit 31 set (i.e. 0x80000000) will be
considered valid.
The exploit code itself is rather terse, obfuscated, and hard to follow,
but Brad
Spengler has provided a detailed description
of the exploit on Reddit. Essentially, it uses a negative value for
the event ID to cause the kernel to change user-space memory. The exploit
uses mmap() to map an area of user-space memory that will be
targeted when the negative event ID is passed. It sets the mapped area to
zeroes, then calls
perf_event_open(), immediately followed by a close() on
the returned file descriptor. That triggers:
static_key_slow_dec(&perf_swevent_enabled[event_id]);
in the
sw_perf_event_destroy() function.
The code then looks for non-zero values in the mapped area, which can be
used (along with the event ID value and the size of the array elements) to
calculate the base address of the
perf_swevent_enabled array.
But that value is just a steppingstone toward the real goal. The exploit
gets the base address of the interrupt descriptor table (IDT) by using the
sidt assembly language instruction. From that, it targets the
overflow interrupt vector (0x4), using the increment in
perf_swevent_init():
static_key_slow_inc(&perf_swevent_enabled[event_id]);
By setting
event_id appropriately, it can turn the address of the
overflow interrupt handler into a user-space address.
The exploit arranges to mmap() the range of memory where the
clobbered interrupt
handler will point and fills it with a NOP sled followed by shellcode that
accomplishes its real task: finding the UID/GIDs and capabilities in
the credentials of the current process so that it can modify them to be UID
and GID 0 with full
capabilities. At that point, in what almost feels like an afterthought, it
spawns a shell—a root shell.
Depending on a number of architecture- or kernel-build-specific features
(not least x86 assembly) makes the exploit itself rather fragile. It also
contains bugs, according to Spengler. It doesn't work on 32-bit x86 systems
because it uses a hard-coded system call number (298) passed to
syscall(), which is different (336) for 32-bit x86 kernels. It
also won't work on Ubuntu systems because the size
of the perf_swevent_enabled array elements is different. The
following will thwart the existing exploit:
echo 2 > /proc/sys/kernel/perf_event_paranoid
But a minor change to the flags passed to
perf_event_open()
will still allow the privilege escalation. None of these is a real defense
of any sort
against the
vulnerability, though they do defend against this
specific exploit. Spengler's analysis has more details, both of the
existing exploit as well as ways to change it to work around its fragility.
The code uses syscall(), presumably because
perf_event_open() is
not (yet?)
available in the GNU C library, but it could also be done to
evade any argument checks done in the library. Any sanity checking done by
the library must also be done in the kernel, because using
syscall() can avoid the usual system call path. Kernels
configured without support for perf events
(i.e. CONFIG_PERF_EVENTS not set) are unaffected by the bug as
they lack the
system call entirely.
There are several kernel hardening techniques that would help to avoid this
kind of bug leading to system compromise. The grsecurity UDEREF mechanism would
prevent
the kernel from dereferencing the user-space addresses so that the
perf_swevent_enabled base address could not be calculated.
The PaX/grsecurity KERNEXEC
technique would prevent the user-space shellcode from executing. While these techniques can inhibit this kind of
bug from allowing privilege escalation, they impose costs
(e.g. performance) that have made them
unattractive to the mainline developers. Suitably configured kernels on
hardware that supports it would be protected by supervisor mode access prevention (SMAP) and
supervisor
mode execution protection (SMEP), the former would prevent access to
the user-space addresses much like UDEREF, while the latter would prevent
execution of user-space code as does KERNEXEC.
This is a fairly nasty hole in the kernel, in part because it has existed
for so long (and apparently been known by some, at least, for most of that
time). Local privilege escalations tend to be somewhat downplayed because
they require an untrusted local user, but web applications (in particular)
can often provide just such a user. Dave Jones's Trinity has clearly
shown its worth over the last few years, though he was not terribly
pleased
how long it took for fuzzing to find this bug.
Jones suspects there may be "more fruit on that branch
somewhere", so more and better fuzzing of the perf system calls (and
kernel as a whole) is
indicated. In addition, the exploit author at least suggests that he has
more exploits waiting in the wings (not necessarily in the perf
subsystem), it is quite likely that others do as well. Finding and fixing
these security holes is an important task; auditing the commit stream to
help ensure that these
kinds of problems aren't introduced in the first place would be quite useful.
One hopes that companies using Linux find a way to fund more work in this
area.
Comments (56 posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.8.13-rt9 .
(May 21, 2013)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>