LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.10-rc2, released on May 20. Linus says: "For being an -rc2, it's not unreasonably sized, but I did take a few pulls that I wouldn't have taken later in the rc series. So it's not exactly small either. We've got arch updates (PPC, MIPS, PA-RISC), we've got driver fixes (net, gpu, target, xen), and we've got filesystem updates (btrfs, ext4 and cepth - rbd)."

Stable updates: 3.9.3, 3.4.46, and and 3.0.79 were released on May 19; 3.6.11.4 came out on May 20.

Comments (none posted)

Ktap 0.1 released

A new kernel tracing tool called "ktap" has made its first release. "KTAP have different design principles from Linux mainstream dynamic tracing language in that it's based on bytecode, so it doesn't depend upon GCC, doesn't require compiling a kernel module, safe to use in production environment, fulfilling the embedded ecosystem's tracing needs." It's in an early state; the project is looking for testers and contributors.

Comments (10 posted)

Merging zswap

By Jonathan Corbet
May 22, 2013
As reported in our Linux Storage, Filesystem, and Memory Management Summit coverage, the decision was made to merge the zswap compressed swap cache subsystem while holding off on the rather more complex "zcache" subsystem. But conference decisions can often run into difficulties during the implementation process; that has proved to be the case here.

Zswap developer Seth Jennings duly submitted the code for consideration for the 3.11 development cycle. He quickly ran into opposition from zcache developer Dan Magenheimer; Dan had agreed with the merging of zswap in principle, but he expressed concerns that zswap may perform poorly in some situations. According to Dan, it would be better to fix these problems before merging the code:

I think the real challenge of zswap (or zcache) and the value to distros and end users requires us to get this right BEFORE users start filing bugs about performance weirdness. After which most users and distros will simply default to 0% (i.e. turn zswap off) because zswap unpredictably sometimes sucks.

The discussion went around in circles the way that in-kernel compression discussions often do. In the end, though, the consensus among memory management developers (but not Dan) was probably best summarized by Mel Gorman:

I think there is a lot of ugly in there and potential for weird performance bugs. I ran out of beans complaining about different parts during the review but fixing it out of tree or in staging like it's been happening to date has clearly not worked out at all.

So the end result is likely to be that zswap will be merged for 3.11, but with a number of warnings attached to it. Then, with luck, the increased visibility of the code will motivate developers to prepare patches and improve the code to a point where it is production-ready.

Comments (2 posted)

Kernel development news

Ktap — yet another kernel tracer

By Jonathan Corbet
May 22, 2013
Once upon a time, usable tracing tools for Linux were few and far between. Now, instead, there is a wealth of choices, including the in-kernel ftrace facility, SystemTap, and the LTTng suite; Oracle also has a port of DTrace for its distribution, available to its paying customers. On May 21, another alternative showed up in the form of the ktap 0.1 release. Ktap does not offer any major features that are not available from the other tracing tools, but there may still be a place for it in the tracing ecosystem.

Ktap appears to be strongly oriented toward the needs of embedded users; that has affected a number of the design decisions that have been made. At the top of the list was the decision to embed a byte-code interpreter into the kernel and compile tracing scripts for that interpreter. That is a big difference from SystemTap, which, in its current implementation, compiles a tracing script into a separate module that must be loaded into the kernel. This difference matters because an embedded target often will not have a full compiler toolchain installed on it; even if the tools are available, compiling and linking a module can be a slow process. Compiling a ktap script, instead, requires a simple utility to produce byte code for the ktap kernel module.

That compiler implements a language that is based on Lua. It is C-like, but it is dynamically typed, has a dictionary-like "table" type, and lacks arrays and pointers. There is a simple function definition mechanism which can be used like this:

    function eventfun (e) {
	printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
    }

The resulting function will, when called, output the current CPU number, process ID, executing program name, and the string representation of the passed-in event e. There is a probe-placement function, so ktap could arrange to call the above function on system call entry with:

    kdebug.probe("tp:syscalls", eventfun)

A quick run on your editor's system produced a bunch of output like:

    3 2745	Xorg	sys_setitimer(which: 0, value: 7fff05967ec0, ovalue: 0)
    3 2745	Xorg	sys_setitimer -> 0x0
    2 27467	as	sys_mmap(addr: 0, len: 81000, prot: 3, flags: 22, fd: ffffffff, off: 0)
    2 27467	as	sys_mmap -> 0x2aaaab67c000
    2 3402	gnome-shell	sys_mmap(addr: 0, len: 97b, prot: 1, flags: 2, fd: 21, off: 0)
    2 3402	gnome-shell	sys_mmap -> 0x7f4ec4bfb000

There are various utility functions for generating timer requests, creating histograms, and so on. So, for example, this script:

    hist = {}

    function eventfun (e) {
	if (e.sc_is_enter) {
	    inplace_inc(hist, e.name)
	}
    }

    kdebug.probe("tp:syscalls", eventfun)

    kdebug.probe_end(function () {
	histogram(hist)
    })

is sufficient to generate a histogram of system calls over the period of time from when it starts until when the user interrupts it. Your editor ran it with a kernel build running and got output looking like this:

                value ------------- Distribution ------------- count
        sys_enter_open |@@@@@@@@                               587779    
       sys_enter_close |@@@@                                   343728    
    sys_enter_newfstat |@@@@                                   331459    
        sys_enter_read |@@@                                    283217    
        sys_enter_mmap |@@@                                    243458    
       sys_enter_ioctl |@@                                     219364    
      sys_enter_munmap |@@                                     165006    
       sys_enter_write |@                                      128003    
        sys_enter_poll |@                                      77311     
    sys_enter_recvfrom |                                       52898     

The syntax for setting probe points closely matches that used by perf; probes can be set on specific functions or tracepoints, for example. It is possible to hook into the perf events mechanism to get other types of hardware or software events, and memory breakpoints are supported. The (sparse) documentation packaged with the code also suggests that ktap is able to set user-space probes, but none of the example scripts packaged with the tool demonstrate that capability.

Ktap scripts can manipulate the return value of probed functions within the kernel. There does not currently appear to be a way to manipulate kernel-space data directly, but that could presumably be added (along with lots of other features) in the future. What's there now is a proof of concept as much as anything; it is a quick way to get some data out of the kernel but does not offer a whole lot that is not available using the existing ftrace interface.

For those who want to play with it, the first step is a simple:

    git clone https://github.com/ktap/ktap.git

From there, building the code and running the sample scripts is a matter of a few minutes of relatively painless work. There is the ktapvm module, which must, naturally, be loaded into the kernel. That module creates a special virtual file (ktap/ktapvm under the debugfs root) that is used by the ktap binary to load and run compiled scripts.

Ktap in its current form is limited, without a lot of exciting new functionality. Even so, it seems to have generated a certain amount of interest in the development community. Getting started with most tracing tools usually seems to involve a fair amount of up-front learning; ktap, perhaps, is a more approachable solution for a number of users. The whole thing is about 10,000 lines of code; it shouldn't be hard for others to run with and extend. If developers start to take the bait, interesting things could happen with this project.

Comments (4 posted)

Low-latency Ethernet device polling

By Jonathan Corbet
May 21, 2013
Linux is generally considered to have one of the most fully featured and fast networking stacks available. But there are always users who are not happy with what's available and who want to replace it with something more closely tuned for their specific needs. One such group consists of people with extreme low latency requirements, where each incoming packet must be responded to as quickly as possible. High-frequency trading systems fall into this category, but there are others as well. This class of user is sometimes tempted to short out the kernel's networking stack altogether in favor of a purely user-space (or purely hardware-based) implementation, but that has problems of its own. A relatively small patch to the networking subsystem might just be able to remove that temptation for at least some of these users.

Network interfaces, like most reasonable peripheral devices, are capable of interrupting the CPU whenever a packet arrives. But even a moderately busy interface can handle hundreds or thousands of packets per second; per-packet interrupts would quickly overwhelm the processor with interrupt-handling work, leaving little time for getting useful tasks done. So most interface drivers will disable the per-packet interrupt when the traffic level is high enough and, with cooperation from the core networking stack, occasionally poll the device for new packets. There are a number of advantages to doing things this way: vast numbers of interrupts can be avoided, incoming packets can be more efficiently processed in batches, and, if packets must be dropped in response to load, they can be discarded in the interface before they ever hit the network stack. Polling is thus a win for almost all situations where there is any significant amount of traffic at all.

Extreme low-latency users see things differently, though. The time between a packet's arrival and the next poll is just the sort of latency that they are trying to avoid. Re-enabling interrupts is not a workable solution, though; interrupts, too, are a source of latency. Thus the drive for user-space solutions where an application can simply poll the interface for new packets whenever it is prepared to handle new messages.

Eliezer Tamir has posted an alternative solution in the form of the low-latency Ethernet device polling patch set. With this patch, an application can enable polling for new packets directly in the device driver, with the result that those packets will quickly find their way into the network stack.

The patch adds a new member to the net_device_ops structure:

    int (*ndo_ll_poll)(struct napi_struct *dev);

This function should cause the driver to check the interface for new packets and flush them into the network stack if they exist; it should not block. The return value is the number of packets it pushed into the stack, or zero if no packets were available. Other return values include LL_FLUSH_BUSY, indicating that ongoing activity prevented the processing of packets (the inability to take a lock would be an example) or LL_FLUSH_FAILED, indicating some sort of error. The latter value will cause polling to stop; LL_FLUSH_BUSY, instead, appears to be entirely ignored.

Within the networking stack, the ndo_ll_poll() function will be called whenever polling the interface seems like the right thing to do. One obvious case is in response to the poll() system call. Sockets marked as non-blocking will only poll once; otherwise polling will continue until some packets destined for the relevant socket find their way into the networking stack, up until the maximum time controlled by the ip_low_latency_poll sysctl knob. The default value for that knob is zero (meaning that the interface will only be polled once), but the "recommended value" is 50µs. The end result is that, if unprocessed packets exist when poll() is called (or arrive shortly thereafter), they will be flushed into the stack and made available immediately, with no need to wait for the stack itself to get around to polling the interface.

Another patch in the series adds another call site in the TCP code. If a read() is issued on an established TCP connection and no data is ready for return to user space, the driver will be polled to see if some data can be pushed into the system. So there is no need for a separate poll() call to get polling on a TCP socket.

This patch set makes polling easy to use by applications; once it is configured into the kernel, no application changes are needed at all. On the other hand, the lack of application control means that every poll() or TCP read() will go into the polling code and, potentially, busy-wait for as long as the ip_low_latency_poll knob allows. It is not hard to imagine that, on many latency-sensitive systems, the hard response-time requirements really only apply to some connections, while others have no such requirements. Polling on those less-stringent sockets could, conceivably, create new latency problems on the sockets that the user really cares about. So, while no reviewer has called for it yet, it would not be surprising to see the addition of a setsockopt() operation to enable or disable polling for specific sockets before this code is merged.

It almost certainly will be merged at some point; networking maintainer Dave Miller responded to an earlier posting with "I just wanted to say that I like this work a lot." There are still details to be worked out and, presumably, a few more rounds of review to be done, so low-latency sockets may not be ready for the 3.11 merge window. But it would be surprising if this work took much longer than that to get into the mainline kernel.

Comments (7 posted)

An unexpected perf feature

By Jake Edge
May 21, 2013

Local privilege escalations seem to be regularly found in the Linux kernel these days, but they usually aren't quite so old—more than two years since the release of 2.6.37—or backported into even earlier kernels. But CVE-2013-2094 is just that kind of bug, with a now-public exploit that apparently dates back to 2010. It (ab)uses the perf_event_open() system call, and the bug was backported to the 2.6.32 kernel used by Red Hat Enterprise Linux (and its clones: CentOS, Oracle, and Scientific Linux). While local privilege escalations are generally considered less worrisome on systems without untrusted users, it is easy to forget that UIDs used by network-exposed services should also qualify as untrusted—compromising a service, then using a local privilege escalation, leads directly to root.

The bug was found by Tommi Rantala when running the Trinity fuzz tester and was fixed in mid-April. At that time, it was not recognized as a security problem; the release of an exploit in mid-May certainly changed that. The exploit is dated 2010 and contains some possibly "not safe for work" strings. Its author expressed surprise that it wasn't seen as a security problem when it was fixed. That alone is an indication (if one was needed) that people in various colored hats are scrutinizing kernel commits—often in ways that the kernel developers are not.

The bug itself was introduced in 2010, and made its first appearance in the 2.6.37 kernel in January 2011. It treated the 64-bit perf event ID differently in an initialization routine (perf_swevent_init() where the ID was sanity checked) and in the cleanup routine (sw_perf_event_destroy()). In the former, it was treated as a signed 32-bit integer, while in the latter as an unsigned 64-bit integer. The difference may not seem hugely significant, but, as it turns out, it can be used to effect a full compromise of the system by privilege escalation to root.

The key piece of the puzzle is that the event ID is used as an array index in the kernel. It is a value that is controlled by user space, as it is passed in via the struct perf_event_attr argument to perf_event_open(). Because it is sanity checked as an int, the upper 32 bits of event_id can be anything the attacker wants, so long as the lower 32 bits are considered valid. Because event_id is used as a signed value, the test:

    if (event_id >= PERF_COUNT_SW_MAX)
            return -ENOENT;
doesn't exclude negative IDs, so anything with bit 31 set (i.e. 0x80000000) will be considered valid.

The exploit code itself is rather terse, obfuscated, and hard to follow, but Brad Spengler has provided a detailed description of the exploit on Reddit. Essentially, it uses a negative value for the event ID to cause the kernel to change user-space memory. The exploit uses mmap() to map an area of user-space memory that will be targeted when the negative event ID is passed. It sets the mapped area to zeroes, then calls perf_event_open(), immediately followed by a close() on the returned file descriptor. That triggers:

    static_key_slow_dec(&perf_swevent_enabled[event_id]);
in the sw_perf_event_destroy() function. The code then looks for non-zero values in the mapped area, which can be used (along with the event ID value and the size of the array elements) to calculate the base address of the perf_swevent_enabled array.

But that value is just a steppingstone toward the real goal. The exploit gets the base address of the interrupt descriptor table (IDT) by using the sidt assembly language instruction. From that, it targets the overflow interrupt vector (0x4), using the increment in perf_swevent_init():

    static_key_slow_inc(&perf_swevent_enabled[event_id]);
By setting event_id appropriately, it can turn the address of the overflow interrupt handler into a user-space address.

The exploit arranges to mmap() the range of memory where the clobbered interrupt handler will point and fills it with a NOP sled followed by shellcode that accomplishes its real task: finding the UID/GIDs and capabilities in the credentials of the current process so that it can modify them to be UID and GID 0 with full capabilities. At that point, in what almost feels like an afterthought, it spawns a shell—a root shell.

Depending on a number of architecture- or kernel-build-specific features (not least x86 assembly) makes the exploit itself rather fragile. It also contains bugs, according to Spengler. It doesn't work on 32-bit x86 systems because it uses a hard-coded system call number (298) passed to syscall(), which is different (336) for 32-bit x86 kernels. It also won't work on Ubuntu systems because the size of the perf_swevent_enabled array elements is different. The following will thwart the existing exploit:

    echo 2 > /proc/sys/kernel/perf_event_paranoid
But a minor change to the flags passed to perf_event_open() will still allow the privilege escalation. None of these is a real defense of any sort against the vulnerability, though they do defend against this specific exploit. Spengler's analysis has more details, both of the existing exploit as well as ways to change it to work around its fragility.

The code uses syscall(), presumably because perf_event_open() is not (yet?) available in the GNU C library, but it could also be done to evade any argument checks done in the library. Any sanity checking done by the library must also be done in the kernel, because using syscall() can avoid the usual system call path. Kernels configured without support for perf events (i.e. CONFIG_PERF_EVENTS not set) are unaffected by the bug as they lack the system call entirely.

There are several kernel hardening techniques that would help to avoid this kind of bug leading to system compromise. The grsecurity UDEREF mechanism would prevent the kernel from dereferencing the user-space addresses so that the perf_swevent_enabled base address could not be calculated. The PaX/grsecurity KERNEXEC technique would prevent the user-space shellcode from executing. While these techniques can inhibit this kind of bug from allowing privilege escalation, they impose costs (e.g. performance) that have made them unattractive to the mainline developers. Suitably configured kernels on hardware that supports it would be protected by supervisor mode access prevention (SMAP) and supervisor mode execution protection (SMEP), the former would prevent access to the user-space addresses much like UDEREF, while the latter would prevent execution of user-space code as does KERNEXEC.

This is a fairly nasty hole in the kernel, in part because it has existed for so long (and apparently been known by some, at least, for most of that time). Local privilege escalations tend to be somewhat downplayed because they require an untrusted local user, but web applications (in particular) can often provide just such a user. Dave Jones's Trinity has clearly shown its worth over the last few years, though he was not terribly pleased how long it took for fuzzing to find this bug.

Jones suspects there may be "more fruit on that branch somewhere", so more and better fuzzing of the perf system calls (and kernel as a whole) is indicated. In addition, the exploit author at least suggests that he has more exploits waiting in the wings (not necessarily in the perf subsystem), it is quite likely that others do as well. Finding and fixing these security holes is an important task; auditing the commit stream to help ensure that these kinds of problems aren't introduced in the first place would be quite useful. One hopes that companies using Linux find a way to fund more work in this area.

Comments (56 posted)

Patches and updates

Kernel trees

  • Sebastian Andrzej Siewior: 3.8.13-rt9 . (May 21, 2013)

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds