|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.15-rc7, released on May 25. "It's just a few days after -rc6, but as expected, there were some pending stuff for when I got back home, so you should think of this as being the 'normal' release, and rc6 just having been oddly delayed by my travel."

Stable updates: no stable updates have been released in the last week. The 3.14.5 and 3.10.41 updates are in the review process as of this writing; they can be expected on or after May 31. 3.12.21 is also in review, with an expected release on or after June 2.

Comments (none posted)

Expanding the kernel stack

By Jonathan Corbet
May 29, 2014
Every process in the system occupies a certain amount of memory just by existing. Though it may seem small, one of the more important pieces of memory required for each process is a place to put the kernel stack. Since every process could conceivably be running in the kernel at the same time, each must have its own kernel stack area. If there are a lot of processes in the system, the space taken for kernel stacks can add up; the fact that the stack must be physically contiguous can stress the memory management subsystem as well. These concerns have always provided a strong motivation to keep the size of the kernel stack small.

For most of the history of Linux, on most architectures, the kernel stack has been put into an 8KB allocation — two physical pages. As recently as 2008 some developers were trying to shrink the stack to 4KB, but that effort eventually proved to be unrealistic. Modern kernels can end up creating surprisingly deep call chains that just do not fit into a 4KB stack.

Increasingly, it seems, those call chains don't even fit into an 8KB stack on x86-64 systems. Recently, Minchan Kim tracked down a crash that turned out to be a stack overflow; he responded by proposing that it was time to double the stack size on x86-64 to 16KB. Such proposals have seen resistance before, and that happened this time around as well; Alan Cox argued that the solution is to be found elsewhere. But he seems to be nearly alone in that point of view.

Dave Chinner often has to deal with stack overflow problems, since they often occur with the XFS filesystem, which happens to be a bit more stack-hungry than others. He was quite supportive of this change:

8k stacks were never large enough to fit the linux IO architecture on x86-64, but nobody outside filesystem and IO developers has been willing to accept that argument as valid, despite regular stack overruns and filesystem having to add workaround after workaround to prevent stack overruns.

Linus was unconvinced at the outset, and he made it clear that work on reducing the kernel's stack footprint needs to continue. But Linus, too, seems to have come around to the idea that playing "whack-a-stack" is not going to be enough to solve the problem in a reliable way:

[S]o while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case.

Linus has also, unsurprisingly, made it clear that he is not interested in changing the stack size in the 3.15 kernel. But the 3.16 merge window can be expected to open in the near future; at that point, we may well see this patch go in as one of the first changes.

Comments (30 posted)

Kernel development news

Who audits the audit code?

By Jonathan Corbet
May 29, 2014
The Linux audit subsystem is not one of the best-loved parts of the kernel. It allows the creation of a log stream documenting specific system events — system calls, modifications to specific files, actions by processes with certain user IDs, etc. For some, it is an ideal way to get a handle on what is being done on the system and, in particular, to satisfy various requirements for security certifications (Common Criteria, for example). For others, it is an ugly and invasive addition to the kernel that adds maintenance and runtime overhead without adding useful functionality. More recently, though, it seems that audit adds some security holes of its own. But the real problem, perhaps, is that almost nobody actually looks at this code, so bugs can lurk for a long time.

The system call auditing mechanism creates audit log entries in response to system calls; the system administrator can load rules specifying which system calls are to be logged. These rules can include various tests on system call parameters, but there is also a simple bitmask, indexed by system call number, specifying which calls might be of interest. One of the first things done by the audit code is to check the appropriate bit for the current system call to see if it is set; if it is not, there is no auditing work to be done.

Philipp Kern recently noticed a little problem with how that code works with the x32 ABI. When code running under that ABI invokes a system call, it does not use the normal system call numbers defined by the x86 architecture; instead, x32 system calls (which require compatibility handling for some parameters) are marked by setting an additional bit (0x40000000) in that number. The audit code fails to remove that bit before checking the system call number in its bitmask; as one might imagine, the results are not as one might wish. Philipp included a patch to strip out the x32 bit, but it turns out that the problem is a bit bigger than that.

Andy Lutomirski, in looking at Philipp's patch, realized that the code wasn't just failing to strip out one bit; there are, in fact, no bounds checks on the system call number at all. User space can pass in any system call number it wants, and the kernel will use that number to index into its bitmask array; the result for a sufficiently large system call number is a predictable kernel oops. Andy also suggested that this failure could be used to determine the value of specific bits in kernel space, leading to an information-disclosure vulnerability.

Andy submitted a patch to fix this particular problem, but he didn't stop there. He has come to the conclusion that the audit subsystem is beyond repair, so his patch marks the whole thing as being broken, making it generally inaccessible. He cited a number of problems beyond this security issue: it hurts performance even when it is not being used, it is not (in his mind) reliable, it has problems with various architectures, and "its approach to freeing memory is terrifying". All told, Andy said, we're better off without it:

In summary, the code is a giant mess. The way it works is nearly incomprehensible. It contains at least one severe bug. I'd love to see it fixed, but for now, distributions seem to think that enabling CONFIG_AUDITSYSCALL is a reasonable thing to do, and I'd argue that it's actually a terrible choice for anyone who doesn't actually need syscall audit rules. And I don't know who needs these things.

It is unsurprising that Eric Paris, who maintains the audit code, disagrees with this assessment. His point of view is that this is just another bug in need of fixing; it does not indicate any systemic problem with the audit code.

It is telling, though, that this particular vulnerability has existed in the audit subsystem almost since its inception. The audit code receives little in the way of review; most kernel developers simply turn it off for their own kernels and look the other way. But this subsystem is just the sort of thing that distributors are almost required to enable in their kernels; some users will want it, so they have to turn it on for everybody. As a result, almost all systems out there have audit enabled (look for a running kauditd thread), even though few of them are using it. These systems take a performance penalty just for having audit enabled, and they are vulnerable to any issues that may be found in the audit code.

If audit were to be implemented today, the developer involved would have to give some serious thought, at least, to using the tracing mechanism. It already has hooks applied in all of the right places, but those hooks have (almost) zero overhead when they are not enabled. Tracing has its own filtering mechanism built in; the addition of BPF-based filters will make that feature more capable and faster as well. In a sense, the audit subsystem contains yet another kernel-based virtual machine that makes decisions about which events to log; using the tracing infrastructure would allow the removal of that code and a consolidation to a single virtual machine that is more widely maintained and reviewed.

The audit system we have, though, predates the tracing subsystem, so it could not have been based on tracing. Replacing it without breaking users would not be a trivial job, even in the absence of snags that have been glossed over in the above paragraph (and such snags certainly exist). So we are likely stuck with the current audit subsystem (which will certainly not be marked "broken" in the mainline kernel) for the foreseeable future. Hopefully it will receive some auditing of its own just in case there are more old surprises lurking therein.

Comments (12 posted)

Seccomp filters for multi-threaded programs

By Jonathan Corbet
May 29, 2014
The secure computing ("seccomp") mechanism helps in the sandboxing of processes by restricting access to system calls. Seccomp works by attaching one or more programs to a process; those programs, written in Berkeley packet filter (BPF) byte code, are invoked for every system call made by the affected process. The BPF filter programs have access to the system call number and arguments; each filter has the option of denying the system call. Seccomp filters can thus restrict access to specific system calls, or, for example, only allow write() to be called on specific file descriptors. This mechanism works well as far as it goes, but it was not designed for use with multi-threaded programs. A set of proposed changes should close that particular functionality gap in the near future, though.

In current kernels, a process can apply a filter program to itself with the prctl() system call:

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);

Where filter is a pointer to a sock_fprog structure containing the BPF program to be applied. Multiple programs can be added with multiple prctl() calls; each will be executed in sequence and any can reject a system call. There is no mechanism for removing filters once they have been applied to a process. Adding filters is normally a privileged operation; otherwise there is a real risk of privilege escalation via setuid programs that do not expect some operations to be denied (see this old sendmail vulnerability for an example). But any process may set filters on itself if it has first called:

    prctl(PR_SET_NO_NEW_PRIVS, 1);

to disable the addition of any privileges to the process. In particular, a process marked as "no new privileges" cannot gain capabilities or access to a different user ID by running setuid or setgid programs.

What is lacking in the current interface is any way for a process to apply filters to a different process (or thread). There does not seem to be a use case for the ability to add filters to arbitrary processes; among other things, trying to contain a program that is already off and running would be a recipe for unpleasant race conditions. But it seems that there is value in allowing a thread to apply filters to its sibling threads. In the absence of this ability, it can be hard to ensure that a seccomp filter applies to all threads running as part of a process. Threads inherit their parent's filters when they are created, but any threads created before the filters are applied will remain uncontained. It may not always be practical to set up the filters before any threads are created, so the ability to attach them to threads after creation is a useful way to ensure that no part of a program escapes filtering.

Adding that ability is the object of this patch set from Kees Cook. All Kees really needed to do was to add an "apply this filter to all threads" flag to the PR_SET_SECCOMP operation, but, as so often seems to be the case, that operation was defined without the ability to pass in additional flags to modify its behavior. So, instead, Kees has added a new operation:

    prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, flags, filter);

If flags is zero, this operation behaves just like the PR_SET_SECCOMP example above; it attaches filter to the calling process. But if the SECCOMP_FILTER_TSYNC flag is set, the given filter (along with any other filters already applied to the calling process) will be applied to all threads in the process's thread group, thus ensuring that all threads are running with the same set of filters.

There is one other new operation:

    prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0);

This one will apply the calling process's filters to all other threads without making any changes to the filters themselves.

In either case, other threads will only have their filtering changed if whatever filter they currently have applied is an "ancestor" of the filters running on the calling process. Essentially, any filters applied to the target thread must also have been applied to the calling thread; any thread that has a totally unrelated filter will not have its filtering changed. If a thread is not running with a filter at all, it will be put into the seccomp mode and the filters will be applied. Also, if the calling thread has the "no new privileges" mode set, that mode will be set on all other threads as well.

This is the fifth version of this patch set; the previous attempts needed work in response to locking and other issues. Unless another problem turns up, this code should be about ready for merging. There does not appear to be any opposition to the concept, so this feature could find its way into the mainline as early as 3.16.

Comments (4 posted)

Debugging ARM kernels using fast interrupts

May 29, 2014

This article was contributed by Daniel Thompson

Recently things have pretty quiet for the interactive kernel debugging tools, with kgdb and kdb combined receiving only four patches in the last year. However, activity has started to pick up as new work inspired by Android's out-of-tree fiq_debugger has been posted for consideration. One of the key features proposed increases the robustness of kdb and kgdb by making it much harder for bugs in the system under test to prevent the user from invoking the debugger.

Both kgdb and kdb have been included in the kernel for a long time. Kgdb, which is a debug stub that allows another machine to connect a source-level debugger over a serial link, was merged in 2.6.26, while kdb, after a significant rewrite, was merged into 2.6.37. The rewrite allowed kdb to reuse kgdb's breakpoint and polled I/O infrastructure in order to implement a machine-level kernel debugger that runs entirely on the machine being debugged.

On a PC, the main distinction between kdb and kgdb is that kdb can be operated from the PC's own keyboard and display. This difference is less obvious on embedded systems that seldom have their own keyboard. However the property is retained; kdb is self hosting, requiring only a terminal emulator, while kgdb requires a machine loaded up with the developer tools and the corresponding vmlinux file.

Both of these debug tools share common infrastructure and they also share a limitation: there are circumstances where other parts of the kernel can mask interrupts, including the one from the serial port, making it impossible for the user to manually stop the machine to debug it. When that happens, the request to stop the system never makes it from the serial port to the processor. A good example of this occurs if spin_lock_irq() is used incorrectly by a faulty driver causing a deadlock that cannot be studied with the debugger.

ARM's fast interrupt (FIQ) support

The ARM architecture includes two ways to interrupt the processor, the normal interrupt (IRQ) and the fast interrupt (FIQ). The two forms of interrupt have separate mask bits within the ARM processor status register and Linux code seldom, if ever, sets the FIQ mask bit. The processor also implements special features to reduce the overhead of FIQ handling. For example, it has a separate bank of seven registers, five of which can be used by the FIQ handler without interfering with any normal CPU registers. In addition, the FIQ vector is carefully placed within the exception vector table so that its handler can be directly executed (all other ARM exceptions must jump due to lack of space in the vector table). This means an FIQ handler, if specially crafted to use only a few registers, need not save or restore any state. The combination of seldomly being masked, reduced demultiplexing overhead (because few drivers employ FIQ), and additional hardware features give fast interrupts their name.

At the CPU level, the ARM FIQ signal is technically very similar to the x86 non-maskable interrupt (NMI), but its role within the system architecture has different historical roots. ARM FIQs were, as the name suggests, designed to rapidly service demanding peripherals or even to allow software to replace hardware (for example in synchronous serial communication). This contrasts strongly with the PC world, where the NMI has long been associated with diagnostics and other troubleshooting techniques. NMI was originally used in the IBM PC to report hardware faults such as memory parity errors. Today, watchdogs built into PC chipsets signal failure using NMI; server systems may even include a physical NMI button that can be used to invoke diagnostic features.

Most ARM systems have interrupt controllers that allow any interrupt source to be routed as either an IRQ or an FIQ. Occasionally, in embedded ARM/Linux systems, this facility it used for its original purpose of supporting a single peripheral with aggressive latency requirements. For example, the Raspberry Pi kernel uses fast interrupts to improve USB performance. However, it is much more common for the FIQ signal to never be used at all. This makes it possible to route the UART (serial port) interrupt to the FIQ signal, improving the robustness of communication between the UART and the debugger. Since the FIQ is never masked, a faulty driver would no longer be able to prevent the debugger operating normally simply by disabling interrupts.

Android's fiq_debugger

Google's Android team have already implemented an interactive debugger that can, optionally, take advantage of FIQ interrupts. Fiq_debugger has a long history that dates back several years before kdb was merged into the kernel. Recently it was used in the development of many of Google's Nexus phones and tablets. On these devices, the UART is connected either to the USB or headphone sockets. These UARTs are disabled during normal use but become active when presence-detect resistors indicate that something is listening to the UART.

On devices whose application processors can support it, fiq_debugger receives and processes all user input and executes the majority of commands from the FIQ handler. This makes it extremely robust against driver bugs that leave the system unresponsive, although there are some drawbacks. In particular, an FIQ can interrupt the kernel at more or less any point during kernel execution, including during critical sections. That means that certain debug commands cannot execute safely from the FIQ handler because they might conflict with the interrupted activity. When running in the FIQ handler, even taking a spin lock can lead to a lock up if the spin lock is held by an interrupted critical section.

To solve this, fiq_debugger can drop into normal interrupt handling using ARM's software interrupt feature. This allows robust basic commands (such as single CPU stack trace) to use the FIQ signal but also to be implemented alongside richer, but slightly less robust, status-reporting features in the same debugger.

Some ARM systems do not permit routing of the UART interrupt from IRQ to FIQ. On these systems, the Android debugger remains useful to study a variety of system failures, but it does not retain the robustness of FIQ-based systems.

In addition to FIQ support, fiq_debugger contains some other unusual features that distinguish it from the existing in-kernel debug technologies. These features are motivated by the relatively hostile environment the debugger might be deployed in.

For example:

  • The UART (and the associated presence-detect circuit) might be presented with significant noise due to the serial port being multiplexed with other activity. Noise must not cause the device to spuriously stop the world.

  • The debugger may be deployed on devices with one or more external hardware watchdogs standing by ready to reset the system should it become stalled for any reason.

  • The debugger may be deployed on production devices and cannot be used as a means to compromise user privacy. For example a hostile charging station or airline headphone service must not be able to access private user data.

Fiq_debugger has two features to counter these issues.

First, fiq_debugger's command interpreter is asynchronous. All CPUs continue to run while commands are received from the user and, on SMP systems, all of the other CPUs continue to run as usual even during command execution. This contrasts with kdb, which is a stop-the-world debugger. As soon as kdb is invoked, all CPUs in the system are brought to a halt and the system will not resume normal processing until the user issues a "go" command.

Stop-the-world has many advantages, in particular the system cannot change state while the user is reasoning about it, however if the world were stopped accidentally due to noise (for example when inserting headphones) then this looks to the user as though the phone has crashed. In this situation, the watchdog will come to the rescue of the normal user but at a terrible cost. If a developer actually wants to stop the world, they will find that the device resets ten seconds after they started debugging it because the watchdog fired. An asynchronous implementation keeps both users happy.

Second, fiq_debugger supports only a fairly limited set of built-in commands. There are no general memory inspection commands and, apart from magic-SysRq and reboot, there is no means to divert the device from normal processing. The idea is that the passive inspection commands that do exist (stack trace, process list, irq status, dmesg, and register dump) give a reasonable chance of performing successful post-mortem analysis without much risk of leaking the user's private data.

Fiq_debugger, like kdb, offers a command that switches to kgdb mode and enables both arbitrary memory access and traditional stop-the-world step/breakpoint debugging. This command is disabled by default and can only be enabled by the root user. Despite its interesting and unique features, it seems unlikely that fiq_debugger will be merged into the kernel because its functionality overlaps so significantly with that of kdb.

Improving kgdb and kdb

Inspired by the Android team's work on fiq_debugger, Anton Vorontsov of Linaro developed a series of patches to implement some of the best ideas from that debugger in kdb. This includes the NMI/FIQ patchset and the reduced capability series.

The NMI/FIQ patchset introduced a generic framework to support NMI-based debuggers together with a concrete implementation for ARM that is based on FIQ. The framework allows both kgdb and kdb to be triggered from non-maskable interrupts, which brings the robustness of non-maskable debuggers to all in-kernel debug technologies.

The generic framework provides a means for the NMI handler to deliver characters to a special TTY driver that interacts with a real serial port driver using its polled I/O interfaces. The TTY driver allows the user to invoke kdb (or kgdb), but takes steps to avoid spuriously stopping the world due to noise by requiring a special "knock" to stop the system.

The framework was merged into 3.7 but, unfortunately, the ARM-specific patches to take advantage of it never were able to get reviewer or maintainer attention despite multiple submissions. Anton moved on to other things and it falls to me to update them with support for multi-platform kernels and to fix bitrot since 3.7.

My work brings NMI/FIQ support to STMicroelectronics STiH415 and STiH416 devices, as well as support for the ARM Versatile platform. It works well with multi-platform kernels, although Russell King has identified some potential issues that require spin locks to be avoided within one of the interrupt controller callbacks. In the device tree portion of the patch, Srinivas Kandagatla asked for better documentation of the new device tree bindings and King has serious concerns about how FIQ-capable interrupt signals are described within device tree interrupt maps. Finally Colin Cross, one of the developers of Android's fiq_debugger, has previously noted the need for additional changes to kgdb to fully benefit from fast interrupts on SMP platforms. In particular, the current code to stop the world uses an inter-processor IRQ to stop the other processors. That should be made to use fast interrupts to fully benefit from the robustness improvements.

Supplementing the NMI/FIQ patchset is the reduced capability patchset, which is a means to restrict which classes of kdb commands can be used during a debug session. The permitted commands are set at boot and can be modified by the root user while the system is running. This allows kdb to be set up with a similar range of commands as fiq_debugger has, although other combinations are also possible and can be used to target different use-cases.

The impact of TrustZone on FIQ

So far, we have assumed that the interrupt controller provides a means to route the UART interrupt to the processor's FIQ signal and that this signal is observable by the kernel. Unfortunately, for modern ARM systems that implement TrustZone, this is not always the case.

TrustZone is a security technology for ARM. It works by dividing the processor into two virtual processors, each of which is considered to occupy a different "world". Peripherals, including processor-intimate peripherals such as the interrupt controller, can determine which world a memory access originates from. This is used to implement hardware-based controls that prevent the "normal world" virtual processor from interfering with the operation of the "secure world".

ARM systems with TrustZone do not introduce new interrupt signals between the interrupt controller and the processor. Instead, the processor will switch automatically from normal world to trusted world in response to the FIQ signal, meaning only the IRQ signal can be used by an operating system running in the normal world. The interrupt controller supports this division by ignoring writes to the FIQ routing registers from the normal world and returning zero for all reads.

In typical systems that employ TrustZone, the secure monitor is booted from a tamper-resistant bootloader and the kernel is later booted in normal world. The kernel can interact with the secure monitor by using a special Secure Monitor Call (SMC) instruction that operates in a similar manner to a system call.

This means that a developer working on a kernel running in normal world must rely on cooperation from the secure monitor to help pass FIQ signals to the kernel. Unfortunately, features to support this are not yet standardized and implemented in currently available secure monitors. Thankfully. there are projects like ARM trusted firmware from ARM itself and the work on trusted execution environments by Linaro, STMicroelectronics, and NVIDIA to provide us with open-source infrastructure to prototype and develop interfaces between the kernel and the secure monitor. This should eventually open up the opportunity for developers to employ NMI-like debug techniques on almost all modern ARM systems.

What's next?

The NMI/FIQ patchset is relatively small, but conceals within it some fairly significant behavioral changes. Not content with proposing big changes to the default configuration of one of ARM's most common interrupt controllers, it also causes all of the debugger code to run from a non-maskable interrupt handler, thus imposing new restrictions on the use of spin locks within the debugger implementation. This means a good bit of testing will be required in order to gain sufficient confidence for the patches to be merged.

Many types of testing are needed, from simple does-it-still-boot regression testing right through to deliberately breaking the kernel and checking that the debugger can still be invoked. For example, running these tests on SMP systems will reveal the limitations of the kgdb code to stop the world, which will allow it to get fixed.

To encourage wider testing, a port to the BeagleBone Black is planned, although progress here has been frustrated slightly by the worldwide shortage of boards. In the meantime, be aware that porting to a new board can be as little as 31 lines of code if the interrupt controller is already supported. That makes porting to other development boards a terrific way for an interested developer to get involved with this work.

It is too late for anything to happen in the 3.16 merge window, but we can hope to see at least some of these patches making their way into 3.17.

Comments (7 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.15-rc7 ?

Architecture-specific

Core kernel code

Device drivers

Documentation

Michael Kerrisk (man-pages) man-pages-3.67 is released ?

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds