Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.3-rc5, released on October 11. "It's the usual 'lots of small fixes to drivers and architecture code, with some filesystem updates thrown in for variety'." This prepatch also features a change to the kernel codename, which is now "Blurry Fish Butt".

Stable updates: none have been released in the last week.

Comments (1 posted)

Quote of the week

Personally, I'd be happy to see RT-PREEMPT go in - there are certain use cases for it. I'd also like to see the NOHZ stuff go in as well. It should solve a set of problems for isolating RT tasks without sacrificing performance on the non-RT CPUs. And finally, I wouldn't mind putting a non-Linux RT kernel like Cobalt (oh the horrors) into mainline as well. Each of these approaches has their strengths and weaknesses. Linux is, after all, whatever we say it is - and it could easily have a small RT micro-kernel as part of the source base as well.

AFAIK the patent issues that plagued the dual-kernel approach are now behind us, so this might be a good area of investigation.

I've long felt that RT-PREEMPT was sucking the oxygen out of getting other, technically valid, approaches to RT with Linux into mainline.

— Tim Bird

Comments (5 posted)

Bottomley: Respect and the Linux Kernel Mailing Lists

SCSI subsystem maintainer James Bottomley has posted a different view on the issue of civility on the kernel's mailing lists. "So, by and large, I’m proud of the achievements we’ve made in civility and the way we have improved over the years. Are we perfect? by no means (but then perfection in such a large community isn’t a realistic goal). However, we have passed our stress test: that an individual with bad patches to several mailing lists was met with courtesy and helpful advice, in spite of serially repeating the behaviour."

Comments (56 posted)

Unprivileged bpf()

By Jonathan Corbet
October 12, 2015

Over the last couple of years, the Berkeley packet filter (BPF) in-kernel virtual machine has gained capabilities and moved beyond its origins in the networking subsystem. Among other things, it has gained its own system call — bpf() — to enable the loading of BPF programs into the kernel and various ancillary functions. In current kernels, bpf() is a root-only system call, and truly root-only at that: one must be root in the initial system namespace to use it. But the plan always included making bpf() available to unprivileged users; now patches are circulating to take that last step.

Kernel developers are understandably nervous about allowing unprivileged users to load code into the kernel for execution in kernel context. It does not take long to think of a number of ways in which things could go wrong. Getting past the initial reflex that says to simply disallow such access requires a high degree of certainty that there is no way for a rogue BPF program to compromise the kernel in any way.

The job of providing this assurance belongs to the BPF verifier module. It checks that any program presented for loading does not exceed the maximum number of instructions (4096 by default) and that it does not contain any loops, thus ensuring that it will not take excessive time to run. All jumps are checked to be sure that they land within the program (and don't create loops). Accesses to memory are not allowed to go outside the memory area provided by the kernel. The type of data stored in each accessible memory location is tracked by simulating the program's execution; instructions are not allowed to operate on inappropriate data types, and uninitialized memory cannot be accessed.

According to BPF developer Alexei Starovoitov, there is just one thing that is missing: ensuring that a BPF program cannot leak information about the kernel — and kernel pointers in particular — to user space. This information can be highly useful to an attacker trying to exploit a vulnerability, so quite a bit of effort has gone into plugging such leaks in recent years. BPF programs do not have access to a great deal of kernel information, but a hostile program still might succeed in exfiltrating something that an attacker can use. That makes it dangerous to allow unprivileged users to load and run BPF programs.

Avoiding this problem requires extending the capabilities of the BPF verifier; that is the intent of this patch set from Alexei. Since the verifier already knows the data types of the values stored in each memory location, this is a matter of restricting what can be done with locations containing pointer values. So simply returning a pointer to user space is clearly to be disallowed, as is storing a pointer into a BPF map. Perhaps more subtly, comparison of pointers is also disallowed; otherwise a BPF program could arrive at a pointer's value indirectly.

Another interesting possibility has to do with pointers stored in the stack. A clever program could try to overwrite parts of a pointer with numeric values in a way that makes it possible to recover the original pointer value, then return the resulting numeric value to user space. The verifier clearly has to catch this case and block it. The list goes on, but the basic idea should be clear by now: there are a lot of ways to sneak pointer values out to user space; the verifier must anticipate and catch them all. Alexei's patch set tries to get the verifier to the point that it can meet that challenge.

Even with these checks in place, there are still some limits on the use of BPF programs by unprivileged users. In particular, with the current patch set, only socket-filter programs can be loaded by those users. BPF programs used in other settings (tracing, for example) inherently have to deal with kernel data, so they will remain restricted to root. BPF programs also pin down some kernel memory; to keep users from occupying too much memory, the space used is charged against their RLIMIT_MEMLOCK resource limit. On many systems, the default value of that limit may prove to be too small to load useful programs, so it may need to be increased by the system administrator.

Finally, there is a sysctl knob (kernel.unprivileged_bpf_disabled) that can be used to disable unprivileged access to bpf() entirely. It defaults to "false" (in the patch; distributors may choose differently); if it is set to "true" it cannot be reset without a reboot. There was some talk of adding more fine-grained control over what BPF programs from unprivileged users can do, but Alexei dismissed that idea as being driven only by "fear." If the verifier is working as intended, he said, there is no reason for such fear and no use for finer-grained controls.

That does leave open the question of whether the verifier has indeed earned the trust that is being placed in it, or whether perhaps some fear might be justified. It is 2000 lines or so of moderately complex code that has been reviewed by a relatively small number of (highly capable) people. It is, in a real sense, an implementation of a blacklist of prohibited behaviors; for it to work as advertised, all possible attacks must have been thought of and effectively blocked. That is a relatively high bar. There is a reason why David Miller described the patch set as "scary stuff".

But David applied the patch set for the 4.4 merge window despite his misgivings for a reason: as with user namespaces, unprivileged BPF access has the potential to increase system security by reducing the number of places where code must be run with elevated privileges. But first it has to get to a point where all of the exploitable loose ends have been found and tied up — and the introduction of new loose ends must be prevented. That point may well be reached in the relatively near future. But it would not be surprising if this feature were to be disabled by distributors for a while after it hits the mainline.

Comments (13 posted)

Speeding up kernel development with QEMU

October 14, 2015

This article was contributed by Joël Porquet

When developing the Linux kernel (be it core components or drivers), it is the tasks of deploying, testing, and debugging that represent a large portion of the work—if not the most time-consuming part. At Kernel Recipes 2015, Stefan Hajnoczi gave a talk about the QEMU hardware emulator and explained how it can help speed up the process. For example, QEMU's GDB stub provides an efficient way to debug the kernel, while QEMU's device emulation facilitates the development and testing of device drivers.

Why QEMU?

Hajnoczi started his presentation by comparing ways to deploy and test Linux kernels.

He first presented the "in situ" debugging approach, where the kernel is deployed and tested directly on the development machine. An example of this would be loading a module under development directly into the kernel that the computer is running. There are two dimensions to considering such an approach. As one can imagine, an unstable or broken module can crash the machine and result in lost work. Furthermore, such a crash seriously disrupts the development workflow, because the computer has to reboot and the session must be restored. The second dimension concerns debugging the kernel; as Hajnoczi pointed out: "KGDB and Kdump cannot be 100% reliable since they share the environment" (i.e. because they are part of the potentially broken system).

The "ex situ" debugging approach requires an additional machine. In a way, he argued, it is better than the previous approach because the test machine can crash without bringing down the development environment with it. However, deploying the kernel and running tests is more cumbersome. In the most commonly encountered setup, the test machine boots over the network via PXE (Pre-boot eXecution Environment). Right after being powered on, the test machine obtains an IP address via DHCP from a PXE server and downloads the binary image it is supposed to boot using TFTP (Trivial FTP). It thus involves this special PXE environment to be installed and set up, most likely on the development machine, which is not necessarily easy.

In terms of debugging, it is possible to connect the development machine to the KGDB infrastructure of the kernel running on the test machine via a serial connection. Collecting crash dumps from the test machine and examining them later on the development machine is also possible. However, these software tools cannot inspect and debug everything, especially when it comes to the interaction between the kernel and the various hardware components (or to debug the early boot code that runs before KGDB is up and running). In such cases, debugging may require complex on-chip instrumentation, such as JTAG. It is also worth mentioning that a test machine is another computer to take care of and probably not one that travels well, making this type of debugging even less practical.

As Hajnoczi noted, "virtual machines are the best of both worlds." Since a virtual machine is a user-space application running on the development machine, it is easy to start and stop, and if ever the kernel running in it crashes, the development machine will not be affected. A virtual machine gives full access to the memory and processor state, and it is much more versatile than a test machine since it can emulate different processor architectures. Furthermore, the hardware in the virtual machine is fully programmable, which provides ways to easily add new or custom devices and to extensively test the corresponding software (e.g. device drivers) by using error-injection techniques.

Overview of QEMU

QEMU is a machine emulator and virtualizer. It can emulate obscure processor architectures as well as more mainstream ones (a total of 17 processor architectures are currently supported). When used as a machine emulator, typically in a cross-architecture setup (e.g. emulating an ARM-based guest system on an x86 host computer), QEMU dynamically translates the guest code into native code for fast execution on the host. When the guest and the host share the same architecture, QEMU is often able to provide near-native performance by using KVM, which allows the guest code to run directly and safely on the host processor without being translated first.

In order to provide guest access to the hardware (e.g. devices), QEMU usually catches the accesses from the guest and performs them on the host on behalf of the guest. For example, guest access to a hard disk can typically be emulated by reading or writing a file on the host computer that contains a full filesystem. For better performance, it is also possible to create a PCI or VGA pass-through, using the VFIO kernel support. The guest is then able to directly access real hardware devices without emulation. QEMU provides the ability to trace the I/O accesses in order to debug the communications between the guest and the hardware.

Booting and testing

The command line parameters for QEMU resemble those of a bootloader, making them easy to understand. For example, the following command starts a development kernel on an x86-64 guest, specifying the paths to the kernel and the initramfs image. It also shows how to append an extra kernel parameter:

    qemu-system-x86_64 \
	-kernel vmlinuz \
	-initrd initramfs.img \
	-append param1=value1

When testing a development kernel, there is often no need for a full-featured root filesystem and, as Hajnoczi explained, initramfs is perfect for containing small test applications and loadable kernel modules. An initramfs image is built using gen_init_cpio, a tool already included in the Linux source tree. The content of such an image is specified by a simple description file that defines which files, directories, device nodes, and symbolic links should be part of the filesystem image. The following example shows how to build a basic initramfs image:

    $ cat initramfs_desc
    file    /init           my-init.sh    0755 0 0
    dir     /bin                          0755 0 0
    nod     /dev/zero                     0666 0 0 c 1 5
    file    /sbin/busybox   /sbin/busybox 0755 0 0
    slink   /bin/sh         /sbin/busybox 0755 0 0

    $ gen_init_cpio initramfs_desc | gzip > initramfs.img

For this kind of initramfs image, the tests would be kicked off directly from the /init executable, which is the first user process launched by the kernel after booting. Here, the image uses BusyBox, which is a simple toolbox application that provides a large set of basic commands (such as a shell) and thus lowers the complexity of building this initial root filesystem image. Finally, Hajnoczi added, the initramfs can even be inserted into the kernel image, making the deployment even easier.

For efficient testing, he mentioned the -nographic option of QEMU. Using that and instructing the kernel to use the serial port (console=ttyS0), will disable QEMU's GUI and use the serial connection as the main channel for displaying text and reading input. The output is then displayed in the host terminal from which QEMU was launched, thus enabling the possibility to run tests from automated scripts since the output can be processed with tools like grep.

Some challenges associated with manually building an initramfs can sometimes necessitate using a persistent root filesystem instead. When the virtual machine is compatible with the host, the initramfs image can be directly populated with files from the host operating system (applications and libraries) because they are compatible. That convenient approach can get more difficult if many of the applications added into the image are dynamically linked, because all of the shared library dependencies must be found with ldd and manually added, which can end up being a lot of work.

Hajnoczi listed two options for providing persistent root filesystems. The guest can either share a directory with the host, using virtfs or NFS, or it can use a disk image file containing a partition table and one or multiple filesystems. While the shared directory is easy to manipulate and to inspect from the host, the disk image file offers the ability to easily install a full Linux distribution.

Debugging

Through the activation of a GDB stub in QEMU (with the -s option), a virtual machine can be connected to a GDB client on the host and can be debugged in an efficient and, most importantly, non-intrusive way. This stub provides the client with access to inspect the processor registers and the memory, as well as to set breakpoints on the kernel code executed in the virtual machine.

Hajnoczi stressed the fact that this kind of remote debugging is not the same as debugging QEMU itself; the two are often confused. While QEMU's GDB stub helps debugging what the virtual machine sees, running GDB directly against QEMU is only for debugging the hardware device emulation or QEMU internals.

He also pointed out a few things to remember regarding remote debugging. First, it is necessary to inform GDB as to which specific architecture or sub-architecture is being used in the virtual machine, otherwise GDB has no way of knowing. But more importantly, since the GDB stub is integrated into the processor model, it always follows the same view of the memory as the processor. When the processor is running in the physical address space (e.g. when Linux is booting and before switching to virtual memory), GDB can see and access the entire memory. But as soon as the virtual memory is enabled, GDB can only access what the processor can access, which is what is mapped through the current page tables. Another effect is that GDB cannot properly differentiate between the kernel and user applications because they all appear like a single execution flow. GDB is thus not able to interpret higher-level abstractions and does not know much about the current user-space process or any swapped-out pages.

Device bring-up

There are several challenges for driver developers, Hajnoczi said. Sometimes the real hardware is not available yet, or is expensive to obtain. Both cases make it difficult to develop the corresponding software support. Other times, hardware and software are supposed to be co-developed in parallel in order to minimize the time to market.

These challenges can be overcome by implementing device emulation in QEMU and developing the corresponding driver against the emulated device. Such software support can later be verified using the real hardware when the device is finally available. As an audience member noted, though, it is nonetheless advised to be cautious because QEMU is not a cycle-accurate emulator. Since it does not respect the timing of the real hardware, a device driver may work fine with the emulated version of the device but not with the real one.

Hajnoczi gave a few examples in which device bring-up in QEMU had successfully been used: the Rocker OpenFlow network switch, the NVMe PCI flash storage controller, and NVDIMM persistent memory devices.

A large base of existing code greatly helps the development of new devices. QEMU supports common buses (e.g. PCI, USB, I2C) and, in a way that is comparable to Linux, provides the notion of device classes (e.g. device, device-pci) from which devices can inherit. Memory-mapped (and port-mapped) devices are implemented through a memory API, which registers a device's memory segments and binds them to functions that describe the behavior of the device—thus emulating it. QEMU also provides bus-specific methods to raise interrupts, in addition to emulating interrupt controllers.

As Hajnoczi mentioned, the documentation for implementing device emulation in QEMU is unfortunately sparse, but there are plenty of examples in the QEMU source available to learn from.

Error injection

When developing a driver for a device, it is often difficult to exercise rare code paths. For example, how can the behavior of the driver be verified when hot unplugging the corresponding device while it is in use? Such events are actually difficult to produce with real devices, without having to reach into a box to pull out cables and potentially cause damage. With device emulation, QEMU is able to easily simulate those error conditions in a safe manner.

For some devices, an advanced error-injection engine is even available. Hajnoczi gave the example of the block I/O layer that can accept error-injection scripts based on state machines. As shown in the snippet below, such a script can fail all the disk reads after a first write, for example:

    [set-state]
    state = "1"
    event = "write_aio"
    new_state = "2"

    [inject-error]
    state = "2"
    event = "read_aio"
    errno = "5"

In the first state, named "set-state", the error-injection script is waiting for the event named "write_aio" to occur. Such an event corresponds to a disk write and will make the state machine jump to the second state, named "inject-error". From there, all the events corresponding to disk reads ("read_aio") will fail with EIO (which is 5). More information can be found in a document that gives an overview of the features available in the block I/O error-injection engine.

Interested readers may want to consult Hajnoczi's slides [PDF] for additional details.

About Kernel Recipes

This is the fourth edition of Kernel Recipes, which is definitely becoming a popular annual rendezvous in France for Linux enthusiasts. It had many well-known speakers from the community. Greg Kroah-Hartman, for example, performed the release of the 3.14.54 longterm kernel on stage. This year's edition also had an artist drawing caricatures of the speakers; the one for Hajnoczi can be seen above. The slides (and soon videos) of all the talks given at the conference are available on the web site.

Comments (12 posted)

Linus Torvalds Linux 4.3-rc5 ?

Jiri Slaby Linux 3.12.49 ?

Ben Hutchings Linux 3.2.72 ?

Zubair Lutfullah Kakakhel MIPS: Add Xilfpga platform ?

Andrey Ryabinin KASAN for arm64 ?

Suzuki K. Poulose arm64: Consolidate CPU feature handling ?

Suzuki K. Poulose arm64: 16K translation granule support ?

Alexei Starovoitov bpf: unprivileged ?

Daniel Wagner Simple wait queue support ?

Sjoerd Simons Add SPDIF support for rockchip ?

Moritz Fischer Adding Xilinx Zynq 7000 FPGA Manager driver ?

cpaul@redhat.com Input: Add userio module ?

Bjorn Andersson Qualcomm WCNSS HCI support ?

Rajendra Nayak qcom: Add support for TSENS driver ?

Yakir Yang Add Analogix Core Display Port Driver ?

Paul Osmialowski Additional kmsg devices ?

Nicolas Saenz Julienne gpio: add tps65218 gpio driver ?

Alexander Popov powerpc/512x: add LocalPlus Bus FIFO device driver ?

Ingi Kim Add RT5033 Flash LED driver ?

Damien Horsley [PATCH V2 00/10] Add support for Imagination Technologies audio controllers ?

John Garry HiSilicon SAS driver ?

Leo Yan mailbox: hisilicon: add Hi6220 mailbox driver ?

Oleksij Rempel Add support for ASM9260 interrupt controller ?

Shunqian Zheng Audio Codec Driver of RK3036 SoC ?

Bayi Cheng Mediatek SPI-NOR flash driver ?

Chaotian Jing Add tune support of Mediatek MMC driver ?

Yuval Mintz Add new drivers: qed & qede ?

Pankaj Dubey Add support for Exynos SROM Controller driver ?

WingMan Kwok Common SerDes driver for TI's Keystone Platforms ?

M'boumba Cedric Madianga Add support for STM32 DMA ?

Maxime Coquelin Add STM32 pinctrl/GPIO driver ?

Shashank Sharma Color Management for DRM ?

Antti Palosaari SDR transmitter API ?

Qu Wenruo Rework btrfs qgroup reserved space framework ?

Andreas Gruenbacher Richacls ?

Matias Bjørling Support for Open-Channel SSDs ?

Dan Williams get_user_pages() for dax mappings ?

Joonsoo Kim zram: introduce contextless compression API and use it on zram ?

David Ahern net: VRF support in IPv6 stack ?

Jarkko Sakkinen Basic trusted keys support for TPM 2.0 ?

Lukasz Pawelczyk Smack namespace ?

Rafal Krypa Smack: limited capability for changing process label ?

Denis V. Lunev Hyper-V synthetic interrupt controller ?

Alex Williamson vfio: Include No-IOMMU mode ?

Wang Nan perf tools: filtering events using eBPF programs ?

Frank Ch. Eigler systemtap 2.9 release ?

Kernel development

Brief items

Kernel release status

Quote of the week

Bottomley: Respect and the Linux Kernel Mailing Lists

Kernel development news

Unprivileged bpf()

Speeding up kernel development with QEMU

Why QEMU?

Overview of QEMU

Booting and testing

Debugging

Device bring-up

Error injection

About Kernel Recipes

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous