Kernel development

Brief items

Kernel release status

The current development kernel is 4.3-rc3, released on September 27. Linus said: "So as usual, rc3 is actually bigger than rc2 (fixes are starting to trickle in), but nothing particularly alarming stands out. Everything looks normal: the bulk is drivers (all over, but gpu and networking are the biggest parts) and architecture updates. There's also networking and filesystem updates, along with documentation."

Stable updates: 4.2.2 and 4.1.9 were released on September 29. The 3.14.54 and 3.10.90 updates are in the review process as of this writing; they can be expected on or after October 1.

Comments (none posted)

Kernel development news

Compile-time stack validation

By Jonathan Corbet
September 30, 2015

An occasionally heard horror story about the kernel development community concerns developers who are told that, in order to get their code upstream, they must first invest considerable effort into fixing a related subsystem. As with many such stories, this is not an experience many kernel developers have had, but there is also a grain of truth behind it. The ongoing live-patching effort, and the extra work that has been required to push that work forward, is a case in point.

Live patching's rough patch

In one sense, the live-patching work has been quiet for much of this year; when LWN last looked at this work in February, the core code had been merged, but the "consistency model" code remained out-of-tree. This code's job is to ensure that a patch is only applied to a live kernel if it is safe to do so; that job includes checking to be sure that the affected functions are not executing at the time the patch is applied. Without this assurance, only relatively trivial patches can be applied with any degree of safety. This is important: the appeal of live patching is the ability to avoid rebooting, so a patch application that crashes the kernel (or, worse, results in data corruption) defeats the whole purpose.

One way of ensuring that a given function is not executing is to freeze all processes on the system, then examine the call stack of each to see which functions are active at the time. This is the approach that was taken when the kpatch and kGraft consistency models were unified in the February patch set. That work ran into strong opposition at the time for a simple reason: the information in the kernel's call stack is often not reliable. The biggest culprit here is assembly-language code, which can easily dispense with the call-stack discipline observed by code compiled from C. The results are often observed by kernel developers; stack traces from kernel crashes are often unreliable, making it hard to determine the sequence of calls that led to the problem.

It's one thing for an unreliable stack trace to make kernel developers scratch their heads more; it's another if that information can fool a live-patching utility into applying a patch at an inopportune time. The risk of that happening was deemed high enough to block the merging of the proposed consistency code. This code, it was said, could only be used if kernel stack traces were known to be 100% reliable.

At the time, 100% reliable stack traces were not widely seen as an attainable goal. It is certainly possible to fix up all of the assembly code that does not set up proper stack frames (assuming it could all be found), but, since nothing in the kernel's normal operation depends on good call-stack information, there was nothing preventing things from breaking again at any time. In the absence of some sort of ongoing assurance that the kernel's call stack will always remain valid, it is hard to be confident that a live-patching system won't do the wrong thing.

Validating the call stack

Some developers might have given up at this point. Josh Poimboeuf, instead, set out to find a way to make the call stack valid at all times and keep it that way; the result is the "compile-time stack metadata validation" patch set, in its 13th revision as of this writing. This work adds a new tool (called stacktool) that checks the entire kernel as part of the build process to be sure that all code obeys the rules for maintaining the call stack.

The rules are, for the most part, relatively straightforward. For example, every function in assembly code must be marked as a callable function (using the ELF function type). There are some handy macros (ENTRY and ENDPROC) that do this annotation now, but not all assembly functions use them. A clear sign that the rules are not being followed is a ret instruction outside of a function block, so stacktool will complain about those.

The primary source of call-stack problems is assembly code that calls another function (possibly a C function) without setting up a new stack frame first. Such calls work, but they will trip up code that is trying to make sense out of the call stack. The validation tool checks to make sure that function calls are surrounded by the appropriate frame-maintenance code. There are currently assembly macros to do this work, but they are unused; Josh's patch renames them to FRAME_BEGIN and FRAME_END and puts them into use. Versions of these macros for inline assembly in C code have also been added; they can be found in <asm/frame.h>.

There are also some rules about dynamic jumps; for the most part, they are only allowed as part of a C switch statement. The one exception is "sibling calls," where the end of one function jumps to the beginning of another and the frame pointer hasn't changed. These rules make it possible for stacktool to follow the control flow in all cases and ensure that the call stack is always maintained.

If the STACK_VALIDATION configuration option is set, stacktool will be run on the kernel's object files as part of the build process. This pass, Josh says, causes a kernel build to take about three seconds longer (he doesn't say whether that's a kernel with a focused configuration or a distribution kitchen-sink configuration). Three seconds is probably an acceptable delay, even for impatient kernel developers, but Josh suggests that some optimization work could probably reduce that figure anyway.

What might be harder for developers to get used to are the complaints emitted by stacktool when it finds a problem. Such complaints go out as warnings in the current patch set, but the intent is to turn them into hard errors once most of the current problems have been fixed. Even if a given developer doesn't enable stack validation, others will, so changes that break the call stack will be returned for repairs in short order. The included documentation file includes examples of the types of errors that may be indicated and how to respond to them.

The current version of the patch set only supports the x86_64 architecture; evidently provisions have been made for adding other architectures, but the nature of the task ensures that a lot of the work will have to be done over again to support something else. Even with a single supported architecture, though, the stack validation work should help to bring an end to the long era where stack traces could not really be trusted. That is good for live patching, but any developer trying to figure out an oops will also benefit from this work. The live-patching developers may not have wanted to take this digression, but the kernel as a whole will be better off as a result of it.

Comments (8 posted)

Random number scalability

By Jonathan Corbet
September 28, 2015

In an era of ongoing attacks and surveillance, proper generation of random numbers is essential to the security of our systems and communications, so the quality of our random numbers is often a topic of discussion. The performance with which the kernel comes up with random numbers is not normally as much of an issue. It turns out, though, that, on large NUMA systems with heavy demand for random numbers, lock contention within the random-number generator (RNG) can severely limit the performance of the system as a whole. A patch addressing that problem is relatively straightforward, but it provides an opportunity to look at how this subsystem works in general.

Most readers will be familiar with the fact that the kernel's RNG subsystem collects entropy (randomness) and provides it via two interfaces. One, exposed as /dev/random, is strictly limited so that it cannot provide more entropy than has been collected by the system; the other (/dev/urandom) functions as a pseudo-random-number generator to be able to continue to provide random data when the supply of incoming entropy is not sufficient to meet the demand. For most applications, even cryptographic applications, the latter interface is more than sufficient, but /dev/random is there for those who need truly random data and are able to wait for it if need be.

These interfaces are supported by three "entropy pools" within the kernel; an entropy pool is an array of bytes of random data, along with some supporting metadata. Whenever randomness is collected by the kernel (be it from interrupt timing, a hardware RNG, or some other source), it is added to the input pool, which contains 4096 bits of data. The pool is not a simple FIFO of random bytes; instead, randomness is "mixed" into the pool with an algorithm that resembles a CRC calculation. The mixing is meant to be fast (so it can be done from interrupt handlers) and to spread the available entropy through the entire pool. It is also intended to keep the state of the pool from being known, even if the attacker is able to write a large amount of known data into it.

The kernel maintains an estimate of the amount of entropy stored in the pool at any given time. That estimate increases when randomness is added to the pool (by an amount that depends on an estimate of how random the input data truly is), and it is decreased when entropy is removed from the pool.

Since the pool is not a FIFO, one does not simply read random bytes out of it. Instead, entropy is extracted by calculating an SHA-1 hash of the pool. The hashed value is returned as the requested random data, but it is also mixed back into the pool. Using the hash will, once again, help to keep the state of the pool from being known.

Users of random data do not simply read it from the input pool, though; instead, the kernel maintains a simple hierarchy of three pools:

Reads from /dev/random will extract data from the blocking pool, while reads from /dev/urandom use the non-blocking pool. The output pools are smaller, each holding a maximum of 1024 bits of entropy in the 4.4 kernel. Entropy spills from the input pool into the two output pools in two ways:

Whenever incoming entropy causes the input pool to look full (the estimate of the entropy stored there approaches the pool size), entropy will be extracted from the input pool and mixed into both of the output pools. The output pools can be filled to 75% of their maximum entropy in this manner.
If an attempt is made to read more entropy from an output pool than is contained there, the needed entropy will be extracted from the input pool and mixed into the appropriate output pool. This is the point where the two random interfaces differ in behavior: /dev/random will block if the input pool is also depleted, while /dev/urandom will generate random numbers regardless.

Many years ago, data was read from the output pools without locking; perhaps the potential for corruption of random data was not seen as being particularly worrisome. But it turned out that, on occasion, it was possible for two processes to read the same random bytes, a vulnerability that could make it possible for one process to know which random numbers were being used by another. So a spinlock was added to each pool to ensure that access to the pools is properly serialized. It is that locking that turns out to be a bottleneck if too many processes (on a large number of CPUs) are trying to read random data at the same time.

After running into the problem, Andi Kleen put together a patch set designed to alleviate this lock contention. It uses the classic approach of avoiding inter-CPU contention by giving each CPU (or, properly in this case, each NUMA node) its own data. To get there, Andi's patch modifies the pool structure to look like this:

In short: each NUMA node gets its own non-blocking pool to read from, so reading random data no longer requires cross-node locking. Each (along with the blocking pool) receives overflow from the input pool in a round-robin fashion, and each can draw from that pool in response to a request if entropy is available. There is, Andi says, no need for per-CPU pools at this time, though things could be split further in the future if that need were to arise. There is also no plan to make a per-node version of the blocking pool; code that is willing to wait for sufficient entropy is unlikely to have trouble with locking scalability.

The patch does indeed result in increased scalability for non-blocking random-number generation on large systems. It also has the effect of distributing the entropy pool across nodes, making it that much harder to guess the state of the pool as a whole. One potential disadvantage is that it is no longer possible to read out the state of all output pools at system shutdown time, meaning that some entropy may be lost over a reboot. That could be fixed with the addition of a new save/restore interface, but it is not clear that anybody is concerned enough to do that work.

This patch has been through a set of revisions in response to comments and seems likely to be ready for merging into the 4.5 kernel.

Comments (1 posted)

Using the KVM API

September 29, 2015

This article was contributed by Josh Triplett

Many developers, users, and entire industries rely on virtualization, as provided by software like Xen, QEMU/KVM, or kvmtool. While QEMU can run a software-based virtual machine, and Xen can run cooperating paravirtualized OSes without hardware support, most current uses and deployments of virtualization rely on hardware-accelerated virtualization, as provided on many modern hardware platforms. Linux supports hardware virtualization via the Kernel Virtual Machine (KVM) API. In this article, we'll take a closer look at the KVM API, using it to directly set up a virtual machine without using any existing virtual machine implementation.

A virtual machine using KVM need not run a complete operating system or emulate a full suite of hardware devices. Using the KVM API, a program can run code inside a sandbox and provide arbitrary virtual hardware interfaces to that sandbox. If you want to emulate anything other than standard hardware, or run anything other than a standard operating system, you'll need to work with the KVM API used by virtual machine implementations. As a demonstration that KVM can run more (or less) than just a complete operating system, we'll instead run a small handful of instructions that simply compute 2+2 and print the result to an emulated serial port.

The KVM API provides an abstraction over the hardware-virtualization features of various platforms. However, any software making use of the KVM API still needs to handle certain machine-specific details, such as processor registers and expected hardware devices. For the purposes of this article, we'll set up an x86 virtual machine using Intel VT. For another platform, you'd need to handle different registers, different virtual hardware, and different expectations about memory layout and initial state.

The Linux kernel includes documentation of the KVM API in Documentation/virtual/kvm/api.txt and other files in the Documentation/virtual/kvm/ directory.

This article includes snippets of sample code from a fully functional sample program (MIT licensed). The program makes extensive use of the err() and errx() functions for error handling; however, the snippets quoted in the article only include non-trivial error handling.

Definition of the sample virtual machine

A full virtual machine using KVM typically emulates a variety of virtual hardware devices and firmware functionality, as well as a potentially complex initial state and initial memory contents. For our sample virtual machine, we'll run the following 16-bit x86 code:

    mov $0x3f8, %dx
    add %bl, %al
    add $'0', %al
    out %al, (%dx)
    mov $'\n', %al
    out %al, (%dx)
    hlt

These instructions will add the initial contents of the al and bl registers (which we will pre-initialize to 2), convert the resulting sum (4) to ASCII by adding '0', output it to a serial port at 0x3f8 followed by a newline, and then halt.

Rather than reading code from an object file or executable, we'll pre-assemble these instructions (via gcc and objdump) into machine code stored in a static array:

    const uint8_t code[] = {
	0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
	0x00, 0xd8,       /* add %bl, %al */
	0x04, '0',        /* add $'0', %al */
	0xee,             /* out %al, (%dx) */
	0xb0, '\n',       /* mov $'\n', %al */
	0xee,             /* out %al, (%dx) */
	0xf4,             /* hlt */
    };

For our initial state, we will preload this code into the second page of guest "physical" memory (to avoid conflicting with a non-existent real-mode interrupt descriptor table at address 0). al and bl will contain 2, the code segment (cs) will have a base of 0, and the instruction pointer (ip) will point to the start of the second page at 0x1000.

Rather than the extensive set of virtual hardware typically provided by a virtual machine, we'll emulate only a trivial serial port on port 0x3f8.

Finally, note that running 16-bit real-mode code with hardware VT support requires a processor with "unrestricted guest" support. The original VT implementations only supported protected mode with paging enabled; emulators like QEMU thus had to handle virtualization in software until reaching a paged protected mode (typically after OS boot), then feed the virtual system state into KVM to start doing hardware emulation. However, processors from the "Westmere" generation and newer support "unrestricted guest" mode, which adds hardware support for emulating 16-bit real mode, "big real mode", and protected mode without paging. The Linux KVM subsystem has supported the "unrestricted guest" feature since Linux 2.6.32 in June 2009.

Building a virtual machine

First, we'll need to open /dev/kvm:

    kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC);

We need read-write access to the device to set up a virtual machine, and all opens not explicitly intended for inheritance across exec should use O_CLOEXEC.

Depending on your system, you likely have access to /dev/kvm either via a group named "kvm" or via an access control list (ACL) granting access to users logged in at the console.

Before you use the KVM API, you should make sure you have a version you can work with. Early versions of KVM had an unstable API with an increasing version number, but the KVM_API_VERSION last changed to 12 with Linux 2.6.22 in April 2007, and got locked to that as a stable interface in 2.6.24; since then, KVM API changes occur only via backward-compatible extensions (like all other kernel APIs). So, your application should first confirm that it has version 12, via the KVM_GET_API_VERSION ioctl():

    ret = ioctl(kvm, KVM_GET_API_VERSION, NULL);
    if (ret == -1)
	err(1, "KVM_GET_API_VERSION");
    if (ret != 12)
	errx(1, "KVM_GET_API_VERSION %d, expected 12", ret);

After checking the version, you may want to check for any extensions you use, using the KVM_CHECK_EXTENSION ioctl(). However, for extensions that add new ioctl() calls, you can generally just call the ioctl(), which will fail with an error (ENOTTY) if it does not exist.

If we wanted to check for the one extension we use in this sample program, KVM_CAP_USER_MEM (required to set up guest memory via the KVM_SET_USER_MEMORY_REGION ioctl()), that check would look like this:

    ret = ioctl(kvm, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY);
    if (ret == -1)
	err(1, "KVM_CHECK_EXTENSION");
    if (!ret)
	errx(1, "Required extension KVM_CAP_USER_MEM not available");

Next, we need to create a virtual machine (VM), which represents everything associated with one emulated system, including memory and one or more CPUs. KVM gives us a handle to this VM in the form of a file descriptor:

    vmfd = ioctl(kvm, KVM_CREATE_VM, (unsigned long)0);

The VM will need some memory, which we provide in pages. This corresponds to the "physical" address space as seen by the VM. For performance, we wouldn't want to trap every memory access and emulate it by returning the corresponding data; instead, when a virtual CPU attempts to access memory, the hardware virtualization for that CPU will first try to satisfy that access via the memory pages we've configured. If that fails (due to the VM accessing a "physical" address without memory mapped to it), the kernel will then let the user of the KVM API handle the access, such as by emulating a memory-mapped I/O device or generating a fault.

For our simple example, we'll allocate a single page of memory to hold our code, using mmap() directly to obtain page-aligned zero-initialized memory:

    mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

We then need to copy our machine code into it:

    memcpy(mem, code, sizeof(code));

And finally tell the KVM virtual machine about its spacious new 4096-byte memory:

    struct kvm_userspace_memory_region region = {
	.slot = 0,
	.guest_phys_addr = 0x1000,
	.memory_size = 0x1000,
	.userspace_addr = (uint64_t)mem,
    };
    ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region);

The slot field provides an integer index identifying each region of memory we hand to KVM; calling KVM_SET_USER_MEMORY_REGION again with the same slot will replace this mapping, while calling it with a new slot will create a separate mapping. guest_phys_addr specifies the base "physical" address as seen from the guest, and userspace_addr points to the backing memory in our process that we allocated with mmap(); note that these always use 64-bit values, even on 32-bit platforms. memory_size specifies how much memory to map: one page, 0x1000 bytes.

Now that we have a VM, with memory containing code to run, we need to create a virtual CPU to run that code. A KVM virtual CPU represents the state of one emulated CPU, including processor registers and other execution state. Again, KVM gives us a handle to this VCPU in the form of a file descriptor:

    vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long)0);

The 0 here represents a sequential virtual CPU index. A VM with multiple CPUs would assign a series of small identifiers here, from 0 to a system-specific limit (obtainable by checking the KVM_CAP_MAX_VCPUS capability with KVM_CHECK_EXTENSION).

Each virtual CPU has an associated struct kvm_run data structure, used to communicate information about the CPU between the kernel and user space. In particular, whenever hardware virtualization stops (called a "vmexit"), such as to emulate some virtual hardware, the kvm_run structure will contain information about why it stopped. We map this structure into user space using mmap(), but first, we need to know how much memory to map, which KVM tells us with the KVM_GET_VCPU_MMAP_SIZE ioctl():

    mmap_size = ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, NULL);

Note that the mmap size typically exceeds that of the kvm_run structure, as the kernel will also use that space to store other transient structures that kvm_run may point to.

Now that we have the size, we can mmap() the kvm_run structure:

    run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);

The VCPU also includes the processor's register state, broken into two sets of registers: standard registers and "special" registers. These correspond to two architecture-specific data structures: struct kvm_regs and struct kvm_sregs, respectively. On x86, the standard registers include general-purpose registers, as well as the instruction pointer and flags; the "special" registers primarily include segment registers and control registers.

Before we can run code, we need to set up the initial states of these sets of registers. Of the "special" registers, we only need to change the code segment (cs); its default state (along with the initial instruction pointer) points to the reset vector at 16 bytes below the top of memory, but we want cs to point to 0 instead. Each segment in kvm_sregs includes a full segment descriptor; we don't need to change the various flags or the limit, but we zero the base and selector fields which together determine what address in memory the segment points to. To avoid changing any of the other initial "special" register states, we read them out, change cs, and write them back:

    ioctl(vcpufd, KVM_GET_SREGS, &sregs);
    sregs.cs.base = 0;
    sregs.cs.selector = 0;
    ioctl(vcpufd, KVM_SET_SREGS, &sregs);

For the standard registers, we set most of them to 0, other than our initial instruction pointer (pointing to our code at 0x1000, relative to cs at 0), our addends (2 and 2), and the initial state of the flags (specified as 0x2 by the x86 architecture; starting the VM will fail with this not set):

    struct kvm_regs regs = {
	.rip = 0x1000,
	.rax = 2,
	.rbx = 2,
	.rflags = 0x2,
    };
    ioctl(vcpufd, KVM_SET_REGS, &regs);

With our VM and VCPU created, our memory mapped and initialized, and our initial register states set, we can now start running instructions with the VCPU, using the KVM_RUN ioctl(). That will return successfully each time virtualization stops, such as for us to emulate hardware, so we'll run it in a loop:

    while (1) {
	ioctl(vcpufd, KVM_RUN, NULL);
	switch (run->exit_reason) {
	/* Handle exit */
	}
    }

Note that KVM_RUN runs the VM in the context of the current thread and doesn't return until emulation stops. To run a multi-CPU VM, the user-space process must spawn multiple threads, and call KVM_RUN for different virtual CPUs in different threads.

To handle the exit, we check run->exit_reason to see why we exited. This can contain any of several dozen exit reasons, which correspond to different branches of the union in kvm_run. For this simple VM, we'll just handle a few of them, and treat any other exit_reason as an error.

We treat a hlt instruction as a sign that we're done, since we have nothing to ever wake us back up:

	case KVM_EXIT_HLT:
	    puts("KVM_EXIT_HLT");
	    return 0;

To let the virtualized code output its result, we emulate a serial port on I/O port 0x3f8. Fields in run->io indicate the direction (input or output), the size (1, 2, or 4), the port, and the number of values. To pass the actual data, the kernel uses a buffer mapped after the kvm_run structure, and run->io.data_offset provides the offset from the start of that mapping.

	case KVM_EXIT_IO:
	    if (run->io.direction == KVM_EXIT_IO_OUT &&
		    run->io.size == 1 &&
		    run->io.port == 0x3f8 &&
		    run->io.count == 1)
		putchar(*(((char *)run) + run->io.data_offset));
	    else
		errx(1, "unhandled KVM_EXIT_IO");
	    break;

To make it easier to debug the process of setting up and running the VM, we handle a few common kinds of errors. KVM_EXIT_FAIL_ENTRY, in particular, shows up often when changing the initial conditions of the VM; it indicates that the underlying hardware virtualization mechanism (VT in this case) can't start the VM because the initial conditions don't match its requirements. (Among other reasons, this error will occur if the flags register does not have bit 0x2 set, or if the initial values of the segment or task-switching registers fail various setup criteria.) The hardware_entry_failure_reason does not actually distinguish many of those cases, so an error of this type typically requires a careful read through the hardware documentation.

	case KVM_EXIT_FAIL_ENTRY:
	    errx(1, "KVM_EXIT_FAIL_ENTRY: hardware_entry_failure_reason = 0x%llx",
		 (unsigned long long)run->fail_entry.hardware_entry_failure_reason);

KVM_EXIT_INTERNAL_ERROR indicates an error from the Linux KVM subsystem rather than from the hardware. In particular, under various circumstances, the KVM subsystem will emulate one or more instructions in the kernel rather than via hardware, such as for performance reasons (to coalesce a series of vmexits for I/O). The run->internal.suberror value KVM_INTERNAL_ERROR_EMULATION indicates that the VM encountered an instruction it doesn't know how to emulate, which most commonly indicates an invalid instruction.

	case KVM_EXIT_INTERNAL_ERROR:
	    errx(1, "KVM_EXIT_INTERNAL_ERROR: suberror = 0x%x",
	         run->internal.suberror);

When we put all of this together into the sample code, build it, and run it, we get the following:

    $ ./kvmtest
    4
    KVM_EXIT_HLT

Success! We ran our machine code, which added 2+2, turned it into an ASCII 4, and wrote it to port 0x3f8. This caused the KVM_RUN ioctl() to stop with KVM_EXIT_IO, which we emulated by printing the 4. We then looped and re-entered KVM_RUN, which stops with KVM_EXIT_IO again for the \n. On the third and final loop, KVM_RUN stops with KVM_EXIT_HLT, so we print a message and quit.

Additional KVM API features

This sample virtual machine demonstrates the core of the KVM API, but ignores several other major areas that many non-trivial virtual machines will care about.

Prospective implementers of memory-mapped I/O devices will want to look at the exit_reason KVM_EXIT_MMIO, as well as the KVM_CAP_COALESCED_MMIO extension to reduce vmexits, and the ioeventfd mechanism to process I/O asynchronously without a vmexit.

For hardware interrupts, see the irqfd mechanism, using the KVM_CAP_IRQFD extension capability. This provides a file descriptor that can inject a hardware interrupt into the KVM virtual machine without stopping it first. A virtual machine may thus write to this from a separate event loop or device-handling thread, and threads running KVM_RUN for a virtual CPU will process that interrupt at the next available opportunity.

x86 virtual machines will likely want to support CPUID and model-specific registers (MSRs), both of which have architecture-specific ioctl()s that minimize vmexits.

Applications of the KVM API

Other than learning, debugging a virtual machine implementation, or as a party trick, why use /dev/kvm directly?

Virtual machines like qemu-kvm or kvmtool typically emulate the standard hardware of the target architecture; for instance, a standard x86 PC. While they can support other devices and virtio hardware, if you want to emulate a completely different type of system that shares little more than the instruction set architecture, you might want to implement a new VM instead. And even within an existing virtual machine implementation, authors of a new class of virtio hardware device will want a clear understanding of the KVM API.

Efforts like novm and kvmtool use the KVM API to construct a lightweight VM, dedicated to running Linux rather than an arbitrary OS. More recently, the Clear Containers project uses kvmtool to run containers using hardware virtualization.

Alternatively, a VM need not run an OS at all. A KVM-based VM could instead implement a hardware-assisted sandbox with no virtual hardware devices and no OS, providing arbitrary virtual "hardware" devices as the API between the sandbox and the sandboxing VM.

While running a full virtual machine remains the primary use case for hardware virtualization, we've seen many innovative uses of the KVM API recently, and we can certainly expect more in the future.

Comments (46 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.3-rc3 ?

Greg KH Linux 4.2.2 ?

Greg KH Linux 4.1.9 ?

Kamal Mostafa Linux 3.19.8-ckt7 ?

Kamal Mostafa Linux 3.13.11-ckt27 ?

Architecture-specific

Julien Grall xen/arm64: Add support for 64KB page in Linux ?

Markos Chandras MIPS VDSO support ?

Josh Poimboeuf Compile-time stack metadata validation ?

Dave Hansen x86: Memory Protection Keys ?

Core kernel code

Marc Titinger Managing cluser-level c-states with generic power domains ?

Chris Metcalf support "task_isolated" mode for nohz_full ?

Peter Zijlstra sched: Killing PREEMPT_ACTIVE ?

Device drivers

Songjun Wu ASoC: atmel-classd: add driver for Atmel CLASSD ?

Cyrille Pitchen ASoc: add driver for Atmel I2S controller ?

Enric Balletbo i Serra Add support for tps65217 charger ?

igal.liberman@freescale.com Freescale DPAA FMan ?

Madalin Bucur dpaa_eth: Add the Freescale DPAA Ethernet driver ?

Adam Thomson ASoC: Add support for DA7219 audio codec ?

Alexander Popov powerpc/512x: add LocalPlus Bus FIFO device driver ?

Bjorn Andersson Qualcomm Shared Memory State Machines ?

Enric Balletbo i Serra Add initial support for slimport anx78xx ?

Bjorn Andersson WCNSS Peripheral Image Loader ?

Alexandre Belloni rtc: Add a driver for Micro Crystal RV8803 ?

Suresh Rajashekara CHROMIUM: iio: Add Dyna-Image AP3223 ambient light and proximity driver ?

Suresh Rajashekara iio:light: Add Dyna-Image AP3223 ambient light and proximity driver ?

Cyrille Pitchen mfd: flexcom: add a driver for Flexcom ?

Richard Fitzgerald Add support for Wolfson Micro WM8998 codec ?

Arnd Bergmann y2038 conversion for ntp/pps and sfc driver ?

Azael Avalos platform/x86: Toshiba WMI Hotkey Driver ?

Chunfeng Yun Mediatek xHCI support ?

Robert Bragg Non perf based Gen Graphics OA unit driver ?

Hannes Reinecke asynchronous ALUA device handler ?

Javier Martin media: Add a driver for the ov5640 sensor. ?

Fei Wang Add Support for Hi6220 PMIC Hi6553 MFD Core and Regulator ?

Srinivas Kandagatla nvmem: new drivers for v4.4 ?

Clément Vuchener Add Corsair Vengeance K90 driver ?

Device driver infrastructure

Baolin Wang Introduce usb charger framework to deal with the usb gadget power negotation ?

Marc Zyngier Early ACPI probing infrastructure ?

Tomeu Vizoso On-demand device probing ?

Ashutosh Dixit misc: mic: Enable COSM and remaining SCIF functionality ?

Filesystems and block I/O

Andreas Gruenbacher Richacls ?

Anna Schumaker VFS: In-kernel copy system call ?

Omar Sandoval Btrfs: free space B-tree ?

Memory management

Jesper Dangaard Brouer Further optimizing SLAB/SLUB bulking ?

Networking

David Ahern net: L3 master device ?

Security-related

Andi Kleen Updated scalable urandom patchkit ?

David Howells Security: Provide unioned file support ?

Zbigniew Jasinski Introducing domain transition mechanism into Smack. ?

Virtualization and containers

Shannon Zhao KVM: ARM64: Add guest PMU support ?

Miscellaneous

Mathieu Desnoyers LTTng modules 2.7.0 (Linux kernel tracer) ?

Mathieu Desnoyers LTTng-UST 2.7.0 (Linux user-space tracer) ?

Sukadev Bhattiprolu perf, tools: Add support for PMU events in JSON format ?

Jiri Olsa perf stat: Add scripting support ?

Page editor: Jake Edge
Next page: Distributions>>