LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.10 kernel was released on June 30. Linus's announcement said: "In the bigger picture (ie since 3.9) this release has been pretty typical and not particularly prone to problems, despite my waffling about the exact release date. As usual, the bulk patch-wise is all drivers (pretty much exactly two thirds), while the rest is evenly split between arch updates and 'misc'. No major new subsystems this time around, although there are individual new features." Some of those new features include a number of Ftrace enhancements, the memory pressure notification mechanism, tickless operation, ARM multi-cluster power management support (part of the big.LITTLE solution), the bcache block caching layer, and much more. See the KernelNewbies 3.10 page for lots of details.

The 3.11 merge window is open as of this writing; see the separate article below for a summary of what has been merged so far.

Stable updates: 3.9.8, 3.4.51, and 3.0.84 were released on June 27, 3.2.48 came out on June 30, and 3.9.9, 3.4.52, and 3.0.85 were released on on July 3.

Comments (none posted)

Quotes of the week

Hmm, I bet lockdep and the branch tracer probably don't play well together. They both are bullies, and want to beat up the same kid. The problem is, they want sole access to beat up that kid, and don't want help.
Steven Rostedt

In my defence, it didn't actually say the patch did this. Just that we "can".
Rusty Russell

At this point in the process, I want testers who choose to test. Hapless victim testers come later. Well, other than randconfig testers, but I consider them to be voluntary hapless victims.
Paul McKenney

Comments (none posted)

A flash filesystem tuning guide

Tim Bird has announced the availability of an extensive guide to tuning Linux for flash-based storage devices [PDF]. "This is the culmination of several months of effort, to determine the results of using different tuning options in the Linux kernel, with different filesystems running on flash-based block devices. The document was prepared by Cogent Embedded, and funded by the CE Workgroup of the Linux Foundation. In addition to describing different tuning options available, the document also gives methodologies for measuring performance on the filesystems and has extensive graphs showing the results of the different tuning options."

Full Story (comments: 12)

2013 Kernel Summit call for topics/proposals

The planning process for the 2013 Kernel Summit (October 23-25, Edinburgh) has begun; as in previous years, the program committee is looking for proposals for interesting topics in need of discussion. "The best topics for the kernel summit tend to focus on topics which are not appropriate for any of the subsystem-specific workshops or minisummits, and which can not be easily resolved using the normal e-mail and IRC channels. These include issues about our overall development process, and topics which span multiple subsystems." The deadline for proposals is July 19.

Full Story (comments: none)

Per-CPU reference counts

By Jonathan Corbet
July 3, 2013
Reference counting is used by the kernel to know when a data structure is unused and can be disposed of. Most of the time, reference counts are represented by an atomic_t variable, perhaps wrapped by a structure like a kref. If references are added and removed frequently over an object's lifetime, though, that atomic_t variable can become a performance bottleneck. The 3.11 kernel will include a new per-CPU reference count mechanism designed to improve scalability in such situations.

This mechanism, created by Kent Overstreet, is defined in <linux/percpu-refcount.h>. Typical usage will involve embedding a percpu_ref structure within the data structure being tracked. The counter must be initialized with:

    int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release);

Where release() is the function to be called when the reference count drops to zero:

    typedef void (percpu_ref_release)(struct percpu_ref *);

The call to percpu_ref_init() will initialize the reference count to one. References are added and removed with:

    void percpu_ref_get(struct percpu_ref *ref);
    void percpu_ref_put(struct percpu_ref *ref);

These functions operate on a per-CPU array of reference counters, so they will not cause cache-line bouncing across the system. There is one potential problem, though: percpu_ref_put() must determine whether the reference count has dropped to zero and call the release() function if so. Summing an array of per-CPU counters would be expensive, to the point that it would defeat the whole purpose. This problem is avoided with a simple observation: as long as the initial reference is held, the count cannot be zero, so percpu_ref_put() does not bother to check.

The implication is that the thread which calls percpu_ref_init() must indicate when it is dropping its reference; that is done with a call to:

    void percpu_ref_kill(struct percpu_ref *ref);

After this call, the reference count degrades to the usual model with a single shared atomic_t counter; that counter will be decremented and checked whenever a reference is released.

The performance benefits of a per-CPU reference count will clearly only be realized if most of the references to an object are added or removed while the initial reference is held. In practice that is often the case. This mechanism has found an initial use in the control group code; the comments in the header file claim that it is used by the asynchronous I/O code as well, but that is not the case in the current mainline.

Comments (1 posted)

Kernel development news

The 3.11 merge window opens

By Jonathan Corbet
July 3, 2013
Once upon a time, Linus tried to limit merge window activity to roughly 1,000 commits in any given day. On July 2, the day he began pulling changes for 3.11, over 3,000 commits made their way into the mainline. Clearly, a lethargic 1,000 commits/day pace won't cut it in the 3.x era. Expect this to be another busy development cycle. That said, the number of new features merged for 3.11 so far is relatively small. Much of the work pulled to date consists of code cleanups (in the staging tree, for example), reworking of ARM architecture code to use common abstractions, and the removal of board-file support for some ARM subarchitectures.

The user-visible changes that have been pulled so far include:

  • The f2fs filesystem now supports security labels, enabling it to be used with security modules.

  • The Lustre distributed filesystem has been merged into the staging tree. It is disabled in the build system, though, since it has build problems on a number of architectures.

  • The ARM architecture (both 32- and 64-bit) has gained better huge page support, in the form of both the hugetlbfs filesystem and transparent huge pages.

  • The ARM64 architecture now supports virtualization with both KVM and Xen.

  • The new O_TMPFILE option to the open() and openat() system calls allows filesystems to optimize the creation of temporary files — files which need not be visible in the filesystem. When O_TMPFILE is present, the provided pathname is only used to locate the containing directory (and thus the filesystem where the temporary file should be). So, among other things, programs using O_TMPFILE should have fewer concerns about vulnerabilities resulting from symbolic link attacks.

  • New hardware support includes:

    • Systems and processors: Freescale i.MX6 SoloLite processors, Freescale Vybrid VF610 processors, Samsung EXYNOS5420 processors, Rockchip RK2928 and RK3xxx processors, TI Nspire processors, and STMicroelectronics STiH41x and STiH416 processors.

    • Miscellaneous: Marvell EBU device bus controllers, Marvell EBU PCIe controllers, ARM cache-coherent interconnect controllers, Microchip Technology MCP3204/08 analog-to-digital converters, Analog Devices AD7303 digital-to-analog converters, STMicroelectronics LPS331AP pressure sensors, and Samsung S3C24XX SoC pin controllers.

    • Networking: MTK USB Bluetooth interfaces.

    • USB: Faraday FUSBH200 host controllers and Cavium Networks Octeon host controllers.

Changes visible to kernel developers include:

  • There is a new struct file_operations method:

        int (*iterate) (struct file *, struct dir_context *);
    

    Its job is to iterate through the contents of a directory. This method is meant to serve as a replacement for the readdir() method that eliminates persistent race conditions associated with updating the current read position. All internal users have been converted, and the readdir() method has been removed.

  • There are a couple of new functions for working with atomic types:

        int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode);
        void wake_up_atomic_t(atomic_t *p);
    

    A call to wait_on_atomic_t() will block the calling thread until the given val goes to zero. Simply decrementing an atomic_t variable will not be sufficient to wake anybody waiting, though; an explicit call to wake_up_atomic_t() is required to do that.

  • The CONFIG_HOTPLUG configuration option has been removed; all kernels are hotplug enabled these days.

  • The wait/wound mutex locking primitive has been merged.

  • As part of the read-copy-update simplification effort, the "tiny-preempt" version of RCU has been removed from the kernel. From the commit message: "People currently using TINY_PREEMPT_RCU can get much better memory footprint with TINY_RCU, or, if they really need preemptible RCU, they can use TREE_PREEMPT_RCU with a relatively minor degradation in memory footprint."

  • The kernel now has the concept of power-efficient workqueues; these are simply marked as "unbound," so that jobs queued to them can run on any CPU in the system. Per-CPU workqueues may perform better in some situations, but they can also cause sleeping CPUs to wake up; that wakeup can be avoided if work items can be run on CPUs that are not sleeping. If the CONFIG_WQ_POWER_EFFICIENT_DEFAULT configuration option is set, a number of workqueues observed to impact power performance will be switched to the unbound mode.

    Kernel code can explicitly request power-efficient behavior by creating workqueues with the WQ_POWER_EFFICIENT flag or by using a couple of new systemwide workqueues: system_power_efficient_wq or system_freezable_power_efficient_wq.

  • The d_hash() and d_compare() methods in struct dentry_operations have lost their inode argument.

  • A new per-CPU reference count mechanism has been added; see this article for details.

A normal two-week merge window could be expected to close on July 16, but Linus has occasionally shortened the merge window in recent development cycles. If the development cycle as a whole lasts for the usual 70 days, then the 3.11 kernel can be expected around September 10.

Comments (3 posted)

Supporting KVM on the ARM architecture

July 3, 2013

This article was contributed by Christoffer Dall and Jason Nieh

One of the new features in the 3.9 kernel is KVM/ARM: KVM support for the ARM architecture. While KVM is already supported on i386 and x86/64, PowerPC, and s390, ARM support required more than just reimplementing the features and styles of the other architectures. The reason is that the ARM virtualization extensions are quite different from those of other architectures.

Historically, the ARM architecture is not virtualizable, because there are a number of sensitive instructions which do not trap when they are executed in an unprivileged mode. However, the most recent 32-bit ARM processors, like the Cortex-A15, include hardware support for virtualization as an ARMv7 architectural extension. A number of research projects have attempted to support virtualization on ARM processors without hardware virtualization support, but they require various levels of paravirtualization and have not been stabilized. KVM/ARM is designed specifically to work on ARM processors with the virtualization extensions enabled to run unmodified guest operating systems.

The ARM hardware extensions differ quite a bit from their x86 counterparts. A simplified view of the ARM CPU modes is that the kernel runs in SVC mode and user space runs in USR mode. ARM introduced a new CPU mode for running hypervisors called HYP mode, which is a more privileged mode than SVC mode. An important characteristic of HYP mode, which is central to the design of KVM/ARM, is that HYP mode is not an extension of SVC mode, but a distinct mode with a separate feature set and a separate virtual memory translation mechanism. For example, if a page fault is taken in HYP mode, the faulting virtual address is stored in a different register in HYP mode than in SVC mode. As another example, for the SVC and USR modes, the hardware has two separate page table base registers, which are used to provide the familiar address space split between user space and kernel. HYP mode only uses a single page table base register and therefore does not allow the address space split between user mode and kernel.

The design of HYP mode is a good fit with a classic bare-metal hypervisor design because such a hypervisor does not reuse any existing kernel code written to work in SVC mode. KVM, however, was designed specifically to reuse existing kernel components and integrate these with the hypervisor. In comparison, the x86 hardware support for virtualization does not provide a new CPU mode, but provides an orthogonal concept known as "root" and "non-root". When running as non-root on x86, the feature set is completely equivalent to a CPU without virtualization support. When running as root on x86, the feature set is extended to add additional features for controlling virtual machines (VMs), but all existing kernel code can run unmodified as both root and non-root. On x86, when a VM traps to the hypervisor, the CPU changes from non-root to root. On ARM, when a VM traps to the hypervisor, the CPU traps to HYP mode.

HYP mode controls virtualization features by configuring sensitive operations to trap to HYP mode when executed in SVC and USR mode; it also allows hypervisors to configure a number of shadow register values used to hide information about the physical hardware from VMs. HYP mode also controls Stage-2 translation, a feature similar to Intel's "extended page table" used to control VM memory access. Normally when an ARM processor issues a load/store instruction, the memory address used in the instruction is translated by the memory management unit (MMU) from a virtual address to a physical address using regular page tables, like this:

  • Virtual Address (VA) -> Intermediate Physical Address (IPA)

The virtualization extensions add an extra stage of translation known as Stage-2 translation which can be enabled and disabled only from HYP mode. When Stage-2 translation is enabled, the MMU translates address in the following way:

  • Stage-1: Virtual Address (VA) -> Intermediate Physical Address (IPA)
  • Stage-2: Intermediate Physical Address (IPA) -> Physical Address (PA)

The guest operating system controls the Stage-1 translation independently of the hypervisor and can change mappings and page tables without trapping to the hypervisor. The Stage-2 translation is controlled by the hypervisor, and a separate Stage-2 page table base register is accessible only from HYP mode. The use of Stage-2 translations allows software running in HYP mode to control access to physical memory in a manner completely transparent to a VM running in SVC or USR mode, because the VM can only access pages that the hypervisor has mapped from an IPA to the page's PA in the Stage-2 page tables.

KVM/ARM design

KVM/ARM is tightly integrated with the kernel and effectively turns the kernel into a first class ARM hypervisor. For KVM/ARM to use the hardware features, the kernel must somehow be able to run code in HYP mode because HYP mode is used to configure the hardware for running a VM, and traps from the VM to the host (KVM/ARM) are taken to HYP mode.

Rewriting the entire kernel to run only in HYP mode is not an option, because it would break compatibility with hardware that doesn't have the virtualization extensions. A HYP-mode-only kernel also would not work when run inside a VM, because the HYP mode would not be available. Support for running both in HYP mode and SVC mode would be much too invasive to the source code, and would potentially slow down critical paths. Additionally, the hardware requirements for the page table layout in HYP mode are different from those in SVC mode in that they mandate the use of LPAE (ARM's Large Physical Address Extension) and require specific bits to be set on the page table entries, which are otherwise clear on the kernel page tables used in SVC mode. So KVM/ARM must manage a separate set of HYP mode page tables and explicitly map in code and data accessed from HYP mode.

We therefore came up with the idea to split execution across multiple CPU modes and run as little code as possible in HYP mode. The code run in HYP mode is limited to a few hundred instructions and isolated to two assembly files: arch/arm/kvm/interrupts.S and arch/arm/kvm/interrupts_head.S.

For readers not familiar with the general KVM architecture, KVM on all architectures works by exposing a simple interface to user space to provide virtualization of core components such as the CPU and memory. Device emulation, along with setup and configuration of VMs, is handled by a user space process, typically QEMU. When such a process decides it is time to run the VM, it will call the KVM_VCPU_RUN ioctl(), which executes VM code natively on the CPU. On ARM, the ioctl() handler in arch/arm/kvm/arm.c switches to HYP mode by issuing an HVC (hypercall) instruction, which changes the CPU mode to HYP mode, context switches all hardware state between the host and the guest, and finally jumps to the VM SVC or USR mode to natively execute guest code. When KVM/ARM runs guest code, it enables Stage-2 memory translation, which completely isolates the address space of VMs from the host and other VMs. The CPU will be executing guest code until the hardware traps to HYP mode, because of a hardware interrupt, a stage-2 page fault, or a sensitive operation. When such a trap occurs, KVM/ARM switches back to the host hardware state and returns to normal KVM/ARM host SVC code with the full kernel mappings available.

When returning from a VM, KVM/ARM examines the reason for the trap, and performs the necessary emulation or resource allocation to allow the VM to resume. For example, if the guest performs a memory-mapped I/O (MMIO) operation to an emulated device, that will generate a Stage-2 page fault, because only physical RAM dedicated to the guest will be mapped in the Stage-2 page tables. KVM/ARM will read special system registers, available only in HYP mode, which contain the address causing the fault and report the address to QEMU through a shared memory-mapped structure between QEMU and the kernel. QEMU knows the memory map of the emulated system and can forward the operation to the appropriate device emulation code. As another example, if a hardware interrupt occurs while the VM is executing, this will trap to HYP mode, and KVM/ARM will switch back in the host state and re-enable interrupts, which will cause the hardware interrupt handlers to execute once again, but this time without trapping to HYP mode. While every hardware interrupt ends up interrupting the CPU twice, the actual trap cost on ARM hardware is negligible compared to the world-switch from the VM to the host.

HYP mode

Providing access to HYP mode from KVM/ARM was a non-trivial challenge, since HYP mode is a more privileged mode than the standard ARM kernel modes and there is no architecturally defined ABI for entering HYP mode from less privileged modes. One option would be to expect bootloaders to either install secure monitor handlers or hypercall handlers that would allow the kernel to trap back into HYP mode, but this method is brittle and error-prone, and prior experience with establishing TrustZone APIs has shown that it is hard to create a standard across different implementations of the ARM architecture.

Instead, Will Deacon, Catalin Marinas, and Ian Jackson proposed that we rely on the kernel being booted in HYP mode if the kernel is going to support KVM/ARM. In version 3.6, a patch series developed by Dave Martin and Marc Zyngier was merged that detects if the kernel is booted in HYP mode and, if so, installs a small stub handler that allows other subsystems like KVM/ARM to take control of HYP mode later on. As it turns out, it is reasonable to recommend that bootloaders always boot the kernel in HYP mode if it is available because even legacy kernels always make an explicit switch to SVC mode at boot time, even though they expect to boot into SVC mode already. Changing bootloaders to simply boot all kernels in HYP mode is therefore backward-compatible with legacy kernels.

Installing the hypervisor stub when the kernel is booted in HYP mode was an interesting implementation challenge. First, ARM kernels are often loaded as a compressed image, with a small uncompressed pre-boot environment known as the "decompressor" which decompresses the kernel image into memory. If the decompressor detects that it is booted in HYP mode, then a temporary stub must be installed at this stage allowing the CPU to fall back to SVC mode to run the decompressor code. The reason is that the decompressor must turn on the MMU to enable caches, but doing so in HYP mode requires support for the LPAE page table format used by HYP mode, which is an unwanted piece of complexity in the decompressor code. Therefore, the decompressor installs the temporary HYP stub, falls back to SVC mode, decompresses the kernel image, and finally, immediately before calling the uncompressed initialization code, switches back to HYP mode again. Then, the uncompressed initialization code will again detect that the CPU is in HYP mode and will install the main HYP stub to be used by kernel modules later in the boot process or after the kernel has finally booted. The HYP stub can be found in arch/arm/kernel/hyp-stub.S. Note that the uncompressed initialization code doesn't care whether the uncompressed code is started directly in HYP mode from a bootloader or from the decompressor.

Because HYP mode is a more privileged mode than SVC mode, the transition from SVC mode to HYP mode occurs only through a hardware trap. Such a trap can be generated by executing the hypercall (HVC) instruction, which will trap into HYP mode and cause the CPU to execute code from a jump entry in the HYP exception vectors. This allows a subsystem to use the hypervisor stub to fully take over control of HYP mode, because the hypervisor stub allows subsystems to change the location of the exception vectors. The HYP stub is called through the __hyp_set_vectors() function, which takes the physical address of the HYP exception vector as its only parameter, and replaces the HYP Vector Base Address Register (HVBAR) with that address. When KVM/ARM is initialized during normal kernel boot (after all main kernel initialization functions have run), it creates an identity mapping (one-to-one mapping of virtual addresses to physical address) of the HYP mode initialization code, which includes an exception vector, and sets the physical address of using the __hyp_set_vectors() function. Further, the KVM/ARM initialization code calls the HVC instruction to run the identity-mapped initialization code, which can safely enable the MMU, because the code is identity mapped.

Finally, KVM/ARM initialization sets up the HVBAR to point to the main KVM/ARM HYP exception handling code, now using the virtual addresses for HYP mode. Since HYP mode has its own address space, KVM/ARM must choose an appropriate virtual address for any code or data, which is mapped into HYP mode. For convenience and clarity, the kernel virtual addresses are reused for pages mapped into HYP mode, making it possible to dereference structure members directly as long as all relevant data structures are mapped into HYP mode.

Both traps from sensitive operations in VMs and hypercalls from the host kernel enter HYP mode through an exception on the CPU. Instead of changing the HYP exception vector on every switch between the host and the guest, a single HYP exception vector is used to handle both HVC calls from the host kernel and to handle traps from the VM. The HYP vector handling code checks the VMID field on the Stage-2 page table base register, and VMID 0 is reserved for the host. This field is only accessible from HYP mode and guests are therefore prevented from escalating privilege. We introduced the kvm_call_hyp() function, which can be used to execute code in HYP mode from KVM/ARM. For example, KVM/ARM code running in SVC mode can make the following call to invalidate TLB entries, which must be done from HYP mode:

    kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);

Virtual GIC and timers

ARMv7 architectures with hardware virtualization support also include virtualization support for timers and the interrupt controller. Marc Zyngier implemented support for these features, which are called "generic timers" (a.k.a. architected timers) and the Virtual Generic Interrupt Controller (VGIC).

Traditionally, timer operations on ARM systems have been MMIO operations to dedicated timer devices. Such MMIO operations performed by VMs would trap to QEMU, which would involve a world-switch from the VM to host kernel, and a switch from the host kernel to user space for every read of the time counter or every time a timer needed to be programmed. Of course, the timer functionality could be emulated inside the kernel, but this would require a trap from the VM to the host kernel, and would therefore add substantial overhead to VMs compared to running on native hardware. Reading the counter is a very frequent operation in Linux. For example, every time a task is enqueued or dequeued in the scheduler, the runqueue clock is updated, and in particular multi-process workloads like Apache benchmarks clearly show the overhead of trapping on each counter read.

ARMv7 allows for an optional extension to the architecture, the generic timers, which makes counter and timer operations part of the core architecture. Now, reading a counter or programming a timer is done using coprocessor register accesses on the core itself, and the generic timers provide two sets of timers and counters: the physical and the virtual. The virtual counter and timer are always available, but access to the physical counter and timer can be limited through control registers accessible only in HYP mode. If the kernel is booted in HYP mode, it is configured to use the physical timers; otherwise the kernel uses the virtual timers, allowing both an unmodified kernel to program timers when running inside a VM without trapping to the host, and providing the necessary isolation of the host from VMs.

If a VM programs a virtual timer, but is preempted before the virtual timer fires, KVM/ARM reads the timer settings to figure out the remaining time on the timer, and programs a corresponding soft timer in the kernel. When the soft timer expires, the timer handler routine injects the timer interrupt back into the VM. If the VM is scheduled before the soft timer expires, the virtual timer hardware is re-programmed to fire when the VM is running.

The role of an interrupt controller is to receive interrupts from devices and forward them to one or more CPUs. ARM's Generic Interrupt Controller (GIC) provides a "distributor" which is the core logic of the GIC and several CPU interfaces. The GIC allows CPUs to mask certain interrupts, assign priority, or set affinity for certain interrupts to certain CPUs. Finally, a CPU also uses the GIC to send inter-processor interrupts (IPIs) from one CPU core to another and is the underlying mechanism for SMP cross calls on ARM.

Typically, when the GIC raises an interrupt to a CPU, the CPU will acknowledge the interrupt to the GIC, interact with the interrupting device, signal end-of-interrupt (EOI) to the GIC, and resume normal operation. Both acknowledging and EOI-signaling interrupts are privileged operations that will trap when executed from within a VM, adding performance overhead to common operations. The hardware support for virtualization in the VGIC comes in the form of a virtual CPU interface that CPUs can query to acknowledge and EOI virtual interrupts without trapping to the host. The hardware support further provides a virtual control interface to the VGIC, which is accessed only by KVM/ARM, and is used to program virtual interrupts generated from virtual devices (typically emulated by QEMU) to the VGIC.

Since access to the distributor is typically not a common operation, the hardware does not provide a virtual distributor, so KVM/ARM provides in-kernel GIC distributor emulation code as part of the support for VGIC. The result is that VMs can acknowledge and EOI virtual interrupts directly without trapping to the host. Actual hardware interrupts received during VM execution always trap to HYP mode, and KVM/ARM lets the kernel's standard ISRs handle the interrupt as usual, so the host remains in complete control of the physical hardware.

There is no mechanism in the VGIC or generic timers to let the hardware directly inject physical interrupts from the virtual timers as virtual interrupts to the VMs. Therefore, VM timer interrupts will trap as any other hardware interrupt, and KVM/ARM registers a handler for the virtual timer interrupt and injects a corresponding virtual timer interrupt using software when the handler function is called from the ISR.

Results

During the development of KVM/ARM, we continuously measured the virtualization overhead and ran long-running workloads to test stability and measure performance. We have used various kernel configurations and user space environments (both ARM and Thumb-2) for both the host and the guest, and validated our workloads with SMP and UP guests. Some workloads have run for several weeks at a time without crashing, and the system behaves as expected when exposed to extreme memory pressure or CPU over-subscription. We therefore feel that the implementation is stable and encourage users to try and use the system.

Our measurements using both micro and macro benchmarks show that the overhead of KVM/ARM is within 10% of native performance on multicore platforms for balanced workloads. Purely CPU-bound workloads perform almost at native speed. The relative overhead of KVM/ARM is comparable to KVM on x86. For some macro workloads, like Apache and MySQL, KVM/ARM even has less overhead than on x86 using the same configuration. A significant source of this improved performance can be attributed to the optimized path for IPIs and thereby process rescheduling caused by the VGIC and generic timers hardware support.

Status and future work

KVM/ARM started as a research project at Columbia University and was later supported by Virtual Open Systems. After the 3.9 merge, KVM/ARM continues to be maintained by the original author of the code, Christoffer Dall, and the ARMv8 (64-bit) port is maintained by Marc Zyngier. QEMU system support for ARMv7 has been merged upstream in QEMU, and kvmtool also has support for KVM/ARM on both ARMv7 and ARMv8. ARMv8 support is scheduled to be merged for the 3.11 kernel release.

Linaro is supporting a number of efforts to make KVM/ARM itself feature complete, which involves debugging and full migration features including migration of the in-kernel support for the VGIC and the generic timers. Additionally, virtio has so far relied on a PCI backend in QEMU and the kernel, but a significant amount of work has already been merged upstream to refactor the QEMU source code concerning virtio to allow better support for MMIO-based virtio devices to accelerate virtual network and block devices. The remaining work is currently a priority for Linaro, as is support for the mach-virt ARM machine definition, which is a simple machine model designed to be used for virtual machines and is based only on virtio devices. Finally, Linaro is also working on ARMv8 support in QEMU, which will also take advantage of mach-virt and virtio support.

Conclusion

KVM/ARM is already used heavily in production by the SUSE Open Build Service on Arndale boards, and we can only speculate about its future uses in the green data center of the future, as the hypervisor of choice for ARM-based networking equipment, or even ARM-based laptops and desktops.

For more information, help on how to run KVM/ARM on your specific board or SoC, or to participate in KVM/ARM development, the kvmarm mailing list is a good place to start.

Comments (10 posted)

When the kernel ABI has to change

By Jonathan Corbet
July 2, 2013
Maintaining user-space ABI compatibility is one of the key guiding principles of Linux kernel development; changes that break user space are likely to be reverted quickly, often after an incendiary message from Linus. But what is to be done in cases where an ABI is deemed to be unworkable and unmaintainable? Control group maintainer Tejun Heo is trying to solve that problem, but, in the process, he is running into opposition from one of Linux's highest-profile users.

Control groups ("cgroups") allow an administrator to divide the processes in a system into a hierarchy of groups; this hierarchy need not match the process tree. The grouping function alone is useful; systemd uses it to keep track of all of the processes involved with a given service, for example. But the real purpose of control groups is to allow resource control policies to be applied to the processes within each group; to that end, the kernel contains a range of "controllers" that enforce policies on CPU time, block I/O bandwidth, memory usage, and more. Control groups are managed with a virtual filesystem exported by the kernel; see Documentation/cgroups/cgroups.txt for a thorough (if slightly dated) description of how this subsystem works.

The trouble with control groups

There is no doubt that the functionality provided by control groups is both extensive and flexible. Indeed, part of the problem is that it is too flexible. Consider, for example, the support for multiple hierarchies in the control group subsystem. Cgroups allow the creation of a hierarchy of processes to be used in dividing up a limited resource, such as available CPU time. But they allow the creation of an entirely different hierarchy for the control of a different resource. Thus, for example, CPU time could be placed under a policy that favors certain users over others, while memory use could, instead, be regulated depending on what program a process is running. Processes can be grouped in entirely different ways in each hierarchy.

The problem here is that, while the design allowing each controller to have its own hierarchy seems nice and orthogonal, the implementation cannot be that way. The controllers for memory usage, I/O bandwidth, and writeback throttling all look independent on the surface, but those problems are all intertwined in the memory management system in the kernel. All three of those controllers will need to associate pages of memory with specific control groups; if a given process is in one cgroup from the memory controller's point of view, but a different cgroup for the I/O bandwidth controller, that tracking quickly becomes difficult or impossible. It is easy to set up policies that conflict or that simply cannot be properly implemented within the kernel.

Another perceived problem is that the virtual filesystem interface is too low-level, exposing too many details of how control groups are implemented in the kernel. As the number of users of control groups grows, it will become increasingly hard to make changes without breaking existing applications. It's not clear what the correct cgroup interface should be, but those who spend enough time looking at the current implementation tend to come away convinced that changes are needed.

This problem is aggravated by an increasing tendency to use file permissions to hand subtrees of a cgroup hierarchy over to unprivileged processes. There are legitimate reasons to want to delegate authority in that way; complex applications may want to use cgroups to implement their own internal policies, for example. There are also use cases associated with virtualization and containers. But that delegation greatly increases the number of programs with an intimate understanding of how cgroups work, complicating any future changes. There are also any number of security issues that come with unprivileged access to a cgroup hierarchy; it is trivially easy to run denial-of-service attacks against a system if one has write access to a cgroup hierarchy. In short, the interface was just never meant to be used in this way.

For these reasons and more, there is a strong desire to rework the cgroup interface into something that is more maintainable, more secure, and easier to use. Getting there, though, is likely to be a long and painful process, as can be seen by the early discussions around the subject.

The solution and its discontents

The plan for control groups can be described in relatively few words; the resulting discussion, instead, is rather more verbose. Multiple hierarchies are seen to be misconceived and unmaintainable on their face; the plan is to phase out that functionality so that, in the end, all controllers are attached to a single, unified hierarchy of processes. Unprivileged access to the cgroup hierarchy will be strongly discouraged; the hope is to have a single, privileged process handling all of the cgroup management tasks. That process will, in turn, provide some sort of higher-level interface to the rest of the system.

Tim Hockin is charged with making Google's massive cluster of machines work properly for a wide variety of internal users. Google uses cgroups extensively for internal resource management; more to the point, the company also makes extensive use of multiple hierarchies. So, needless to say, Tim is not at all pleased with the prospect of that functionality going away. As he put it:

So yeah, I'm in a bit of a panic. You're making a huge amount of work for us. You're breaking binary compatibility of the (probably) largest single installation of Linux in the world. And you're being kind of flip about the reality of it...

The kernel's ABI rules have not been suspended for control groups Part of the reason for Tim's panic is that he was under the impression that the existing functionality would be removed within a year or two. That is decidedly not the case; the kernel's ABI rules have not been suspended for control groups. The plan is to add a new control interface, and any new features will probably only work with that new interface, but the existing interface, including multiple hierarchies, will continue to be supported until it's clear that it is no longer being used.

Tim described, in general terms, how Google uses multiple hierarchies. Essentially, every job in the system has two attributes: whether it's a production or "batch" job, and whether it gets I/O bandwidth guarantees. The result is a 2x2 matrix describing resource allocation policies (though one of the entries — batch processes with I/O guarantees, makes little sense and is not used). Using two independent cgroup hierarchies makes this set of policies relatively easy to express; Tim asserts that a unified hierarchy would not be usable in the same way.

Tejun was unimpressed, responding that this case could be managed by setting up three cgroups at the same level of the hierarchy, each of which would implement one of the three useful policy combinations. The problem with this solution, according to Tim, is that the processes without I/O bandwidth guarantees would be split into two groups, whereas in the current solution they are in one group. If one of those two groups has far more members than the other, the members of that larger group will get far less of the available bandwidth than the members of the small group. Tejun still thinks that the problem should be solvable, perhaps with the use of a user-space management daemon that would adjust the relative bandwidth allocations depending on the workload. Tim has answered that the situation is actually a lot more complicated, but he has not yet shared the details of how, so it is hard to understand what the real difficulties with a single hierarchy are.

A single management process?

Tim also dislikes the plan to have a single process managing the control group hierarchy. That process could be made to provide the functionality that Google (along with others) needs, though there are performance concerns associated with adding a process in the middle. But Tim was not alone in being concerned by this message from Lennart Poettering on the nature of that single process:

This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs.

Google does not currently run systemd and is not thrilled by the prospect of having to switch to be able to make use of cgroup functionality. So Tim responded that "If systemd is the only upstream implementation of this single-agent idea, we will have to invent our own, and continue to diverge rather than converge." There is no particular judgment against systemd implied by that position; it is simply that making that switch would affect a whole lot of things beyond cgroups, and that is more than Google feels like it would want to take on at the moment. But, in general, it would not be surprising if, in the long term, some users remain opposed to the idea of systemd as the only interface to cgroups. That suggests that we will be seeing competing implementations of the cgroup management daemon concept.

One of those alternatives may be about to come into view; Serge Hallyn confessed that he is working on a cgroup management daemon of his own. In some situations, a separate daemon might meet a lot of needs, but Lennart was clear that he would never have systemd defer to such a daemon. His position — not an entirely unreasonable one — is that the init process, as the creator of all other processes in the system, should not be dependent on any other process for its normal operation. He also seems to feel that it would not be possible to put the cgroup management code into a library that could be used in multiple places. So we are likely to see multiple implementations of this functionality in use before this story is done. That, in turn, could create headaches for developers of applications that need to interface with the cgroup subsystem.

The discussion, thus far, seems to have changed few minds. But Tejun has made it clear that he doesn't intend to just ignore complaints from users:

While the bar to overcome is pretty high, I do want to learn about the problems you guys are foreseeing, so that I can at least evaluate the graveness properly and hopefully compromises which can mitigate the most sore ones can be made wherever necessary.

He also acknowledged the biggest problem faced by the development community: despite having accumulated some experience on wrong ways to solve the problem, nobody really knows what the right solution is. More mistakes are almost certain, so it's too soon to try to settle on final solutions.

In the early years of Linux, most of the ABIs implemented by the kernel were specified by groups like POSIX or by prior implementation in other kernels. That made the ABI design problem mostly go away; it was just a matter of doing what had already been done before. For current problems, though, there are rather fewer places to look for guidance, so we are having to figure out the best designs as we go. Mistakes are certain to happen in such a setting. So we are going to have to get better at learning from those mistakes, coming up with better designs, and moving to them without causing misery for our users. The control group transition is likely to set a lot of precedents regarding how these changes should (or should not) be handled in the future.

Comments (35 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Virtualization and containers

Miscellaneous

  • Lucas De Marchi: kmod 14 . (July 3, 2013)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds