Brief items
The 3.10 kernel was released on June 30. Linus's
announcement said: "
In the bigger
picture (ie since 3.9) this release has been pretty typical and not
particularly prone to problems, despite my waffling about the exact release
date. As usual, the bulk patch-wise is all drivers (pretty much exactly two
thirds), while the rest is evenly split between arch updates and 'misc'. No
major new subsystems this time around, although there are individual new
features." Some of those new features include a number of
Ftrace enhancements, the
memory pressure notification mechanism,
tickless operation, ARM
multi-cluster power management support (part
of the big.LITTLE solution), the
bcache
block caching layer, and much more. See the
KernelNewbies 3.10 page for
lots of details.
The 3.11 merge window is open as of this writing; see the separate article
below for a summary of what has been merged so far.
Stable updates:
3.9.8,
3.4.51, and 3.0.84 were released on June 27,
3.2.48 came out on June 30, and
3.9.9, 3.4.52,
and 3.0.85 were released on on
July 3.
Comments (none posted)
Hmm, I bet lockdep and the branch tracer probably don't play well
together. They both are bullies, and want to beat up the same
kid. The problem is, they want sole access to beat up that kid, and
don't want help.
—
Steven Rostedt
In my defence, it didn't actually say the patch did this. Just
that we "can".
—
Rusty Russell
At this point in the process, I want testers who choose to test.
Hapless victim testers come later. Well, other than randconfig
testers, but I consider them to be voluntary hapless victims.
—
Paul McKenney
Comments (none posted)
Tim Bird has announced the availability of
an
extensive guide to tuning Linux for flash-based storage devices [PDF].
"
This is the culmination of several months of effort, to determine
the results of using different tuning options in the Linux kernel, with
different filesystems running on flash-based block devices. The document
was prepared by Cogent Embedded, and funded by the CE Workgroup of the
Linux Foundation. In addition to describing different tuning options
available, the document also gives methodologies for measuring performance
on the filesystems and has extensive graphs showing the results of the
different tuning options."
Full Story (comments: 12)
The planning process for the 2013 Kernel Summit (October 23-25, Edinburgh)
has begun; as in previous years, the program committee is looking for
proposals for interesting topics in need of discussion. "
The best topics for the kernel summit tend to focus on topics which
are not appropriate for any of the subsystem-specific workshops or
minisummits, and which can not be easily resolved using the normal
e-mail and IRC channels. These include issues about our overall
development process, and topics which span multiple subsystems."
The deadline for proposals is July 19.
Full Story (comments: none)
By Jonathan Corbet
July 3, 2013
Reference counting is used by the kernel to know when a data structure is
unused and can be disposed of. Most of the time, reference counts are
represented by an
atomic_t variable, perhaps wrapped by a
structure like a
kref. If references are added and removed
frequently over an object's lifetime, though, that
atomic_t
variable can become a performance bottleneck. The 3.11 kernel will include
a new per-CPU reference count mechanism designed to improve scalability in
such situations.
This mechanism, created by Kent Overstreet, is defined in
<linux/percpu-refcount.h>.
Typical usage will involve embedding a percpu_ref structure within
the data structure being tracked. The counter must be initialized with:
int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release);
Where release() is the function to be called when the reference
count drops to zero:
typedef void (percpu_ref_release)(struct percpu_ref *);
The call to percpu_ref_init() will initialize the reference count
to one. References are added and removed with:
void percpu_ref_get(struct percpu_ref *ref);
void percpu_ref_put(struct percpu_ref *ref);
These functions operate on a per-CPU array of reference counters, so they
will not cause cache-line bouncing across the system. There is one
potential problem, though: percpu_ref_put() must determine whether
the reference count has dropped to zero and call the release()
function if so. Summing an array of per-CPU counters would be expensive,
to the point that it would defeat the whole purpose. This problem is
avoided with a simple observation: as long as the initial reference is
held, the count cannot be zero, so percpu_ref_put() does not
bother to check.
The implication is that the thread which calls percpu_ref_init()
must indicate when it is dropping its reference; that is done with a call
to:
void percpu_ref_kill(struct percpu_ref *ref);
After this call, the reference count degrades to the usual model with a
single shared atomic_t counter; that counter will be decremented
and checked whenever a reference is released.
The performance benefits of a per-CPU reference count will clearly only be
realized if most of the references to an object are added or removed while
the initial reference is held. In practice that is often the case. This
mechanism has found an initial use in the control group code; the comments
in the header file claim that it is used by the asynchronous I/O code as
well, but that is not the case in the current mainline.
Comments (1 posted)
Kernel development news
By Jonathan Corbet
July 3, 2013
Once upon a time, Linus tried to limit merge window activity to roughly
1,000 commits in any given day. On July 2, the day he began pulling
changes for 3.11, over 3,000 commits made their way into the mainline.
Clearly, a lethargic 1,000 commits/day pace won't cut it in the 3.x era.
Expect this to be another busy development cycle.
That said, the number of new features merged for 3.11 so far is relatively
small. Much of the work pulled to date consists of code cleanups (in the
staging tree, for example), reworking of ARM architecture code to use
common abstractions, and the removal of board-file support for some ARM
subarchitectures.
The user-visible changes that have been pulled so far include:
- The f2fs filesystem now supports security labels, enabling it to be
used with security modules.
- The Lustre
distributed filesystem has been merged into the staging tree. It is
disabled in the build system, though, since it has build problems on a
number of architectures.
- The ARM architecture (both 32- and 64-bit) has gained better huge page
support, in the form
of both the hugetlbfs filesystem and transparent huge pages.
- The ARM64 architecture now supports virtualization with both KVM and
Xen.
- The new O_TMPFILE option to the open() and
openat() system calls allows filesystems to optimize the
creation of temporary files — files which need not be visible in the
filesystem. When O_TMPFILE is present, the provided pathname
is only used to locate the containing directory (and thus the
filesystem where the temporary file should be). So, among other
things, programs using O_TMPFILE should have fewer concerns
about vulnerabilities resulting from symbolic link attacks.
- New hardware support includes:
- Systems and processors:
Freescale i.MX6 SoloLite processors,
Freescale Vybrid VF610 processors,
Samsung EXYNOS5420 processors,
Rockchip RK2928 and RK3xxx processors,
TI Nspire processors, and
STMicroelectronics STiH41x and STiH416 processors.
- Miscellaneous:
Marvell EBU device bus controllers,
Marvell EBU PCIe controllers,
ARM cache-coherent interconnect controllers,
Microchip Technology MCP3204/08 analog-to-digital converters,
Analog Devices AD7303 digital-to-analog converters,
STMicroelectronics LPS331AP pressure sensors, and
Samsung S3C24XX SoC pin controllers.
- Networking:
MTK USB Bluetooth interfaces.
- USB:
Faraday FUSBH200 host controllers and
Cavium Networks Octeon host controllers.
Changes visible to kernel developers include:
- There is a new struct file_operations method:
int (*iterate) (struct file *, struct dir_context *);
Its job is to iterate through the contents of a directory. This
method is meant to serve as a replacement for the readdir() method that
eliminates persistent race conditions associated with updating the
current read position. All internal users have been converted, and
the readdir() method has been removed.
- There are a couple of new functions for working with atomic types:
int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode);
void wake_up_atomic_t(atomic_t *p);
A call to wait_on_atomic_t() will block the calling thread
until the given val goes to zero. Simply decrementing an
atomic_t variable will not be sufficient to wake anybody
waiting, though; an explicit call to wake_up_atomic_t() is
required to do that.
- The CONFIG_HOTPLUG configuration option has been removed; all
kernels are hotplug enabled these days.
- The wait/wound mutex locking primitive
has been merged.
- As part of the read-copy-update
simplification effort, the "tiny-preempt" version of RCU has been
removed from the kernel. From the
commit message: "People currently using TINY_PREEMPT_RCU can
get much better memory footprint with TINY_RCU, or, if they really
need preemptible RCU, they can use TREE_PREEMPT_RCU with a relatively
minor degradation in memory footprint."
- The kernel now has the concept of power-efficient workqueues; these
are simply marked as "unbound," so that jobs queued to them can run on
any CPU in the system. Per-CPU workqueues may perform better in some
situations, but they can also cause sleeping CPUs to wake up; that
wakeup can be avoided if work items can be run on CPUs that are not
sleeping. If the CONFIG_WQ_POWER_EFFICIENT_DEFAULT
configuration option is set, a number of workqueues observed to impact
power performance will be switched to the unbound mode.
Kernel code can explicitly request power-efficient behavior by
creating workqueues with the WQ_POWER_EFFICIENT flag or by
using a couple of new systemwide workqueues:
system_power_efficient_wq or
system_freezable_power_efficient_wq.
- The d_hash() and d_compare() methods in struct
dentry_operations have lost their inode argument.
- A new per-CPU reference count mechanism has been added; see this article for details.
A normal two-week merge window could be expected to close on July 16,
but Linus has occasionally shortened the merge window in recent development
cycles. If the development cycle as a whole lasts for the usual
70 days, then the 3.11 kernel can be expected around
September 10.
Comments (3 posted)
July 3, 2013
This article was contributed by Christoffer Dall and Jason Nieh
One of the new features in the 3.9 kernel is KVM/ARM: KVM support for the ARM
architecture. While KVM is already supported on i386 and x86/64, PowerPC, and
s390, ARM support required more than just reimplementing the features and
styles of the other architectures. The reason is that the ARM virtualization
extensions are quite different from those of other architectures.
Historically, the ARM architecture is not virtualizable, because there are a
number of sensitive instructions which do not trap when they are executed in
an unprivileged mode. However, the most recent 32-bit ARM processors, like
the Cortex-A15, include hardware support for virtualization as
an ARMv7 architectural extension. A number of research
projects have attempted to support virtualization on ARM processors without
hardware virtualization support, but they require various levels of paravirtualization
and have not been stabilized. KVM/ARM is designed specifically to
work on ARM processors with the virtualization extensions enabled to run
unmodified guest operating systems.
The ARM hardware extensions differ quite a bit from their x86 counterparts. A
simplified view of the ARM CPU modes is that the kernel runs in SVC mode and
user space runs in USR mode. ARM introduced a new CPU mode for running
hypervisors called HYP mode, which is a more privileged mode than SVC mode.
An important characteristic of HYP mode, which is central to the design of
KVM/ARM, is that HYP mode is not an extension of SVC mode, but a distinct mode
with a separate feature set and a separate virtual memory translation mechanism.
For example, if a page fault is taken in HYP mode, the faulting virtual address
is stored in a different register in HYP mode than in SVC mode. As another
example, for the SVC and USR modes, the hardware has two separate page table base
registers, which are used to provide the familiar address space split between
user space and kernel. HYP mode only uses a single page table base register and
therefore does not allow the address space split between user mode and kernel.
The design of HYP mode is a good fit with a classic bare-metal hypervisor
design because such a hypervisor does not reuse
any existing kernel code written to work in SVC mode.
KVM, however, was designed specifically to reuse existing kernel components and
integrate these with the hypervisor. In comparison, the x86 hardware support
for virtualization does not provide a new CPU mode, but provides an orthogonal
concept known as "root" and "non-root". When running as non-root on x86,
the feature set is completely equivalent to a CPU without virtualization
support. When running as root on x86, the feature set is extended to add
additional features for controlling virtual machines (VMs), but all
existing kernel code can run
unmodified as both root and non-root. On x86, when a VM traps to the
hypervisor, the CPU changes from non-root to root. On ARM, when a VM traps
to the hypervisor, the CPU traps to HYP mode.
HYP mode controls virtualization features by configuring sensitive
operations to trap to HYP mode when executed in SVC and USR mode; it also allows
hypervisors to configure a number of shadow register values used to hide
information about the physical hardware from VMs. HYP mode also controls
Stage-2 translation, a feature similar to Intel's "extended page table"
used to control VM memory
access. Normally when an ARM processor issues a
load/store instruction, the memory address used in the instruction is translated
by the memory management unit (MMU) from a virtual address to a physical address using regular page
tables, like this:
- Virtual Address (VA) -> Intermediate Physical Address (IPA)
The virtualization extensions add an extra stage of translation known
as Stage-2 translation which can be enabled and disabled only from HYP mode.
When Stage-2 translation is enabled, the MMU translates address in the following
way:
- Stage-1: Virtual Address (VA) -> Intermediate Physical Address (IPA)
- Stage-2: Intermediate Physical Address (IPA) -> Physical Address (PA)
The guest operating system controls the Stage-1 translation independently of the
hypervisor and can change mappings and page tables without trapping to the
hypervisor. The Stage-2 translation is controlled by the hypervisor, and a
separate Stage-2 page table base register is accessible only from HYP mode. The use
of Stage-2 translations allows software running in HYP mode to control access
to physical memory in a manner completely transparent to a VM running in SVC or
USR mode, because the VM can only access pages that the hypervisor has mapped
from an IPA to the page's PA in the Stage-2 page tables.
KVM/ARM design
KVM/ARM is tightly integrated with the kernel and effectively turns the
kernel into a first class ARM hypervisor. For KVM/ARM to use the hardware
features, the kernel must somehow be able to run code in HYP mode because
HYP mode is used to configure the hardware for running a VM, and traps from the
VM to the host (KVM/ARM) are taken to HYP mode.
Rewriting the entire kernel to run only in HYP mode is not an option, because
it would break compatibility with hardware that doesn't have the virtualization
extensions. A HYP-mode-only kernel also would not work when run inside a
VM, because the HYP
mode would not be available. Support for running both in HYP mode and
SVC mode would be much too invasive to the source code, and would potentially
slow down critical paths. Additionally, the hardware requirements for the page
table layout in HYP mode are different from those in SVC mode in that they
mandate the use of LPAE (ARM's Large Physical Address Extension) and require
specific bits to be set on the page table entries, which are otherwise clear on
the kernel page tables used in SVC mode. So KVM/ARM must manage a separate
set of HYP mode page tables and explicitly map in code and data accessed from
HYP mode.
We therefore came up with the idea to split execution across multiple CPU modes
and run as little code as possible in HYP mode. The code run in HYP mode is
limited to a few hundred instructions and isolated to two assembly files:
arch/arm/kvm/interrupts.S
and arch/arm/kvm/interrupts_head.S.
For readers not familiar with the general KVM architecture, KVM on all
architectures works by exposing a simple interface to
user space to provide virtualization of core components such as the CPU and
memory. Device emulation, along with setup and configuration of VMs, is handled by a
user space process, typically QEMU. When such a process decides it
is time to run the VM, it will call the KVM_VCPU_RUN ioctl(),
which executes VM code natively on the CPU. On ARM, the ioctl() handler in
arch/arm/kvm/arm.c switches to HYP mode by issuing an HVC (hypercall)
instruction, which changes the CPU mode to HYP mode, context switches all hardware
state between the host and the guest, and finally jumps to the VM SVC or
USR mode to natively execute guest code.
When KVM/ARM runs guest code, it enables Stage-2 memory translation, which completely
isolates the address space of VMs from the host and other VMs. The CPU will be
executing guest code until the hardware traps to HYP mode, because of a hardware
interrupt, a stage-2 page fault, or a sensitive operation. When such a trap
occurs, KVM/ARM switches back to the host hardware state and returns to normal
KVM/ARM host SVC code with the full kernel mappings available.
When returning from a VM, KVM/ARM examines the reason for the trap, and performs
the necessary emulation or
resource allocation to allow the VM to resume. For example, if the guest
performs a memory-mapped I/O (MMIO) operation to an emulated device, that will generate a Stage-2
page fault, because only physical RAM dedicated to the guest will be mapped in
the Stage-2 page tables. KVM/ARM will
read special system registers, available only in HYP mode, which contain the
address causing the fault and report the address to QEMU through a shared
memory-mapped structure between QEMU and the kernel. QEMU knows the memory map
of the emulated system and can forward the operation to the appropriate device
emulation code. As another example, if a hardware interrupt occurs while the VM
is executing, this will trap to HYP mode, and KVM/ARM will switch back in the host
state and re-enable interrupts, which will cause the hardware interrupt handlers to
execute once again, but this time without trapping to HYP mode. While every hardware
interrupt ends up interrupting the CPU twice, the actual trap cost on ARM
hardware is negligible compared to the world-switch from the VM to the host.
HYP mode
Providing access to HYP mode from KVM/ARM was a non-trivial challenge, since HYP
mode is a more privileged mode than the standard ARM kernel modes and there is
no architecturally defined ABI for entering HYP mode from less privileged modes.
One option would be to expect bootloaders to either install secure monitor
handlers or hypercall handlers that would allow the kernel to trap back
into HYP mode, but this method is brittle and error-prone, and prior experience
with establishing TrustZone APIs has shown that it is hard to create a standard
across different implementations of the ARM architecture.
Instead,
Will Deacon, Catalin Marinas, and Ian Jackson proposed
that we rely on
the kernel being booted in HYP mode if the kernel is going to support KVM/ARM. In
version 3.6, a patch series developed by Dave Martin and Marc Zyngier was
merged that detects if the kernel is booted in HYP mode and, if so, installs a
small stub handler that allows other subsystems like KVM/ARM to take
control of HYP mode later on. As it turns out, it is reasonable to recommend that
bootloaders always boot the kernel in HYP mode if it is available because even
legacy kernels always make an explicit switch to SVC mode at boot time, even
though they expect to boot into SVC mode already. Changing bootloaders to
simply boot all kernels in HYP mode is therefore backward-compatible with
legacy kernels.
Installing the hypervisor stub when the kernel is booted in HYP mode was
an interesting implementation challenge. First, ARM kernels are often loaded as
a compressed image, with a small uncompressed pre-boot environment known as the
"decompressor" which decompresses the kernel image into memory. If the
decompressor detects that it is
booted in HYP mode, then a temporary stub must be installed at this stage
allowing the CPU to fall back to SVC mode to run the decompressor code. The
reason is that the decompressor must turn on the MMU to enable caches, but
doing so in HYP mode requires support for the LPAE page table format used by HYP
mode, which is an unwanted piece of complexity in the decompressor code.
Therefore, the decompressor installs the temporary HYP stub, falls back to SVC
mode, decompresses the kernel image, and finally, immediately before calling the
uncompressed initialization code, switches back to HYP mode again. Then, the uncompressed
initialization code will again detect that the CPU is in HYP mode and will
install the main HYP stub to be used by kernel modules later in the boot process
or after the kernel has finally booted. The HYP stub can be found in
arch/arm/kernel/hyp-stub.S.
Note that the uncompressed initialization code
doesn't care whether the uncompressed code is started directly in HYP mode from a
bootloader or from the decompressor.
Because HYP mode is a more privileged mode than SVC mode, the transition from
SVC mode to HYP mode occurs only through a hardware trap. Such a trap can be
generated by executing the hypercall (HVC) instruction, which will trap into HYP
mode and cause the CPU to execute code from a jump entry in the HYP exception
vectors. This allows a subsystem to use the hypervisor stub to fully take
over control of HYP mode, because the hypervisor stub allows subsystems to
change the location of the exception vectors. The HYP stub is called through
the __hyp_set_vectors() function, which takes the physical address of
the HYP exception vector as its only parameter, and replaces the HYP Vector Base Address Register
(HVBAR) with that address. When KVM/ARM is initialized during normal kernel boot
(after all main kernel initialization functions have run), it creates an identity mapping
(one-to-one mapping of virtual addresses to physical address) of the HYP mode
initialization code, which includes an exception vector, and sets the physical address of
using the __hyp_set_vectors() function. Further, the KVM/ARM initialization code
calls the HVC instruction to run the identity-mapped initialization code, which can safely
enable the MMU, because the code is identity mapped.
Finally, KVM/ARM initialization sets up the HVBAR to point to the main KVM/ARM HYP exception
handling code, now using the virtual addresses for HYP mode. Since HYP mode
has its own address space, KVM/ARM must choose an appropriate virtual address
for any code or data, which is mapped into HYP mode. For convenience
and clarity, the kernel virtual addresses are reused for pages mapped into HYP
mode, making it possible to dereference structure members directly as long as
all relevant data structures are mapped into HYP mode.
Both traps from sensitive operations in VMs and hypercalls from the host kernel
enter HYP mode through an exception on the CPU.
Instead of changing the HYP exception vector on every switch between
the host and the guest, a single HYP exception
vector is used to handle both HVC calls from the host kernel and to handle traps
from the VM. The HYP vector handling code checks the VMID field on the Stage-2
page table base register, and VMID 0 is reserved for the host. This field is
only accessible from HYP mode and guests are therefore prevented from
escalating privilege. We introduced the kvm_call_hyp()
function, which can be
used to execute code in HYP mode from KVM/ARM.
For example, KVM/ARM code running in SVC mode can make the
following call to invalidate TLB entries, which must be done from HYP mode:
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
Virtual GIC and timers
ARMv7 architectures with hardware virtualization support also include
virtualization support for timers and the interrupt controller. Marc Zyngier
implemented support for these features, which are called "generic timers"
(a.k.a. architected timers) and the Virtual Generic Interrupt Controller (VGIC).
Traditionally, timer operations on ARM systems have been MMIO
operations to dedicated timer devices. Such MMIO
operations performed by VMs would trap to QEMU, which would involve a
world-switch from the VM to host kernel, and a switch from the host kernel to user space for
every read of the time counter or every time a timer needed to be programmed.
Of course, the timer functionality could be emulated inside the kernel, but this
would require a trap from the VM to the host kernel, and would therefore add
substantial overhead to VMs compared to running on native hardware.
Reading the
counter is a very frequent operation in Linux. For example, every time a task
is enqueued or dequeued in the scheduler, the runqueue clock is updated,
and in particular multi-process workloads like Apache benchmarks clearly show
the overhead of trapping on each counter read.
ARMv7 allows for an optional extension to the architecture, the generic timers,
which makes counter and timer operations part of the core architecture. Now,
reading a counter or programming a timer is done using coprocessor register
accesses on the core itself, and the generic timers provide two sets of timers
and counters: the physical and the virtual. The virtual counter and timer are
always available, but access to the physical counter and timer can be limited
through control registers accessible only in HYP mode. If the kernel is booted
in HYP mode, it is configured to use the physical timers; otherwise
the kernel uses the virtual timers, allowing both an unmodified kernel to
program timers when running inside a VM without trapping to the host, and providing
the necessary isolation of the host from VMs.
If a VM programs a virtual
timer, but is preempted before the virtual timer fires, KVM/ARM reads the timer
settings to figure out the remaining time on the timer, and programs a
corresponding soft timer in the kernel. When the soft timer expires, the timer
handler routine injects the timer interrupt back into the VM. If the VM is
scheduled before the soft timer expires, the virtual timer hardware is
re-programmed to fire when the VM is running.
The role of an interrupt controller is to receive interrupts from devices and
forward them to one or more CPUs. ARM's Generic Interrupt Controller
(GIC) provides a "distributor" which is the core logic of the GIC and several CPU
interfaces. The GIC allows CPUs to mask
certain interrupts, assign priority, or set affinity for certain interrupts to
certain CPUs. Finally, a CPU also uses the GIC to send inter-processor
interrupts (IPIs) from one CPU core to another and is the underlying mechanism
for SMP cross calls on ARM.
Typically, when the GIC raises an interrupt to a
CPU, the CPU will acknowledge the interrupt to the GIC, interact with the interrupting
device, signal end-of-interrupt (EOI) to the GIC, and resume normal operation.
Both acknowledging and EOI-signaling interrupts are privileged operations that will trap
when executed from within a VM, adding performance overhead to common
operations. The hardware support for virtualization in the VGIC comes in the
form of a virtual CPU interface that CPUs can query to acknowledge and EOI virtual
interrupts without trapping to the host. The hardware support further provides a
virtual control interface to the VGIC, which is accessed only by KVM/ARM, and is
used to program virtual interrupts generated from virtual devices (typically
emulated by QEMU) to the VGIC.
Since access to the distributor is typically
not a common operation, the hardware does not provide a
virtual distributor, so KVM/ARM provides in-kernel GIC distributor emulation
code as part of the support for VGIC. The result is that VMs can acknowledge and EOI
virtual interrupts directly without trapping to the host. Actual hardware interrupts
received during VM execution
always trap to HYP mode, and KVM/ARM lets the kernel's standard ISRs
handle the interrupt as usual, so the host remains in complete control
of the physical hardware.
There is no mechanism in the VGIC or generic timers to
let the hardware directly inject physical interrupts from the virtual timers as
virtual interrupts to the VMs. Therefore, VM timer interrupts will trap as any
other hardware interrupt, and KVM/ARM registers a handler for the virtual timer
interrupt and injects a corresponding virtual timer interrupt using software
when the handler function is called from the ISR.
Results
During the development of KVM/ARM, we continuously measured the virtualization overhead
and ran long-running workloads to test stability and measure performance. We
have used various kernel configurations and user space environments (both ARM
and Thumb-2) for both the host and the guest, and validated our workloads with
SMP and UP guests. Some workloads have run for several weeks at a time without
crashing, and the system behaves as expected when exposed to extreme memory
pressure or CPU over-subscription. We therefore feel that the
implementation is
stable and encourage users to try and use the system.
Our measurements using both micro and macro benchmarks show that the overhead
of KVM/ARM is within 10% of native performance on multicore platforms for
balanced workloads. Purely CPU-bound workloads perform almost at native speed.
The relative overhead of KVM/ARM is comparable to KVM on x86. For some macro
workloads, like Apache and MySQL, KVM/ARM even has less overhead than on x86
using the same configuration. A significant source of this improved
performance can be attributed to the optimized path for IPIs and thereby
process rescheduling caused by the VGIC and generic timers hardware
support.
Status and future work
KVM/ARM started as a research project at Columbia University and was later
supported by Virtual Open
Systems. After the 3.9 merge, KVM/ARM continues to be
maintained by the original author of the code, Christoffer Dall, and the ARMv8
(64-bit) port is maintained by Marc Zyngier. QEMU system support for ARMv7 has
been merged upstream in QEMU, and kvmtool also has support for KVM/ARM on both
ARMv7 and ARMv8. ARMv8 support is scheduled to be merged for the 3.11 kernel
release.
Linaro is supporting a number of efforts to make KVM/ARM itself feature
complete, which involves debugging and full migration features including
migration of the in-kernel support for the VGIC and the generic timers.
Additionally, virtio has so far relied on a PCI backend in QEMU and the kernel,
but a significant amount of work has already been merged upstream to refactor
the QEMU source code concerning virtio to allow better support for MMIO-based
virtio devices to accelerate virtual network and block devices. The remaining
work is currently a priority for Linaro, as is support for the mach-virt
ARM machine definition, which is a simple machine model designed to be used for
virtual machines and is based only on virtio devices. Finally, Linaro is also
working on ARMv8 support in QEMU, which will also take advantage of mach-virt and virtio
support.
Conclusion
KVM/ARM is already used heavily in production by the SUSE Open Build Service on
Arndale boards, and we can only speculate about its future uses in the green
data center of the future, as the hypervisor of choice for ARM-based networking
equipment, or even ARM-based laptops and desktops.
For more information, help on how to run KVM/ARM on your specific board or SoC,
or to participate in KVM/ARM development, the kvmarm
mailing list is a good place to start.
Comments (10 posted)
By Jonathan Corbet
July 2, 2013
Maintaining user-space ABI compatibility is one of the key guiding
principles of Linux kernel development; changes that break user space are
likely to be reverted quickly, often after an incendiary message from
Linus. But what is to be done in cases where an ABI is deemed to be
unworkable and unmaintainable? Control group maintainer Tejun Heo is
trying to solve that problem, but, in the process, he is running into
opposition from one of Linux's highest-profile users.
Control groups ("cgroups") allow an administrator to divide the processes
in a system into a hierarchy of groups; this hierarchy need not match the
process tree. The grouping function alone is
useful; systemd uses it to keep track of all of the processes involved with
a given service, for example. But the real purpose of control groups is to allow
resource control policies to be applied to the processes within each group;
to that end, the kernel contains a range of "controllers" that enforce
policies on CPU time, block I/O bandwidth, memory usage, and more. Control
groups are managed with a virtual filesystem exported by the kernel; see Documentation/cgroups/cgroups.txt for a
thorough (if slightly dated) description of how this subsystem works.
The trouble with control groups
There is no doubt that the functionality provided by control groups is both
extensive and flexible. Indeed, part of the problem is that it is
too flexible. Consider, for example, the support for multiple
hierarchies in the control group subsystem. Cgroups allow the creation of
a hierarchy of processes to be used in dividing up a limited resource, such
as available CPU time. But they allow the creation of an entirely
different hierarchy for the control of a different resource. Thus, for
example, CPU time could be placed under a policy that favors certain users
over others, while memory use could, instead, be regulated depending on
what program a process is running. Processes can be grouped in entirely
different ways in each hierarchy.
The problem here is that, while the design allowing each controller to have
its own hierarchy seems nice and orthogonal, the implementation cannot be
that way. The controllers for memory usage, I/O bandwidth, and writeback
throttling all look independent on the surface, but those problems are all
intertwined in the memory management system in the kernel. All three of
those controllers will need to associate pages of memory with specific
control groups; if a given process is in one cgroup from the memory
controller's point of view, but a different cgroup for the I/O bandwidth
controller, that tracking quickly becomes difficult or impossible. It is
easy to set up policies that conflict or that simply cannot be properly
implemented within the kernel.
Another perceived problem is that the virtual filesystem interface is too
low-level, exposing too many details of how control groups are implemented
in the kernel. As the number of users of control groups grows, it will
become increasingly hard to make changes without breaking existing
applications. It's not clear what the correct cgroup interface should be,
but those who spend enough time looking at the current implementation tend
to come away convinced that changes are needed.
This problem is aggravated by an increasing tendency to use file
permissions to hand subtrees of a cgroup hierarchy over to unprivileged
processes. There are legitimate reasons to want to delegate authority in
that way; complex applications may want to use cgroups to implement their own
internal policies, for example. There are also use cases associated with
virtualization
and containers. But that delegation greatly increases the number of
programs with an intimate understanding of how cgroups work, complicating
any future changes. There are
also any number of security issues that come with unprivileged access to a
cgroup hierarchy; it is trivially easy to run denial-of-service attacks against a
system if one has write access to a cgroup hierarchy. In short, the interface
was just never meant to be used in this way.
For these reasons and more, there is a strong desire to rework the cgroup
interface into something that is more maintainable, more secure, and easier to
use. Getting there, though, is likely to be a long and painful process, as
can be seen by the early discussions around the subject.
The solution and its discontents
The plan for control groups can be described in relatively few words; the
resulting discussion, instead, is rather more verbose. Multiple
hierarchies are seen to be misconceived and unmaintainable on their face;
the plan is to phase out that functionality so that, in the end, all
controllers are attached to a single, unified hierarchy of processes.
Unprivileged access to the cgroup hierarchy will be strongly discouraged;
the hope is to have a single, privileged process handling all of the cgroup
management tasks. That process will, in turn, provide some sort of
higher-level interface to the rest of the system.
Tim Hockin is charged with making Google's massive cluster of machines work
properly for a wide variety of internal users. Google uses cgroups
extensively for internal resource management; more to the point, the
company also makes extensive use of multiple hierarchies. So, needless to
say, Tim is not at all pleased with the prospect of that functionality
going away. As he put it:
So yeah, I'm in a bit of a panic. You're making a huge amount of
work for us. You're breaking binary compatibility of the
(probably) largest single installation of Linux in the world. And
you're being kind of flip about the reality of it...
[PULL QUOTE:
The kernel's ABI rules have not been suspended for
control groups
END QUOTE]
Part of the reason for Tim's panic is that he was under the impression that
the existing functionality would be removed within a year or two. That is
decidedly not the case; the kernel's ABI rules have not been suspended for
control groups. The plan is to add a new control interface, and any new
features will probably only work with that new interface, but the existing
interface, including multiple hierarchies, will continue to be supported
until it's clear that it is no longer being used.
Tim described, in general terms, how Google
uses multiple hierarchies. Essentially, every job in the system has two
attributes: whether it's a production or "batch" job, and whether it gets
I/O bandwidth guarantees. The result is a 2x2 matrix describing resource
allocation policies (though one of the entries — batch processes with I/O
guarantees, makes little sense and is not used). Using two independent
cgroup hierarchies
makes this set of policies relatively easy to express; Tim asserts that a
unified hierarchy would not be usable in the same way.
Tejun was unimpressed, responding that this
case could be managed by setting up three cgroups at the same level of the
hierarchy, each of which would implement one of the three useful policy
combinations. The problem with this solution, according to Tim, is that
the processes without I/O bandwidth guarantees would be split into two
groups, whereas in the current solution they are in one group. If one of
those two groups has far more members than the other, the members of that
larger group will get far less of the available bandwidth than the members
of the small group. Tejun still thinks that the problem should be
solvable, perhaps with the use of a user-space management daemon that would
adjust the relative bandwidth allocations depending on the workload. Tim
has answered that the situation is actually
a lot more complicated, but he has not yet shared the details of how, so it
is hard to understand what the real difficulties with a single hierarchy
are.
A single management process?
Tim also dislikes the plan to have a single process managing the control
group hierarchy. That process could be made to provide the functionality
that Google (along with others) needs, though there are performance
concerns associated with adding a process in the middle. But Tim was not
alone in being concerned by this message from
Lennart Poettering on the nature of that single process:
This hierarchy becomes private property of systemd. systemd will
set it up. Systemd will maintain it. Systemd will rearrange
it. Other software that wants to make use of cgroups can do so only
through systemd's APIs.
Google does not currently run systemd and is not thrilled by the prospect
of having to switch to be able to make use of cgroup functionality. So Tim
responded that "If systemd is the
only upstream implementation of this single-agent idea, we will have to
invent our own, and continue to diverge rather than converge." There
is no particular judgment against systemd implied by that position; it is
simply that making that switch would affect a whole lot of things beyond
cgroups, and that is more than Google feels like it would want to take on
at the moment. But, in general, it would not be surprising if, in the long
term, some users remain opposed to the idea of systemd as the only
interface to cgroups. That suggests that we will be seeing competing
implementations of the cgroup management daemon concept.
One of those alternatives may be about to come into view; Serge Hallyn confessed that he is working on a cgroup
management daemon of his own. In some situations, a separate daemon might
meet a lot of needs, but Lennart was clear
that he would never have systemd defer to such a daemon. His position —
not an entirely unreasonable one — is that the init process, as the creator
of all other processes in the system, should not be dependent on any other
process for its normal operation. He also seems to feel that it would
not be possible to put the cgroup management code into a library that could
be used in multiple places. So we are likely to see multiple
implementations of this functionality in use before this story is done.
That, in turn, could create headaches for developers of applications that
need to interface with the cgroup subsystem.
The discussion, thus far, seems to have changed few minds. But Tejun has
made it clear that he doesn't intend to
just ignore complaints from users:
While the bar to overcome is pretty high, I do want to learn about
the problems you guys are foreseeing, so that I can at least
evaluate the graveness properly and hopefully compromises which can
mitigate the most sore ones can be made wherever necessary.
He also acknowledged the biggest problem
faced by the development community: despite having accumulated some
experience on wrong ways to solve the
problem, nobody really knows what the right solution is. More mistakes are
almost certain, so it's too soon to try to settle on final solutions.
In the early years of Linux, most of the ABIs implemented by the kernel
were specified by groups like POSIX or by prior implementation in other
kernels. That made the ABI design problem mostly go away; it was just a
matter of doing what had already been done before. For current problems,
though, there are rather fewer places to look for guidance, so we are
having to figure out the best designs as we go. Mistakes are certain to
happen in such a setting. So we are going to have to get better at
learning from those mistakes, coming up with better designs, and moving to
them without causing misery for our users. The control group transition is
likely to set a lot of precedents regarding how these changes should (or
should not) be handled in the future.
Comments (35 posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.8.13-rt13 .
(June 30, 2013)
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 14 .
(July 3, 2013)
Page editor: Jonathan Corbet
Next page: Distributions>>