Leading items
Welcome to the LWN.net Weekly Edition for November 21, 2019
This edition contains the following feature content:
- LSM stacking and the future: the long-running security-module stacking project is finally reaching fruition.
- A recap of KVM Forum 2019: many topics from the 2019 KVM Forum meeting.
- Enhancing KVM for guest protection and security: various approaches to securing virtualized guests from each other and from the host system.
- Some near-term arm64 hardening patches: several security improvements for the arm64 architecture that should land soon.
- Keeping memory contents secret: the problem of protecting memory contents from snooping.
- The Yocto Project 3.0 release: new features in this distribution release, including a much more efficient build system.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Note that November 28 is the Thanksgiving holiday in the United States. As is our tradition, there will be no LWN Weekly Edition that day; we will all be too busy eating to put one together. We will keep the lights on over the break and post the occasional article, and we'll be back to normal service with the December 5 Edition.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
LSM stacking and the future
The idea of stacking (or chaining) Linux security modules (LSMs) goes back 15 years (at least) at this point; progress has definitely been made along the way, especially in the last decade or so. It has been possible to stack "minor" LSMs with one major LSM (e.g. SELinux, Smack, or AppArmor) for some time, but mixing, say, SELinux and AppArmor in the same system has not been possible. Combining major security solutions may not seem like a truly important feature, but there is a use case where it is pretty clearly needed: containers. Longtime LSM stacker (and Smack maintainer) Casey Schaufler gave a presentation at the 2019 Linux Security Summit Europe to report on the status and plans for allowing arbitrary LSM stacking.
LSMs allow adding more restrictions to Linux than those afforded by the traditional security policies. For the most part, those policies reflect the existing mechanisms, such as permissions bits on files. But there are also other security concerns, such as binding to a network socket, that are outside of the usual permissions, so mechanisms to restrict access to them have been added to the LSM interface.
![Casey Schaufler [Casey Schaufler]](https://static.lwn.net/images/2019/lsseu-schaufler-sm.jpg) 
Prior to the advent of the Yama LSM, only one security module could be active in a running kernel; Yama was originally manually stacked, which "didn't really sit very well". To support adding the Yama restrictions on top of other LSMs in a dynamic fashion, lists of modules were added to the kernel, which would allow multiple LSMs to be active. But there was a problem for the "bigger" LSMs that need security "blobs"—context data associated with various kernel objects—in which to store their state. There is only one pointer available to use, so only one blob-using LSM could be active, though multiple minor LSMs that did not need the blobs could be stacked with it.
At this point, an LSM attribute has been added to tag LSMs; LSM_FLAG_EXCLUSIVE. The "exclusive" tag is applied to the blob-using LSMs: SELinux, Smack, and AppArmor. The idea is to remove that tag from those LSMs over time.
There can only be one exclusive LSM active in a running kernel. "That's bad", Schaufler said, but for a long time that was not seen as a serious problem. That was before containers became so widespread, however. Now there are some people who run, for example, Ubuntu in their data centers (with AppArmor) and who want to run Android (SELinux) containers on top. So the goal of the work he and others have been doing is to get rid of the exclusive bit for "as many modules as we possibly can".
The 5.1 kernel added "infrastructure-managed blobs" for a number of different kernel objects: tasks, credentials, files, inodes, and the System V interprocess-communication mechanisms (semaphores, shared memory, and message queues). An LSM will tell the kernel how much space it needs to store its information and the kernel will take care of allocating, managing, and freeing the blob. So, any LSM that only uses blobs on those object types can be marked as non-exclusive at this point.
That means a variety of LSMs can be used alongside SELinux, so "the IT people are really happy" since SELinux does not have to be turned off to get the protections afforded by some other module that only uses those blobs. There are also a number of smaller LSMs that are headed toward the mainline that could benefit from this. Those, or some custom module, can be run with one of the exclusive LSMs, mostly without interference; so "everybody's happy", he said.
Next up
"But not everybody's happy", he continued, because there are still limitations, which leads to the plans for an upcoming kernel, possibly 5.5. The code to remove the exclusive flag for AppArmor is basically ready. AppArmor is different than Smack and SELinux, "in that it is path-name-based-ish", though it is less so now than it used to be. It has a different fundamental security model; Smack and SELinux are both based on subjects and objects, while AppArmor mostly focuses on path names. The use cases for AppArmor are also different than those of the others, so it makes sense to start with it.
In order to make non-exclusive stacking work for AppArmor, kernel socket-object security blobs have to join the other infrastructure-managed blobs so that multiple LSMs can have them. That is fairly easily done, since it already has been done for the other objects. There are also more difficult pieces; when you get to those, that's where people start to bikeshed, he said.
The first problem is sharing /proc/PID/attr/current, which is used by AppArmor and other major LSMs to report the security context for the process identified by PID. So SELinux and AppArmor would both want to put their contexts in that file, but that is impossible. Similarly, the SO_PEERSEC socket option to retrieve the security context of the other endpoint of a Unix socket also cannot be shared. The solution for both is to introduce a new interface, so that the existing interfaces stay backward compatible.
A number of different ideas for the format of /proc/PID/attr/context and SO_PEERCONTEXT (the new interfaces) were proposed along the way, but the developers "finally did the intelligent thing" and asked the user community, the D-Bus developers in particular. They suggested a simple string with pairs of null-terminated strings of the form "LSM-name\0value\0"; the full length of the string will be known, so pulling out the individual LSM contexts will be straightforward. There is something of a lesson there, Schaufler said: instead of debating something like this, ask the people who will be using the information.
But adding interfaces doesn't really solve the problem, since there are numerous system utilities that will use the existing interfaces—and for a long time to come. So there is a new /proc/self/attr/display setting that can be used to determine which LSM's context information is reported via the existing interfaces. An SELinux container could set its display to ensure that the container sees the SELinux information even if it is running on a kernel with AppArmor active as well; the rest of the system could set the display to AppArmor so that it would look like that was the only LSM active.
The permissions required to change the display attribute also needed to be worked out. He thought there should be no checks on switching the value, but SELinux developer Stephen Smalley came up with some problems with that approach. So Schaufler suggested requiring the CAP_MAC_ADMIN capability, but it turns out that SELinux developers do not want to rely on capabilities, they want SELinux to be able to weigh in on the choice. So there is now a hook for display changes; SELinux and AppArmor have added ways to set a policy for changing display, while Smack just says "sure, go ahead".
It turns out that Android's binder security mechanism also uses the contexts, so the code needed to ensure that the processes at both ends of the bind see the same context; it doesn't matter which it is, he said, but it needs to be the same. There was also a need to add new audit fields to support subject contexts on a per-LSM basis, while still maintaining the "subj=" entries for backward compatibility. The same thing can be done with object contexts (i.e. "obj=") if that is needed down the road.
Before too long
The next major step is to remove the exclusive tag entirely, by getting rid of it for Smack and SELinux, so that you can use any set of arbitrary LSMs in the same running kernel That is targeted for the 5.8 kernel or so. It is more challenging, in part because the two LSMs do a lot of the same things; in particular, both interact with the networking subsystem extensively, he said.
Two more kernel objects, for keys and superblocks, need to be added to the infrastructure-managed list. Part of the reason that superblocks need blobs for Smack and SELinux is that both process mount options, which is a bit messy to do. Instead of simply handing the options to a single LSM, they will need to be sent to a series of LSMs; each LSM needs to only deal with the options it knows about, ignoring those it doesn't, but then any options that are not handled by any LSM need special treatment.
"The networking stuff has a wonderful set of challenges", Schaufler said. The NetLabel interface is useful to allow an LSM to put CIPSO or CALIPSO labels on packets, but two LSMs cannot put different labels on the same packets. After much "gnashing of teeth", it was decided that unless all of the relevant LSMs could agree on the labels, packet sending would fail. It may be a bit harsh, but it makes sense: "If you can't get people to agree, you probably shouldn't send it".
The label is set when the socket is created, so that is the operation that should fail, even though it doesn't really matter until a packet is actually sent. Making that work requires some changes in NetLabel and SELinux, but more in Smack. NetLabel is used differently by Smack, which "makes things more complicated", he said.
The secmark facility allows associating a 32-bit number with a packet; it is added to the socket buffer (sk_buff or SKB) object by nftables. However, 32 bits is not enough to be able to handle two, three, or even more LSMs that want to use secmarks. It is not clear what to do about that, yet. A hash-table mapping might work or only allowing a single LSM to use the facility is another option, though "it's kind of a cop-out". Another possibility is an SKB extension, but he is a bit leery of going that route because he anticipates some opposition from the networking developers.
Labeled NFSv4 presents an "interesting conundrum", he said. It was defined with a format for the label data that is passed back and forth, which "Linux very carefully ignores". The Linux implementation doesn't add the labels or read them; it just assumes that any data that is there is reasonable for whatever is actually going to use it. The NFS developers are looking into that at this point.
Schaufler wrapped up by reiterating that the first set of changes for AppArmor are targeting Linux 5.5. The second set needs more work and there are some solutions to be found, but it will hopefully make its way into the mainline in 5.8 or thereabouts. Interested readers can view his slides [PDF] and the YouTube video of the talk.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend the Linux Security Summit Europe in Lyon, France.]
A recap of KVM Forum 2019
The 13th KVM Forum virtualization conference took place in Lyon, France in October 2019. One might think that development may have finished on the Kernel Virtual Machine (KVM) module that was merged in Linux 2.6.20 in 2007, but this year's conference underscored the amount of work still being done, particularly on side-channel attack mitigation, I/O device assignment with VFIO and mdev, footprint reduction with micro virtual machines (VMs), and with the ability to run VMs nested within VMs. Many talks also involved the virtual machine monitor (VMM) user-space programs that use the KVM kernel module—of which QEMU is the most widely used.
Side-channel attacks and memory isolation/encryption
Side-channel attacks leak sensitive information by attackers using mechanisms other than the intended input and output methods. They have become more problematic, recently, by being coupled with speculative execution. The KVM project is looking at ways to mitigate these attacks.
Dario Faggioli presented "Core-Scheduling for Virtualization: Where are We?" for a scheduler-centric way of mitigating side-channel attacks.
![Group photo [Group photo]](https://static.lwn.net/images/2019/kvmf-group-sm.jpg) 
With simultaneous multi-threading ("hyperthreading" or SMT), cores are shared by threads; some of the CPU resources are shared and the threads share cache (even L1), which boosts performance with SMT-aware scheduling. Core scheduling tags tasks that can be scheduled together on a core. The scheduler will only place tasks with the same tag onto a core at any given time. Tasks in different security domains can be given different tag values to ensure they do not share the same core, thus thwarting some side-channel attacks. Core scheduling also helps achieve fairness of accounting in cloud environments where a busy thread may consume more CPU resources than a less busy thread sharing the same core.
Core scheduling can also be used to align guest virtual CPUs (vCPUs) with the host SMT topology so that intelligent scheduling decisions are made. Furthermore, L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS) are the two prominent side-channel attacks that core scheduling addresses.
The goal of core scheduling as a side-channel-attack mitigation is to achieve higher performance than by disabling SMT altogether. Hyper-V and Xen already support core scheduling. Patches for Linux are not yet upstream and a set of benchmark results shows that performance still varies wildly between workloads. More work is needed to handle pathological cases where workloads run much slower with core scheduling.
L1TF is a hardware vulnerability that allows unprivileged speculative access to data which is available in the level-1 data (L1D) cache when the page-table entry (PTE) for a virtual address has the "present" bit cleared or other reserved bits set. If an instruction accesses a virtual address mapped by a non-present PTE, a page fault is raised only when the faulting instruction retires. Until then, the speculative value from the L1D cache is computed. This can result in unauthorized access through the faulting instruction and can leak a secret key from another guest or from the host system.
This problem can be mitigated by flushing the L1D cache on entry to the VM and by using a shadow memory-management unit (MMU). Flushing the L1D cache is not sufficient to mitigate L1TF attacks when hyperthreading is enabled because the L1D cache is shared between hyperthreads on same core. Guest vCPUs can leak L1D cache data populated either by sibling hyperthreads running vCPUs of another VM or by sibling hyperthreads running vCPU threads that are currently running the VM-exit handler (where the VM is exited to give control to the kernel or user space to handle a request).
Liran Alon and Alexandre Chartre gave a talk on "KVM Address Space Isolation (ASI)", which is a feature that introduces a separate virtual address space for VM-exit handlers. Previous mitigations excluded data from the virtual address space, but it is difficult to identify all of the sensitive data. KVM ASI is a whitelist approach that builds a virtual address space with only the data actually required by KVM's VM-exit handlers.
Thomas Lendacky covered the status of the AMD Secure Encrypted Virtualization (SEV) CPU feature in "Improving and Expanding SEV Support". SEV protects VMs from an untrusted hypervisor such as when deploying a VM on a public cloud. Guest memory is encrypted so that the hypervisor cannot inspect or modify its contents. Lendacky discussed current developments with SEV such as eliminating memory pinning, live migration, and SEV with encrypted state (SEV-ES).
SEV live migration uses separate encryption keys for the source and destination VM and the keys are not migrated. The firmware requires a copy of encrypted memory and special transport keys are used for moving pages between hosts.
For SEV performance reasons, guest memory is currently pinned. Pinning memory is undesirable because it prevents swapping and overcommitting. Options to eliminate guest-memory pinning by preventing page movement (migration/swapping) and SEV firmware-related work to copy encrypted pages are being investigated. Future work also includes support for SEV-ES for encryption of VM registers and SEV-SNP (Secure Nested Paging).
Recently, s390 joined the list of architectures implementing encryption or protection of virtual machines. In "Protected Virtual Machines for s390x", Claudio Imbrenda gave an overview of this architecture. The hypervisor is considered untrusted, so the only trusted entity in the system is the "ultravisor" that is implemented in hardware and firmware. It decrypts and verifies the boot image for "protected guests", and proxies interactions between the hypervisor and guests. The hypervisor may not access any of the protected guest's memory other than what that guest shared with it for I/O. The s390 code was able to reuse a lot of the infrastructure that had already been introduced for AMD SEV.
VFIO and mdev
The VFIO driver in Linux exposes physical PCI adapters and other devices to applications. VFIO is used by user-space device drivers in software-defined networking applications and in VMMs such as QEMU to pass physical PCI adapters through to guests. Now an ecosystem of additional drivers built on top of VFIO is enabling additional use cases.
In "Toward a Virtualization World Built on Mediated Pass-Through", Kevin Tian presented an overview of this growing ecosystem that builds on the VFIO and mediated device framework (mdev) drivers in Linux. While VFIO exposes physical devices to user space, the mdev framework makes it possible to write pure software devices or to combine hardware and software functionality. This can be used to add software logic on top of hardware, for example, to work around hardware limitations or to present a subset of the functionality embodied in the hardware.
One of these new VFIO/mdev applications was presented in more detail by Xin Zeng and Yi Liu in their talk "Bring a Scalable IOV Capable Device into Linux World". Data-center PCI adapters sometimes support PCI single-root I/O virtualization (SR-IOV), a feature that splits the adapter into virtual function (VF) child devices that can be individually passed through to guests. This allows multiple virtual machines to safely share access to a PCI adapter, although only a subset of resources is available through each VF. The newer scalable I/O virtualization approach replaces the hardware SR-IOV feature with an mdev software driver that decides which hardware resources to pass through directly and which ones to emulate in software.
Another aspect of mediated devices was presented by Thanos Makatos and Swapnil Ingle in their "muser: Mediated User Space Device" presentation. Muser is an mdev kernel driver and associated libmuser user-space library for implementing device emulation programs in user space. A program using the libmuser API can implement a VFIO-capable PCI device in software. QEMU and other VMMs can then pass this device through to a guest. Both C and Python bindings were presented for writing software PCI devices in user space. The API revolves around registering a PCI device description and then handling device accesses either through callbacks or by polling mmap()-ed register space when performance is critical. The device emulation program can define the PCI Base Address Registers (BARs), configuration space, and other standard PCI functionality in order to emulate existing or custom PCI devices.
Micro VMs and new VMMs
In the last few years, containers and micro-services have become more and more popular. Cloud users and providers are looking at virtual machines to increase the security and isolation of these workloads, and several sessions presented KVM-based solutions.
In "Firecracker: Lessons from the Trenches", Andreea Florescu and Alexandra Iordache presented a lightweight VMM written in Rust: Firecracker. It is designed for multi-tenant cloud workloads like containers and functions as a service (FaaS). Its key features are minimal boot time, low memory footprint, and over-subscription of CPUs and memory. Each Firecracker process handles a single micro-VM and uses chroot(), control groups, and seccomp to delimit security boundaries. Rust allowed them to write more reliable code, but Florescu and Iordache described a few bugs that crept through, including an integer overflow that caused Firecracker to potentially run afoul of undefined behavior.
Continuing on the Rust theme, Alibaba, AWS, CloudBase, Google, Intel, and Red Hat have collaborated on rust-vmm, which is a collection of Rust crates for virtualization software. Florescu and Samuel Ortiz talked about that in "Playing Lego with Virtualization Components". They explained that Rust's key features (memory safety, safe concurrency, and great performance) fit well with the VMM requirements. Some examples of components provided by rust-vmm are API bindings (KVM, VirtIO, VFIO), a memory model, a kernel loader, and several utility libraries.
In "Bring QEMU to Micro Service World", Xiao Guangrong and Zhang Yulei described how they used QEMU to launch micro-services extremely quickly. Their solution, called QEMU basepoint, bypasses QEMU and guest initialization. Using the existing QEMU migration feature, it saves the state of the VM into a template file just after it starts. Then it uses this basepoint to launch new VMs. To save QEMU initialization time as well, the new VMs are handled by a fork of the base QEMU process. This solution allows them to reduce the boot time, but it still has limitations, such as the lack of kernel address-space layout randomization, that will be addressed in the future.
Nested virtualization
Nested virtualization is the idea of running virtual machines inside other virtual machines via the same or a different hypervisor. With the popularity of cloud infrastructure, which typically uses virtualization, the kernel is frequently already running on top of another hypervisor, so nested virtualization is needed to run KVM. Nested virtualization has therefore become important for development, continuous integration (CI), and production environments running in the cloud. Two talks at KVM Forum looked at testing these setups from different perspectives.
In "Managing Matryoshkas: Testing Nested Guests", Marc Hartmayer presented an approach for testing deeply nested setups (up to seven levels) using existing Python test infrastructure via the Avocado testing framework. He used the Mitogen library to run Python code remotely on the guest, reusing existing virtualization tests on nested guests. A short demo showed that this setup is easy to extend and information on test results is easy to obtain even for more deeply nested guests. Hartmayer used the Avocado-vt plugin for his implementation, but integration with the simpler Avocado-QEMU infrastructure (already in use in QEMU's acceptance test suite) should be doable once the needed Mitogen changes have hit upstream.
In "Nesting&testing" Vitaly Kuznetsov looked in some detail at how nested virtualization on x86 is tested today. Two testing frameworks complement each other: kvm-unit-tests, which makes use of QEMU to run tests, and kvm-selftests, which uses custom user-space tooling to allow testing corner cases at the price of making the tests more complicated to write. Furthermore, testing can use tests specifically written to test nested functionality, or it can run preexisting virtualization tests under a hypervisor; the latter is for example how KVM under Hyper-V is tested.
Looking at the tests currently available, VMX for Intel CPUs has a lot of tests in what Kuznetsov labeled the "correctness", "functional", and "regression" test categories in kvm-unit-tests; SVM for AMD CPUs, however, has very few tests, mostly in the "functional" category. For kvm-selftests, the situation for AMD CPUs looks even more dire, as there is currently no support for SVM at all; SVM certainly can do with some more love.
Other than adding more tests, integrating running the tests already available into some kind of CI is obviously a good idea; especially as this enables running them in more nested environments, including Hyper-V/Azure, which not everybody may have easy access to.
Conclusion
This covers the main themes at KVM Forum 2019. A number of interesting talks explored other areas and are available for watching online (along with all the presentations). It is a safe bet that side-channel attack mitigations will remain a topic at KVM Forum 2020 in Dublin, Ireland next year and that the VFIO/mdev ecosystem will grow as both hardware vendors and software takes advantage of its capabilities. The presence of Amazon, Google, Microsoft, Tencent, and Huawei underscored how pervasive KVM has become in cloud hosting and that its use will continue to grow in the foreseeable future.
Enhancing KVM for guest protection and security
A key tenet in KVM is to reuse as much Linux infrastructure as possible and focus specifically on processor virtualization. Back in 2007, this meant a smaller code base and less friction with the other kernel subsystems, especially when compared with other virtualization technologies such as Xen. This led to KVM being merged into the mainline with relative ease.
But now, in the era of microarchitectural vulnerabilities, the priorities have shifted, and the KVM's reliance on other kernel subsystems can be a liability. For one thing, the host kernel widens the TCB (Trusted Computing Base) and makes for a larger attack surface. In addition, kernel data structures such as the direct memory map give Linux access to guest memory even when it is not strictly necessary and make it impossible to fully enforce the principle of least privilege. In his talk "Enhancing KVM for Guest Protection and Security" (slides [PDF]) presented at KVM Forum 2019, long-time KVM contributor Jun Nakajima explained this risk and suggested some strategies to mitigate it.
Removing the VMM from the TCB
Nakajima pointed to three main security issues in the current KVM hypervisor design: piggybacking on Linux subsystems, the user-space virtual machine monitor (VMM) having access to the data of the guest it is managing, and the kernel having access to all address spaces and data structures, including those from every guest running in the same host. Inspired by virtualization-oriented memory encryption technologies, like AMD SEV, he proposes a security strategy based on removing as many elements as possible from the TCB, as seen from the guest perspective.
The first step would be removing the user-space VMM from the TCB. To achieve this, the guest would use a new KVM facility to specify a virtual memory region to be shared, while the rest of the memory would be marked as private to the guest. Any attempts from the VMM to access a memory region that the guest doesn't want to be shared will result in a page fault and, depending on the implementation, potentially send a signal to the VMM.
This implies that the kernel running inside the guest needs be modified to make use of the facility and to ensure that DMA operations work exclusively within the range of the memory region shared with the VMM, which can be done using a software I/O translation buffer (an swiotlb-based bounce buffer). This strategy is already being used for supporting AMD SEV, which allows guests to run with memory encryption (making it inaccessible from either the host's kernel or the user-space VMM), and Nakajima pointed out that the intention is to rely on the existing code as much as possible.
Protecting guests from the host kernel
Removing the user-space VMM from the TCB is useful, but it is just the first step. Going further, and removing the host Linux kernel from the TCB, requires deeper changes so that the hypervisor can "absorb" the host kernel and deprive it of its normal full privileges on the machine.
This feat had already been presented (slides [PDF]) at KVM Forum 2016 and, in the meantime, got the name virtualization-based hardening (VBH). As that name suggests, the hypervisor protects itself from the host kernel through hardware virtualization. Once the hypervisor has initialized itself, Linux would run in guest mode (with special privileges) and the hypervisor could use extended and nested page tables (EPT/NPT) along with an IOMMU to control Linux's access to physical memory.
With this execution model, KVM would be able to provide virtual memory contexts to guests without relying on the Linux memory-management subsystem, which would be still in use for servicing both the user-space processes running in the host and the kernel itself. This way, KVM will be the only subsystem able to access address space mappings and alter their properties (including their protection levels), so even if an attacker happens to take control of some other kernel subsystem, the guest's memory won't be compromised.
In this scenario, while still far from being an actual Type-1 Hypervisor, KVM would effectively hijack a significant part of the low-level virtual memory functionality from Linux. Of course, this means that the hypervisor would need to provide its own mechanism for swapping pages, and define new interfaces to be able to access the actual memory backends (RAM, software-defined, NVDIMM, etc.).
This strategy shares some similarities with one also presented at KVM Forum 2019 by Liran Alon and Alexandre Chartre in their talk "KVM ASI (Address Space Isolation)" (slides [PDF]). In it, they suggested creating a virtual address space for each guest that would exclusively map per-VM data, KVM, and the needed kernel subsystems. This is less radical in the sense that it would still be using the Linux memory-management facilities, and thus probably be easier to get accepted upstream, but at the cost of keeping a larger TCB.
All in all, it seems like a consensus is being built around the idea that it's necessary to rethink the way in which KVM manages the guest's memory, and its relationship with the rest of the Linux subsystems.
Going further: removing KVM from the TCB
So far, the strategies presented were able to remove the VMM and the host kernel from the TCB. The last step would be removing KVM itself from the TCB. For this to be possible, KVM must remove guest memory from its address space, including from the Linux direct map. Only regions explicitly shared by the guests would be accessible to KVM. This is problematic because currently some operations serviced during VM-exits assume that KVM has full access to whole guest memory. VM-exits are performed when the VM returns control to the host's kernel to handle an operation it cannot perform.
To overcome this issue, Nakajima proposes two options. One is to adapt the kernel running inside the guest. Operations that would trigger VM-exits, and for which KVM would need to to access the guest's memory, would be replaced with explicit hypercalls. Alternatively, in order to run unmodified drivers, the operations could be reflected to a virtualization exception (#VE) handler. The handler would emulate the memory-mapped I/O (MMIO) operations and translate them to hypercalls.
Either of these strategies can be used to enable isolation of guest memory from the hypervisor. This would be helpful to prevent accidental leaks and side-channel attacks, but an attacker gaining full control of KVM would still be able to alter mappings and, potentially, gain access to the private regions of every guest. A complete mitigation of this risk therefore requires hardware assistance, like the one provided by AMD SEV and its successors, SEV-ES (Encrypted State) [PDF] and SEV-SNP (Secure Nested Paging) [PDF slides]. This allows the guest memory to be transparently encrypted in such a way that not even the hypervisor is able to access it in the clear.
Proof of concept and performance
Lastly, Nakajima presented a proof of concept that implements the ability to remove the guest mappings from both the VMM and the host's kernel, and gave some initial numbers about the performance impact. For disk read operations on a virtio device, he measured a 1.2% increase in CPU time with one guest, and 1.3% with more than ten. For write operations, the impact was slightly lower, with 1.1% for a single guest and 1.2% for the more than ten guests case. For network send operations, also on a virtio device, the increase in CPU time was of 2.6% when running a single guest, and 3.8% with more than ten.
According to Nakajima, the next step is finishing the proof of concept, and then sharing the patches with the upstream community. This will probably spark an interesting discussion in the upstream mailing lists about this and, possibly, other address space isolation techniques for KVM to protect the guests' memory from both an attacker controlling the host and side-channel leaks.
Some near-term arm64 hardening patches
The arm64 architecture is found at the core of many, if not most, mobile devices; that means that arm64 devices are destined to be the target of attackers worldwide. That has led to a high level of interest in technologies that can harden these systems. There are currently several such technologies, based in both hardware and software, that are being readied for the arm64 kernel; read on for a survey on what is coming.
E0PD
The Meltdown vulnerability enables an attacker in user space to read kernel-space data by making use of a combination of speculative execution and cache-based side channels. The kernel's defense against Meltdown is kernel page-table isolation — removing the kernel's page tables from the user-space mapping entirely. That works, but it has a significant performance cost and it can interfere with the use of other processor features. Nonetheless, it is fairly widely accepted that address-space isolation will be increasingly necessary to protect systems for some time.
There is an alternative, though: fix the hardware instead. One initiative in this area appears to be the E0PD feature, which was added as part of the Arm v8.5 extensions. Documentation on E0PD is scarce to the point of nonexistence; not even the patch set supporting it from Mark Brown describes how it works or what the acronym stands for. That said, the most informative bit of text about E0PD can be found there:
E0PD, thus, doesn't prevent speculative execution from going off into memory that user space should not be able to access, but it does block the side channel normally used to extract the data exposed by incorrectly speculated operations. Systems that support E0PD do not need to enable kernel page-table isolation and should, thus, regain the performance that it took away; no benchmark results were included with the patch set, though. E0PD support for the kernel is apparently close to ready, but the availability of processors with E0PD support may take rather longer.
Return-address signing
Arm pointer authentication is a mechanism for applying cryptographic signatures to pointers used in running code. A special instruction creates a signature for a given pointer value using a secret key; the signature is stored in the unused bits at the upper end of the pointer itself. A separate instruction verifies that a given pointer was indeed signed using a specific key. This mechanism can be used to prevent attackers from fooling the kernel into using an ill-advised pointer value.
The return-address signing patch set from Amit Daniel Kachhap uses this feature for a specific purpose: protecting the return addresses for function calls on the stack. In particular, it uses the ‑msign‑return‑address flag added to GCC 7 to build the kernel with this protection. On entry to a function, the return address is signed; when the time comes for the function to return, the signature is verified. Should the verification fail, a kernel oops will be generated and the running process will be killed.
The intent behind this work, of course, is to protect the kernel against buffer overflows or other attacks that overwrite the stack. An attacker may be able to corrupt the stack, but they should not be able to place return addresses there that will pass the verification step. That should protect the kernel against a wide range of potential attacks, since many common techniques depend on placing crafted return addresses on the stack.
Shadow call stacks
Another approach to protecting return addresses can be seen in the shadow call stack support patch set from Sami Tolvanen. Rather than signing return addresses, this patch set uses the Clang ‑fsanitize=shadow‑call‑stack option to cause return addresses to be placed on a separate "shadow" stack located somewhere in memory. Before a function returns, it restores the return address from the shadow stack.
The current call stack tends to be some of the easiest memory for an attacker to corrupt; any buffer overflow of an automatic variable will do. With the shadow call stack, though, this sort of corruption is rendered less harmful, since return addresses no longer live on the stack. The shadow stack will typically be much harder for an attacker to modify, or to even know where it might be located. The result should, once again, be a system that is more secure against buffer-overflow attacks.
Return-address signing and shadow call stacks appear to be two different approaches to the same problem; one probably does not want to use both of them. Tolvanen addresses the question of which should be used in the cover letter:
In other words, processors that can do pointer authentication should use that feature; shadow call stacks are there for those without that support. This patch set seems to be about ready; it is currently earmarked for the 5.6 merge window.
Branch target identification
The last of the arm64 features under consideration is branch-target identification (BTI), which is intended to trap wild jumps. The idea is simple enough: if BTI is enabled, the first instruction encountered after an indirect jump must be a special BTI instruction. That instruction is a no-op on systems without BTI; with BTI, it has the added benefit of not throwing a fault should it be jumped to. Jumps to locations that do not feature a BTI instruction, instead, will lead to the quick death of the process involved.
BTI, thus, is a way of marking code that is meant to be the target of an indirect jump, thwarting attacks that somehow convince the kernel to jump to some random spot. That should block a range of attacks based on, for example, overwriting a structure full of function pointers called by the kernel. It is interesting to note that BTI does not check the target of a return from a function; the intent is that return-address signing should be used to protect returns. The GCC 9 release includes support for BTI.
Each of these technologies addresses one piece of the problem of protecting arm64 systems from attackers. Put together, they should have the effect of making these systems into significantly harder targets. The arms race will not end, and attackers will certainly find ways of getting around these techniques, at least some of the time. But, with luck, they will find themselves being frustrated more often in the future.
Keeping memory contents secret
One of the many responsibilities of the operating system is to help processes keep secrets from each other. Operating systems often fail in this regard, sometimes due to factors — such as hardware bugs and user-space vulnerabilities — that are beyond their direct control. It is thus unsurprising that there is an increasing level of interest in ways to improve the ability to keep data secret, perhaps even from the operating system itself. The MAP_EXCLUSIVE patch set from Mike Rapoport is one example of the work that is being done in this area; it also shows that the development community has not yet really begun to figure out how this type of feature should work.MAP_EXCLUSIVE is a new flag for the mmap() system call; its purpose is to request a region of memory that is mapped only for the calling process and inaccessible to anybody else, including the kernel. It is a part of a larger address-space isolation effort underway in the memory-management subsystem, most of which is based on the idea that unmapped memory is much harder for an attacker to access.
Mapping a memory range with MAP_EXCLUSIVE has a number of effects. It automatically implies the MAP_LOCKED and MAP_POPULATE flags, meaning that the memory in question will be immediately faulted into RAM and locked there — it should never find its way to a swap area, for example. The MAP_PRIVATE and MAP_ANONYMOUS flags are required, and MAP_HUGETLB is not allowed. Pages that are mapped this way will not be copied if the process forks. They are also removed from the kernel's direct mapping — the linear mapping of all of physical memory — making them inaccessible to the kernel in most circumstances.
The goal behind MAP_EXCLUSIVE seems to have support within the community, but the actual implementation has raised a number of questions about how this functionality should work. One area of concern is the removal of the pages from the direct mapping. The kernel uses huge pages for that mapping, since that gives a significant performance improvement through decreased translation lookaside buffer (TLB) pressure. Carving specific pages out of that mapping requires splitting the huge pages into normal pages, slowing things down for every process in the system. The splitting of the direct mapping in another context caused a 2% performance regression at Facebook, according to Alexei Starovoitov in October; that is not a cost that everybody is willing to pay.
Elena Reshetova indicated that she has been working on similar functionality; rather than enhancing mmap(), her patch provides a new madvise() flag and requires that the secret areas be a multiple of the page size. Her version will eventually wipe any secret areas before returning the memory to general use in case the calling process doesn't do that.
Reshetova also raised the idea of mapping this memory uncached. The benefit of doing so would be to protect its contents from a whole range of speculative-execution attacks, known and unknown. On the other hand, the effect on application performance would be something between "painful" and "crippling", depending on how often the memory is accessed. Some users would likely welcome the extra protection; many others may well find that the performance penalty rules out this feature's use entirely. Andy Lutomirski said that uncached memory should only be provided if it is explicitly asked for, but Alan Cox responded that users generally do not know whether they want uncached memory or not.
More to the point, Cox continued, there may be any of a number of things that the system might do to protect the contents of secret memory; those things will vary from one system to the next and users will not be in a position to know what any specific system should use. That makes it all the more important to nail down what the MAP_EXCLUSIVE flag really means:
James Bottomley took this
argument even further, describing MAP_EXCLUSIVE as "a
usability problem
".  Protecting secret data might, on some systems,
involve hardware technologies like TME and
SEV, for example, but developers cannot know that in a general way.
Somehow, Bottomley suggested, the kernel should make 
the best choice it can for how to protect secret memory; one such choice
could be to make the memory uncached only on systems where the
speculative-execution mitigations are not active.  Lutomirski worried
that this approach would not work, though; there are too many variables and
ways in which things could go wrong.
There is only one truly clear conclusion from this discussion: a desire for memory with higher levels of secrecy exists, but the development community lacks a clear idea of how that secrecy should be implemented and how it should be presented to the user. That suggests that this feature will not be showing up in a mainline kernel anytime soon. Getting memory secrecy wrong risks saddling the community with the maintenance of a misdesigned interface and, possibly, giving application developers a false sense of security. It is better to go slow in the hope of getting things right.
The Yocto Project 3.0 release
The Yocto Project recently announced its 3.0 release, maintaining the spring/fall cadence it has followed for the past nine years. As well as the expected updates, it contains new thinking on getting the best of two worlds: source builds and prebuilt binaries. This fits well into a landscape where reproducibility and software traceability, all the way through to device updates, are increasingly important to handle complex security issues.This update contains the usual things people have come to expect from a Yocto Project release, such as upgrades to the latest versions of many of the software components including GCC 9.2, glibc 2.30, and the 5.2 and 4.19 kernels. But there is more to it than that.
One major development in this release was the addition of the ability to run the toolchain test suites. The project is proud of its ability to run builds of complete Linux software stacks for multiple architectures from source, boot them under QEMU, and run extensive software tests on them, all in around five hours. In that time we can now include tests for GCC, glibc, and binutils on each of the principal architectures. As a result, the test report for the release now has around two-million test results.
Build change equivalence
What is slightly less usual is a small line in the release notes
starting with "build change equivalence
". This
innocuous-sounding line  
covers what could become one of the most useful enhancements
to the project in recent years and may also be a first for large-scale
distribution compilation in general. In basic terms, it allows detection of
build-output equivalence and hence reuse of previously built binaries — but in
a way never seen before — by building on technology already used by the
project.
While the project has been able to reuse binaries resulting from identical input configurations for some time, 3.0 allows the reuse of previously built binaries when the output of the intermediate steps in the build process is the same. This avoids much rebuilding, leading to faster development times, more efficient builds, reduced binary artifact storage, and also a reduction in work like testing, allowing build and test resources to focus on "real" changes. In short, it addresses one of the complaints many Yocto Project users have about the system: its "continual building from source".
In some ways focusing on this feature is unfair to 3.0, as there are many other, smaller features in there, many of which are small, incremental improvements to things the project has already done well. One other feature of note is that the change equivalence work led naturally into more efficient "multiconfig" builds where multiple different configurations can be built in parallel. These are now optimized when the builds share artifacts. The Yocto Project is one of the few where you can build components for different architectures or operating systems (e.g. an RTOS) in parallel and combine them all in one build.
The Yocto Project/OpenEmbedded build process
To understand more about what build change equivalence means and how it works, it first makes sense to understand how prebuilt binaries were already being handled. There is a common misconception that the Yocto Project (or OpenEmbedded, the underlying build system) always builds everything from source. This may have been true ten years ago but, in modern times, the project maintains what it terms its shared-state cache (or "sstate"). This cache contains binaries from previous builds that the system can reuse under the right conditions.
When building software with OpenEmbedded, there are a series of steps that are followed. The project's execution engine, "BitBake" takes these steps ("tasks" in BitBake terms), builds a dependency tree, and executes them in the correct order. These tasks are usually written in Python or the shell and, ultimately, are effectively represented by key/value pairs of data. These data pairs could be the topic of an article in their own right but, in short, they are how OpenEmbedded manages to be customizable and configurable. It does this through this data store and its ability to "override" values and stack configuration files from different sources, all of which can all potentially manipulate the values. An example could be the following configuration fragment:
    PARALLELISM ?= ""
    MAKE = "make"
    MAKE_specialmake = "new-improved-make"
    OVERRIDES = "${@random.choice(['', 'specialmake'])}"
    do_compile () {
        ${MAKE} ${PARALLELISM}
    }
    addtask compile after configure before install
This fragment illustrates some of the capabilities of the syntax. Several keys are defined, including do_compile, which is promoted to a task with some ordering constraints on when it needs to run. The "do_" prefix is simply a convention to make it obvious which keys are tasks. A user could set PARALLELISM elsewhere to pass options like -j to make, speeding up compilation, or to turn it off if some source code doesn't support parallel building.
Also shown is a simple override where MAKE is changed to a different tool when specialmake is added to OVERRIDES. In this case, it is being triggered randomly just to show the ability to inject Python code to handle more complex situations. There is much more to the syntax, but the idea is you build up functions and tasks that are executed to build software, and these functions are highly customizable depending on many different kinds of input, including the target architecture, the specific target device, the policies being used for the software stack, and so on. The BitBake data store is the place where all the different inputs are combined together to build the system.
There is code in BitBake that knows how to parse these shell and Python configuration fragments — in the Python case using Python's own abstract-syntax-tree code — and from this figure out which keys and associated values were used to build the functions that represent the tasks. Once you know the values that are going into a given task, you can represent them as a hash. If the input data changes, the hash changes. Some values, such as the directory where the build is happening, can be filtered out of the hash but, in general, they're sensitive and accurate representations of the configuration being used as the input. In addition, hashes of files being added to the build are included.
The sstate cache is therefore a store of output from tasks, indexed by a hash that represents the configuration and other inputs used to create that output. If the configuration (hash) matches, the object can be pulled from the cache and used instead of rebuilding it from scratch. This, however, is old technology; the project has been using sstate since 2010. One potential issue with this approach is how sensitive it is to changes. If you add some white space to a patch being applied to some piece of software early in the build process, it will change the input hashes to almost everything in the system and cause a full rebuild. There has, therefore, been a long-held desire to be more intelligent about when to rebuild. Solving a case of white space changes may be possible through optimization, but there are many other cases that could be optimized too, and the question becomes how to do this in a more generic way.
Better optimization
This is where the new work in 3.0 comes into play. So far, we've only talked about configuration inputs, but these inputs result in some kind of output such as, in the case of the example above, some generated object files. The project would usually be most interested in the output from "make install", which would be generated by a do_install() function following do_compile(). For more efficient builds, it became clear that the project should start analyzing the output from the intermediate tasks, so it came up with a way of representing the output of a task as a hash too. The algorithm currently chosen to do this looks at the checksums of the output files, but it ignores their timestamps. There are lots of potential future features that could be added here, such as representing the ABI of a library instead of its checksum but, for now, even this simplistic model is proving effective.
Once you have this output hash, you can compare it to hashes of previous builds of this task. If the output hashes from two builds match, then the input hashes from those two builds are deemed to be "equivalent". This means that the current build matched the previous build up to this point; it follows that anything beyond this point should also match, assuming the same subsequent configuration, even though the input configurations before this point were different. At this point, the build can therefore stop building things from scratch and start reusing prebuilt objects from sstate.
To make this work, the system needs to store information about these "equivalences", so the project added a "hash equivalence server" — a server that stores which input hashes generate the same output data as other input hashes and are thus equivalent. The first input hash given to the server is deemed to be the "base" hash and used to resolve any other matching hashes to that value. This server is written as a network service so that multiple different builds can feed it data and benefit from its equivalence data to speed up their builds.
This is all good, but BitBake itself required major surgery to be able to use this data. Previously, it would look for prebuilt objects from sstate, install them, then proceed with anything it needed to build from source. Installing from sstate is a different operation, since the task ordering is reversed compared to normal task execution. To understand why, consider that, in some cases, if an sstate object is available for some task late in the build, you can skip all the earlier tasks leading up to that object, as those objects aren't needed. That means you need to start with the tasks that would normally be last to run, then work backward up the dependency tree, installing the sstate objects in reverse order, stopping when no dependencies are needed. In simpler terms, this that means mixing sstate tasks and normal tasks is hard.
To do this, 3.0 BitBake now has to have two different task queues — sstate tasks and non-sstate tasks — and both queues need to be able to execute in parallel. When sstate equivalence is detected, tasks are "rehashed" with the new hashes, and tasks can possibly migrate back to the sstate queue. The build can alternate between these two states many times during a build as different equivalences are found and objects from sstate become valid. It makes for a fascinating scheduling problem.
Reproducibility and more
There is a further consideration here: reproducibility. This is a hot topic for many distributions, so there have been many people quietly working away at making software builds generate consistent binary output. The Yocto Project is no exception and has been trying to do its part, including sending patches to upstream projects where it can (including the Linux kernel). This ties in well for hash equivalence, since the higher the reproducibility of the output, the larger the number of equivalent hashes that should be found. In 3.0, automated tests for reproducibility were added. The project has this working for building the minimal core image, including its toolchain, and will continue to improve this feature in the next release.
While this is a beta feature in 3.0 that is not enabled by default, the project believes it represents a significant optimization in working with source builds. Perhaps the Yocto Project can finally put the reputation of "always building from source" behind it.
Finally, its also worth mentioning a quick follow-up to a previous article that discussed how the project found a Linux kernel bug with its automated testing. As the Yocto Project was approaching the 3.0 release and switching to the 5.2 kernel, a similar situation occurred where developers noticed that the "ptest" tests for the strace recipe were hanging. "ptests" are where the project has packaged up the upstream tests that come with software; they are run in the target environment. This was discussed on LKML and ultimately found to be a bug that only appeared when preemption was enabled.
The takeaway from all this is that "from scratch" source builds for the Yocto Project should be something that happens less frequently in the future, particularly as build reproducibility continues to improve. Along with the obvious benefits faster builds bring to developers, they reduce storage and load requirements and reduce testing requirements too. This is particularly important when you consider security updates for end-user devices, where little of the system should change for a given update. Minimizing rebuilding allows more focused testing and, thus, reduced risk of unintended side effects, which in turn may encourage more updates to be made.
The project plans to make this functionality the default in the near future and is looking forward to further improvements in reproducibility as it works on finalizing its long-term support plans within which these developments can play a key role.
[Richard Purdie is one of the founders of the Yocto Project and its technical lead.]
Page editor: Jonathan Corbet
Next page:
                  Brief items>>
                  
 
           