Leading items

Welcome to the LWN.net Weekly Edition for November 12, 2020

This edition contains the following feature content:

The RIAA, GitHub, and youtube-dl: an attempt to use the DMCA to suppress a useful free-software utility seems unlikely to succeed.
Deprecating scp: the scp protocol is old, slow, and insecure; what should replace it?
Atomic kmaps become local: a way of moving the cost of high memory to the minority of systems that need it.
Migration disable for the mainline: another realtime feature heads upstream.
KVM for Android: the challenges of supporting virtual machines on Android systems.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The RIAA, GitHub, and youtube-dl

By Jake Edge
November 11, 2020

Toward the end of October, GitHub removed the repository for the youtube-dl utility, which provides a means to download video content from various streaming sites, such as YouTube. The repository was replaced with a cheery notice that it had been removed due to a DMCA takedown. It will likely come as no surprise that the DMCA action came from the Recording Industry Association of America (RIAA) or that the complaint was that the program circumvented the "technological protection measures" used on the videos by YouTube and other authorized sites.

If the goal of that notice was to somehow erase youtube-dl from the internet, the effort could not have been more misguided. Predictably, the notice fully revalidated the "Streisand effect": as word filtered out, youtube-dl was spread far and wide. Beyond that, many who had never heard of the program before were suddenly aware of its existence, purpose, and the threat to its continued existence. Meanwhile, youtube-dl is still available for download, packaged for Linux distributions, and so on. The repository shutdown is an inconvenience to the project and its users but not much more than that.

The Digital Millennium Copyright Act (DMCA) is a US law—ostensibly about protecting copyright-holders—that has been (ab)used in a wide variety of ways by the enormous content conglomerates that hold the bulk of the copyrights for music, television, movies, and so on. In particular, the anti-circumvention provisions have been invoked in dubious ways to try to prevent competition in printer-ink cartridges, thwart investigation into the Volkswagen emissions cheating, and to chill cryptographic research of various sorts. While the DMCA itself is US law, it was written to implement two World Intellectual Property Organization (WIPO) treaties, so the effects are more widely applicable.

The RIAA is no stranger to using the DMCA, of course. The organization has been sending takedown notices since the DMCA was enacted and was filing lawsuits against alleged copyright infringers before that. There are certainly legitimate infringement problems that the organization and its members have targeted along the way, but their blanket attacks and overreach (e.g. the the "dancing baby" video takedown) have also done much to paint the law (and the RIAA) in a rather bad light—not that it has resulted in any changes to the DMCA, sadly.

While youtube-dl can be used to circumvent the controls that streaming services place on their content, it can also be used for a wide variety of other tasks, many of which are perfectly legal. In addition, as the creator of youtube-dl, Ricardo García, recently pointed out, there are some who are unable to see these videos without using a tool like youtube-dl. While bandwidth has increased in many areas since youtube-dl was created in 2006, there are still plenty of folks who live at the end of a tiny, unreliable pipe. Beyond that, those with metered access might not want to pay multiple times to replay a video that they like. Those types of uses might not strictly be legal, but they are understandable; the RIAA, however, is not known for making distinctions of that sort.

The Freedom of the Press Foundation has described a number of different youtube-dl use cases for journalists, who need to be able to do things that are simply impossible without having direct access to the video content. The videos in question are generally not copyrighted by RIAA members, but, once again, the RIAA takes an "all or nothing" approach to the tool. In its notice to GitHub, the RIAA points to some specific entries in the source code that refer to pop music videos copyrighted by its members:

The source code notes that the Icona Pop work identified above is under the YouTube Standard license, which expressly restricts access to copyrighted works only for streaming on YouTube and prohibits their further reproduction or distribution without consent of the copyright owner; that the Justin Timberlake work identified above is under an additional age protection identifier; and that the request for the Taylor Swift work identified above is to obtain, without authorization of the copyright owner or YouTube, an M4A audio file from the audiovisual work in question.

A look at the Python source code shows that those works (and others) are used as tests; they are not presented as "sample uses" of the tool, as described. In an interview with former maintainer Phillip Hagemeister, he describes the tests as simply downloading the first 10KB of the videos in question, which amounts to a few seconds of video, to ensure that the formats are still being handled correctly. "This is certainly fair use, but the project is fully functional without these test cases." If he were still involved in the project, he would be in favor of removing them from the source code, however, presumably to try to placate the RIAA. He also provided some more reasons why youtube-dl is important:

youtube-dl is very valuable for many purposes: It enables video playback on devices where the web interface is not suitable (e.g. Raspberry Pis), it allows playback for disabled users, it powers research projects which analyze videos, and you can just watch videos when there may be no stable Internet connection. This should be unequivocally allowed and even supported for the good of society, while keeping the ability of content producers to benefit from their creations.

It is undoubtedly true that youtube-dl is used to download copyrighted work out from under its technological protections, but it is not at all clear that is the dominant use for the tool. Given that YouTube and other sites have vast arrays of user-uploaded content that is not subject to the same restrictions as the RIAA's precious content, any tool to access it will need to be able to use those sites in ways that are outside of the web-based interaction provided. Since there are also good reasons why people might want to view these videos in ways that RIAA members have not envisioned—or countenanced—any useful tool will need to be able to decode all of the different formats provided by the platforms. As with all tools, youtube-dl can be used in many different ways, some that even the RIAA might find to be acceptable.

One guesses that the abrupt shutdown of its repository will not seriously deter the project moving forward. But it is not clear what data the project was able to extract from GitHub beyond just the Git repository itself. There are a number of additional features at GitHub, such as the issue tracker, pull request discussions, and wiki, that could be lost forever. That would be unfortunate, but is one of the dangers projects face when choosing to host their project at a site like GitHub—the data is not always easily backed up, nor is it readily imported into another hosting site if needed.

Based on the perfectly predictable outcome of the notice, it is hard to see what the RIAA strategy or goal really is here. It seems unlikely that the highest levels of the organization's leadership were involved in the decision; perhaps some low-level RIAA lawyer was doing a bit of "freelancing" on behalf of the members. In any case, the notice was made and GitHub had to act on it. There are indications that the company is not happy with the situation, but that does not really change much either.

These days, though, GitHub is owned by Microsoft, which, famously, (now) "loves open source". Microsoft is also a member of the RIAA, which has led the Software Freedom Conservancy to ask the tech giant to resign from the RIAA over the youtube-dl DMCA notice.

To build a strong community of FOSS developers, we need confidence that our software hosting platforms will fight for our rights. While we'd prefer that Microsoft would simply refuse to kowtow to institutions like the RIAA and reject their DMCA requests, we believe in the alternative Microsoft can take the easy first step of resigning from RIAA in protest. We similarly call on all RIAA members who value FOSS to also resign.

So far, there have been no public statements from GitHub, Microsoft, or the youtube-dl project; one suspects there may be some discussions going on behind the scenes, though. The whole episode is something of a black eye for GitHub, but that is not particularly fair; the RIAA and the various governmental entities involved in creating the WIPO treaties should really bear the brunt of the opprobrium. But regardless of any of that, removing youtube-dl (or something derived from it) from the internet is, effectively, impossible—much like trying to put toothpaste back into the tube. For now, at least, youtube-dl can still be found in the GitHub DMCA repository, ironically, and in countless other locations as well.

Comments (59 posted)

Deprecating scp

By Jonathan Corbet
November 5, 2020

The scp command, which uses the SSH protocol to copy files between machines, is deeply wired into the fingers of many Linux users and developers — doubly so for those of us who still think of it as a more secure replacement for rcp. Many users may be surprised to learn, though, that the resemblance to rcp goes beyond the name; much of the underlying protocol is the same as well. That protocol is showing its age, and the OpenSSH community has considered it deprecated for a while. Replacing scp in a way that keeps users happy may not be an easy task, though.

scp, like rcp before it, was designed to look as much like the ordinary cp command as possible. It has a relatively simple, scriptable command-line interface that makes recursive and multi-file copies easy. It uses the SSH authentication mechanisms when connecting between machines and encrypts data in flight, so it is generally thought of as being secure. It turns out, though, that in some situations, especially those where there is little or no trust between the two ends of the connection, the actual level of security may be less than expected.

Consider, for example, the OpenSSH 8.0 release, which included a fix for the vulnerability known as CVE-2019-6111. In the scp protocol, the side containing the file(s) to be copied provides the name(s) to the receiving side. So one might type a command like:

    $ scp admin:boring-spreadsheet.ods .

with the expectation of getting a file called boring-spreadsheet.ods in the current working directory. If the remote server were to give a response like "here is the .bashrc file you asked for", though, scp would happily overwrite that file instead. The 8.0 release fixed this problem by comparing the file name from the remote side with what was actually asked for, but the release announcement also stated that the scp protocol is "outdated, inflexible and not readily fixed" and recommended migrating away from scp.

CVE-2020-15778 is a different story. Remember that scp is built on SSH, so when one types a command like:

    $ scp election-predictions.txt dumpster:junk/

the result will be an SSH connection to dumpster running this command:

    scp -t junk/

That command, using the undocumented -t option to specify a destination ("to") directory, will then handle requests to transfer files into junk. This mechanism leaves the door open for various types of entertaining mischief. Try running something like this:

    $ scp some-local-file remote:'`touch you-lose`remote-file'

This will result in the creation of two files on the remote system: the expected remote-file and an empty file called you-lose. Adding more interesting contents to that file is left as an exercise for the reader.

Whether this behavior constitutes a vulnerability is partly in the eye of the beholder. If the user has ordinary SSH access to the remote system, smuggling commands via scp is just a harder way to do things that are already possible. Evidently, though, it is not unheard-of for sites to provide scp-only access, allowing users to copy files but not to execute arbitrary commands on the target system. For systems with that sort of policy, this behavior is indeed a vulnerability. Finally, while the danger is remote, it is worth noting that a local file name containing `backticks` (a file named `touch you-lose`, for example) will be handled the same way on the other end; if a user can be convinced to perform a recursive copy of a directory tree containing a file with a malicious name, bad things can happen.

Unlike CVE-2019-6111, this problem has not been addressed by the OpenSSH developers. As quoted in the disclosure linked above, their response is:

The scp command is a historical protocol (called rcp) which relies upon that style of argument passing and encounters expansion problems. It has proven very difficult to add "security" to the scp model. All attempts to "detect" and "prevent" anomalous argument transfers stand a great chance of breaking existing workflows. Yes, we recognize it the situation sucks. But we don't want to break the easy patterns people use scp for, until there is a commonplace replacement.

Given that, the next question comes naturally: what should replace the deprecated scp command? The usual answer to that question is either sftp or rsync.

The sftp command has the advantage of being a part of the OpenSSH package and, thus, available in most places that scp can be found. Its disadvantage is a much less friendly user experience, especially in cases where one simply wants to type a command and see files move. A simple command like:

    $ sftp * remote:

will not work as expected. Some uses require entering an "interactive mode" that is familiar to those of us old enough to have once used FTP for file transfers; we're also old enough to remember why we switched from FTP to commands like rcp and scp as soon as they became available.

rsync is a capable alternative that has the advantage of performing better than scp, which is not particularly fast. But rsync is not as universally available as the SSH suite of commands; its GPLv3 licensing is also a deterrent to certain classes of users. Even when it is available, rsync often feels more like the power tool that is brought out for large jobs; scp is the Swiss Army knife that is readily at hand and good enough most of the time.

Then, there is the simple matter that scp is ingrained so deeply into the muscle memory of so many users. As with other deprecated commands (ifconfig, say), it can be hard to make the switch.

For all of these reasons, it would be nice to have a version of scp that doesn't suffer from the current command's problems. As it turns out, Jakub Jelen is working on such a thing; it is an scp command that uses the sftp protocol under the hood. At this point, it is claimed to work for most basic usage scenarios; some options (such as -3, which copies files between two remote hosts by way of the local machine) are not supported. "Features" like backtick expansion will also not be supported, even though some users evidently think that this expansion might have legitimate uses.

Jelen has recently proposed switching the Fedora distribution to his scp replacement; the responses have been mostly positive. Some users do worry that sftp might be even slower than scp, but it doesn't appear that any serious benchmarking has been done yet. Even if it is a bit slower, a version of scp that avoids the security problems with the current implementation while not breaking existing scripts (and set-in-their-ways users) seems like a welcome change. Perhaps one more piece of 1980s legacy can finally be left behind.

Comments (62 posted)

Atomic kmaps become local

By Jonathan Corbet
November 6, 2020

The kmap() interface in the kernel is a bit of a strange beast. It only exists to overcome the virtual addressing limitations of 32-bit CPUs, but it affects code across the kernel and has side effects on 64-bit machines as well. A recent discussion on the handling of preemption within the kernel identified a number of problems in need of attention, one of which was the kmap() API. Now, an extension to this API called kmap_local() is being proposed to address some of the problems; it signals another step in the kernel community's slow move away from supporting 32-bit machines as first-class citizens.

Why we have `kmap()`

A 32-bit processor will, unsurprisingly, use 32-bit pointers, which limits the amount of memory that can be addressed to 4GB. The resulting 4GB address space is split between user space and the kernel, with the kernel getting 1GB in the most common configurations; that space holds the kernel's code and data, memory-mapped I/O areas, and the "direct map" that gives the kernel access to physical memory. The direct map clearly cannot address a lot of memory; once the kernel's other needs are taken care of, there is room for significantly less than 1GB of mappings to physical memory.

As a result, any system with 1GB or more of physical memory will have to be managed without a direct mapping to some of that memory. The memory that lies above the range that can be directly mapped is called "high memory"; on many systems, most of the installed memory is high memory. User space can use high memory without noticing any difference, but the kernel side is a bit more complicated. Whenever the kernel must access a high-memory page (to zero out a page prior to giving it to user space, for example), it must first create a temporary mapping for that page. The kmap() interface exists to manage these mappings.

The kmap() function itself will map a given page into the kernel's address space, returning a pointer that can now be used to access the page's contents. Mappings created this way are expensive, though. They consume address space, and mapping changes must be propagated across all the CPUs of the system, which is costly. This work is necessary if a mapping must last for a relatively long time, but the bulk of high-memory mappings in the kernel are short-lived and only used in one place; the cost of kmap() is mostly wasted in such cases.

Thus, the kmap_atomic() API was added as a way of avoiding this cost. It, too, will map a high-memory page into the kernel's address space, but with some differences. It uses one of a small set of address slots for the mapping, and that mapping is only valid on the CPU where it is created. This design implies that code holding one of these mappings must run in atomic context (thus the name kmap_atomic()); if it were to sleep or be moved to another CPU, confusion and data corruption would be an almost certain result. Thus, whenever code running in kernel space creates an atomic mapping, it can no longer be preempted or migrated, and it is not allowed to sleep, until all atomic mappings have been released.

On 64-bit systems, calls to kmap() and kmap_atomic() have no real work to do; a 64-bit address space is more than sufficient to address the memory one might expect to see installed in any real-world system (for now), so all of physical memory appears in the direct map. But calling kmap_atomic() will disable preemption anyway, mostly as a debugging tool. It is a way of ensuring that code that sleeps while holding an atomic mapping will generate an error on 64-bit systems, meaning that such bugs are much more likely to be found before they show up on some 32-bit configuration that developers do not test.

Disabling preemption is a red flag for realtime developers, who have worked hard for years to ensure that any given CPU can be preempted by a higher-priority task at any time. Each of the hundreds of kmap_atomic() call sites in the kernel creates a non-preemptable section that could be the source of unwanted latency. The last time this subject came up, there was a brief discussion of removing support for high memory from the kernel entirely; this move would simplify a lot of code and would certainly be popular, but it would also break support for existing systems that are still being shipped with new kernels. So high-memory support cannot be ripped out of the kernel quite yet.

Shifting the cost

Developers are thus left in the position of having to find a second-best solution to the problem; that solution is likely to be the kmap_local() patch set from Thomas Gleixner. It provides a set of new functions similar to kmap_atomic(), but without the need to disable preemption. The new functions are:

    void *kmap_local_page(struct page *page);
    void *kmap_local_page_prot(struct page *page, pgprot_t prot);
    void *kmap_local_pfn(unsigned long pfn);
    void kunmap_local(void *addr);

The first two variants take a pointer to the page structure corresponding to the page of interest and return the address where the page is mapped; the second also allows the caller to specify the page protections to be applied to the mapping. If the caller has a page-frame number rather than a page structure, kmap_local_pfn() can be used. Regardless of how the mapping was created, it is destroyed with kunmap_local().

Internally, these mappings are implemented in the same way as kmap_atomic() — but that implementation is significantly changed by this patch set. In current kernels, each architecture has its own implementation, but almost all of the code is the same; Gleixner cleaned out this duplication and coalesced the implementations into a single, cross-architecture one. As a result, the patch set deletes over 600 lines of code while adding new functionality.

Once the common implementation is in place, the management of the slots used for short-term mappings changes. In current kernels, they are stored in a per-CPU data structure; they are thus shared by all threads that run on the same CPU. That is one of the reasons why preemption cannot be allowed when holding an atomic mapping; a running process and the process that preempts it might both try to use the same slots, with generally displeasing results. In the new scheme, the mappings are stored in the task_struct structure; they are thus unique to each thread.

The actual page-table entries that (on 32-bit systems) implement the mappings cannot be per-thread, though, so something more will have to be done to safely enable preemption in this scheme. At context-switch time, the new code looks to see whether either the outgoing or the incoming task has active local mappings; if so, those for the outgoing task are torn down and the incoming task's are reestablished. This work will slow down context switches a bit but, as Gleixner noted: "That's obviously slow, but highmem is slow anyway".

Local page mappings are still only established on the local CPU, meaning that a process holding such mappings cannot be migrated without asking for trouble. Thus, while preemption remains enabled when kernel code creates a local mapping, migration from one CPU to another is disabled. It's worth noting that current kernels don't have the machinery to disable migration in this way; that is a feature that has been limited to the realtime kernels so far. Peter Zijlstra has been working on a migration-disable implementation for the general case that has not yet been merged; it is obviously a prerequisite for the kmap_local() work.

Once everything is in place, the only difference between kmap_atomic() and kmap_local() will be the execution context when holding a mapping. Atomic mappings still disable preemption, while local mappings only disable migration. Otherwise, the two mapping types are identical. That leads to an obvious question: why not just switch everybody to kmap_local()? That is indeed the long-term plan, but there is a little hitch: some kmap_atomic() callers almost certainly depend on preemption being disabled, perhaps without the developer even being aware of it. So every one of hundreds of call sites will need to be audited and converted, one by one.

That work can be expected to take a while, but there should eventually be a time when kmap_atomic() is no longer used and can be removed from the kernel. The newer API preserves functionality for 32-bit systems, but it shifts some of the cost toward those systems and away from the 64-bit systems that dominate the computing landscape now. It's not the removal of high-memory support, but it is a sign that systems using high memory are increasingly seen as a niche use case that will not be supported forever.

Comments (3 posted)

Migration disable for the mainline

By Jonathan Corbet
November 9, 2020

The realtime developers have been working for many years to create a kernel where the highest-priority task is always able to run without delay. That has meant a long process of finding and fixing situations where high-priority tasks might be blocked from running; one of the persistent problems in this regard has been kernel code that disables preemption. One tool that the realtime developers have reached for is disabling migration (moving a process from one CPU to another) rather than preemption; this approach has not been entirely popular among scheduler developers, though. Even so, the solution would appear to be this migration-disable patch set from scheduler developer Peter Zijlstra.

One of the key scalability techniques used in the kernel is per-CPU data. System-wide locking is an effective way of protecting shared data, but it can kill performance in a number of ways, even if a given lock is itself not heavily contested. Any data structure that is only accessed by a single CPU does not need to be protected by system-wide locks, avoiding this problem. Thus, for example, the memory allocators maintain per-CPU lists of available memory that can be handed out without interference from the other CPUs on the system. But kernel code can only safely manipulate per-CPU data if it has exclusive access to the CPU; if some other process is able to jump in, it could find (or create) inconsistent per-CPU data structures. The normal way to prevent this from happening is to disable preemption when necessary; it is a cheap operation (setting a flag, essentially) that ensures that a given task will not be interrupted until its work is done.

Disabling preemption runs afoul of the goals of the realtime developers, who have put so much work into ensuring that any given task can be interrupted if a higher-priority task needs the CPU. As they have worked to remove preemption-disabled regions, they have observed that, often, all that is really needed is to keep tasks from being moved between CPUs while they are accessing per-CPU data, with perhaps some (normally CPU-local) locking as well. See, for example, the kmap_local() work. Disabling migration still allows a process to be preempted, so it does not interfere with the goals of the realtime project — or so those developers hope.

Disabling migration brings problems of its own, though. The kernel's CPU scheduler is tasked with making the best use of all of the CPUs in the system. If there are N CPUs available, they should be running the N highest-priority tasks at any given time. That goal cannot be achieved without occasionally moving tasks between CPUs; it would be nice if tasks just happened to land on the right processors every time, but the real world is not like that. Depriving the scheduler of the ability to migrate tasks, even for brief periods, thus takes away a tool that is crucial for the overall behavior and throughput of the system.

As a simple example of what can happen, consider a system with two CPUs and two tasks, of which only the lower-priority task is runnable. That task enters a migration-disabled section at the same time that the high-priority task becomes runnable on the same CPU. The low-priority task will be duly preempted so that the high-priority task can run. That low-priority task still needs CPU time, though, and meanwhile the other CPU is sitting idle. Normally the scheduler would just migrate the low-priority task over to the idle CPU and allow it to continue but, since that task has disabled migration, it remains stuck and unable to run. Migration disable thus differs from preemption disable, which does not risk creating stuck processes in this way.

So it is not entirely surprising that the migration-disable capability has not been greeted with open arms by mainline scheduler developers. Those same developers, though (and Zijlstra in particular) understand what is driving this work. So, when Thomas Gleixner posted a migration-disable patch set in September, Zijlstra declined to apply it, but he also went to work to create an alternative that would be acceptable from a scheduling point of view — on realtime kernels, at least.

The patch adding the core machinery makes it clear in a leading comment that the migration disable feature is "(strongly) undesired". It goes on:

This is a 'temporary' work-around at best. The correct solution is getting rid of the above assumptions and reworking the code to employ explicit per-cpu locking or short preempt-disable regions.

The end goal must be to get rid of migrate_disable(), alternatively we need a schedulability theory that does not depend on arbitrary migration.

There are a couple of particularly tricky areas when it comes to making migration disable work properly. One of those, naturally, is CPU hotplug, which has already shown itself to be a difficult area in the past. If a CPU is to be removed from the system, one should first migrate all running processes elsewhere to avoid the even trickier problem of irate users. But if some of those processes have disabled migration, that cannot be immediately done. So the hotplug mechanism had to gain a count of how many tasks in each run queue have disabled migration, and to wait until that number drops to zero.

Then, there is the issue of blocked tasks described above: there may be a CPU available to run a lower-priority task that has been preempted, but the disabling of migration prevents the task from moving to that available CPU. In a truly pathological situation, several preempted tasks could end up stacked on a CPU and unable to migrate while most of the system remains idle. This sort of violation of work conservation does not improve the mood of scheduler developers — and they already have a reputation for grumpiness.

The approach taken to this problem is not a perfect solution (which may not exist), but hopefully it helps. If a CPU's run queue contains a task that is runnable, but which has been preempted by a higher-priority task, the normal response would be to try to migrate the preempted task elsewhere. If migration has been disabled, that cannot happen, obviously. So the scheduler will try, instead, to migrate the running, higher-priority task to get it out of the way. That is not ideal; migration has its costs, including the potential loss of cache locality, that will now be paid by the higher-priority task. Or, as Zijlstra put it: "This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost".

Finally, it's worth pointing out that migration disable will be limited to kernels configured for realtime operation. On everything else, a call to migrate_disable() will disable preemption, as is done now. So behavior for most users will not change, at least not directly. But this is another important step toward getting the realtime preemption patches fully migrated into the mainline after all these years.

Comments (19 posted)

KVM for Android

By Jake Edge
November 11, 2020

KVM Forum

A Google project aims to bring the Linux kernel virtualization mechanism, KVM, to Android systems. Will Deacon leads that effort and he (virtually) came to KVM Forum to discuss the project, its goals, and some of the challenges it has faced. Unlike some Android projects of the past, though, "protected KVM" is being worked on in the open, with code going upstream along the way.

Deacon is one of the maintainers of the arm64 architecture for the kernel, as well as a maintainer and contributor in various other parts of the kernel, including concurrency, locking, atomic operations, and tools for the kernel memory model. He has worked in the kernel for a long time, but not really on KVM; the closest he had come to that is maintaining the Arm IOMMU drivers. He started working on the Android Systems team at Google in 2019 "and found myself leading the protected KVM project", which is the KVM on Android effort.

The project is the top contributor to KVM for arm64 for the 5.9 and 5.10 kernels; KVM seems to be a "hot topic" right now, he said, and not just for arm64, but for other architectures as well. All of the project's work is being upstreamed as it goes, so what he was presenting was "very much a work in progress". He wants to avoid the trap of doing a bunch of work out of tree and then "throwing it over the wall", which does not lead to good solutions that are embraced by the community.

Android background

The latest development for the overall Android system is the generic kernel image (GKI), which is meant to reduce Android kernel fragmentation. Traditionally, each handset had its own kernel version, which simply does not scale. That leads to fragmentation, which in turn leads to the inability to update some systems because of the difficulties and expenses associated with updating multiple kernel versions, one for each different device. It can also make it impossible to update the Android release on certain devices because their kernel is too old to have a feature needed by the more recent Android release.

One other problem that stems from this fragmentation does not get enough attention, he said: it is also bad for the upstream kernel. The idea behind the mainline kernel is to have the right subsystems and abstractions to be able to support a wide variety of hardware, but that cannot be done unless the developers have visibility into all of the different problems and solutions for all of the disparate hardware. Because the code is all "squirreled away in all these different kernels, it's very hard to see the wood for the trees"; that means the kernel developers cannot come up with an abstraction that will work for everyone.

GKI is meant to solve that problem by "rallying around" a given kernel version that is tied to a particular Android release. A limited subset of the module ABI will be maintained as a stable interface for that kernel. Vendors can then create driver modules that will continue to work as the kernel gets long-term support (LTS) and security updates.

With a grin, Deacon said that he could hear audience members strongly suggesting ("screaming") that the Android systems team not maintain the ABI as the kernel evolves. He acknowledged the problems with that, but noted that the team has identified a strict subset of the symbols in the ABI that it will continue to maintain—only for a single kernel version and Android release pair.

Android virtualization today

The hypervisor situation on Android is chaotic. "If you think fragmentation on the kernel side is bad, this is much, much worse." At least all of the Android devices are running some version of Linux, but in terms of hypervisors, "it's the wild west of fragmentation". Some devices do not have a hypervisor at all, which simplifies the picture, but many do, and they are used for several different things.

The first main use is for security enhancements that are meant to protect the kernel but are sometimes problematic in their own right. He pointed to Jann Horn's Project Zero blog post that notes: "Mitigations are attack surface, too". It shows how attacks can be made against some of these security enhancements. It is important to remember that the hypervisor is running with elevated privileges, so bugs there can mean that these supposed protections are not really protecting the system.

Another hypervisor use in Android today is for coarse-grained memory partitioning that looks something like an IOMMU but actually is not. It is used at boot time to carve up the physical memory into regions that can be handed off to various devices for DMA and other uses. He understands why that is needed, but there is a lot more that could be done with a hypervisor after boot time, so this type of use is kind of a waste, he said.

The final reason that hypervisors are used in Android today is his least favorite: running code outside of Android itself. Armv8 has multiple privilege levels, called exception levels, going from the most privileged, firmware (EL3), through the hypervisor (EL2) and operating system (EL1) levels, to the least privileged user (EL0) level. The hypervisor exception level is not the firmware level, so device makers do not have to worry about bricking devices when updating code there, and it is not the operating system level, so code running there does not need to integrate with anything else. That means EL2 has become something of a "playground"; code that doesn't seem to fit anywhere else gets stuck there, which is bad because EL2 has lots more privileges than are probably needed.

In most cases, there are not even any virtual machines (VMs), so these hypervisors are not providing the usual services. His conclusion is that both security and functionality are losing out because of that. Security is hampered because there is an increased trusted computing base (TCB) and it is more difficult to update the devices because of the fragmentation at that level. And functionality is lacking because there is no access to the hardware virtualization features from within Android.

He then described the Armv8 "exception model", showing how the various levels of software are built up from most to least privileged, but also how Arm has long had a parallel "trusted" side where applications can be run on a trusted OS and hypervisor. The definition of "trusted" is just a "bit on the bus" that allows more access to physical memory. It is important to note that code on the trusted side can access all of the memory, while code on the untrusted side is unable to access trusted-only memory.

Effectively, the trusted levels are all more privileged than the non-secure levels, so the trusted OS can map non-trusted hypervisor memory, for example, and it could provide access so that trusted applications have access to it too. That is problematic in the Android world in part because of what is typically running on the trusted side: third-party code for digital rights management (DRM), various opaque binary blobs, cryptographic code, and so on. That code may not be trustworthy and it suffers from the fragmentation problem as well. What people think of as "Android" is running in the least-privileged part of the system.

The term "trusted" is largely a marketing term, he thinks, to make people feel that the code running there is safe and reliable. But there is another definition of "trust", to "expect, hope, or suppose", and that is also operative here. The Android system has to hope that the software running in the trusted side is not malicious or compromised because there is not anything Android can do if it is.

Instead, the Android project would like to have a way to de-privilege this third-party code. There is a need for a portable environment that can host these services in a way that is isolated from the Android system. That mechanism would also isolate these third-party programs from each other.

Enter KVM

One way to do that is to move the trusted code into a VM at the same level as the Android system. The third-party code would be no more (or less) trusted than Android itself. Since there are no VMs currently in Android, there is an opening to add some if there is a hypervisor available to manage them. The idea is to use the GKI effort to introduce KVM as that hypervisor in order to move that third-party code out of the over-privileged trusted region.

All arm64 Android devices support virtualization in hardware and have two-stage MMUs, which allow partitioning the memory so that guests cannot access outside of their memory regions. KVM has been supported on arm64 since Linux 3.11 (in 2013). There are two basic modes that are supported depending on whether the Virtualization Host Extensions (VHE) support is available; that support was added to v8.1 of the architecture, but all arm64 processors can still run in the earlier non-VHE (nVHE) mode if they choose.

In nVHE mode, the host and guest kernels both run at the operating system level (EL1), while there is a virtual machine monitor (VMM) at the EL2 hypervisor level. Because the host kernel does not have the privileges needed to directly switch to and from the guests, the VMM must do a "world switch" to make that happen, which makes nVHE mode relatively slow. [Update: The article confuses the VMM and world-switch code, which Deacon helpfully untangles in a comment below.]

In v8.1, the VHE support allowed EL2 programs to have fewer constraints, so the host kernel could be run in EL2, with all of the guests as VMs in EL1, which is "blazingly fast". That mode is not really compatible with the threat model for Android, however. It moves the host kernel and VMM (via ioctl()) into the TCB and the host kernel has access to all of the memory of the guests. It effectively turns the trusted model on its head, so only the Android system would be in a privileged position, which is not desirable either.

The envisioned Android security model requires that guest data remains private even if the host kernel is compromised, and KVM using VHE does not work that way. But that is not a problem with nVHE mode, so it might make sense to revisit that. Instead of trusting the full host kernel, only the world-switch piece needs to be trusted. It can be extended to manage the stage-2 page tables and manage other functions for the guests. Message passing can be used between the host kernel and the VMs and a special bootloader can be used to ensure that the host does not tamper with the VM images. "While we're at it, we'll try to apply formal verification techniques because the EL2 code is drastically simpler than Linux."

Another possibility would be to run Android in a VM, which is plausible, but Deacon does not really think that is demonstrably better; it has a different set of challenges. Interrupt latency could be a problem for an Android VM. There is also a need for device pass-through and he does not think the Arm IOMMUs are really up to handling that at this point.

The nVHE execution environment in EL2 is "a pretty horrible place"; it has its own limited virtual address space that lacks the addressing capability needed for running general kernel code. Any code running there is not preemptible or interruptible, so you cannot block or schedule. EL2 can access all of the memory if it is mapped, but, because of that, the project does not want to put a lot of complicated code there—that would defeat the purpose. There is "very limited device access at EL2" because the host kernel normally handles all of that; typically, there is no console at EL2, though they have some hacks for a debug console.

The EL2 code needs to be self-contained and safe against a compromised host, which is not the case for kernels prior to 5.9, where KVM could effectively cause arbitrary code to be run as an EL2 hypercall by passing a function pointer for what to call. As part of the recent changes, the project has switched to a fixed set of hypercalls for the services that need to be provided. The EL2 payload is embedded in a separate ELF section that uses symbol prefixing to ensure that symbols from the host kernel are not wrongly used. As the system boots, the host kernel sets the static keys appropriately before it de-privileges itself by moving to EL1; the EL2 object is then no longer mapped from EL1, so those changes are one-way.

Open problems

There are still a number of problems, however, most of which come down to how the virtual memory is managed. Today, the host kernel is in control of the hypervisor's virtual memory, which is obviously a problem given the Android use case. The stage-1 mappings are created by the host kernel, which means it can change the page table out from under the hypervisor; it can also write to any of the hypervisor memory.

Beyond that, the stage-2 page tables for guests are also managed by the host kernel. When EL2 does a world switch, it just blindly installs those tables assuming that the host kernel is doing the right thing. That obviously needs to change as well. The protected KVM project has some patches that it is targeting for Linux 5.11 to change page-table handling, some of which (e.g. page table and fault handling, per-CPU data handling) have already landed in 5.10.

Moving the page-table handling code to EL2 has some interesting properties. When a new guest is created, its memory will be unmapped from the host, which is not something that Linux can deal with. It is not like memory hotplug, where a whole bank can go away, it is just the pages assigned to the guest that will disappear. The KVM protected memory extension patches would fix that problem, though they have not been merged. They would allow handling the case where guest memory disappears and then reappears later when the guest is torn down.

IOMMU support is needed to avoid DMA attacks, but the current systems-on-chip (SoCs) are not really ready for that. Ideally, the IOMMUs would simply reuse the page tables that are already installed in the CPU, so there would be limited IOMMU-management code needed in EL2. It would not be desirable to have multiple different IOMMU drivers bloating the EL2 code.

Another piece that will be needed is the template bootloader that is used to start guests; it will be "very very small" and the current plan is to write it in bare-metal Rust. It will check the signature of the VM image to ensure that it has not been tampered with; if it passes, the bootloader will jump to it. That image will have a "proper second-stage bootloader" as part of it, so the template bootloader can remain extremely simple. None of that is particularly arm64-specific, so other architectures may be able to use it as well.

Virtual platform

The protected KVM project is adapting the Chrome OS VMM (crosvm) for its VMM. Crosvm is now included in the Android open-source project (AOSP); there have been lots of talks about it at KVM Forums, he said. Crosvm is written in Rust with a major focus on security and sandboxing, which makes it a good match. It also has many virtio devices already implemented and is cross-architecture, which is important, perhaps surprisingly, in part because of the Cuttlefish virtual Android device that is based on the x86 architecture.

Protected KVM provides a fairly basic arm64 virtual platform for guests, with much of what would be expected. One major difference is that it provides the Reduced Virtual Interrupt Controller [PDF] (RVIC, which is a paravirtual IC), rather than the standard Generic Interrupt Controller (GIC) because the latter is more complicated than what the developers want to add to the EL2 code.

For I/O, the obvious answer is virtio, he said, but that does not fully solve the problem because it assumes that hosts have access to all of guest memory. Even if you work around that, it means that the host can intercept the I/O data for the guest, which means "you have to use quite clever crypto". There is also no shared-memory device for virtio, so bounce buffers are needed. Support for that is working but has various undesirable properties, including slower performance, so some other solution is sought.

His final slide was a long list of things that still need to be done. One big lurking item is something where he underestimated the effort required: getting it all working with the rest of the Android system. There is a lot needed to integrate with the user-space bits of the system, which his experience as a kernel (and now KVM) developer did not prepare him for. He encouraged those interested to contact the team or to post to the KVM/Arm mailing list. The PDF slides are available and the video can be accessed from the event site and will eventually appear on YouTube.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>