Leading items

Welcome to the LWN.net Weekly Edition for May 2, 2019

This edition contains the following feature content (including the beginning of our coverage from the Linux Storage, Filesystem, and Memory-Management Summit):

The state of system observability with BPF: Brendan Gregg shows how BPF-based tools can help to peer inside a production Linux system.
Containers and address space separation: a radical idea for improving the security of container workloads.
ClearlyDefined: Putting license information in one place: another attempt at making sense of the licensing jungle.
Some 5.1 development statistics: where the code in 5.1 came from and how it got there.
Bounce buffers for untrusted devices: a proposal for defending the system against hostile hardware.
Toward a reverse splice(): what if the splice() system call could go in the other direction?
Memory encryption issues: an LSFMM session on memory-encryption support in the kernel.
Android memory management: Android's memory-management challenges, and opportunistic reclaim in particular.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The state of system observability with BPF

By Jonathan Corbet
May 1, 2019

LSFMM

The 2019 version of the Linux Storage, Filesystem, and Memory-Management Summit opened with a plenary talk by Brendan Gregg on observing the state of Linux systems using BPF. It is, he said, an exciting time; the BPF-based "superpowers" being added to the kernel are growing in capability and maturity. It is now possible to ask many questions about what is happening in a production Linux system without the need for kernel modifications or even basic debugging information.

Gregg started with a demonstration tool that he had just written: it's immediate manifestation was in the creation of a high-pitched tone that varied in frequency as he walked around the lectern. It was, it turns out, a BPF-based tool that extracts the signal strength of the laptop's WiFi connection from the kernel and creates a noise in response. As he interfered with that signal with his body, the strength (and thus the pitch of the tone) varied. By tethering the laptop to his phone, he used the tool to measure how close he was to the laptop. It may not be the most practical tool, but it did demonstrate how BPF can be used to do unexpected things.

Gregg works at Netflix, a company that typically operates about 150,000 server instances. Naturally, Netflix cares a lot about performance; that leads to a desire for observability tools that can help to pinpoint the source of performance problems. But the value of good tools goes beyond just performance tuning.

Better bug reports

For example, the company is currently trying to move its server images forward to the 4.15 kernel, but a snag has come up. There is a log-rotation service that waits for a specific process to exit by repeatedly running ps. It seems like a simple task, but it started failing on the newer kernel; ps would fail to report the process when, in fact, that process was still running. It was, Gregg said, time to investigate.

He began with the bpftrace tool which, he said, is soon to be packaged by a number of distributors. It could be interesting, he thought, to see which system calls are failing under the new kernel that worked under the old. Finding out which system calls are failing can be done with a command like:

    bpftrace -e 't:syscalls:sys_exit_* /comm == "ps"/ \
    	     { @[probe, args->ret > 0 ? 0 : - args->ret] = count(); }'

This has to attach to all 316 system calls, so it takes a while to get going; there is talk of a "multi-attach" functionality to speed things up in the future. In the meantime, one can get around that problem by changing the command to attach to the raw system-call tracepoint instead:

    bpftrace -e 't:raw_syscalls:sys_exit /comm == "ps"/ \
    	      {@[args->id, args->ret > 0 ? 0 : - args->ret] = count(); } '

Gregg posted a lot of code in this session, most of which has not been reproduced here. See his slides for the full details.

That works more quickly, at the cost of reporting system-call numbers rather than names. This command did, indeed, turn up one extra failure that was happening under the new kernel, but it turned out to have nothing to do with the problem — a dead end. So it was time to shift tactics: perhaps some system call was "successfully failing". A couple more bpftrace commands later, he had an answer: occasionally the getdents() system call will return a partial result on /proc, causing the entry for the target process to be left out.

So, perhaps, the problem lives in the /proc filesystem. Switching to an ftrace-based tool for a moment, he looked at which functions were being called in the failed runs. Then back to bpftrace to chase a few wrong leads before determining that find_ge_pid(), which is supposed to continue a read by finding the next highest process ID, was stopping early. That, in turn, appears to have been caused by a change in the implementation of the process-ID table. And at that point he had to stop to travel to the conference.

He thus didn't yet have a solution to the problem (though, since the guilty parties were in the room, a solution seems likely to come soon). But, he said, even at this point BPF has helped to significantly narrow down the problem. One of the biggest benefits to the current crop of BPF-based observability tools, he said, is that they enable the creation of much better bug reports.

One-liners

Stepping back briefly, Gregg talked about the architecture of the BPF system. At the lowest level is the BPF virtual machine itself. One can write raw BPF code for this machine, but it's "super hard" and nobody should do it. He asked the people in the audience who had done this to raise their hands and was surprised by how many hands went up; the room contained, he said, the entire world population of raw BPF experts. For those wanting a higher-level experience, though, an LLVM backend allows BPF programs to be written in a restricted version of C. Then, the BCC system allows those programs to be loaded and driven from C or Python code. And now, above that, is bpftrace, which allows the writing of useful one-line tools.

So, for example, he asked about the effectiveness of readahead in the page cache: is it actually accelerating I/O, or is it just polluting the cache? One need not wonder about such things; they can now be observed. So he wrote a simple tool in bpftrace to record timestamps for all pages brought in via readahead; it then compares the timestamps when those pages are actually referenced. The result, for his workload, was that almost all pages were accessed within 32ms of being read into the cache; the number of pages that were never referenced was quite small. So readahead is indeed helping here; if anything, the application in question could benefit from even more readahead.

One can ask similar questions about eviction from the page cache: by looking at the age of pages as they are pushed out, it is possible to evaluate whether the page cache is appropriately sized. If those pages have not been referenced in a long time, perhaps the cache is too large. This tool does not yet exist but, he suggested, perhaps the assembled group could put an implementation together during the conference and send it to him.

The tool to answer the readahead question was not quite a one-liner, but it fit nicely onto a single slide. bpftrace, he said, is a great platform for the creation of short tools. In its simplest form, one need only provide a probe (a kernel tracepoint, for example), an optional filter expression, and an action to take when the probe is hit. The language overall, he said, looks a lot like awk or DTrace.

Inside, it has a parser that builds an abstract syntax tree describing the task, which is then used to create the BPF program(s) to attach to the events of interest. The perf ring buffer is used to get high-bandwidth data out of the kernel but, to the greatest extent possible, data is boiled down in kernel space first and exported via BPF maps. Version 0.90 was released in March; the next release, called 0.10, is due soon (though that release number may be changed after some complaints were raised about the apparent backward number jump).

BPF everywhere

As Gregg works on a book about performance analysis with BPF, he is trying to fill in various "observability gaps" in the system. To that end, he brought back his old superping utility, which hooked into packet send and receive events in an attempt to obtain more accurate ping times by eliminating any scheduling latency. But he ended up reporting longer times than ordinary ping does — not the expected result. It turns out that this problem has been solved in the networking stack years ago, with timestamps being stored in the packets themselves, so this tool proved unnecessary. The superping tool demonstrates, though, how BPF programs can work from kernel headers to walk through chains of structures.

A tool that is still useful, instead, is called tcplife; it collects information about TCP connections to and from the local system. This kind of data is normally generated through packet capture, which is an expensive thing to do; tcplife, instead, just hooks into the networking stack to get events when connections begin and end. Initially it used a kprobe set on tcp_set_state(); that information proved useful enough that a proper tracepoint was added in 4.15. That was an improvement; tracepoints are more stable and less prone to breaking than kprobes. But only to a point.

In 4.16, that tracepoint was moved to inet_sock_set_state() and some more information was added to it; that broke all of Gregg's tools that were using the older tracepoint. And, he said, he is "fine with that". Even if a tracepoint changes occasionally, it's far better than using raw kprobes. He did have one request, though: tracepoints should include a direct pointer to the structure(s) in use at that location in the code. He realizes that any code using that pointer will be unstable, but it's still useful to have if there is a need for data that is not exported directly by the tracepoint.

This case illustrates a more general point, he said: kprobes can serve as a useful way to prototype tracepoints. A tool built around a kprobe shows what the use case for the tracepoint would be and which data should be included; that helps to justify the tracepoint's addition and to get it right when that happens.

Gregg concluded with a "reality check": even though the various BPF tools provide all kinds of useful information, most performance wins still come from poring over flame graphs instead. But, naturally, BPF is moving into that area as well; it is now possible to create trace summaries in the kernel, reducing the overhead of collecting that sort of performance data. That, he said, shows the direction the kernel community should continue to take in the future: "BPF all the things".

Comments (6 posted)

Containers and address space separation

By Jake Edge
May 1, 2019

LSFMM

James Bottomley began his talk at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) by noting that the main opposition to his ideas was not present at the summit, which was likely to mean the ideas got a much easier reception than they would have otherwise. In particular, Peter Zijlstra and Ingo Molnar expressed some strong reservations to the work that Bottomley's colleague Mike Rapoport posted recently; none of those three were in attendance at LSFMM. The idea is to use address spaces to reduce the attack surface available to virtual machines (VMs) and containers such that kernel bugs of various sorts have less reach on multi-tenant systems.

Bottomley has been working with Rapoport on the idea for the container use case, but there are others, from Google and Oracle, who are trying to solve the same problems for VMs. Address spaces are the oldest and most secure mechanism for keeping tenants separate from one another, he said. Separating processes into their own address spaces is what was used to support multi-user systems, so there is around 50 years of history there. Part of the reason to extend the idea for VMs and containers is that address spaces have proven to work well as a security measure.

For a KVM system, the security of the guest OS depends on the fact that the hypercalls make up a pretty tiny footprint within the host kernel. That interface is all that the guest OS has access to in the host. Prior to Spectre and Meltdown, that boundary would be enough for security but now programs can observe what others are doing using the flaws. So KVM developers would like to find ways to ensure that the guest cannot access anything else in the host kernel (even speculatively) by unmapping its code and by not providing access to the address space of its data.

Moving on to containers, he said that one of the big problems in container security is the kernel footprint that containers use. In the simplest case, containers have access to the entire system-call interface of the kernel, which means they are exposed to a "broad swath of the kernel". If you listen to VM people, he said, they will claim that the access to all of the 360 or so system calls is what makes containers inherently far less secure than VMs. But if you measure what is actually being used by containers, "that is bollocks"; generic containers are something like two to five times less secure than hypervisors (in terms of the amount of kernel code traversed), not the hundreds of times that VM proponents like to suggest.

There is a "semantic gap" between the system calls that a guest VM can make and the code that actually gets executed in the host kernel. The security of VMs is partly based on the fact that it is actually difficult to create an exploit at the system-call level that will translate across the hypercalls into the host kernel. So there is lots of exposed code in the guest OS, but a great deal less in the host OS because of the hypervisor interface. Container developers would like to get back the properties that the semantic gap provides, without reverting to having the gap itself, since it is the lack of this gap that gives containers much of their power.

But in order to reduce the kernel footprint, projects like gVisor and Nabla Containers create sandboxes to handle some of the kernel work in user space. For gVisor, the kernel calls have been rewritten in Go, while Nabla Containers took the approach of reducing the projection of the containers into the kernel by restricting which system calls can be made and emulating the rest in user space.

The price for those approaches is a "massive semantic gap", he said. It has high security properties, but it causes problems for memory management and elsewhere. He is looking to thin out the sandbox; ideally via separate address spaces for each container or VM. If every kernel namespace could have objects that were private to that namespace, that would go a long way toward solving this problem, he said. It is primarily a protection for multi-tenant systems, so he is not suggesting that the feature would be turned on everywhere.

Trond Myklebust asked how these ideas differed from a microkernel architecture. In a microkernel, the filesystem driver and the block device driver would each be in their own address spaces, Bottomley said, so sharing filesystems between containers would be problematic because the two containers would be accessing the same filesystem address space. What he is looking for is to separate address spaces on a per-tenant basis, not per subsystem.

Another attendee noted that in a microkernel, a fault in the filesystem driver, for example, only affects that component, not the rest of the system. The architecture will make it difficult to separate a shared filesystem and all of its data structures; it is a question of which resources you are protecting.

Bottomley believes that two unrelated containers sharing a filesystem is a niche use case in the cloud. Since most don't do it, we can afford to make them pay a huge penalty when they do, he said. In terms of fault isolation, a microkernel will simply restart a failing driver, but what is needed in the multi-tenant case is to kill the tenant that caused the fault. The idea is to punish a malicious tenant. The difference is that a microkernel provides an address space per subsystem, while he is looking for an address space per tenant; the design goals and security properties are different in the two cases.

The question of performance was raised; can the performance that will be lost by doing this address-space separation ever be recovered? Bottomley believes that it is a hardware problem. He noted that virtualization was slow initially, so the hardware vendors stepped up to make it faster; the same could happen here. Beyond that, this feature is for security; anything is acceptable in the name of security, he (half) joked.

Matthew Wilcox said that he did not buy the semantic gap argument. He noted that in a previous life as a Java programmer he had a colleague who was able to accidentally corrupt the BIOS in his machine with a Java program. Bottomley agreed; the semantic gap argument is a form of security through obscurity, he said. There are two main approaches to providing security for containers: either guarding against malicious system calls, which sometimes works and sometimes doesn't, or emulating system calls in user space.

Returning to the shared-filesystem question, Ted Ts'o noted that often the base images for containers (such as RHEL) are shared. Bottomley said that typically those are shared in a read-only fashion, which is a solved problem. Read-only pages in the page cache can be shared, but sharing read-write pages is not a common cloud-container pattern.

Comments (19 posted)

ClearlyDefined: Putting license information in one place

By Jake Edge
April 30, 2019

Determining the license that any given package uses can be difficult, but it is essential in order to properly comply with that license and, thus, the developer's wishes. There is an enormous amount of "open source" software available these days that is not clearly licensed, which is where the ClearlyDefined project comes in. The project is collecting a curated list of packages, source location, and license information; some of that collection can be automated, but ClearlyDefined is targeting the community to provide curation in the form of cleanups and additions.

Licensing information is notoriously complex to get right. Packages are often made up of source files that come with their own licenses, based on where the code originally came from—or the aims of the original developers. For example, even though the Linux kernel is licensed under GPLv2, it has many different licenses throughout the tree. The effort over the past few years to add Software Package Data Exchange (SPDX) headers to the kernel's source files is still ongoing. What seems like it should be a simple, straightforward process turns out to be quite a bit less so.

The kernel effort is just for one "package", however. There are untold thousands (tens, hundreds, ...) of other packages that today's software relies upon; the licensing information for many of those is even harder to work out. But there are a number of reasons that it is important to have that information available.

Without proper license information, some will be unwilling to use the code in question, or at least to distribute it. Others may need to spend some time tracking that information down before they can use the software. That effort may well satisfy their compliance needs, but the end result does not (necessarily) help others. If, for example, an organization sets out with the goal to create a list of the packages and licenses from a container image, the work of determining the license may need to be repeated by others, potentially including other parts of the same organization. There is no easy way share that information with others that use the image or the packages contained within it.

Eliminating that duplication of effort is one of the ways that ClearlyDefined is trying to help fix the problem. Its home page presents an interface that allows community members to add information they have discovered for packages. It also provides a REST API that can be used to retrieve various kinds of data from its repository.

There is more to it than that, however. The real underlying problem lives in the upstream repository and/or source files. Making it easy for upstream projects to update their source files to have SPDX headers, as well as providing an overall license for the whole, could, eventually, make ClearlyDefined obsolete—though that may be just a tad overly optimistic.

Even projects that are simply providing a library or component for others to consume can encounter problems that ClearlyDefined can help solve. It is the rare project indeed that has no outside dependencies, so having licensing and other information available for those dependencies will make it easier for others to pick up and use the code. The idea is that anyone in the community can help curate the license information; ClearlyDefined is meant to be a portal where that work can be done.

The starting point is a description of the project, which includes the location of its source code, the bug-tracker location, and project web site. Beyond that, there will generally be multiple entries for a project, one for each release. Those entries will have the release date associated with it. But projects are often made up of different pieces, some of which may not matter from a licensing standpoint because they are not distributed. So, things like tests, build utilities, examples, and so on can be defined as "facets" of the project; the "core" facet consists of the files that are actually part of the distributed code.

Each facet then gets assigned licensing information that has either been automatically determined (by code scanners like ScanCode and FOSSology) or has been contributed to the project, as the Eclipse Foundation did when ClearlyDefined became an incubator project of the Open Source Initiative (OSI) a little over a year ago. The "declared" license, which comes from the license choice stored in the source repository (if any), is recorded, along with the number of files in the facet. Any SPDX headers discovered in the source files along with the copyright attribution information are recorded as well.

Others can help out by contributing data through the curation process on the web site. It is handled in the same way that contributions to GitHub-hosted projects normally are, with pull requests (PRs), in this case to the curated-data repository. The process is a bit clunky right now, as the project admits, but it is actively seeking ways to make the curation process work more smoothly.

Another area where ClearlyDefined is not, yet, clearly defined, is in its security component. The overarching idea is to track security vulnerabilities and fixes so that users can understand the security status of the components they use. How that will be done is still under discussion; for now, the project is mostly focused on the licensing piece.

The project's charter gives a nice overview of the project and its goals. As might be expected of an OSI project, ClearlyDefined is committed to using open-source tools and to releasing its code as open-source software. For example, "harvesting" data will not be done using proprietary tools:

Harvesting is the act [of] getting data from upstream projects. This may be as simple as reading prescribed data from canonical locations to full-on analysis of the source code using a variety of open tools. The discovered data is stored in its entirety in its native form in ClearlyDefined infrastructure and made available to the community on demand. The harvesting tools themselves are always fully open and accessible to the community for vetting and inspection. The project is open to including new tools subject to a vote, as described below.

Harvesting may be run by the ClearlyDefined project itself or by designated parties, typically curators. In all cases, only output from agreed to tools and configurations will be admitted to the system. Harvesting operators are free to focus on a given domain of projects that best suit their expertise and interests.

As the stats page shows, there are nearly five million definitions currently in the database (as of this writing, anyway). Multiple repositories are being harvested, including npm for Node.js, PyPI for Python, Maven for Java, Crate for Rust, GitHub, and others. ClearlyDefined was the subject of a lively workshop at the recent FSFE Legal and Licensing Workshop (LLW), led by project lead Jeff McAffer of GitHub. The project has lots of partners, such as Google, Microsoft, Amazon Web Services, Qualcomm, Software Heritage, and Codescoop.

The data ClearlyDefined gathers is clearly helpful, but it needs a lot of attention from the community in order to get it into a fully useful state. Once done, that data has a lot of value, however, especially in not having to redo the work over and over. Hopefully that value will lure more companies into the fold to help curate, quite possibly using data they have already gathered as part of their compliance efforts. Crowdsourcing the data seems ... clearly ... like the right way to go.

Comments (4 posted)

Some 5.1 development statistics

By Jonathan Corbet
April 25, 2019

The release of the 5.1-rc6 kernel prepatch on April 21 indicates that the 5.1 development cycle is getting close to its conclusion. So naturally the time has come to put together some statistics describing where the changes merged for 5.1 came from. It is, for the most part, a fairly typical development cycle.

As of this writing, 12,749 non-merge changesets have been pulled into the mainline repository for the 5.1 release. That is slightly more than seen in 5.0, but still a bit lower than the other kernels released in the last few years. There were nearly 545,000 lines of code added by those changesets and 289,000 lines removed, for a net growth of 256,000 lines; this is not one of those rare development cycles where the kernel gets smaller. That work was contributed by 1,707 developers, 245 of whom made their first contribution in the 5.1 cycle.

The most active developers this time around were:

Most active 5.1 developers

By changesets

Gustavo A. R. Silva 192 1.5%

Yue Haibing 150 1.2%

Christoph Hellwig 147 1.2%

Chris Wilson 136 1.1%

Colin Ian King 104 0.8%

Arnd Bergmann 102 0.8%

Masahiro Yamada 96 0.8%

Takashi Iwai 94 0.7%

Heiner Kallweit 94 0.7%

Axel Lin 89 0.7%

Greg Kroah-Hartman 88 0.7%

Sean Christopherson 83 0.7%

Jakub Kicinski 79 0.6%

Bartosz Golaszewski 77 0.6%

Eric Biggers 75 0.6%

Bart Van Assche 74 0.6%

Christophe Leroy 72 0.6%

Trond Myklebust 71 0.6%

Arnaldo Carvalho de Melo 70 0.5%

Hans Verkuil 69 0.5%

By changed lines

Oded Gabbay 48737 7.2%

Jakub Kicinski 19466 2.9%

Eric Biggers 17489 2.6%

Christoph Hellwig 15556 2.3%

Greg Kroah-Hartman 14997 2.2%

Chris Wilson 12242 1.8%

Shunli Wang 11046 1.6%

Hans Verkuil 10509 1.6%

Kaike Wan 9788 1.5%

Srinivas Kandagatla 8160 1.2%

Alex Deucher 7827 1.2%

James Smart 7421 1.1%

Larry Finger 7184 1.1%

David Francis 7127 1.1%

Felix Fietkau 6854 1.0%

Mark Rutland 5958 0.9%

Jens Axboe 5366 0.8%

Claudiu Manoil 4974 0.7%

Johannes Berg 4665 0.7%

Neil Brown 4595 0.7%

The top contributor of changesets in 5.1 was Gustavo A. R. Silva, who continues to make general cleanups (such as marking switch fall-through cases) throughout the kernel tree. Yue Haibing was also a contributor of widespread cleanup work. Christoph Hellwig has been reworking the DMA-mapping code, improving the XFS filesystem, and more. Chris Wilson contributed a lot of work to the i915 graphics driver and Colin Ian King fixed a number of typos and coding-style issues.

Oded Gabbay reached the top of the "lines changed" column by adding the Habana AI processor driver. Jakub Kicinski reworked the BPF self tests, Eric Biggers added a lot of testing code to the crypto subsystem, and Greg Kroah-Hartman deleted the xgifb driver from the staging tree.

The kernel development community relies heavily on its testers and reviewers. The testing and review picture for 5.1 looks like this:

Test and review credits in 5.1

Tested-by

Arnaldo Carvalho de Melo 49 7.6%

Andrew Bowers 47 7.2%

Christian Zigotzky 21 3.2%

Alexander Steffen 17 2.6%

Stefan Berger 16 2.5%

Gustavo Pimentel 15 2.3%

Aaron Brown 13 2.0%

Stan Johnson 13 2.0%

Marek Szyprowski 11 1.7%

Nathan Chancellor 9 1.4%

Jarkko Sakkinen 8 1.2%

Matthias Kaehlcke 8 1.2%

Keerthy 8 1.2%

Linus Walleij 6 0.9%

Stefan Wahren 6 0.9%

Sven Auhagen 6 0.9%

Geert Uytterhoeven 5 0.8%

Guenter Roeck 5 0.8%

Jonathan Hunter 5 0.8%

Randy Dunlap 5 0.8%

Tyler Baicar 5 0.8%

David Safford 5 0.8%

Reviewed-by

Rob Herring 208 4.4%

Christoph Hellwig 86 1.8%

Simon Horman 76 1.6%

Tvrtko Ursulin 76 1.6%

Andrew Lunn 75 1.6%

Geert Uytterhoeven 74 1.6%

Hannes Reinecke 65 1.4%

Alex Deucher 63 1.3%

Andrew Morton 62 1.3%

David Sterba 60 1.3%

Daniel Vetter 59 1.3%

Chao Yu 55 1.2%

Florian Fainelli 49 1.0%

Jaroslav Kysela 49 1.0%

Jakub Kicinski 48 1.0%

Ville Syrjälä 47 1.0%

Mika Kuoppala 45 1.0%

Chris Wilson 44 0.9%

Guenter Roeck 41 0.9%

Laurent Pinchart 41 0.9%

Darrick J. Wong 40 0.9%

Mike Marciniszyn 39 0.8%

There have been times when these statistics have shown some questionable behavior — large numbers of reviews from an author's coworkers that never saw a public list, for example. This time, about the only thing that jumps out is Rob Herring's activity: he reviewed a large number of device-tree bindings from many different developers, just as one might expect a device-tree maintainer to do. Overall, the community benefits hugely from the efforts of our many testers and reviewers.

Companies

A total of 230 companies (that could be identified) supported work on 5.1 — a typical number. The most active employers this time around were:

Most active 5.1 employers

By changesets

Intel 1508 11.8%

(None) 897 7.0%

Red Hat 857 6.7%

(Unknown) 812 6.4%

Google 671 5.3%

Linaro 504 4.0%

Mellanox 493 3.9%

Huawei Technologies 487 3.8%

IBM 404 3.2%

SUSE 350 2.7%

AMD 340 2.7%

Linux Foundation 298 2.3%

Renesas Electronics 280 2.2%

(Consultant) 266 2.1%

NXP Semiconductors 230 1.8%

ARM 205 1.6%

Oracle 202 1.6%

Facebook 180 1.4%

Bootlin 176 1.4%

Code Aurora Forum 159 1.2%

By lines changed

Intel 76848 11.4%

Habana Labs 52429 7.8%

(None) 36930 5.5%

Google 36916 5.5%

(Unknown) 32249 4.8%

Red Hat 31598 4.7%

Linaro 29175 4.3%

AMD 26705 4.0%

Mellanox 24222 3.6%

(Consultant) 24089 3.6%

Netronome Systems 23691 3.5%

Facebook 18639 2.8%

IBM 18529 2.7%

NXP Semiconductors 17957 2.7%

Linux Foundation 16283 2.4%

ARM 15369 2.3%

MediaTek 14508 2.1%

SUSE 13871 2.1%

Broadcom 11564 1.7%

Renesas Electronics 8718 1.3%

One obvious newcomer here is Habana Labs, which contributed a driver for its AI coprocessor to the kernel; otherwise there are not a lot of surprises in this table.

Paths

One of the interesting things that can be determined, with some effort, from the kernel's Git repository is the path each commit took into the mainline — which other Git trees did it go through first? The information is not perfect; in particular, fast-forward merges will cause the provenance of a commit to be lost. But such merges are relatively rare in the kernel community (which lacks the fear of merges seen in many other projects), so an interesting picture can be created.

The entire picture is rather large to embed in an article; it can be seen on this page. The portion corresponding to the networking tree (the biggest single source of commits flowing into the mainline) is shown below:

If a link between two git trees uses signed tags, it is shown in black; otherwise it appears in red. As can be seen, a number of significant trees are still not using signed tags in pull requests to the mainline; these include networking and, ironically, the security and crypto trees. The rule applied by Linus Torvalds is to require signed tags on any pull request that is not hosted on kernel.org; the diagram shows that he is adhering to that. Many of the other maintainers feeding into the mainline, though, do not enforce the same rule, so commits that originate on sites like GitHub are still being pulled in without signatures.

The overall picture shows that, while there are more subsystems using multiple levels of maintainers, an awful lot of code still goes directly to Torvalds. The system appears to work, though; Torvalds has shown few signs of stress in recent years. The same could be said of the development community in general. While some maintainers are clearly stressed, the system as a whole continues to function smoothly, producing kernels with thousands of changes on a predictable nine or ten-week schedule.

Comments (5 posted)

Bounce buffers for untrusted devices

April 26, 2019

This article was contributed by Marta Rybczyńska

The recently discovered vulnerability in Thunderbolt has restarted discussions about protecting the kernel against untrusted, hotpluggable hardware. That vulnerability, known as Thunderclap, allows a hostile external device to exploit Input-Output Memory Management Unit (IOMMU) mapping limitations and access system memory it was not intended to. Thunderclap can be exploited by USB-C-connected devices; while we have seen USB attacks in the past, this vulnerability is different in that PCI devices, often considered as trusted, can be a source of attacks too. One way of stopping those attacks would be to make sure that the IOMMU is used correctly and restricts the device to accessing the memory that was allocated for it. Lu Baolu has posted an implementation of that approach in the form of bounce buffers for untrusted devices.

PCI and untrusted devices

PCI devices are usually built into the system, there was not much concern about them going rogue (however, a reader expressed concerns in the comments on an LWN article about peer-to-peer PCI accesses). The PCI bus does support hotplugging, but its use is limited. It is, however, possible to attach external PCI devices to a bus like Thunderbolt. That opens the door to the Thunderclap vulnerability; a rogue device can benefit from the fact that the PCI bus is, in practice, more trusted than externally accessible buses.

The PCI bus does not have uncontrolled access to the system, though, on systems where an IOMMU exists and is in use. It allows (or denies) access by devices to specific memory regions and maps bus addresses to physical memory addresses. The IOMMU works at the page level, and the remapped regions must be set explicitly before use; each device has different regions it can access. However, not all systems have an IOMMU enabled (or even installed) because of performance concerns or functionality that does not work correctly with the IOMMU.

One step toward improving the situation is to keep track of which devices are expected to behave well and which might not. The marking of trusted and untrusted PCI devices was added in December 2018. It is done with an untrusted flag added to struct pci_dev to control special handling of such devices, including full IOMMU mapping and functions like the bounce buffers. A PCI device is marked untrusted if the firmware marks its root port as external (currently only if the ExternalFacingPort ACPI property is set); that should be the case for Thunderbolt devices.

IOMMU constraints

Trusted PCI devices are expected to perform their DMA operations to and from the buffers they have been given to use; they do not run out of bounds or access other memory zones. With such devices, the IOMMU configuration code can take some shortcuts and, for example, map slightly bigger zones to fit hardware limitations and optimize IOMMU usage. For untrusted devices, we cannot make the same assumptions; the correct and strict configuration of the IOMMU becomes more important. Unfortunately, the minimum granularity of the (Intel) IOMMU is 4KB. Mapping a buffer with the IOMMU means allowing access to the whole 4KB page, even if the desired zone is smaller.

One result of this limitation is that an unaware driver that allocates a small buffer for device DMA and maps it through the IOMMU exposes the whole page with all of the other data it may contain, even if it belongs to other drivers or to the kernel itself. The fact that this situation does not cause any runtime error could be considered a weak point of the DMA API. Just activating the IOMMU doesn't solve the problem — the system must also take care to not place any unrelated data in the memory mapped by the IOMMU.

Bounce buffers

This is where the proposed patch set comes into play. It implements bounce buffers for the untrusted devices; a bounce buffer is simply a separate memory area that is used for DMA operations. Data is copied ("bounced") between the original buffer and the bounce buffer, which is located in isolated memory that can be mapped by the IOMMU in such a way that there is no access to the data outside the buffer in question.

If the original buffer covers a full page (or multiple full pages), nothing needs to change as this buffer can be directly mapped without exposing any unrelated data. If, instead, the buffer is inside a page that may contain other data, bounce buffers will be used. During the mapping, unmapping, and sync operations, the code will copy the data from the original buffer to the bounce buffer and back, depending on the direction of the transfer. Then the IOMMU uses the bounce-buffer addresses for the device instead of the original one.

When an I/O operation is set up, the original I/O buffer is split into three parts: "low", "middle", and "high". The low and high parts might lay on pages that may contain other data: they are the first and the last page that contains the device buffers. The middle pages contain only the device buffer, so they do not use the bounce buffer; only the low and high pages do. This operation may thus split a single contiguous buffer into three pieces; those pieces will be reunited (from the device's point of view) in the IOMMU mapping.

The bounce-buffer patch implements another change: the IOMMU mapping is invalidated immediately after the unmap operation. If that mapping stays cached in the IOMMU, the device might still use it after the mapped page has been reallocated for some other purpose. The patch set also provides an option to deactivate the bounce buffers if the system administrator trusts the attached devices.

Similarity to swiotlb

In the discussion following the first version of the patch set, Christoph Hellwig noted that the code has similarities to the swiotlb (software input output translation lookaside buffer) subsystem. The swiotlb is a bounce-buffering mechanism used with devices that cannot access all of a system's memory. In response, Lu tried to make use of the swiotlb code, but that effort failed because the approach is somewhat different and the offsets given by the swiotlb are different than the the original ones for the low pages. This is because swiotlb copies the whole buffer, rather than just the low and high segments, during the mapping operation.

Robin Murphy suggested that the implementation should be made generic for the whole IOMMU subsystem and not limited to Intel VT-d only. The discussion continued after the second version submission and Lu proposed an extension to swiotlb. A new version of the patch set was posted on April 21. It includes a refactoring of the swiotlb and moves some of the driver-specific code to the generic IOMMU layer.

Next steps

The use of bounce buffers can protect a system against a class of attacks. It remains an open question if there are more similar issues in the kernel and if there will be a need to harden other in-kernel interfaces. This is likely, as the threat model has completely changed — the attacker now controls the devices that were previously thought of as trusted. It seems certain that we are going to see more attacks from rogue devices using unexpected protocols. The kernel interfaces that were considered internal in the past may need to be reviewed and hardened.

The implementation of the IOMMU bounce buffers is complete; one remaining question is what the performance penalty is. The measurements of the impact have not yet been submitted with the patch set. According to the description, the impact is expected to be small. One may expect that it should be lower than swiotlb since less data copying takes place. Large transfers should not be affected as they are usually page-aligned already. The overhead will be more visible for small transfers, where the setup will dominate the cost of a small copy.

The missing performance information, along with some other comments on the latest posting of the patch set, suggest that there is still some work to be done before this code is ready to go upstream. With luck, though, it shouldn't be too long before Linux systems have a higher level of protection against untrustworthy devices.

Comments (19 posted)

Toward a reverse splice()

By Jonathan Corbet
May 1, 2019

LSFMM

The splice() system call is, at its core, a write operation; it attempts to implement zero-copy I/O by moving pages from a pipe to a file. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Miklos Szeredi described a nascent idea for rsplice() — a "reverse splice" system call. There were not a lot of definitive outcomes from this discussion, but one thing was clear: rsplice() needs a much better description (and some code posted) before the development community can begin to form an opinion on it.

A key aspect of splice() is that it works with up-to-date buffers of data, meaning that it moves pages already containing data obtained from some source. The reverse-splice operation would, instead, operate on empty buffers that need to be filled from somewhere. It would, in other words, be a read operation rather than a write. It could be used to fill buffers from a file and feed the result into a pipe, for example. One possible use case, he said, is user-space filesystems, which could use it to feed pages of file data into the kernel (and beyond) without copying the data. He thinks that the idea is "implementable", but was curious to hear what the other developers in the room thought about the idea.

Rik van Riel worried about page-lifecycle problems. Moving a page of data into the page cache (as rsplice() might do) is easy if there are no other users of the page, but what if other processes already have that page in their page tables? Szeredi responded that rsplice() can only work if the pages involved have not yet been activated, so no other references to them can exist.

Matthew Wilcox said that he knows of a use case from his time at Microsoft. It has to do with fetching pages of data from a remote server; there would be value in having an efficient way to place that data into the page cache. Doing this would require adding a recvpage() complement to the kernel's internal sendpage() method. He hoped that somebody within Microsoft would be able to speak more about this use case in the future.

Hugh Dickins, instead, recalled that Linus Torvalds has expressed regret about having introduced splice() in the first place. Torvalds thought that it was a great idea at the time, but few users have materialized since it was implemented. Adding new system calls that lack users only serves to increase the attack surface of the kernel, he said. There are few people who truly understand splice(), which has needed to be "corrected" numerous times over the years. An rsplice() call, he said, would likely have to go through the same process before it could be trusted.

From there the discussion wandered in various directions. There was some questioning of the value of zero-copy interfaces in general, but it does seem to offer benefits on systems with high-bandwidth adapters and huge pages. There was a fair amount of confusion about how rsplice() differs from splice(), perhaps driving home the point that not many people fully understand splice() in the first place. What is needed, it was agreed, was a well-defined use case for this new system call that would help to nail down what it actually does. Then, if an implementation appears shortly thereafter, it will be possible to have a more informed discussion on whether the whole thing makes sense.

Comments (2 posted)

Memory encryption issues

By Jonathan Corbet
May 1, 2019

LSFMM

"People think that memory encryption sounds really cool; it will make my system more secure so I want it". At least, that is how Dave Hansen characterized the situation at the beginning of a session on the topic during the memory-management track at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. This session, also led by Kirill Shutemov, covered a number of aspects of the memory-encryption problem on Intel processors and beyond. One clear outcome of the discussion was also raised by Hansen at the beginning: users of memory encryption need to think hard about where that extra security is actually coming from.

Memory encryption takes a lot of forms, Hansen said; he was prepared to talk mostly about Intel's offerings, but he didn't want the discussion to be limited to that. The feature is driven by a desire for the protection of data while it is at rest. Data stored in ordinary RAM is normally thought of as disappearing when the power goes away, but even that data can be recovered with a sufficiently sophisticated offline attack. Persistent memory, of course, makes such attacks easier. Those devices can have hardware locks, but they apply to the whole device, while users would like to have more fine-grained protection.

In particular, there is a desire to create separate encryption domains within a system — one key per container, essentially. Users feel that it makes them secure, he said. This type of encryption might be able to protect users from a compromised operating system, but that is not something we can do now.

On the Intel side, effort is being put into the "multi-key total-memory encryption" feature (MKTME). The older TME functionality only supports a single key for the entire system; it provides protection for data at rest, but does not allow for any sort of separation of domains within the system. MKTME changes that by allowing different processes to run with different keys. There are patches implementing MKTME support (for anonymous memory only, initially) out for consideration.

Rik van Riel asked at this point whether the encryption is managed within the CPU; in particular, is data stored in the CPU caches encrypted? The answer was "no"; encryption is implemented in the memory controller. Thus, Van Riel continued, it provides no protection against attacks exploiting vulnerabilities like L1TF. Hansen agreed, noting that technologies like memory encryption should always be evaluated against prior vulnerabilities to see what they would have protected against. Memory encryption does not help against speculative-execution attacks.

Hansen raised another interesting problem that doesn't seem to have an easy solution. There are a number of attacks on encryption keys that are helped by having large amounts of encrypted data available. When a mechanism like MKTME is in use, the memory controller essentially becomes a high-bandwidth encryption engine; that could facilitate such attacks. Such things need to be taken into account when considering memory encryption to be sure that it provides a real security benefit.

Somebody asked whether memory encryption works with DMA I/O; the answer is that yes, it can be done. But supporting DMA requires programming the encryption key into the IOMMU. The AMD memory-encryption implementation does bounce-buffering instead for now.

Future work, Hansen said, includes implementing support for the protection of file-backed memory and persistent memory. It should eventually be possible to set keys at a device level, or on individual files (or directory trees, more likely). This functionality may be built on the fs-crypt feature supported by some filesystems now.

Van Riel asked how many keys can be supported by current systems; the answer was that each memory controller can handle up to 64 different keys. In the current patches, all controllers must be configured with the same keys. There is talk of breaking that link, so that different controllers could have different keys, thus increasing the total number of key slots available but, Hansen said, "that sounds like a nightmare" to implement. Memory can normally be allocated arbitrarily across controllers; if only some controllers could handle a given process's encryption key, though, the situation would be complicated immensely. It might make sense when data is tied to a specific controller — data on a specific persistent-memory device, for example.

Shutemov talked briefly about the additional challenges posed by CPU hotplugging. A new CPU likely brings with it a new memory controller, which must then be programmed with the same keys used by the existing controllers. To be able to do that, though, the kernel must keep a copy of the keys in its own memory, where an attacker may try to steal it. It is not possible (or not intended to be possible, anyway) to extract the keys from the memory controllers themselves, so if the kernel can delete its copy of the keys after setting them, key-stealing attacks should be nearly impossible. Storing the key in the kernel, thus, can only reduce the security of the system as a whole.

As a result, there is still discussion over whether the MKTME patches should allow users to set their own keys for anonymous memory at all. There does not appear to be any security benefit from doing so; indeed, the opposite seems to be true. Users would be more secure without that feature. One potential benefit to user-supplied keys for anonymous memory, though, could be to provide a sort of secure-erase feature. When a given user's processes shut down, the associated encryption key can be overwritten in the memory controllers and deleted from the kernel, after which that user's data will be inaccessible.

Matthew Wilcox noted that encryption doesn't come for free; he wondered about what the power cost of memory encryption might be. Nobody had a good answer to that question. Hansen did note that there is a "single-digit" percentage increase in memory latency when encryption is used; total memory bandwidth is unaffected, but latency does increase. Another cost comes when keys are updated; that requires flushing all of the CPU caches, which is expensive. Setting keys is a privileged operation in the current patches; key slots are a limited resource, so ordinary users should not be allowed to consume them at will.

As the session ran out of time, it began to wander a bit. A final question had to do with where the key used for TME (which must already be present at boot) comes from — how do users know that it is secure? There was not a clear answer to that question either, but Hansen did note that the key originates in the CPU. If the CPU itself cannot be trusted, then the question of how the encryption key is generated does not matter much.

Comments (11 posted)

Android memory management

By Jonathan Corbet
May 1, 2019

LSFMM

The Android system is designed to provide a responsive user experience on systems that, in a relative sense at least, have limited amounts of CPU and memory. Doing so requires a number of techniques, including regular use of a low-memory process killer, that are not seen elsewhere. In a memory-management-track session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Suren Baghdasaryan covered a number of issues related to how Android ensures that interactive processes have enough memory to get their jobs done.

Baghdasaryan started by noting that the recently added pressure-stall information feature, which was not originally developed for Android at all, has proved to be quite useful. It gives the Android runtime more accurate information about memory pressure, which can be used to better manage the set of running processes. Overall, he said, the goal of Android memory management is to ensure that interactive processes work as well as possible while minimizing the number of out-of-memory kills needed to do that.

The Android low-memory killer daemon (LMKD) is charged with making all of this happen. Beyond pressure-stall information, a number of recent developments are helping to make it more effective at that task. File descriptors that represent processes (discussed here) are helpful, and the upcoming ability to poll those descriptors for process death will also be useful. But there are still issues with reclaiming memory when the need arises. The use of control groups, while helpful in many ways, does split the kernel's least-recently-used (LRU) list into a large number of smaller lists, which makes reclaim harder in general.

The core issue discussed in this session, though, was quick reclaim of memory from processes that have been killed by LMKD. Depending on what else is happening in the system, that reclaim can take a long and unpredictable time, which makes LMKD's problem harder and forces it to kill processes sooner than it otherwise would. Baghdasaryan's opportunistic reclaim patches are an attempt to improve this situation; this feature tries to immediately strip memory from a process that has been explicitly killed. That avoids situations where the target process itself may be slow to free those resources, making reclaim faster and more predictable.

The first implementation was based on the OOM reaper code which, he said, is probably not how a final implementation should look. But getting to that final version requires answering a few questions, the first of which was where the work of reaping of memory from a killed process should be done. One option is to have the process sending the SIGKILL signal take responsibility (in kernel space) for this reclaim work. There are a number of advantages to doing things that way: it is relatively simple to implement, the CPU time required will be charged to the killing process, priority inheritance for the reclaim work will happen automatically, and it provides for better user-space control over when this work is done. On the other hand, it may require a new user-space API to control opportunistic reclaim, and the scalability of reclaim from large processes could be a problem.

The alternative would be to do this work in one or more kernel threads. That simplifies the API issues and might make things more scalable. That, however, takes away any sort of user control over when expedited reclaim might happen.

Rik van Riel observed that this is, at its core, a hardware problem. At some point, mobile devices will get fast enough, with enough memory, that the reclaim problem will go away by itself. In that case, adding a new API to speed up reclaim might well be the wrong thing to do. Michal Hocko, though, said that the real problem is processes that are blocked (and thus cannot do their own cleanup) rather than hardware limitations.

Johannes Weiner said that, to the extent that hardware is the problem, things could be improved by automatically moving exiting processes to the fastest CPU in the system. Other resource limits are waived on exiting processes so they can get out of the way quickly, he said, so it might make sense to do the same with regard to CPU placement. Others worried that this could create power-consumption issues, but since this situation tends to come about when an interactive process wants to run, a fast CPU should be running and available anyway.

Hocko replied that power use is not a big concern, but that process isolation might be. If a process has been confined to a slow CPU, moving it to a fast one, even if it's just to die, may break the isolation between groups and affect the running of interactive processes. If the desire is to have a killed process do its cleanup on a fast CPU, the solution is for user space to explicitly move the process to a different control group prior to killing it.

Mel Gorman said that there are a couple of problems that need to be addressed here. One is that there is not enough CPU time to do the cleanup work quickly. But the other aspect is that address spaces have gotten so large that even the fastest CPU is limited in how quickly it can get the job done. But the best solution is simple, he said, at least on the kernel side: an exiting process should be migrated to a fast CPU if its CPU mask allows that. The rest of the problem can be solved in user space, which can move the process prior to killing it if reclaim time is an issue. Doing anything else in the kernel would, he said, break isolation for somebody eventually.

Matthew Wilcox returned to the issue of doing the reclaim in the context of the killing process, which skirts around a number of these issues. Gorman replied that implementing reclaim that way is guaranteed to increase the number of inter-processor interrupts, since the killed process's memory will have been touched on two separate CPUs. That would not be good for performance There are also concerns that doing reclaim this way would turn the kill() system call into a blocking operation that could run for an arbitrary time, which could surprise callers.

One final option that was discussed was a remote version of the madvise(MADV_DONTNEED) operation; that would allow one process to force reclaim of (some of) another process's memory. Gorman worried, though, that this operation would have a large "potential for shenanigans"; this concern could be addressed by applying the usual "can the calling process use ptrace() on the target?" test. This call would have the advantage that it could be done in multiple threads, each of which would release a part of the address space; that would be a simple way of parallelizing the task. At the close of the session, it was also suggested that either fadvise() or truncate(), when called on a process file descriptor, could be given the effect of reclaiming that process's memory, but the idea was not developed further.

Comments (4 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>