|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for June 11, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Seccomp and deep argument inspection

By Jake Edge
June 10, 2020

Kees Cook has been doing some thinking about plans for new seccomp features to work on soon. There were four separate areas that he was interested in, which he detailed in a lengthy mid-May message on the linux-kernel mailing list. One of those features, deep argument inspection, has been covered here before, but it would seem that we are getting closer to a resolution on how that all will work.

Deep arguments

Seccomp filtering (or "seccomp mode 2") allows a process to filter which system calls can be made by it or its threads—it can be used to "sandbox" a program such that it cannot make calls that it shouldn't. Those filters use the "classic" BPF (cBPF) language to specify which system calls and argument values to allow or disallow. The seccomp() system call is used to enable filtering mode or to load a cBPF filtering program. Those programs only have access to the values of the arguments passed to the system call; if those arguments are pointers, they cannot be dereferenced by seccomp, which means that accepting or rejecting the system call cannot depend on, for example, values in structures that are passed to system calls via pointers—or even string values.

The reason that seccomp cannot dereference the pointers is to avoid the time-of-check-to-time-of-use (TOCTTOU) race condition, where user space can change the value of what is being pointed to between the time that the kernel checks it and the time that the value gets used. But certain system calls, especially newer ones like clone3() and openat2(), have some important arguments passed in structures via pointers. These new system calls are designed with an eye toward easily adding new arguments and flags by redefining the structure that gets passed; in his email, Cook called these "extensible argument" (or EA) system calls.

It does not make sense for seccomp to provide a mechanism to inspect the pointer arguments of every system call, he said: "[...] the grudging consensus was reached that having seccomp do this for ALL syscalls was likely going to be extremely disruptive for very little gain". But for the EA system calls (or perhaps only a subset of those), seccomp could copy the structure pointed to and make it available to the BPF program via its struct seccomp_data. That would mean that seccomp would need to change to perform that copy, which would require a copy_from_user() call, and affected system calls would need to be seccomp-aware so that they can use the cached copy if seccomp creates one.

There are some other wrinkles to the problem, of course. The size of the structure passed to the EA system calls may grow over time in order to add new features. If the size is larger than expected on either side (user space or kernel), finding or filling zeroes in the "extra" space is specifically designed to mean that those new features are unused (the openat2() man page linked above has some good information on how this is meant to work). Since user space and the kernel do not have to be in lockstep, that will allow newer user-space programs to call into an older kernel and vice versa. But that also means that seccomp needs to be prepared to handle argument sizes larger (or smaller) than "expected" and ensure that the zero-filling is done correctly.

It gets even more complicated because different threads might have different ideas of what the EA structure size is, Cook said:

Since there is not really any caller-based "seccomp state" associated across seccomp(2) calls, I don't think we can add a new command to tell the kernel "I'm expecting the EA struct size to be $foo bytes", since the kernel doesn't track who "I" is besides just being "current", which doesn't take into account the thread lifetime -- if a process launcher knows about one size and the child knows about another, things will get confused.

He had suggestions of a few different possibilities to solve the problem, but seemed to prefer the zero-fill option:

leverage the EA design and just accept anything <= PAGE_SIZE, record the "max offset" value seen during filter verification, and zero-fill the EA struct with zeros to that size when constructing the seccomp_data + EA struct that the filter will examine. Then the seccomp filter doesn't care what any of the sizes are, and userspace doesn't care what any of the sizes are. (I like this as it makes the problems to solve contained entirely by the seccomp infrastructure and does not touch user API, but I worry I'm missing some gotcha I haven't considered.)

Others commenting also seemed to prefer that option, though Jann Horn noted that there is no need to zero-fill beyond the size that the kernel knows about:

We don't need to actually zero-fill memory for this beyond what the kernel supports - AFAIK the existing APIs already say that passing a short length is equivalent to passing zeroes, so we can just replace all out-of-bounds loads with zeroing registers in the filter. The tricky case is what should happen if the userspace program passes in fields that the filter doesn't know about. The filter can see the length field passed in by userspace, and then just reject anything where the length field is bigger than the structure size the filter knows about. But maybe it'd be slightly nicer if there was an operation for "tell me whether everything starting at offset X is zeroes", so that if someone compiles with newer kernel headers where the struct is bigger, and leaves the new fields zeroed, the syscalls still go through an outdated filter properly.

Implementing that new operation would require changes to cBPF, however, which is not going to happen, according to BPF maintainer Alexei Starovoitov: "cbpf is frozen." An alternative would be for seccomp to switch to extended BPF (eBPF) for its filters. Using eBPF would allow the filters to perform that operation themselves without adding any new opcodes, but switching to eBPF is something that Cook hopes to avoid. As he explained in a message back in 2018, eBPF is something of fast-moving target, which worries him from a security standpoint: "[...] I want absolutely zero surprises when it comes to seccomp". Beyond that, eBPF would add a lot more code for the seccomp filter to interact with in potentially dangerous ways.

Aleksa Sarai, who is the developer behind the EA scheme, generally agreed with Cook's plan for handling those structures, but he raised another point. The structures may contain pointers—those cannot be dereferenced by seccomp either, of course. Should something be done so that the filters can access that data as well? When these "nested pointers" came up in another discussion, Linus Torvalds made it abundantly clear that he thinks that is not a problem that the kernel should deal with at all.

Less-deep arguments

A few days after his original post, Cook posted an item on the ksummit-discuss mailing list to suggest that there be a session at the (virtual) Kernel Summit in August to discuss these seccomp issues. Torvalds acknowledged that this kind of system call exists, but did not think there was much to discuss with regard to seccomp:

So I am not in the least interested in some kind of general discussion about system calls with "pointers to pointers". They exist. Deal with it. It's not in the least an interesting issue, and no, we shouldn't make seccomp and friends incredibly more complicated for it.

[...] And if you have some actual and imminent real security issue, you mention _that_ and explain _that_, and accept that maybe you need to do that expensive emulation (because the kernel people just don't care about your private hang-ups) or you need to explain why it's a real issue and why the kernel should help with your odd special case.

Cook seemed somewhat relieved in his response:

Perhaps the question is "how deeply does seccomp need to inspect?" and maybe it does not get to see anything beyond just the "top level" struct (i.e. struct clone_args) and all pointers within THAT become opaque? That certainly simplifies the design.

Christian Brauner, who has also been doing a lot of development in these areas, agreed that the filters could likely live without the ability to chase pointers any further than the top level. Sarai would like to see there at least be a path forward if requirements of that sort do arise, but seemed willing to keep things simple for now—perhaps forever.

io_uring

In his message on linux-kernel, Horn raised an interesting point for seccomp developers: handling io_uring. Since its introduction in early 2019, io_uring has rapidly added features that effectively allow routing around the normal system-call entry path, while still performing the actions that a seccomp filter might be trying to prevent.

io_uring is growing support for more and more syscalls, including things like openat2, connect, sendmsg, splice and so on, and that list is probably just going to grow in the future. If people start wanting to use io_uring in software with seccomp filters, it might be necessary to come up with some mechanism to prevent io_uring from permitting access to almost everything else...

Obviously, the filters could simply disallow the io_uring system calls entirely, but that may be problematic down the road. Sarai agreed that it is something that may need some attention. Cook said that he needed to look more closely at io_uring: "I thought this was strictly for I/O ... like it's named". Trying to filter based on the arguments to the io_uring system calls will be a difficult problem to solve, since the actual commands and their arguments are buried inside a ring buffer that lives in an mmap() region shared between the kernel and user space. Chasing pointers in that environment seems likely to require eBPF—or even stronger medicine.

It would seem that a reasonable path for inspecting the first level of structure "arguments" to some system calls has been identified. clone3() and openat2() are obvious candidates, since their flag arguments, which will help seccomp filters determine if the call is "reasonable" under the rules of the sandbox, live in such structures. On the other hand, complex, multiplexing system calls like ioctl() and bpf() were specifically mentioned as system calls that would not make sense to try to add the pointer-chasing feature. Though Cook did not put any timetable on his plans, one might think we will see this feature sometime before the end of the year.

Comments (7 posted)

5.8 Merge window, part 1

By Jonathan Corbet
June 5, 2020
Just over 7,500 non-merge changesets have been pulled into the mainline repository since the opening of the 5.8 merge window — not a small amount of work for just four days. The early pulls are dominated by the networking and graphics trees, but there is a lot of other material in there as well. Read on for a summary of what entered the kernel in the first part of this development cycle.

Architecture-specific

  • Branch-target identification and shadow call stacks (both described in this article) have been added to the Arm64 architecture. Both are hardening technologies that, with luck, will make Arm64 systems more resistant to attack. The shadow call stack support is likely to spread to other architectures in the near future.

Core kernel

  • The new faccessat2() system call adds the flags argument that POSIX has always said should be there. The current support for faccessat() on Linux systems depends on emulation of the flags argument by the C library; faccessat2() will allow a better implementation in the kernel.
  • Memory control groups have a new knob, memory.swap.high, which can be used to slow down tasks that are using large amounts of swap space; see this commit for a bit more information.
  • The io_uring subsystem now supports the tee() system call.
  • It is now possible to pass a pidfd to the setns() system call; in that case, it is possible to specify multiple namespace types. The calling process will be moved to all of the applicable namespaces in an atomic manner.
  • The "BPF iterator" mechanism, which facilitates the dumping of kernel data structures to user space, has been merged; this feature was covered in this article in April.
  • There is a new ring buffer for communicating data from BPF programs. It is intended to resemble the perf ring buffer while allowing sharing of the buffer across multiple CPUs. See this documentation commit for more information.
  • The padata mechanism now supports multi-threaded jobs with load balancing; see this documentation commit for details.
  • The kernel's swappiness tuning knob, which sets the balance between reclaiming file-backed and anonymous pages, has traditionally been used to bias the system away from swapping anonymous pages. With fast I/O devices, though, swapping may be faster than filesystem access, so it may be useful to bias the system toward swapping. Now swappiness can take values up to 200 to push things in that direction; see this commit for details.

Filesystems and block I/O

  • Low-level support for inline encryption has been added to the block layer. Inline encryption is a hardware feature that encrypts (and decrypts) data moving between a block storage device and the CPU using a key provided by the CPU. Some more information can be found in this commit.
  • There is a new statx() flag (STATX_ATTR_DAX) that indicates that the file in question is being accessed directly via the DAX mechanism. There is also a documentation patch that attempts to specify just how filesystems will behave when DAX is in use. More DAX-related changes can be expected during this merge window.

Hardware support

  • Graphics: Leadtek LTK050H3146W panels, Northwest Logic MIPI DSI host controllers, Chrontel CH7033 video encoders, Visionox RM69299 panels, and ASUS Z00T TM5P5 NT35596 panels.
  • Hardware monitoring: Maxim MAX16601 voltage regulators, AMD RAPL MSR-based energy sensors, Gateworks System Controller analog-to-digital converters, and Baikal-T1 process, voltage, and temperature sensors.
  • Interrupt control: Loongson3 HyperTransport interrupt vector controllers, Loongson PCH programmable interrupt controllers, and Loongson PCH MSI controllers.
  • Media: Rockchip video decoders and OmniVision OV2740 sensors. The "atomisp" driver has also been resurrected in the staging tree and seen vast amounts of cleanup work.
  • Miscellaneous: AMD SPI controllers, Maxim 77826 regulators, Arm CryptoCell true random number generators, Amlogic Meson SDHC host controllers, Freescale eSDHC ColdFire controllers, and Loongson PCI controllers,
  • Networking: Broadcom BCM54140 PHYs, Qualcomm IPQ4019 MDIO interfaces, MediaTek STAR Ethernet MACs, Realtek 8723DE PCI wireless network adapters, and MediaTek MT7915E wireless interfaces.

Miscellaneous

  • The new initrdmem= boot-time option specifies an initial disk image found in RAM; see this commit for more information.

Networking

  • The bridge code now supports the media redundancy protocol, where a ring of Ethernet switches can be used to survive single-unit failures. See this commit for more information.
  • The new "gate" action for the traffic-control subsystem allows specific packets to be passed into the system during specified time slots. This action is naturally undocumented, but some information can be found in this commit.
  • Some network devices can perform testing of attached network cables; the kernel and ethtool utility now support that functionality when it is available.
  • The multiprotocol label switching routing algorithm is now available for IPv6 as well as IPv4.
  • RFC 8229, which describes encapsulation of key-exchange and IPSec packets, is now supported.

Security-related

  • The CAP_PERFMON capability has been added; a process with this capability can do performance monitoring with the perf events subsystem.
  • The new CAP_BPF capability covers some BPF operations that previously required CAP_SYS_ADMIN. In general, most BPF operations will also require either CAP_PERFMON (for tracing and such) or CAP_NET_ADMIN; this commit gives a terse overview of which operations require which capabilities.

Internal kernel changes

  • The "pstore" mechanism, which stashes away system-state information in case of a panic, has gained a new back-end that stores data to a block device. See this commit for documentation.
  • There is a new read-copy-update (RCU) variant called "RCU rude"; it delineates grace periods only at context switches. Those wondering about the name might see the comment in this commit, which reads: "It forces IPIs and context switches on all online CPUs, including idle ones, so use with caution".
  • The RCU-tasks subsystem has a new "RCU tasks trace" variant suited to the needs of tracing and BPF programs; see this commit for details.
  • "Local locks" have been brought over from the realtime preemption tree. These locks are intended to replace code that disables preemption and/or interrupts on a single processor. Advantages include a better realtime implementation and the ability to properly instrument locking; see this commit for more information.
  • The API for managing file readahead has changed significantly; see this patch series for details.
  • The kgdb kernel debugger is now able to work with the boot console, enabling debugging much earlier in the boot process; see this commit and this documentation patch for more information.
  • There is a new buffer-allocation API intended to make the writing of XDP network drivers easier. Documentation is too much to hope for, but the API can be seen in this commit.

The 5.8 merge window can be expected to remain open until June 14; after that, the actual 5.8 release should happen in early August. Stay tuned; LWN will provide an update on the rest of this merge window after it closes.

Comments (12 posted)

A crop of new capabilities

By Jonathan Corbet
June 8, 2020
Linux capabilities empower the holder to perform a set of specific privileged operations while withholding the full power of root access; see the capabilities man page for a list of current capabilities and what they control. There have been no capabilities added to the kernel since CAP_AUDIT_READ was merged for 3.16 in 2014. That's about to change with the 5.8 release, though, which is set to contain two new capabilities; yet another is currently under development.

New capabilities in 5.8

The first of the new capabilities is CAP_PERFMON, which was covered in detail here last February. With this capability, a user can perform performance monitoring, attach BPF programs to tracepoints, and other related actions. In current kernels, the catch-all CAP_SYS_ADMIN capability is required for this sort of performance monitoring; going forward, users can be given more restricted access. Of course, a process with CAP_SYS_ADMIN will still be able to do performance monitoring as well; it would be nice to remove that power from CAP_SYS_ADMIN, but doing so would likely break existing systems.

The other new capability, CAP_BPF, controls many of the actions that can be carried out with the bpf() system call. This capability has been the subject of a number of long and intense conversations over the last year; see this thread or this one for examples. The original idea was to provide a special device called /dev/bpf that would control access to BPF functionality, but that proposal did not get far. What was being provided was, in essence, a new capability, so capabilities seemed like a better solution.

The current CAP_BPF controls a number of BPF-specific operations, including the creation of BPF maps, use of a number of advanced BPF program features (bounded loops, cross-program function calls, etc.), access to BPF type format (BTF) data, and more. While the original plan was to not retain backward compatibility for processes holding CAP_SYS_ADMIN in an attempt to avoid what Alexei Starovoitov described as the "deprecated mess", the code that was actually merged does still recognize CAP_SYS_ADMIN.

One interesting aspect of CAP_BPF is that, on its own, it does not confer the power to do much that is useful. Crucially, it is still not possible to load most types of BPF programs with just CAP_BPF; to do that, a process must hold other capabilities relevant to the subsystem of interest. For example, programs for tracepoints, kprobes, or perf events can only be loaded if the process also holds CAP_PERFMON. Most program types related to networking (packet classifiers, XDP programs, etc.) require CAP_NET_ADMIN. If a user wants to load a program for a networking function that calls bpf_trace_printk(), then both CAP_NET_ADMIN and CAP_PERFMON are required. It is thus the combination of CAP_BPF with other capabilities that grants the ability to use BPF in specific ways.

Additionally, some BPF operations still require CAP_SYS_ADMIN. Offloading BPF programs into hardware is one example. Another one is iterating through BPF objects — programs, maps, etc. — to see what is loaded in the system. The ability to look up a map, for example, would give a process the ability to change maps belonging to other users and with it, the potential for all sorts of mayhem. Thus the bar for such activities is higher.

The end result of this work is that it will be possible to do quite a bit of network administration, performance monitoring, and tracing work without full root (or even full CAP_SYS_ADMIN) access.

CAP_RESTORE

The CAP_RESTORE proposal was posted in late May; its purpose is to allow the checkpointing and restoring of processes by (otherwise) unprivileged processes. Patch author Adrian Reber wrote that this is nearly possible today using the checkpoint/restore in user space (CRIU) feature that has been under development for many years. There are a few remaining obstacles, though, one of which is process IDs. Ideally, a process could be checkpointed and restored, perhaps on a different system, without even noticing that anything had happened. If the process's ID changes, though, that could be surprising and could lead to bad results. So the CRIU developers would like the ability to restore a process using the same ID (or IDs for a multi-threaded process) it had when it was checkpointed, assuming that the desired IDs are available, of course.

Setting the ID of a new process is possible with clone3(), but this feature is not available to unprivileged processes. The ability to create processes with a chosen ID would make a number of attacks easier, so ID setting is restricted to processes with, of course, CAP_SYS_ADMIN. Administrators tend to balk at handing out that capability, so CRIU users have been resorting to a number of workarounds; Reber listed a few that vary from the reasonable to the appalling:

  • Containers that can be put into user namespaces can, of course, control process IDs within their namespaces without any particular difficulty. But that is evidently not a solution for everybody.
  • Some high-performance computing users run CRIU by way of a setuid wrapper to gain the needed capabilities.
  • Some users run the equivalent of a fork bomb, quickly creating (and killing) processes to cycle through the process-ID space up to the desired value.
  • Java virtual-machine developers would evidently like to use CRIU to short out their long startup times; they have been simply patching out the CAP_SYS_ADMIN checks in their kernel (a workaround that led Casey Schaufler to exclaim: "That's not a workaround, it's a policy violation. Bad JVM! No biscuit!").

Reber reasonably suggested that it should be possible to find a better solution than those listed above, and said that CAP_RESTORE would be a good fit.

Discussion of this patch focused on a couple of issues, starting with whether it was needed at all. Schaufler, in particular, wanted to know what the new capability would buy, and whether it would truly be sufficient to carry out the checkpoint and restore operations without still needing CAP_SYS_ADMIN. Just splitting something out of CAP_SYS_ADMIN, he said, is not useful by itself:

If we broke out CAP_SYS_ADMIN properly we'd have hundreds of capabilities, and no one would be able to manage the capability sets on anything. Just breaking out of CAP_SYS_ADMIN, especially if the process is going to need other capabilities anyway, gains you nothing.

It does seem that CAP_RESTORE may, in the end, be sufficient for this task, though, so Schaufler's objections seemed to fade over time.

The other question that came up was: what other actions would eventually be made possible with this new capability? The patch hinted at others, but they were not implemented. The main one appears to be the ability to read the entries in /proc/pid/map_files in order to be able to properly dump out various mappings during the checkpoint procedure. The next version of the patch will have an implementation of this behavior as well. Some developers wondered whether there should be two new capabilities, with the second being CAP_CHECKPOINT, to cover the actions specific to each procedure; that change may not happen without further pressure, though.

The final form of this patch remains to be seen; security-related changes can require a lot of discussion and modification before they find their way in. But this capability seems useful enough that it will probably end up merged in some form at some point.

Comments (14 posted)

DMA-BUF cache handling: Off the DMA API map (part 1)

June 4, 2020

This article was contributed by John Stultz

Recently, the DMA-BUF heaps interface was added to the 5.6 kernel. This interface is similar to ION, which has been used for years by Android vendors. However, in trying to move vendors to use DMA-BUF heaps, we have begun to see how the DMA API model doesn't fit well for modern mobile devices. Additionally, the lack of clear guidance in how to handle cache operations efficiently, results in vendors using custom device-specific optimizations that aren't generic enough for an upstream solution. This article will describe the nature of the problem; the upcoming second installment will look at the path toward a solution.

The kernel's DMA APIs are all provided for the sharing of memory between the CPU and devices. The traditional DMA API has, in recent years, been joined by additional interfaces such as ION, DMA-BUF, and DMA-BUF heaps. But, as we will see, the problem of efficiently supporting memory sharing is not yet fully solved.

As an interface, ION was poorly specified, allowing applications to pass custom, opaque flags and arguments to vendor-specific, out-of-tree heap implementations. Additionally, since the users of these interfaces only ran on the vendors' devices with their custom kernel implementations, little attention was paid to trying to create useful generic interfaces. So multiple vendors might use the same heap ID for different purposes, or they might implement the same heap functionality but using different heap IDs and flag options. Even worse, many vendors drastically changed the ION interface and implementation itself, so that there was little in common between vendor ION implementations other than their name and basic functionality. ION essentially became a playground for out-of-tree and device-specific vendor hacks.

Meanwhile, the general dislike of the interface upstream meant that objections to the API often obfuscated the deeper problems that vendors were using ION to solve. But now that the DMA-BUF heaps interface is upstream, some vendors are trying to migrate from their ION heap implementations (and, hopefully, submit the result upstream). In doing so, they are starting to wonder how they will implement some of the functionality and optimizations they were able to obtain with ION while using the more constrained DMA-BUF heaps interface.

A side effect of trying to cajole vendors into pushing their heap functionality upstream is learning more about the details and complexities of how vendors use DMA-BUFs. Since performance is important to mobile vendors, they spend lots of time and effort optimizing how data moves through the device. Specifically, they use buffer sharing not just for moving data between the CPU and a device, but for sharing data between different devices in a pipeline. Often, data is generated by one device, then processed by multiple other devices without the CPU ever accessing it.

For example, a camera sensor may capture raw data to a buffer; that buffer is then passed to an image signal processor (ISP), which applies a set of corrections and adjustments. The ISP will generate one buffer that is passed directly to the display compositor and rendered directly to the screen. The ISP also produces a second buffer that is converted by an encoder to produce yet another buffer that can then be passed to a neural-network engine for face detection (which is then used for focus correction on future frames).

This model of multi-device buffer sharing is common in mobile systems, but isn't as common upstream, and it exposes some limitations of the existing DMA API — particularly when it comes to cache handling. Note that while both the CPU and devices can have their own caches, in this article I'm specifically focusing on the CPU cache; device caches are left to be handled by their respective device drivers.

The DMA API

When we look at the existing DMA API, we see that it implements a clear model that handles memory sharing between the CPU and a single device. The DMA API is particularly careful about how "ownership" — with respect to the CPU cache — of a buffer is handled, in order to avoid data corruption. By default, memory is considered part of the CPU's virtual memory space and the CPU is the de-facto owner of it. It is assumed that the CPU may read and write the memory freely; it is only when allowing a device to do a DMA transaction on the memory that the ownership of the memory is passed to the device.

The DMA API describes two types of memory architecture, called "consistent" and "non-consistent" (or sometimes "coherent" and "non-coherent"). With consistent-memory architectures, changes to memory contents (even when done by a device) will cause any cached data to be updated or invalidated. As a result, a device or CPU can read memory immediately after a device or CPU writes to it without having to worry about caching effects (though the DMA API notes that the CPU cache may need to be flushed before devices can read). Much of the x86 world deals with consistent memory (with some exceptions, usually dealing with GPUs), however in the Arm world, we see many devices that are not coherent with the CPU and are thus non-consistent-memory architectures. That said, as Arm64 devices gain functionality like PCIe, there can often be a mix of coherent and non-coherent devices on a system.

With non-consistent memory, additional care has to be taken to properly handle the cache state of the CPU to avoid corrupting data. If the DMA API's ownership rules are not followed, the device could write to memory without the CPU's knowledge; that could cause the CPU to use stale data in its cache. Similarly, the CPU could flush stale data from its cache to overwrite the newly device-written memory. Data corruption is likely to result either way.

If you're interested in learning more, Laurent Pinchart's ELC 2014 presentation on the DMA API is great; the slides [PDF] are also available.
Thus, the DMA API rules help establish proper cache handling in a generic fashion, ensuring that the CPU cache is invalidated if the device is writing to the memory and flushed before the device reads the memory. Normally, these cache operations are done when the buffer ownership is transferred between the CPU and the device, such as when the memory is mapped and then unmapped from the DMA device (via functions like dma_map_single()).

From the DMA API perspective, sharing buffers with multiple devices is the same as sharing with a single device, except that the sharing is done in a series of discrete operations. The CPU allocates a buffer, then passes ownership of that buffer to the first device (potentially flushing the CPU cache). The CPU then allows the device to do the DMA and unmaps the buffer (potentially invalidating the CPU cache) when the operation is complete, bringing the ownership back to the CPU. Then the process is repeated for the next device and the device after.

The problem here is that those cache operations add up, especially when the CPU isn't actually touching the buffer in between. Ideally, if we were sharing the buffer with a series of cache-incoherent devices, the CPU cache would be initially flushed, then the buffer could be used by devices in series without additional cache operations. The DMA API does allow for some flexibility here, so there are ways to have mapping operations skip CPU syncing; there are also the dma_sync_*_for_cpu/device() calls which allow explicit cache operations to be done while there is an existing mapping. But these are "expert-only" tools provided without much guidance, and they trust that drivers take special care when using these optimizations.

DMA-BUFs

DMA-BUFs were introduced to provide a generic way for applications and drivers to share a handle to a memory buffer. The DMA-BUFs themselves are created by a DMA-BUF exporter, which is a driver that can allocate a specific type of memory but that also provides hooks to handle mapping and unmapping the buffer in various ways for the kernel, user space, or devices.

The general usage flow of DMA-BUFs for a device is as follows (see the dma_buf_ops structure for more details):

dma_buf_attach()
Attaches the buffer to a device (that will use the buffer in the future). The exporter can try to move the buffer if needed to make it accessible to the new device or return an error. The buffer can be attached to multiple devices.

dma_buf_map_attachment()
Maps the buffer into an attached device's address space. The buffer can be mapped by multiple attachments.

dma_buf_unmap_attachment()
Unmaps the buffer from the attached device's address space.

dma_buf_detach()
Signals that the device is finished with the buffer; the exporter can do whatever cleanup it needs.

If we were looking at this from the classic DMA API perspective, we would consider a DMA-BUF to be normally owned by the CPU. Only when dma_buf_map_attachment() was called would the buffer ownership transfer to the device (with the associated cache flushing). Then on dma_buf_unmap_attachment(), the buffer would be unmapped and ownership would return to the CPU (again with the proper cache invalidation required). This in effect would make the DMA-BUF exporter the entity responsible for complying with the DMA API rules of ownership.

The trouble with this scheme arises with a buffer pipeline consisting of a number of devices, where the CPU doesn't actually touch the buffer. Following the DMA API and calling dma_map_sg() and dma_unmap_sg() on each dma_buf_map_attachment() and dma_buf_unmap_attachment() call results in lots of cache-maintenance operations, which dramatically impacts performance. This was viscerally felt by ION users after a cleanup series landed in 4.12 that caused ION to use the DMA API properly. Previously, it had lots of hacks and was not compliant with the DMA API, resulting in buffer corruption in some cases; see the slides from Laura Abbott’s presentation for more details. This compliance cleanup caused a dramatic performance drop for ION users, which resulted in some vendors reverting back to the 4.9 ION code in their 4.14-based products, and others creating their own hacks to improve performance.

So how can we have DMA-BUF exporters that better align with the DMA API, but do so with the performance needed for modern devices when using buffer pipelines with multiple devices? In the second part of this article, we will continue discussing some of the unique semantics and flexibility in DMA-BUF that allows drivers to potentially avoid this performance impact (by going somewhat "off-road" from the DMA API usage guidelines), as well as the downsides of what that flexibility allows. Finally, we'll share some thoughts as to how these downsides might be avoided.

Comments (8 posted)

Home Assistant, the Python IoT hub

By John Coggeshall
June 10, 2020

The Internet of Things (IoT) push continues to expand as tens of thousands of different internet-enabled devices from light bulbs to dishwashers reach consumers' homes. Home Assistant is an open-source project to make the most of all of those devices, potentially with no data being shared with third parties.

Generally speaking, IoT devices are most useful when operating in coordination with each other. While decentralized systems are possible, keeping all of these devices coordinated in the effort to make a "smart house" generally uses a centralized server (or "smart hub") — a reality not lost on Apple, Amazon, and Google among others, who all provide various solutions to the problem.

For the privacy and security minded however, those solutions are problematic because they send private data to centralized servers for processing simply to turn on your lights. That these solutions are also closed-source black boxes does not help matters. Home Assistant is an Apache-licensed project to address this problem, started by Paulus Schoutsen, to provide a private centralized hub for automation and communication between a home's various IoT devices. Schoutsen is also founder of Nabu Casa, Inc., which provides commercial backing for the project. The platform is popular and active, approaching its 600th release from nearly 2,000 contributors to its core on GitHub. Integrations between various IoT platforms and Home Assistant are provided by its 1,600 available components written by the various contributors.

Meet Home Assistant

Home Assistant strives to be a user-friendly tool, with recent releases investing significantly in ease-of-use and front-end features. This includes technologies like auto-discovery of a wide range of IoT devices, built-in configuration editors, and an interface editing tool called Lovelace. Under the hood, Home Assistant is a Python 3 code base with configurations managed through YAML files for those who prefer to work from their own editor instead.

From a developer's perspective, Home Assistant has a well-documented API and architecture. It provides a web-based interface to manage the home, and mobile applications for both Android and iOS. Architecturally, it is an event-driven platform and state machine where various "entities" are managed by the system. Entities are arbitrary "things" and take many forms. They can be concrete devices, such as a particular light bulb, internet-based information like a weather report, or even a boolean indicating the presence of a specific phone's MAC address on the local network.

In Home Assistant, components expose entities and define services. Not to be confused with API web services, Home Assistant services enable actions related to the nature of the component, which may or may not use an underlying API. These components, also written in Python, are really the heart of the Home Assistant ecosystem. Components are responsible for integrating with third-party technologies (such as a smart bulb provider's API); executing the necessary API calls to make sure the device reflects the desired state as stored in a Home Assistant entity and vice versa. Components are also not limited to remote APIs and devices either; There are components that expose services for shell commands, standalone Python scripts, and other interactions with the local machine as well.

Components are downloaded and installed by Home Assistant on demand from the library managed by the Home Assistant project. If someone hasn't already written a component for a particular API integration or task, a custom component can be implemented in Python for your specific installation.

Tying together entities and services are automations, which are triggered based on events related to the state of one or more entities. Automations in turn perform actions by calling Home Assistant services to accomplish things such as turning on a light bulb. They also support conditional execution, one example being the creation of an automation that only executes on a weekend.

Getting started

For those who would like to try Home Assistant, the recommended hardware is a Raspberry Pi 4 Model B with one of the provided SD card images available here. For those who would like to install it on an existing, more traditional Linux host, the options available are a Docker image or installation using pip.

As previously stated, configuration values are managed in YAML. Editing the configuration can be done via the provided web interface, or directly editing of the instance's configuration.yaml. A separate secrets.yaml holds credentials and other sensitive information. Since the configuration is a collection of YAML files, it is convenient and recommended to track changes via a Git repository.

As an example of how configuration works, we are going to use the ecobee component. This component enables the integration of data, as well as API calls, from an Ecobee thermostat into Home Assistant entities and services. It starts with defining the component in configuration.yaml:

    ecobee:
        api_key !secret ecobee_api_key

Where ecobee_api_key is defined in secrets.yaml. Adding the ecobee component and restarting Home Assistant will automatically download and install the necessary packages for the component to function.

With the component enabled, entities for sensors can be created to extract specific pieces of information. For example, below is the definition of a sensor representing the HVAC fan that the Ecobee controls:

    sensor:
      - platform: template
        sensors:
          my_hvac_fan:
            value_template: "{{ states.climate.home.attributes.fan }}"

Above we define a new sensor using the template platform, indicating the value will come from a template. Platforms in Home Assistant categorize the types of configuration options and behaviors expected. For the template platform, it tells Home Assistant to extract the sensor's value from value_template.

It is worth noting that value_template uses the Jinja2 template engine under the hood, enabling sensors to normalize or implement logic for sensor values. In this example, we extract the value from an attribute of the state provided by the ecobee component. Templates can also be more complicated as needed (taken from my own setup of Home Assistant).

Once defined, this sensor's value can now be referenced by the entity identifier sensors.my_hvac_fan. Entities can be rendered to the web and mobile interface, their changes can be stored and graphed over time, and they can be used as a trigger for automations. Home Assistant provides a reasonable default rendering for an entity value in most cases and can be customized extensively as desired.

Automations provide the rules and actions to tie entities together. Each automation has three main segments: The trigger(s) for the automation, the condition(s) of the automation, and the action(s) to take if the conditions are met.

A typical automation looks like this:

    automation:
      - alias: 'Light on in the evening'
        trigger:
          - platform: sun
            event: sunset
            offset: '-01:00:00'
          - platform: state
            entity_id: person.john
            to: 'home'
        condition:
          - condition: time
            after: '16:00:00'
            before: '23:00:00'
        action:
          service: homeassistant.turn_on
          entity_id: group.living_room

In this automation we are using the built in sun entity to trigger our automation one hour before sunset (based on the time zone and location information configured in Home Assistant) along with a person entity (me). In Home Assistant a person entity is, among other things, a collection of device trackers used in presence tracking. Like some of the core entities in Home Assistant, the person entity type is an abstraction of values taken from other entity sources. For example the location of a person entity can be sourced from the devices connected to the LAN, the mobile app on a phone, or other similar sources assigned to that person.

Returning to the example, the automation is triggered by two states: At sunset minus an hour, and if the person is home. If both conditions are true, the automation is triggered and a check against the conditions is performed. If the conditions are met (in this case, the current time is within a range), then the action(s) are executed. The action for this automation is the homeassistant.turn_on service, which is given an elsewhere-defined target entity group.living_room to turn on. Note also that automations themselves are entities, and can be interacted with by other automations adding to their versatility. One common technique is to enable and disable one automation based on the actions of another automation.

Privacy and security

A key benefit to Home Assistant is not that it assures privacy for a smart home, rather that it gives the user a choice in the matter unlike commercial alternatives. That said, while Home Assistant provides the framework to implement a smart home hub completely free of the cloud, note that many IoT devices do not provide a way to control them in any other way. Some products are better than others in this regard, so it is an important consideration to keep in mind when choosing the various devices to use on the platform.

The other side of the coin is security, as one cannot have privacy unless they also have security. Various recommendations are provided by the project to ensure that an instance is secure, including keeping pace with releases and security fixes of the code base itself. Since Home Assistant is a web application (using APIs to communicate with the mobile applications over HTTP), all of the standard security practices normally encouraged apply. Accessing Home Assistant remotely requires access to the host machine remotely. In the case of the mobile applications, Home Assistant either must be exposed on a public IP or otherwise accessible by secure tunnel to function.

Home Assistant does make efforts to ensure that the platform is secure. Security fixes are common (but not too common) in releases, and the project advertises how to let it know if a new security vulnerability is discovered.

That being said, Home Assistant is typically not the only technology powering the smart home stack. Message brokers such as a MQTT server and database packages are commonly needed. Due consideration to ensure their security is important as well. Recent reporting of nearly 50,000 MQTT servers being exposed to the public indicates improper configuration of secondary services is a major concern for IoT deployments in general.

Learning more

If Home Assistant looks like a promising solution in the smart home space, then there are plenty of resources to get started with. Home Assistant has a large community of users to learn from and maintains a brisk release cycle of new features. The user documentation is generally acceptable and a great reference as well. For those with the Python skills to get involved, the developer portal has everything needed to get started writing components; and there is a popular Discord chat room if needed.

Home Assistant appears to be a lively project and worth keeping an eye on as a front-runner in the open-source IoT space. At least for the moment, it does also require a meaningful amount of technical know-how to set up. This does, however, seem likely to change considering the investments the project is making to create a more user-friendly experience. Readers who have avoided the home-automation bandwagon due to privacy concerns will find the project worth a look, as will as anyone interested in controlling the technology behind their own smart hub.

Comments (16 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds