Leading items

Welcome to the LWN.net Weekly Edition for October 4, 2018

This edition contains the following feature content:

Freedesktop.org: its past and its future: Daniel Stone presents a history of Freedesktop.org and some thoughts about where it may be going next.
Revenge of the modems: many Android phones can be made to do unfortunate things by sending them "AT" commands over the USB port.
XFS, LSM, and low-level management APIs: a mismatch between security modules and how the trust model for the storage stack works.
Device-to-device memory-transfer offload with P2PDMA: a new API for supporting DMA operations directly between two PCI devices.
OpenBSD's unveil(): a different approach to securing user-space applications.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Freedesktop.org: its past and its future

By Jake Edge
October 3, 2018

X.Org Developers Conference

At the 2018 X.Org Developers Conference (XDC) in A Coruña, Spain, Daniel Stone gave an update on the status of freedesktop.org, which serves multiple projects as a hosting site for code, mailing lists, specifications, and more. As its name would imply, it started out with a focus on free desktops and cross-desktop interoperability, but it lost that focus—along with its focus in general—along the way. He recapped the journey of fd.o (as it is often known) and unveiled some idea of where it may be headed in the future.

The talk was billed with Keith Packard as co-presenter, but Packard could not make it to XDC; Stone said that he sent Packard a copy of the slides and heard no complaints, so he left Packard on the slide deck [PDF]. Stone wanted to start with the history of fd.o, because there are lots of new contributors these days—"which is great"—who may not know about it.

Yesteryear

Fd.o was founded in 2000 by Havoc Pennington, who is a GNOME developer; he was joined by a few more developers from other desktops. The initial goal was to provide a forum for discussing common desktop standards. If you were running KDE applications under GNOME and using Sawfish as a window manager, "you wanted it to be coherent". Early on, the group came up with agreements on various behaviors, including drag-and-drop and copy-and-paste.

It was a "good time for desktop collaboration", Stone said. Further discussions led to agreements of MIME type handling, which gave us ways to specify what applications would be opened for different file types. In addition, theming agreements allowed customizing the look of users' desktops in a cross-desktop-friendly way. Eventually, fd.o grew support for hosting source trees managed by the CVS revision control system, so it was not simply a "place to dump some HTML specifications".

In parallel to this, a short-lived site called xwin.org was started in 2003. At that time, the X Consortium "essentially did not exist"; all X development was done internally by companies like Sun and HP or by the XFree86 project. When Stone tried to package XFree86 for Debian, he could not get access to the source repository; the development mailing list was closed to outsiders as well.

Eventually, XFree86 changed its license to be hostile to other projects. At that point, Packard and some others started xwin.org to discuss what to do. "Everyone jumped on board" the xwin.org train at that point. From his perspective as a packager, it was great to be able to talk directly with the developers, he said.

At that point, X was governed by The Open Group (TOG), which offered to host the discussions about the license change and any subsequent plans. The name of that organization is rather ironic, Stone said, since it is a closed, "pay for play" group. In any case, the xwin.org developers merged the TOG X11R6 tree and the last release of XFree86 with a reasonable license; that became X11R6.7. Along the way, the X.Org Foundation was formed and got its independence from The Open Group; he doesn't know if the TOG was just being nice or whether it was "happy to see the back of us", he said with a grin.

At that point, it made sense for what was xwin.org to merge with fd.o due to the "shared interests" of the two; fd.o absorbed all of the xwin.org projects and communities. The X code had forked away from XFree86 and fd.o was hosting all of the X.Org projects, which is still the case today. Meanwhile, the fd.o standardization efforts continued with the "cross-desktop group" (XDG) family of specifications. Out of those efforts came things like .desktop and Autostart files. It was a steady evolution of the kinds of things fd.o had been working on since it was formed.

"It started getting really out of hand" in 2004. CVS hosting was a painful thing for projects to handle on their own, so they were looking for places to put their code; fd.o would accept "basically anyone" onto its servers. It became something like SourceForge. At the same time, though, there was an idea to turn all of the specifications and projects that were being hosted into an "LSB-like platform for desktops". That effort failed; it was 12 or 13 years too early, Stone said, as Flatpak is mostly what was envisioned back then.

In 2006, fd.o was "a bit bereft of direction". It was also a busy time in X development, so Stone and Packard were busy hacking on X and did not have time to look at much else. There was little coherence between the projects that fd.o was hosting. In addition, fd.o had taken on communities that had little or no contact with fd.o as an organization. By 2009, fd.o got "very lost". Anyone who had vaguely expressed interest in helping administer the site was given root privileges. There was no way to discuss project-specific problems; in some cases there was no person to even contact about a particular project. The fd.o project was being run as an anarchistic cooperative; "member" projects were off doing their own thing.

The project started to degenerate; its infrastructure became quite unreliable. There were disk and machine failures that goaded him and others into action. In 2012, Tollef Fog Heen was sponsored part-time for system administration work. Stone started to work on some system administration and Packard began working with the community of projects more actively. Others also helped to administer the fd.o infrastructure.

A few years down the road, the project was a bit "dazed and confused". GitHub was well established by 2015 and it was unclear what fd.o offered over GitHub, GitLab, and other similar hosting sites beyond that fd.o was "something vaguely about the desktop". Projects hosted at fd.o were left alone with little or no communication from the fd.o umbrella project. All of the time that fd.o had was spent "treading water and paying down technical debt". Various projects had moved away from fd.o years before, but had never told fd.o about it; it was "fair enough, we had not talked to them in that time" either, Stone said. But it was also a time where fd.o started experimenting with new services such as Jenkins and Phabricator.

2017 was an inflection point for fd.o, he said. The infrastructure was stable and the administrators were "aggressively making things work". There was finally time and bandwidth to start thinking more about the fd.o community. There were worries about legal liability, so work started on adopting a code of conduct. Clear copyright violations, "things that would obviously get my door kicked in", were removed from the site. At the time, GNOME was looking around for hosting software for its code; some long discussions with GNOME developers about GitLab took place. Eventually, GNOME decided to use GitLab and fd.o followed suit.

Today

So, now in 2018, fd.o is an open, neutral collaboration space; there is no danger that fd.o will ever be acquired, Stone said. It has a loose coalition of projects and is made up of free services of various sorts. Fd.o has "consistently had good intentions" over the years, but it has a mixed track record in terms of execution.

Depending on how you count, fd.o has around 42 active projects at this point; all are "fairly active and healthy". There are 28 dormant projects, which means there has been five years or more with no activity in their Git repositories. There are also 30 extinct projects that never moved from CVS to Git and 31 departed projects, 24 of the latter moved to GitHub.

These days, fd.o can offer "all the usual services you expect" for project hosting. That includes code hosting, issue tracking, web pages, and mailing-list hosting. It does not, yet, have continuous integration (CI) features or a modern code-review tool. In addition, fd.o offers apologetic responses to requests it can't fulfill at this point; the fd.o administrators will be "quite nice" when they have to say no, Stone said.

Some modernization is in order, however. No one really seems to like Bugzilla and some projects do not want to do patch review on a mailing list. Beyond that, CI is not really optional anymore. Bugzilla and the lack of CI were the main reasons that projects moved from fd.o to GitHub, he said. In addition, adding an account to fd.o services required filing a bug and waiting for an administrator to make it happen; that will be changing as well.

As part of the long discussion with GNOME, Phabricator was considered, but it had some user-interface issues and looked like a project that could become more closed over time. GitLab is an open project that is run using an open bug tracker and code repositories. The GitLab developers are willing to hear suggestions from GNOME and fd.o and genuinely care about open source and open projects, he said. It also "runs really well, which helps".

The existing fd.o machines are not up to running GitLab, however. The company behind GitLab, GitLab Inc., has "quite generously offered to sponsor" fd.o for a "year-ish" to get it up and running in the Google cloud. In that time frame, fd.o will be looking for other sponsors to continue running there. "Now I know what the cloud is", Stone said with a chuckle. Kubernetes is "very complex" but he got things working by trial and error. Over the years, Portland State University (PSU) has been quite generous in hosting fd.o, but it is time to move on.

At this point, almost all of the projects have been migrated to GitLab. There are still some areas that need work, including the GStreamer specifications that need to be moved over and a couple of small projects that will be migrated soon. Beyond that, the Direct Rendering Manager (DRM) and kernel mode setting (KMS) projects must wait until the end of the year. The status of the migration is being tracked in an fd.o GitLab issue.

The disk space for fd.o is around 35GB of Git repositories, plus 7GB of Git large-file storage (Git-LFS); CI artifacts weigh in at around 7GB as well. There is 800MB of file uploads. The Docker registry image for CI is 157GB, however. There is less network traffic than he thought there would be, but once the kernel pieces migrate, that will change. For now, the EMEA region uses the most bandwidth at 200GB/month; the US is second at 180GB. That's another reason that moving the servers away from the west coast of the US (PSU) makes sense. He was happy to see that his native country, Australia, was represented in the traffic, using about 5GB/month.

Running the GitLab servers costs around $500/month, which is being covered by GitLab Inc. for now. And fd.o has a few thousand dollars in the bank. Those figures are about as organized as fd.o gets about financial transparency to the community, which is "not great" and needs to be fixed, he said.

Tomorrow

The project needs to clearly define what it is and does—and why. The obligations of fd.o to its member projects, and vice versa, need to be clearly laid out. Perhaps it should not just be open to any project that has stuck it out over the years. There should be an electable board, for example, which is something that the X.Org Foundation has had for a long time. X.Org has existed since 2004 and it is also a member of Software in the Public Interest (SPI). X.Org does great work. It also has great conferences, Stone said; the audience enthusiastically followed the suggestion of "applause" for that in his slides.

So he put up a strawman proposal for fd.o to join forces with X.Org. That leads to some open questions, though. X.Org has a narrowly defined mission (it is good at saying what it does, he said) that does not include all of what fd.o does. There are fd.o projects like ModemManager, Poppler, and (formerly) LibreOffice that are pretty far outside the scope of what X.Org does. Should the X.Org mission be widened or should some fd.o projects be excluded? As long as it is clear what projects can expect from fd.o (and, perhaps, X.Org), there are alternatives for most of the services that are provided.

There is also a question of how much independence fd.o would maintain. If a project is interested in being hosted there, who decides whether to allow it (or not)? Otherwise, moving fd.o under X.Org seems to make quite a bit of sense, he said. It would solve all of the community issues that fd.o has had without duplicating all of what X.Org has done.

No matter how that plays out, there is a division of responsibilities needed. A board cannot do it all, but volunteers don't always work out either. Two specific areas need to have people with delegated power and responsibility: infrastructure administration and handling code-of-conduct reports. Currently, the fd.o code-of-conduct committee consists of Stone, Packard, and Fog Heen, but fd.o wants projects to be enforcing their own codes; the committee is simply meant to be the final place for escalating problems that cannot be resolved by the projects.

There is also the question about transparency in handling code-of-conduct reports. The recent uproar surrounding the kernel code of conduct highlights the need to determine what should be published, how often, and where. As a starting point, he gave attendees a transparency report for 2018. The fd.o committee has handled three "non-troll" abuse reports so far this year. No action was taken on two of those, but there was a private discussion with the person reported in the other. In that discussion, it was suggested that the person might want to change how they deal with other people. No further action has been taken in that case.

As with most projects, fd.o can use some help. With a grin, Stone asked: "Have you seen our web site?" The cloud is still something of a mystery and more help would be useful there. The specifications and their current status is not really reflective of reality at this point, which could use a clean up. Helping the member projects get up to speed on the best practices would be helpful. There are fd.o issues in GitLab that need addressing as well.

The Q&A session was replete with various "thank-yous" for all of what fd.o has done over the years. When asked what he thought fd.o would be doing beyond just GitLab hosting in 3-5 years, Stone pointed to mailing-list hosting. Having a high-volume mailing-list server in 2018 is bad: "the internet hates you". That is something that fd.o will probably still have to handle itself.

The X.Org Foundation is looking at making some fairly small by-law changes to allow it to take on fd.o, foundation secretary Daniel Vetter reported later in the conference. The wording of that is being worked on; once that is complete, it will go out to members for a vote. That vote might take place before the end of the year, Stone told me after his talk, so we may know soon if "freedeXtop.Org", as he jokingly called the combination in his slides, is going to come about or whether fd.o will need to find another path to better organization, more transparency, and a thriving community. Whatever the result of the merge proposal, it is good to see freedesktop.org emerging from years in the wilderness.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to A Coruña for XDC.]

Comments (6 posted)

Revenge of the modems

By Jake Edge
October 3, 2018

Back in the halcyon days of the previous century, those with a technical inclination often became overly acquainted with modems—not just the strange sounds they made when connecting, but the AT commands that were used to control them. While the AT command set is still in use (notably for GSM networks), it is generally hidden these days. But some security researchers have found that Android phones often make AT commands available via their USB ports, which is something that can potentially be exploited by rogue USB devices of various sorts.

A paper [PDF] that was written by a long list of researchers (Dave (Jing) Tian, Grant Hernandez, Joseph I. Choi, Vanessa Frost, Christie Ruales, Patrick Traynor, Hayawardh Vijayakumar, Lee Harrison, Amir Rahmati, Michael Grace, and Kevin R. B. Butler) and presented at the 27th USENIX Security Symposium described the findings. A rather large number of Android firmware builds were scanned for the presence of AT commands and many were found to have them. That's not entirely surprising since the baseband processors used to communicate with the mobile network often use AT commands for configuration. But it turns out that Android vendors have also added their own custom AT commands that can have a variety of potentially harmful effects—making those available over USB is even more problematic.

They started by searching through 2018 separate Android binary images (it is not clear how that number came about, perhaps it is simply coincidental) from 11 different vendors. They extracted and decompressed the various pieces inside the images and then searched those files for AT command strings. That process led to a database of 3500 AT commands, which can be seen at the web site for ATtention Spanned—the name given to the vulnerabilities.

In order to further test the reach of these commands, the researchers then ran tests sending the commands to actual devices: 13 Android phones and one Android tablet. Of those, they found that five would enable AT commands on the USB by default; three more had non-default USB configurations with AT-command support, which could be switched on for rooted phones. The others were immune to these particular AT-command-based attacks.

The results were eye-opening—at least for the affected devices. The kinds of operations that can be performed using AT commands are extensive—and worrisome. The set of operations supported varies widely, from firmware flashing to factory reset to making calls (even when the lock screen is up) to extracting information about the device and its configuration. These are, in short, a surefire way to end up with a compromised device, but only if it is plugged into the "wrong" USB device.

It turns out that some of these devices simply present a serial port when the USB cable is plugged in. The other side can just start sending AT commands via that serial port, which are handed off to a user-space daemon on the Android device. That daemon processes the AT commands and sends back any response (e.g. "OK"). Wiring up a user-space process to the external world via the connection used for charging may not have been the wisest choice these device makers could have made.

While there are rogue USB devices out there, and one can imagine ways to engineer someone into plugging one in, there is a more common case where phones are routinely plugged in, sometimes to devices that are not necessarily secure or well-policed: battery charging. There are multiple situations where one might take advantage of a charge from a potentially dodgy USB power source, including at power kiosks (e.g. at airports) or by borrowing one. Getting your device charged is not generally seen as a risky move but, for some devices, perhaps it should be. It is probably worth considering a USB Condom or similar device for charging safety.

Some of these vulnerabilities have been reported before, but those "analyses have been ad-hoc and narrowly focused on specific smartphone vendors". What the ATtention Spanned researchers have done, then, is to take a more systematic look across a wider slice of the Android marketplace to see what kinds of problems can be found—the results were not encouraging. One phone model can be temporarily bricked with a reset AT command (i.e. normal recovery modes would not bring it back but, ironically, using two AT commands would reboot the phone), for example. In addition, several of the phones would make calls using the "ATD" command, even when the screen is locked. There is a whole raft of information that can be leaked from some devices via AT commands; these include things like SIM card details, /proc and /sys data, the IMEI number, software versions, and more.

While it undoubtedly makes for an excellent way to help debug and troubleshoot phones and other devices, it is a little hard to understand why device makers would leave that path enabled on production hardware. Perhaps in the same way that users don't think of charging as a path to system compromise, device makers were also thinking past the problem. USB is a useful way to recharge these devices, but that shouldn't make folks lose track of what else it can do.

One might guess that device makers will be disabling USB access to their AT commands before long, but the researchers also note some other, potentially disturbing possibilities. In the FAQ on the web site, they note: "We did not investigate remote AT attack surface, but the first places we would look would be the BlueTooth interface and the baseband." One hopes that this might alert device makers so that they investigate and lock down these possibilities if needed. More cynical observers might be forgiven for guessing that it may take another research paper or two before they actually get around to doing so, however.

Comments (8 posted)

XFS, LSM, and low-level management APIs

By Jonathan Corbet
October 2, 2018

The Linux Security Module (LSM) subsystem allows security modules to hook into many low-level operations within the kernel; modules can use those hooks to examine each requested operation and decide whether it should be allowed to proceed or not. In theory, just about every low-level operation is covered by an LSM hook; in practice, there are some gaps. A discussion regarding one of those gaps — low-level ioctl() operations on XFS filesystems — has revealed a thorny problem and a significant difference of opinion on what the correct solution is.

In late September Tong Zhang pointed out that xfs_file_ioctl(), the 300-line function that dispatches the various ioctl() operations that can be performed on an XFS filesystem, was making a call to vfs_readlink() without first consulting the security_inode_readlink() LSM hook. As a result, a user with the privilege to invoke that operation (CAP_SYS_ADMIN) could read the value of a symbolic link within the filesystem, even if the security policy in place would otherwise forbid it. Zhang suggested that a call to the LSM hook should be added to address this problem.

XFS developer Dave Chinner disagreed, saying that such operations are below the level that security modules should be operating at:

I really don't think these interfaces are something the LSMs should be trying to intercept or audit, because they are essentially internal filesystem interfaces used by trusted code and not general user application facing APIs.

Many of the operations carried out by XFS ioctl() commands, such as deduplication of file contents, fast backups, and defragmentation, require bypassing all protections that apply at higher levels; an LSM just isn't relevant at this level, Chinner argued.

Unsurprisingly, others were not entirely on board with this point of view. Stephen Smalley, the maintainer of the SELinux LSM, said that "if they are interfaces exposed to userspace, then they should be mediated via LSM". Alan Cox added that "in a secure environment low level complete unrestricted access to the file system is most definitely something that should be mediated". Neither seemed to think that there was anything particularly special about ioctl() operations on XFS filesystems. And there could be security benefits to proper LSM coverage; Cox described a scenario where these operations are fully mediated:

With a proper set of LSM checks you can lock the filesystem management and enforcement to a particular set of objects. You can build that model where for example only an administrative login from a trusted console may launch processes to do that management.

There are, according to Chinner, a few significant problems with this particular vision of how the system should work. One of those is that the kernel is full of ioctl() operations that carry out privileged tasks. In theory, the security_file_ioctl() hook mediates access to those operations, but there are vast numbers of them and no security module can be expected to recognize and properly reason about even a small fraction of them. Outside of the context where a given ioctl() command is implemented, it is difficult to make any sense out of what that command will do or what security policies should apply to it. ioctl() is, by its nature, a black box that can do just about anything.

Since there are so many of these operations, just adding LSM checks to the XFS operations, even if it could be done correctly, would not solve the problem. Chinner pointed out that the device-mapper ioctl() operations are also only protected by a CAP_SYS_ADMIN check, with no LSM involvement. As a result, an attacker with root privileges could simply remap blocks underneath the filesystem. Many operations at the block-device level also only check for CAP_SYS_ADMIN. As a result, he said:

The storage stack is completely dependent on a simplistic layered trust model and that root (CAP_SYS_ADMIN) is god. The storage trust model falls completely apart if we don't have a trusted root user to administer all layers of the storage stack.

This trust model has created problems at other times, he added. The CAP_SYS_ADMIN checks have had to be tightened up to only allow operations when the user is privileged in the initial namespace; otherwise user namespaces create problems. The issues around unprivileged mount operations also trace back to this trust model. Fixing this problem now is far from straightforward.

There are also some practical issues; adding an LSM check now risks breaking scripts in the wild, creating regressions that would have to be reverted. Some regressions could be severe:

As such, there are very few trusted applications have "massive data loss" as a potential failure mode if an inappropriately configured LSM is loaded into the kernel. Breaking a HSM application's access to the filesystem unexpectedly because someone didn't set up a new security policy correctly brings a whole new level of risk to administrating sites that mix non-trivial storage solutions with LSM-based security.

Ted Ts'o suggested that anybody who wants to control low-level filesystem operations with security modules should sit down and specify how the whole thing would work: "a formal security model, and detail *everything* that would need to be changed in order to accomplish it". The resulting changes, he predicted, would have to be made in "a really huge number of places". Chances are, this request will bring the conversation to a close in the near future. While there may be numerous developers who would like to see the system's behavior changed in this regard, most of them are likely to shy away once they realize how much work would be required to do the job in a way that would actually increase security without causing regressions. It may well be many years too late to try to add that level of security to Linux.

Comments (5 posted)

Device-to-device memory-transfer offload with P2PDMA

October 2, 2018

This article was contributed by Marta Rybczyńska

One of the most common tasks carried out by device drivers is setting up DMA operations for data transfers between main memory and the device. Often, data read into memory from one device will be immediately written, unchanged, to another device. Common examples include carrying the image between the camera and screen on a mobile phone, or downloading files to be saved on a disk. Those transfers have an impact on the CPU even if it does not use the data directly, due to higher memory use and effects like cache trashing. There are cases where it is possible to avoid usage of the system memory completely, though. A patch set (posted by Logan Gunthorpe with contributions by Christoph Hellwig and Steve Wise) has been in the works for some time that addresses this case for PCI devices using peer-to-peer (P2P) transfers, with a focus on offering an offload option for the NVMe fabrics target subsystem.

PCI peer-to-peer memory concepts

PCI devices expose memory to the host system in form of memory regions defined by base address registers (BARs). Those are regions mapped into the host's physical memory space. All regions are mapped into the same address space, and PCI DMA operations can use those addresses directly. It is thus possible for a driver to configure a PCI DMA operation to perform transfers between the memory zones of two devices while bypassing system memory completely. The memory region might be on a third device, in which case two transfers are still required, but even in that case the advantage is lower load on the system CPU, decreased memory usage, and possibly lower PCI bandwidth usage. In the specific case of the NVMe fabrics target [PDF], the data is transferred from a remote direct memory access (RDMA) network interface to a special memory region, then to the NVMe drive directly.

The difficulty is in obtaining the addresses and communicating them to the devices. This has been solved by introducing a new interface, called "p2pmem", that allows drivers to register suitable memory zones, discover zones that are available, allocate from them, and map them to the devices. Conceptually, drivers using P2P memory can play one or more of three roles: provider, client, and orchestrator:

Providers publish P2P resources (memory regions) to other drivers. In the NVMe fabrics implementation, this is the done by the NVMe PCI driver that exports zones of the NVMe devices.
Clients make use of the resources, setting up DMA transfers from and to them. In the NVMe fabrics implementation there are two clients: the NVMe PCI driver accepts buffers in P2P memory, and the RDMA driver uses it for DMA operations.
Finally, orchestrators manage flows between providers and clients; in particular, they collect the list of available memory regions and choose the one to use. In this implementation there are also two orchestrators: NVMe PCI again, and the NVMe target that sets up the connection between the RDMA driver and the NVMe PCI device.

Other scenarios are possible with the proposed interface; in particular, the memory region may be exposed by a third device. In this case two transfers will still be required, but without the use of the system memory.

Driver interfaces

For the provider role, registering device memory as being available for P2P transfers takes place using:

    int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
				u64 offset);

The driver specifies the parameters of the memory region (or parts of it). The zone will be represented by ZONE_DEVICE page structures associated with the device. When all resources are registered, the driver may publish them to make them available to orchestrators with:

    void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);

In the orchestrator role, the driver must create a list of all clients participating in a specific transaction so that a suitable range of P2P memory can be found. To that end, it should build that list with:

    int pci_p2pdma_add_client(struct list_head *head, struct device *dev);

The orchestrator can also remove the clients with pci_p2pdma_remove_client() and free the list completely with pci_p2p_client_list_free():

    void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
    void pci_p2pdma_client_list_free(struct list_head *head);

When the list is finished, the orchestrator can locate a suitable memory region available for all client devices with:

    struct pci_dev *pci_p2pmem_find(struct list_head *clients);

The choice of provider is determined by its "distance", defined as the number of hops in the PCI tree between two devices. It is zero if the two devices are the same, four if they are behind the same switch (up to the downstream port of the switch, up to the common upstream, then down to the other downstream port and the final hop to the device). The closest (to all clients) suitable provider will be chosen; if there is more than one at the same distance, one will be chosen at random (to avoid using the same one for all devices). Adding new clients to the list after locating the provider is possible if they are compatible; adding incompatible clients will fail.

There is a different path for the orchestrators that know which provider to use or that want to use different criteria for the choice. In such case, the driver should verify that the provider has available P2P memory with:

    bool pci_has_p2pmem(struct pci_dev *pdev);

Then it can calculate the cumulative distance from its clients to the memory with:

    int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients,
			    bool verbose);

When the orchestrator has found the desired provider, it can assign that provider to the client list using:

    bool pci_p2pdma_assign_provider(struct pci_dev *provider,
    				    struct list_head *clients);

This call returns false if any of the clients are unsupported. After the provider has been selected, the driver can allocate and free memory for DMA transactions from that device using:

    void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
    void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);

Additional helpers exist for allocating scatter-gather lists with P2P memory:

    pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
    struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev, unsigned int *nents,
 					     u32 length);
    void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);

While passing the P2P memory for DMA, the addresses must be PCI bus addresses. The users of the memory (clients) need to change their DMA mapping routine to:

    int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
			  enum dma_data_direction dir);

A driver using P2P memory will use pci_p2pmem_map_sg() instead of dma_map_sg(). This routine is lighter, it just adjusts the bus offset, as the P2P uses bus addresses. To determine which mapping functions to use, drivers can benefit from this helper:

    bool is_pci_p2pdma_page(const struct page *page);

Special properties

One of the most important tradeoffs the authors faced was finding out which hardware system configurations can be expected to work for P2P DMA operations. In PCI, each root complex defines its own hierarchy. Some complexes do not support peer-to-peer transfers between different hierarchies and there is no reliable way to find out if they do (see the PCI Express specification r4.0, section 1.3.1). The authors have decided to allow the P2P functionality only if all devices involved are behind the same PCI host bridge; otherwise the user would be required to understand their PCI topology and understand all devices in their system. This restriction may be lifted with time.

Even so, the configuration requires user intervention, as it is necessary to to pass the kernel parameter disable_acs_redir that was introduced in 4.19. This disables certain parts of the PCI access control services functionality that might redirect P2P requests (the low-level details have been deeply discussed earlier in the development of this patch set).

P2P memories have special properties — they are I/O memories without side effects (they are not device-control registers) and they are not cache coherent. The code handling those memories should be prepared and avoid passing this memory to code that is not. iowrite*() and ioread*() are not necessary, as there are no side effects, but if the driver needs a spinlock to protect its accesses, it should use mmiowb() before unlocking. There are currently no checks in the kernel to ensure the correct usage of this memory.

Other subsystem changes

Using P2P transfers with the NVMe subsystem required some changes in other subsystems, too. The block layer gained an additional flag, QUEUE_FLAG_PCI_P2P, to indicate that the specific queue can target P2P memory. A driver that submits a request using P2P memory should make sure that this flag is set on the target queue. There has been a discussion if there should be an additional check, but the developers decided against it.

The NVMe driver was modified to use the new infrastructure; it also serves as an example of the implementation. The NVMe controller memory buffer (CMB) functionality, which is memory in the NVMe device that can be used to store commands or data, has been changed to use P2P memory. This means that, if P2P memory is not supported, the NVMe CMB functionality won't be available. The authors find that reasonable, since CMB is designed for P2P operations in the first place. Another change is that the request queues can benefit from P2P memory too.

RDMA, which is used for the NVMe fabrics, is now using flags to indicate if it should use P2P or regular allocations. The NVMe fabrics target itself allows the system administrator to choose to use P2P memory and specify the memory device using a configuration attribute that can be a boolean or PCI device name. In the first case it will use any suitable P2P memory, in the second — only from the specific P2P memory device.

Current state

The patch set has been under review for months now (see this presentation [PDF]), and the authors provide a long list of hardware it has been tested with. The pace of this patch set (up to version 8 as of this writing) is fast; it seems that it might be merged in the near future.

The patch set allows use cases that were not possible with the mainline kernel before and opens a window for other use cases (P2P can be used with graphics cards, for example). At this stage, the support is basic and there are numerous modifications and extensions to be added in the future; one direction will be to extend the range of supported configurations. Others would be to hide the API behind the generic DMA operations and use the optimization with other types of devices.

Comments (14 posted)

OpenBSD's unveil()

By Jonathan Corbet
September 28, 2018

One of the key aspects of hardening the user-space side of an operating system is to provide mechanisms for restricting which parts of the filesystem hierarchy a given process can access. Linux has a number of mechanisms of varying capability and complexity for this purpose, but other kernels have taken a different approach. Over the last few months, OpenBSD has inaugurated a new system call named unveil() for this type of hardening that differs significantly from the mechanisms found in Linux.

The value of restricting access to the filesystem, from a security point of view, is fairly obvious. A compromised process cannot exfiltrate data that it cannot read, and it cannot corrupt files that it cannot write. Preventing unwanted access is, of course, the purpose of the permissions bits attached to every file, but permissions fall short in an important way: just because a particular user has access to a given file does not necessarily imply that every program run by that user should also have access to that file. There is no reason why your PDF viewer should be able to read your SSH keys, for example. Relying on just the permission bits makes it easy for a compromised process to access files that have nothing to do with that process's actual job.

In a Linux system, there are many ways of trying to restrict that access; that is one of the purposes behind the Linux security module (LSM) architecture, for example. The SELinux LSM uses a complex matrix of labels and roles to make access-control decisions. The AppArmor LSM, instead, uses a relatively simple table of permissible pathnames associated with each application; that approach was highly controversial when AppArmor was first merged, and is still looked down upon by some security developers. Mount namespaces can be used to create a special view of the filesystem hierarchy for a set of processes, rendering much of that hierarchy invisible and, thus, inaccessible. The seccomp mechanism can be used to make decisions on attempts by a process to access files, but that approach is complex and error-prone. Yet another approach can be seen in the Qubes OS distribution, which runs applications in virtual machines to strictly control what they can access.

Compared to many of the options found in Linux, unveil() is an exercise in simplicity. This system call, introduced in July, has this prototype:

    int unveil(const char *path, const char *permissions);

A process that has never called unveil() has full access to the filesystem hierarchy, modulo the usual file permissions and any restrictions that may have been applied by calling pledge(). Calling unveil() for the first time will "drop a veil" across the entire filesystem, rendering the whole thing invisible to the process, with one exception: the file or directory hierarchy starting at path will be accessible with the given permissions. The permissions string can contain any of "r" for read access, "w" for write, "x" for execute, and "c" for the ability to create or remove the path.

Subsequent calls to unveil() will make other parts of the filesystem hierarchy accessible; the unveil() system call itself still has access to the entire hierarchy, so there is no problem with unveiling distinct subtrees that are, until the call is made, invisible to the process. If one unveil() call applies to a subtree of a hierarchy unveiled by another call, the permissions associated with the more specific call apply.

Calling unveil() with both arguments as null will block any further calls, setting the current view of the filesystem in stone. Calls to unveil() can also be blocked using pledge(). Either way, once the view of the filesystem has been set up appropriately, it is possible to lock it so that the process cannot expand its access in the future should it be taken over and turn hostile.

unveil() thus looks a bit like AppArmor, in that it is a path-based mechanism for restricting access to files. In either case, one must first study the program in question to gain a solid understanding of which files it needs to access before closing things down, or the program is likely to break. One significant difference (beyond the other sorts of behavior that AppArmor can control) is that AppArmor's permissions are stored in an external policy file, while unveil() calls are made by the application itself. That approach keeps the access rules tightly tied to the application and easy for the developers to modify, but it also makes it harder for system administrators to change them without having to rebuild the application from source.

One can certainly aim a number of criticisms at unveil() — all of the complaints that have been leveled at path-based access control and more. But the simplicity of unveil() brings a certain kind of utility, as can be seen in the large number of OpenBSD applications that are being modified to use it. OpenBSD is gaining a base level of protection against unintended program behavior; while it is arguably possible to protect a Linux system to a much greater extent, the complexity of the mechanisms involved keeps that from happening in a lot of real-world deployments. There is a certain kind of virtue to simplicity in security mechanisms.

Comments (67 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>