Leading items

Welcome to the LWN.net Weekly Edition for August 11, 2022

This edition contains the following feature content:

Kolibri and GNOME: a GUADEC talk on how the Kolibri system is helping the Endless OS Foundation (and others) promote offline computing worldwide.
A security-module hook for user-namespace creation: a disagreement over whether access to the user-namespace feature should be subject to control by security modules.
6.0 Merge window, part 1: the first set of changes merged for the 6.0 kernel release.
An io_uring-based user-space block driver: this new approach to putting block drivers in user space could be a sign of where Linux is headed.
Adding auditing to pip: a discussion over whether Python's "pip" utility should include an auditing feature.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Kolibri and GNOME

By Jake Edge
August 10, 2022

GUADEC

Offline computing and learning was something of a theme at GUADEC 2022 as there were multiple talks by people from the Endless OS Foundation, which targets that use case. Dylan McCall and Manuel Quiñones had a talk on day two about a switch that Endless has made over the last few years away from its home-rolled "knowledge apps" to apps based on the Kolibri learning platform. While Endless has its roots in GNOME, and Kolibri runs well in that environment, the switch will allow Endless to reach users who are not running a GNOME desktop.

The talk would be a project update on some of the work Endless has done to bring the value of the internet to those beyond its reach, McCall said. Some of that also came up in two other talks at GUADEC: one on digital autonomy the previous day and another on an Endless OS project in Oaxaca, Mexico the next day, both given by foundation CEO Rob McQueen. McCall said that they wanted to present what Endless is working on now, especially as it relates to how GNOME makes a "really great platform to develop this type of software". He began by introducing himself a bit, he is from Vancouver, Canada and has been with Endless since 2019; Quiñones said that he is from the Litoral region of Argentina and has been with Endless since 2017.

The internet provides lots of benefits, Quiñones said, but there are large numbers of people who cannot access it for a variety of reasons. The Endless solution to that problem is to use storage, which is inexpensive these days, to bridge the gap. A small, cheap USB storage device can hold most of Wikipedia plus additional educational content—and even some entertainment options.

Endless has learned that integration with the operating system is important; "people understand apps". So making apps that work as people expect is needed. "What if they can search in their GNOME desktop in the same way that they 'Google'?" Endless OS has various "knowledge apps" that are well integrated with the desktop. The operating system comes with content from a variety of sources and "many apps", including general-purpose tools like Encyclopedia and more specialized apps depending on the needs of the target users.

But, he asked, "if we have something working, why change it?" The answer is "scale"; in order to reach more people and scale out its efforts, Endless needed to shift gears. The existing apps are great for Flatpak-enabled systems, but in order to reach more people, other types of platforms need to be supported as well. In addition, the development pipeline for creating content apps for Flatpak was expensive to maintain; it made creating the apps easy, "but we want to go farther".

Enter Kolibri

It turns out that Endless is not the only organization trying to solve these problems. Learning Equality, which created the Kolibri open-source, offline-first learning platform, has an overlapping mission. Kolibri comes with a "massive library" of freely licensed content and has tools to add more. It can be used in various kinds of deployments and runs on multiple platforms, including in the web browser.

One of the selling points of switching to a Kolibri-based system is the huge content library. The content is organized into channels, which can contain different types of media, such as books or videos; each channel is organized into different topic areas. Some of the channels are enormous, so all or part of them can be downloaded. There are tools for creating new channels and remixing content from existing channels into new ones as well.

The library has video and audio lectures with subtitles in multiple languages, books in EPUB and PDF format, as well as interactive content. For example, educational games and simulations can be included into channels. Content types are implemented with plugins, so new types can be added easily.

Quiñones then passed the microphone to McCall to describe how Endless would be using this rich content library and tools for developing instructional materials. The server is the other half of Kolibri, he said; it can store various channels that can be accessed from a web browser. The channels can be searched in different ways and all of the content types can then be displayed in the browser.

There are also some e-learning features in the server. Classes can be created and learners can be assigned to them. A teacher can monitor the progress of the learners, and so on. Each instance of the server can connect with other instances to share content, class information, and more. There are some public Kolibri servers out there, including one from Endless and another from Learning Equality.

But nobody really wants to run a web server, McCall said. Luckily, there is a friendlier solution. The Kolibri server itself is a complicated beast, however. It is a "back-end-heavy application", in part because it is designed to be cross-platform; it supports a lot of operating systems and, even, Python 2 for some reason, he said. As part of the collaboration on Kolibri, Endless has been working to make the server more flexible and more straightforward to use on a single-user desktop system.

For an end user who does not want to think about running a web server, there is the Kolibri app for GNOME, which handles all of that stuff in the background. The app is based on WebKitGTK and is distributed as a Flatpak. That means it is available for many different Linux distributions, including, of course, Endless OS.

There are still some problems, however. Most of them stem from the fact that Kolibri is a "really big application". It takes a long time to start and stop it, and it stores a lot of data on the system. The latter is of particular importance to Endless because many of its users share their computers with others. If each user installs Kolibri and a bunch of the same content, there is a lot of wasted space. Since it needs to be a Flatpak, the permissions model does not allow sharing the duplicated content between the users.

In order to address that, Endless created a kolibri-daemon to decouple the server from the Flatpak. Kolibri instances talk to the server using the D-Bus interprocess communication (IPC) mechanism. The daemon can run as part of the user session, but that means data may be replicated among different users on the system. For Endless OS, there is a new package that gets installed as part of the immutable OSTree-based system partition to provide a system-wide version of the kolibri-daemon D-Bus service. That means the data on the system can be shared between multiple users without replicating it for each. The Endless OS daemon is small and specific to the OS, but the rest of the code is the same on each installation. This architecture is meant to help isolate platform-specific code.

Endless wants the individual Kolibri channels to look like individual applications, with their own icons; clicking the icon will bring up a window for that channel, Quiñones said. The work was done for Endless OS, but it can also be used on any GNOME desktop; behind the scenes, it is all a single kolibri-gnome instance, but it looks like each is its own app. The channels have also been integrated with the desktop search, so that a search query will bring up results from each of the installed channels. Normally, Kolibri users have to register and log into the system, but that does not make sense for desktop users, who have already logged into their user account. So Kolibri for GNOME effectively logs in automatically to Kolibri.

Using kolibri-daemon as a system service allows Endless to install a big chunk of content in a single place that can be shared between users, McCall said. That allows creating custom versions of Endless OS that have the content that will be useful to a targeted group of offline users, such as those in Oaxaca as was described in the talk on the following day. It makes it easy for Endless and its partners to pick out a collection of content from the library and/or create their own, and then get it into the hands of the communities they are trying to reach.

Lessons

Endless has had similar talks at GUADEC before, McCall said; the Endless developers have built much of this infrastructure without using Kolibri "and it actually worked pretty well". The older knowledge apps were "technically very sound", lightweight, and still work, but the problem is that "they're insular" because they are not available for all of the different platforms. Because Endless has widened its goals, it has had to accept some limitations; it is putting its focus more on the ends, rather than the means.

That has meant integrating with Kolibri, which has been a somewhat rocky process; "it isn't really built for GNOME desktops", McCall said. The offline-content world is a small one, though, so it makes a lot of sense for the players to team up. "There's a lot to gain by accepting that little bit of weird if it means we can fit together better and reach more people."

Another lesson was that the lightweight system component added to Endless OS was a good choice for working around some problems that currently cannot be solved with Flatpak portals. Most of the code in Kolibri remains the same, but adding that component unblocked a lot of features. McCall said that he was sometimes surprised at how well the Linux desktop plumbing made certain seemingly difficult features fairly straightforward to solve.

Offline learning is as important as ever, he said. Another Kolibri-using project is Endless Key, which is targeting the surprising number of students in the US who do not have internet access at home. That problem was particularly acute during the COVID lockdowns. Endless Key provides Kolibri-based content on USB media that can be used from any operating system. Even those who have good internet access can benefit from a pile of curated content. It gives a certain amount of peace of mind that there is a trove of information "and none of it is trying to trick you", McCall said.

Endless developers are already working with the Kolibri upstream to bring some of the features added for Kolibri on GNOME to other platforms, Quiñones said. There are also plans to port to GTK4 and libadwaita. Work is being done on the overall architecture to generally improve the performance, including reducing the startup time. Kolibri has a simple concept of content collections that they would like to enhance to support removable storage and to better serve self-guided learners, he said.

The work that Learning Equality, Endless OS, and others are doing is definitely interesting and seems likely to be useful to many students—of all ages—for a long time to come. These kinds of projects are highly visible reflections of some of the ideals that the free-software movement holds dear. It is great to see them being put into practice.

A YouTube video of the talk is available.

[I would like to thank LWN subscribers for supporting my trip to Guadalajara, Mexico for GUADEC.]

Comments (1 posted)

A security-module hook for user-namespace creation

By Jonathan Corbet
August 4, 2022

The Linux Security Module (LSM) subsystem works by way of an extensive set of hooks placed strategically throughout the kernel. Any specific security module can attach to the hooks for the behavior it intends to govern and be consulted whenever a decision needs to be made. The placement of LSM hooks often comes with a bit of controversy; developers have been known to object to the performance cost of hooks in hot code paths, and sometimes there are misunderstandings over how integration with LSMs should be handled. The disagreement over a security hook for the creation of user namespaces, though, is based on a different sort of concern.

User namespaces, which can be created by unprivileged processes, give the creator complete control over user and group IDs. Within the namespace, the creator can run as root, but all interactions with the system are mapped back to the creator's user and group ID. They are a fundamental building block for unprivileged containers. In theory, user namespaces are entirely safe; in practice, they have long been accompanied by worries about the increased attack surface that comes from making formerly root-only actions available within the namespace. There have indeed been vulnerabilities resulting from interactions with user namespaces; see this report for a recent example. Whether user namespaces are truly more prone to vulnerabilities than the rest of the kernel is not clear, though.

See this article for more information on user namespaces.

As Frederick Lawler notes in this patch set, security modules currently have a degree of control over user-namespace creation, but that control relies on a hook that was not actually intended for access-control decisions. Among other things, that prevents any error code from being propagated back to the (presumably frustrated) user. The solution is to create a new hook (security_create_user_ns()) that is called prior to the creation of a new namespace, and which can cause the action to fail if it is not consistent with the current security policies.

It is a relatively straightforward patch set, even after adding a self-test and a hook implementation for SELinux. Over four revisions, it has seen a number of tweaks and appears to be at a point where developers in the security community are happy with it. There is, of course, an exception; Eric Biederman raised some objections in response to the posting of the third revision in late July. One was that blocking access to user namespaces as a way of reducing attack surface would only be effective if the bulk of the exploitable bugs in the kernel would be blocked; otherwise, he said, attackers would simply find a different bug to exploit.

His larger complaint, though, was essentially an objection to applying any sort of access control to user namespaces at all. Over time, he said, numerous new kernel features have been restricted to the root account mostly because otherwise they could be used to confuse setuid programs. That has resulted in more code running as root, which is not good for security overall. User namespaces were meant to bring an end to that trend:

One of the goals of the user namespace is to avoid more and more code migrating to running as root. To achieve that goal ordinary application developers need to be able to assume that typically user namespaces will be available on linux.
An assumption that ordinary applications like chromium make today.
Your intentions seem to be to place a capability check so that only root can use user namespaces or something of the sort. Thus breaking the general availability of user namespaces for ordinary applications on your systems.

Biederman concluded with a "Nacked-by" tag indicating rejection of the patch.

He was alone in that view, though. SELinux maintainer Paul Moore answered that LSM hooks provide for more than just access control; auditing and observability are also important use cases for them. He asserted that "integrating the LSM into the kernel's namespaces is a natural fit, and one that is long overdue", and asked Biederman to suggest alternatives if the proposed hook is truly not acceptable. Ignat Korchagin (who, like Lawler, posted from a Cloudflare email address) said that the real goal was to increase the use of user namespaces by providing more control over them:

So in a way, I think this hook allows better adoption of user namespaces in the first place and gives distros and other system maintainers a reasonable alternative than just providing a global "kill" sysctl (which is de-facto is used by many, thus actually limiting userspace applications accessing the user namespace functionality).

Biederman did not respond further, and the conversation wound down. Moore suggested that it was time to get the work merged:

There is the issue of Eric's NACK, but I believe the responses that followed his comment sufficiently addressed those concerns and it has now been a week with no further comment from Eric; we should continue to move forward with this.

After the posting of version 4 on August 1, though, Biederman made it clear that he did not believe his concerns had been dealt with. "Nack Nack Nack".

What will come next is not entirely clear. Biederman still has not offered any sort of alternative approach to the problem, so developers are left with a somewhat unpleasant choice: either restart from the beginning in the hopes of finding a solution that Biederman will not try to block, or simply ignore Biederman and merge the hook anyway. Since Biederman is evidently opposed to any sort of access control for user-namespace creation, the space of mutually acceptable solutions may be small. Pushing code over a maintainer's objections is not done lightly in the kernel community, but it does occasionally happen. Moore has already indicated that this case may play out that way.

Comments (27 posted)

6.0 Merge window, part 1

By Jonathan Corbet
August 5, 2022

The merge window for the kernel that will probably be called "6.0" has gotten off to a strong start, with 6,820 non-merge changesets pulled into the mainline repository in the first few days. The work pulled so far makes changes all over the kernel tree; read on for a summary of what has happened in the first half of this merge window.

The most significant changes accepted as of this writing include:

Architecture-specific

The arm64 architecture can now swap transparent huge pages without the need to split them to base pages first. This feature is incompatible with the memory tagging extension, though.

Core kernel

The energy-margin heuristic that limited process migration across CPUs has been removed, resulting in better energy utilization overall.
A number of other tweaks have been made to task placement on larger systems, resulting in better performance overall, but the pull description warns that behavioral changes might be seen in some workloads.
Support for epoll_ctl() operations in io_uring has been deprecated and will be removed from a future release if there are no complaints.
The new IORING_RECV_MULTISHOT flag enables multi-shot operation with recv() calls, significantly improving performance in applications that do a lot of receives from the same socket(s).
Support for buffered writes (to XFS filesystems only for now) in io_uring has been considerably improved, increasing performance by a factor of two or so.
Zero-copy network transmission is also now supported in io_uring.
BPF programs attached to uprobes are now allowed to be sleepable.
There is a new BPF iterator for working through kernel symbols; no documentation is included, but there is a self-test with an example of how it works.

Filesystems and block I/O

There are two seemingly unused distributed lock-manager features (DLM_LSFL_TIMEWARN and DLM_LKF_TIMEOUT) that have been marked as deprecated. The current plan is to remove them entirely in the 6.2 development cycle.
The fsnotify subsystem has a new flag, FAN_MARK_IGNORE, which provides more control over which specific events are ignored; this commit changelog has a little more information.
The kernel can now properly implement POSIX access control lists on overlayfs filesystems that are, in turn, layered on top of ID-mapped lower-level filesystems. The curious can find a lot of details on the problem being solved in this pull request.
There is a new user-space block driver that is driven by io_uring. It is thoroughly undocumented, but some information can be found in this commit changelog and the ubdsrv GitHub page.
A new version (version 2) of the Btrfs "send" protocol has been added. It supports sending data in larger chunks, sending raw compressed extents, and including more metadata. Naturally, the version 1 protocol is still supported on both ends.

Hardware support

Graphics: LogiCVC display controllers, Freescale i.MX8QM/QXP pixel combiners, Freescale i.MX8QM/QXP display pixel links, Freescale i.MX8QXP pixel link to display pixel interfaces, Freescale i.MX8QM and i.MX8QXP LVDS display bridges, Freescale i.MX LCDIFv3 LCD controllers, and EBBG FT8719 panels. Of course, the kernel has also gained several hundred-thousand more lines of amdgpu register headers.
Hardware monitoring: Analog Devices ADM1021, ADM1021A, ADM1023, ADM1020, ADT7481, ADT7482, and ADT7483 temperature sensors, Maxim MAX1617 and MAX6642 temperature sensors, National Semiconductor LM84 temperature sensors, ON Semiconductor NCT210, NCT214, NCT218, and NCT72 digital thermometers, Philips NE1618 temperature sensors, Analog Devices LT7182S step-down switchers, and Aquacomputer Quadro fan controllers.
Media: Allwinner A31 MIPI CSI-2 controllers, Allwinner A83T MIPI CSI-2 controllers, and ON Semiconductor AR0521 sensors.
Miscellaneous: Hisilicon HNS3 performance monitoring units, NVIDIA Tegra186 timers, Renesas RZ/G2L interrupt controllers, Loongson PCH LPC controllers, Loongson3 Extend I/O interrupt vector controllers, Arm SCMI system power controllers, MediaTek Smart Voltage Scaling engines, Qualcomm interconnect bandwidth monitors, Sunplus SP7021 reset controllers, Sunplus SP7021 interrupt controllers, Microchip FPGA I2C controllers, and Renesas RZ/V2M interfaces.
Networking: Renesas RZ/N1 A5PSW Ethernet switches, Renesas RZ/N1 MII converters, Wangxun 10GbE PCI Express adapters, and Microchip LAN937x switches. There is also a new module that can repurpose ELM ELM327 OBD-II adapters as hobbyist-level CAN network interfaces.
Regulator: Richtek RT5120 PMIC voltage regulators, MediaTek MT6370 SubPMIC regulators, and Maxim 597x power switches.

Miscellaneous

The "efivars" interface in sysfs has been deprecated since 2012; in 6.0 it will be removed entirely. It is believed that all users have long since moved to the efivarfs interface for EFI data.

Networking

There are new BPF helpers for the generation and checking of SYN cookies. Documentation is absent, but there is a self-test to look at for an example.
There is also a new set of BPF kfuncs for accessing and modifying connection-tracking state.
The in-kernel TLS implementation has seen a number of performance improvements; see this blog post for details.

Security-related

The x86 kernel can now obtain a random-number seed from the setup data passed in by the bootloader. A similar feature has been added to the m68k kernel using that platform's bootinfo protocol.
The SafeSetID security module can now control changes made with setgroups().
The kernel has gained support for the ARIA block cipher algorithm.
The BPF security module now implements hooks attached to a control group as well as to a single target process.

Internal kernel changes

Running the KUnit unit tests will now taint the kernel, on the theory that some of those tests could leave the system in a bad state.

There are still over 6,000 changesets sitting in linux-next, so the 6.0 merge window is far from done. Assuming the usual schedule holds, the window will remain open through August 14; LWN will, of course, post a summary of the changes in the second half once it closes.

Comments (5 posted)

An io_uring-based user-space block driver

By Jonathan Corbet
August 8, 2022

The addition of the ublk driver during the 6.0 merge window would have been easy to miss; it was buried deeply within an io_uring pull request and is entirely devoid of any sort of documentation that might indicate why it merits a closer look. Ublk is intended to facilitate the implementation of high-performance block drivers in user space; to that end, it uses io_uring for its communication with the kernel. This driver is considered experimental for now; if it is successful, it might just be a harbinger of more significant changes to come to the kernel in the future.

Your editor has spent a fair amount of time beating his head against the source for the ublk driver, as well as the ubdsrv server that comprises the user-space component. The picture that has emerged from this exploration of that uncommented and vowel-deficient realm is doubtless incorrect in some details, though the overall shape should be close enough to reality.

How ublk works

The ublk driver starts by creating a special device called /dev/ublk-control. The user-space server (or servers, there can be more than one) starts by opening that device and setting up an io_uring ring to communicate with it. Operations at this level are essentially ioctl() commands, but /dev/ublk-control has no ioctl() handler; all operations are, instead, sent as commands through io_uring. Since the purpose is to implement a device behind io_uring, the reasoning seems to be, there is no reason to not use it from the beginning.

A server will typically start with a UBLK_CMD_ADD_DEV command; as one might expect, it adds a new ublk device to the system. The server can describe various aspects of this device, including the number of hardware queues it claims to implement, its block size, the maximum transfer size, and the number of blocks the device can hold. Once this command succeeds, the device exists as far as the ublk driver is concerned and is visible as /dev/ublkcN, where N is the device ID returned when the device is created. The device has not yet been added to the block layer, though.

The server should open the new /dev/ublkcN device for the following steps, the first of which is to map a region from the device into the server's address space with an mmap() call. This region is an array of ublksrv_io_desc structures describing I/O requests:

    struct ublksrv_io_desc {
	/* op: bit 0-7, flags: bit 8-31 */
	__u32		op_flags;
	__u32		nr_sectors;
	__u64		start_sector;
	__u64		addr;
    };

Notification of new I/O requests will be received via io_uring. To get to that point, the server must enqueue a set of UBLK_IO_FETCH_REQ requests on the newly created device; normally there will be one for each "hardware queue" declared for the device, which may also correspond to each thread running within the server. Among other things, this request must provide a memory buffer that can hold the maximum request size declared when the device was created.

Once this setup is complete, a separate UBLK_CMD_START_DEV operation will cause the ublk driver to actually create a block device visible to the rest of the system. When the block subsystem sends a request to this device, one of the queued UBLK_IO_FETCH_REQ operations will complete. The completion data returned to the user-space server will include the index of the ublkserv_io_desc structure describing the request, which the server should now execute. For a write request, the data to be written will be in the buffer that was provided by the server; for a read, the data should be placed in that same buffer.

When the operation is complete, the server must inform the kernel of that fact; this is done by placing a UBLK_IO_COMMIT_AND_FETCH_REQ operation into the ring. It will give the result of the operation back to the block subsystem, but will also enqueue the buffer to receive the next request, thus avoiding the need to do that separately.

There are the expected UBLK_CMD_STOP_DEV and UBLK_CMD_DEL_DEV operations to make existing devices go away, and a couple of other operations to query information about existing devices. There are also a number of details that have not been covered here, mostly aimed at increased performance. Among other things, the ublk protocol is set up to enable zero-copy I/O, but that is not implemented in the current code.

The server code implements two targets: null and loop. The null target is, as one might expect, an overly complicated, block-oriented version of /dev/null; it is useless but makes it possible to see how things work with a minimum of unrelated details. The loop target uses an existing file as the backing store for a virtual block device. According to author Ming Lei, with this loop implementation, "the performance is is even better than kernel loop with same setting".

Implications

One might wonder why this work has been done (and evidently supported by Red Hat); if the world has been clamoring for an io_uring-based, user-space, faster loop block device, it has done so quietly. One advantage cited in the patch cover letter is that development of block-driver code is more easily done in user space; another is high-performance qcow2 support. The patch cover letter also cites interest expressed by other developers in having a fast user-space block-device mechanism available.

An interesting question, though, is whether this mechanism might ultimately facilitate the movement of a number of device drivers out of the kernel — perhaps not just block drivers. Putting device drivers into user-space code is a fundamental concept in a number of secure-system designs, including microkernel systems. But one of the problems with those designs has always been the communication overhead between the two components once they are no longer running within the same address space. Io_uring might just be a convincing answer to that problem.

Should that scenario play out, kernels of the future could look significantly different from what we have today; they could be smaller, with much of the complicated logic running in separate, user-space components. Whether this is part of Lei's vision for ublk is unknown, and things may never get anywhere near that point. But ublk is clearly an interesting experiment that could lead to big changes down the line. Something will need to be done about that complete absence of documentation, though, on the way toward world domination.

Comments (35 posted)

Adding auditing to pip

By Jake Edge
August 9, 2022

A tool to discover known security vulnerabilities in the Python packages installed on a system or required by a project, called pip-audit, was recently discussed on the Python discussion forum. The developers of pip-audit raised the idea of adding the functionality directly into the pip package installer, rather than keeping it as a separately installable tool. While the functionality provided by pip-audit was seen as a clear benefit to the ecosystem, moving it inside the pip "tent" was not as overwhelmingly popular. It is not obvious that auditing is part of the role that the package installer should play.

Background

In late July, Dustin Ingram proposed adding pip-audit to pip, as a subcommand (i.e. pip audit). As part of Ingram's work on the Google open-source security team, he and a group from Trail of Bits developed pip-audit, which was created with the hope it could eventually be merged into pip itself. Doing so would help meet the goals they set out with:

Our ultimate goal is to make a) useful vulnerability tooling that is b) free to use, c) community owned and operated and and d) canonical and easily available to every Python user. We've already achieved a) and b) and to some extent c) (the project is open-source, and we plan to request a transfer to the PyPA [Python Packaging Authority] org) but we think the most effective way to achieve d) is by making pip-audit a subcommand of pip itself, due to pip's wide user base.

The post was seeking feedback from the pip developers and Python community on the tool, the idea of adding it to pip, and rough roadmap he had created for integrating pip-audit into pip. The tool was carefully designed with that integration in mind, so there should be minimal disruption if it were to happen, Ingram said. The command-line interface is compatible with pip and the pip-audit code could simply be "vendored" (copied) into pip repository; the latter would allow pip-audit development and maintenance to continue in its existing repository if that was deemed desirable.

The purpose of pip-audit is to look at the packages that have been installed from the Python Package Index (PyPI) and determine which of them have known security vulnerabilities. The PyPA maintains an advisory database that stores vulnerabilities affecting PyPI packages in YAML format. For example, pip-audit reported that the version of Babel on my system is vulnerable to PYSEC-2021-421, which is a local code-execution flaw. That PYSEC advisory refers to CVE-2021-42771, which is how the flaw is known to the wider world.

As it turns out, my system is actually not vulnerable to CVE-2021-42771, as the Ubuntu security entry shows. The pip-audit tool looks at the version numbers of the installed PyPI package to decide which are vulnerable, but Linux distributions regularly backport fixes into earlier versions so the PyPI package version number does not tell the whole story—at least for those who get their PyPI packages from distributions rather than via pip.

But pip-audit is clearly geared for installations where Python packages do come from PyPI, perhaps installed in multiple versions in separate virtual environments to support the needs of different applications, libraries, and frameworks. It can also be used to examine a project's "requirements" file, which lists the packages and versions the project depends on, to see if any of those packages (or those they depend on) have known vulnerabilities, which entails retrieving and unpacking the packages in question and, recursively, their dependencies. That check could be added as a step in the continuous integration (CI) pipeline, for example, or run as a test before a change is committed to a repository. When the tool finds vulnerable packages, it can even auto-upgrade them to fixed versions with pip‑audit ‑‑fix.

The tool is obviously useful, but, of course, it is not magic, as the security model section of the documentation points out. "TL;DR: If you wouldn't pip install it, you should not pip audit it." It is important to note that pip-audit cannot defend against malicious PyPI packages or their requirements files; using the -r file option to check the dependencies of a package may seem harmless, but:

In particular, it is incorrect to treat pip‑audit ‑r INPUT as a "more secure" variant of pip-audit. For all intents and purposes, pip‑audit ‑r INPUT is functionally equivalent to pip install ‑r INPUT, with a small amount of non-security isolation to avoid conflicts with any of your local environments.

Response

After Ingram's post, the response was quite favorable about the tool, but some did not see it as something that belonged in pip. Bernát Gábor said: "I personally feel that pip does one thing: discover, download and install packages. IMHO the audit feature sounds very useful but orthogonal to its goal." He noted that the auditing should not be dependent on the installation mechanism, so the tool should not really be tied to pip.

Both Pradyun Gedam and Donald Stufft were in favor of turning pip-audit into a pip subcommand; neither thought that maintaining it in its own repository was the right choice, however. Stufft said that it might make sense to look at creating a library for pip that other tools can use; right now, pip-audit uses pip-api which wraps the pip command line to work around the lack of an importable pip library. If pip-audit moves into pip, it can use the internal pip API, but there may be other tools that would benefit from such a library, he said.

User "fungi" guessed that the end goal is to do the auditing as part of the installation process, but for the functionality to also be available for audits after installation in order to catch vulnerabilities that are found later. If so, having the code in two different places would be cumbersome. But Gábor was strongly against tying installation and auditing together:

If I ask an installer to install a package, I'd not like to get a security audit, just install it. I'll handle security auditing in parallel when I need it. I'd rather not make pip even slower if that's possible.

Paul Moore wondered where the idea of running an audit at install time came from; he was opposed to that idea for the same reasons as Gábor. Beyond that, though, he was not really in favor of moving pip-audit into pip; he did not think the proposal described the advantages for pip all that well. He said: "pip is large and complicated enough already, it's up to you, if you're suggesting we add this to pip, to explain why we should be willing to consider it." If audit does get added to pip, however, he had another good reason to move the code into pip as well:

[...] trying to make it work as both standalone and part of pip will be suboptimal for both. Let's choose one and stick with it. Apart from anything else, I don't think we want to be in a position where pip-audit gets fixes added and released and they take up to 3 months to get into pip (which is what vendoring could result in) - that just gives a bad message about using the pip subcommand rather than the standalone version.

Ingram replied to multiple pieces of those messages. He noted that "running an audit at install time is explicitly not a goal for pip audit"; the npm package manager for JavaScript has a similar feature that has not worked out well. Meanwhile, the main reason for wanting the functionality in pip was mentioned in the goals (quoted above), he said; putting it in pip would make it readily available to every Python user right out of the box.

But Moore was not convinced that was a good argument for adding it to pip; "other pip maintainers may disagree, but I think pip is already overloaded with functionality and we should be streamlining, not adding more". Stufft noted that there are a lot of pip subcommands that already start down the path toward pip audit, however, such as pip list and pip show; he said that if those were added to pip, it is hard to see "an objective criterion" that would reject pip audit.

Stéphane Bidoul is concerned with further burdening the pip team with a large feature of this sort even though he sees pip-audit as a "very useful and important feature" that is worth making ubiquitously available. He wondered if there were other ways to make that happen, perhaps by automatically installing pip-audit with pip or by adding a plugin system to pip similar to the one Git has. There are dangers to adding it directly into pip, however:

My (very personal) feeling is that, at this point in time, the pip team is so small, and there is so much to do to just to make the existing pip features consistent with each other, to support new standards and new python versions, to deprecate legacy behaviours, etc… that I tend to cringe at the idea of adding new large features that could live outside, since just the review would divert us.

Moore agreed with that assessment; he said that more help in maintaining pip is needed, so that if the pip-audit maintainers wanted to pitch in, that might change the equation. The people working on a project obviously help shape its direction, but he would want new maintainers to "feel responsible for pip as a whole" and not only on the audit piece. Ingram said that he could potentially help out, but that he definitely had money available in his budget to pay for development and maintenance, which might be a path forward. He stressed that the pip-audit team intended to continue maintaining the code wherever it ended up landing and that the intent had always been to eliminate "any increased work on any existing pip maintainer (aside from what would be unavoidable to land the feature)".

There are plans for a Python Enhancement Proposal (PEP); the discussion and decision on that, once it appears, should help clarify things. As Sumana Harihareswara put it, the right location for the tool comes down to user expectations: "what do users want/expect when they're working with pip, and how much additional friction would it cause for them if they have to invoke the audit command one way versus another way?" Other package managers do have audit capabilities, as pip-audit co-maintainer William Woodruff pointed out, but there are additional arguments to bolster the case for pip audit:

I think of pip as a "package management system." It already has functionality that's strictly outside of package installation, but all current functionality (AFAIK) has something to do with querying, managing, or checking the state of Python packages and package distributions.
From there, I think pip-audit would qualify for subcommand inclusion on the basis that (1) it operates entirely on the same objects and state as pip ordinarily does (i.e., it does not broaden the scope of things pip concerns itself with, even though it adds new code), and (2) it exposes functionality that exists in other package management ecosystems, like npm and cargo.
I recognize, however, that "other package tools do it!" is not necessarily a precedent that the pip maintainers wish to establish. But I think the combination of prior art and the limited domain of interest make pip-audit a reasonable candidate in particular for inclusion.

We'll have to wait to see how things play out from here, but at a minimum, there is a new tool available to help developers and administrators find any known vulnerabilities in their PyPI packages (and the dependencies thereof). It needs to be installed separately for now, but that does not limit its usefulness, really, just its visibility. Meanwhile, there are a quite few discussions of proposals for new Python packaging functionality floating around on discuss.python.org these days, so it is an active area of development. Like every other project, though, PyPA, pip, and other tools all suffer from a lack of contributors, so some hard decisions on which to pursue are going to have to be made.

Comments (19 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>