A survey of the Python packaging landscape
Over the past several months, there have been wide-ranging discussions in the Python community about difficulties users have with installing packages for the language. There is a bewildering array of options for package-installation tools and Python distributions focused on particular use cases (e.g. scientific computing); many of those options do not interoperate well—or at all—so they step on each others' toes. The discussions have focused on where solutions might be found to make it easier on users, but lots of history and entrenched use cases need to be overcome in order to get there—or even to make progress in that direction.
In order to follow along on these lengthy discussions, though, an overview of Python's packaging situation and the challenges it presents may be helpful. Linux users typically start by installing whichever Python version is supplied by their distribution, then installing various other Python packages and applications that come from their distribution's repositories. That works fine so long as the versions of all of those pieces are sufficient for the needs of the user. Eventually, though, users may encounter some package they want to use that is not provided by their distribution, so they need to install it from somewhere else.
PyPI and pip
The Python Package Index (PyPI) contains a huge number of useful packages that can be installed in a system running Python. That is typically done using the pip package installer, which will either install the package in a site-wide location or somewhere user-specific depending on whether it was invoked with privileges. pip will also download any needed dependencies, but it only looks for those dependencies at PyPI, since it has no knowledge of the distribution package manager. That can lead to pip installing a dependency that actually is available in the distribution's repository, which is just one of the ways pip and the distribution package manager (e.g. DNF, Apt, etc.) can get crosswise.
Beyond that, there can be conflicting dependency needs between different packages or applications. If application A needs version 1 of a dependency, but application B needs version 2, only one can be satisfied because only a single version of a package can be active for a particular Python instance. It is not possible to specify that the import statement in A picks up a different version than the one that B picks up. Linux distributions solve those conflicting-version problems in various ways, which sometimes results in applications not being available because another, more important package required something that conflicted. The Linux-distribution path is not a panacea, especially for those who want bleeding-edge Python applications and modules. For those not following that path, this is where the Python virtual environment (venv) comes into play.
Virtual environments
A virtual environment is a lightweight way to create a Python instance with its own set of packages that are precisely tuned to the needs of the application or user. They were added to Python itself with PEP 405 ("Python Virtual Environments") in 2011, but they had already become popular via the virtualenv module on PyPI. At this point, it has become almost an expectation that developers are using virtual environments to house and manage their dependencies; it has reached a point where there is talk of forcing pip and other tools to only install into them.
When the module to be installed is pure Python, installation with pip is fairly straightforward, but Python modules can also have pieces that are written to the C API, thus they need to be built for the target system from source code in C, C++, Rust, or other languages. That requires the proper toolchain to be available on that system, which is typically easy to ensure on Linux, but is less so on other operating systems. So projects can provide pre-built binary "wheels" in addition to source distributions on PyPI.
But wheels are highly specialized for the operating system, architecture, C library, and other characteristics of the environment, which leads to a huge matrix of possibilities. PyPI relies on all of the individual projects to build "all" of the wheels that users might need, which distributes the burden, but also means that there are gaps for projects that do not have the resources of a large build farm to create wheels. Beyond that, some Python applications and libraries, especially in the scientific-computing world, depend on external libraries of various sorts, which are also needed on target systems.
Distributions
This is where Python distributions pick up the slack. For Linux users, their regular distribution may well provide what is needed for, say, NumPy. But if the version that the distribution provides is insufficient for some reason, or if the user is running some other operating system that lacks a system-wide package manager, it probably makes sense to seek out Anaconda, or its underlying conda package manager.
The NumPy installation page demonstrates some of the complexities with Python packaging. It has various recommendations for ways to install NumPy; for beginners on any operating system, it suggests Anaconda. For more advanced users, Miniforge, which is a version of conda that defaults to using the conda-forge package repository, seems to be the preferred solution, but pip and PyPI are mentioned as an alternate path.
There are a number of differences between pip and conda that are
described in the
"Python
package management" section of the NumPy installation page. The
biggest difference is that conda manages external, non-Python dependencies,
compilers, GPU compute libraries, languages, and so on, including Python
itself. On the other hand, pip only works with some version of
Python that
has already been installed from, say, python.org or as part of a Linux
distribution. Beyond that, conda is an integrated solution that handles
packages, dependencies, and virtual environments, "while with pip you
may need another tool (there are many!) for dealing with environments or
complex dependencies
".
In fact, the "pip" recommendation for NumPy is not to actually use
that tool, but to use Poetry instead, because it
"provides a dependency resolver and environment management capabilities
in a similar fashion as conda does
". So a conda-like approach is what
NumPy suggests and the difference is that Poetry/pip use PyPI,
while conda normally uses conda-forge. The split is bigger than that, though,
because conda does not use binary wheels, but instead uses its own format
that is different from (and, in some cases, predates) the packaging
standards that pip and much of the rest of the Python packaging
world use.
PyPA
The Python Packaging Authority (PyPA) is a working group in the community that maintains pip, PyPI, and other tools; it also approves packaging-related Python Enhancement Proposals (PEPs) as a sort of permanent PEP-delegate from the steering council (which was inherited from former benevolent dictator Guido van Rossum). How the PEP process works is described on its "PyPA Specifications" page. Despite its name, though, the PyPA has no real authority in the community; it leads by example and its recommendations (even in the form of PEPs) are simply that—tool authors can and do ignore or skirt them as desired.
The PyPA maintains multiple tools, the "Python Packaging User
Guide", and more. The organization's goals are specified on its
site, but they are necessarily rather conservative because the Python
software-distribution ecosystem "has a foundation that is almost 15
years old, which poses a variety of challenges to successful
evolution
".
In a lengthy (and opinionated) mid-January blog post, Chris Warrick looked at the proliferation of tools, noting that there are 14 that he found, most of which are actually maintained by the PyPA, but it is not at all clear from that organization's documentation which of those tools should be preferred. Meanwhile, the tools that check most of the boxes in Warrick's comparison chart, Poetry and PDM, are not maintained by the working group, but instead by others who are not participating in the PyPA, he said.
The situation is, obviously, messy; the PyPA is well aware of that and has
been trying to wrangle various solutions for quite some time. The
discussions of the problems have seemingly become more widespread—or more
visible—over the past few months, in part because of an off-hand comment in
Brett Cannon's (successful) self-re-nomination
to the steering council for 2023. He surely did not know how much
discussion would be spawned from a note tucked into the bottom of that
message:
"(I'm also still working towards lock/pinned dependencies files on the
packaging side and doing stuff with the Python Launcher for Unix, but
that's outside of Python core).
"
Several commented in that thread on their hopes that the council (or someone) could come up with some kind of unifying vision for Python packaging. Those responses were split off into a separate "Wanting a singular packaging tool/vision" thread, which grew from there. That discussion led to other threads, several of which are still quite active as this is being written. Digging into those discussions is a subject for next week—and likely beyond.
Readers who want to get a jump-start on the discussions will want to read Warrick's analysis and consult the pypackaging-native site that was announced by Ralf Gommers in late December. Also of interest is the results of the Python packaging survey, which further sets the stage for much of the recent discussion and work. Packaging woes have been a long-simmering (and seriously multi-faceted) problem for Python, so it is nice to see some efforts toward fixing, or at least improving, the situation in the (relatively) near term. But there is still a long way to go. Stay tuned ...
| Index entries for this article | |
|---|---|
| Python | Packaging |
