|
|
Subscribe / Log in / New account

Reinventing the Python wheel

By Jake Edge
July 9, 2025

PyCon US

It is no secret that the Python packaging world is at something of a crossroads; there have been debates and discussions about the packaging landscape that started long before our 2023 series describing some of the difficulties. There has been progress since then—and incremental improvements all along, in truth—but a new initiative is looking to overhaul packaging for the language. At PyCon US 2025, Barry Warsaw and Jonathan Dekhtiar gave a presentation on the WheelNext project, which is a community effort that aims to improve the experience for users and providers of Python packages while also working with toolmakers and other parts of the ecosystem to "reinvent the wheel". While the project's name refers to Python's wheel binary distribution format, its goals stretch much further than simply the format.

Warsaw started things off by noting that, while he and Dekhtiar both work for NVIDIA, WheelNext is a "community-driven initiative that spans all of the entire Python community". He put up profile pictures from around 30 different people who had already been contributing to the WheelNext GitHub repository; "it's really open to anybody", Warsaw said.

Before getting into the meat of WheelNext, it is important to "celebrate the wins for the Python packaging community", he said, showing some screen shots that had been taken a few weeks before the talk in May. The numbers of projects (600K+), releases (nearly 7 million), files (14.1 million), and users (920K+) listed on the Python Package Index (PyPI) home page are eye-opening, for example. The PyPI Stats page showed numbers that "certainly blew me away; 1.6 billion downloads a day [on that day], 20 billion downloads a month". He showed some other graphs that illustrated the "prodigious amount of data and packages that are being vended by PyPI".

It is clear that Python packages, and wheels in particular, have been extremely successful. They are used "every day, in many many different ecosystems, in corners of the Python world that we're not even aware of" and that is the result of a lot of hard work over decades by many people, organizations, and companies. "Wheels pretty much serve the needs of most users most of the time ... so that's awesome." Over time, though, as Python reaches more development communities and additional use cases arise, "cracks are beginning to show"

WheelNext

[Barry Warsaw]

His elevator pitch for WheelNext: "An incubator for thinking about the problems and solutions of the packaging community for the next, let's say, ten years, five years, 30 years". WheelNext goes well beyond just the wheel format; "we're really talking about evolving the Python-packaging ecosystem". Among the companies, organizations, and individuals involved are all of the different stakeholders for packaging, beyond just users: "tool makers, environment managers, installers, package managers", many tools that consume wheels, people building packages for PyPI, and so on. All of those are affected by what is working well—and not so well—for packaging.

He showed a slide of the logos of a dozen or so communities and projects that are part of the effort, noting that it was just a sampling of them. He said that they were mostly in the scientific-computing part of the Python world; "that 's just because I think the packaging ecosystem doesn't serve their needs quite as well as many other communities".

A lot of the inspiration for WheelNext came out of the pypackaging-native web site; he recommended visiting that site for "really excellent, detailed information about the problem space that we're trying to solve". WheelNext is the other side of the coin, trying to find solutions for the problems outlined there.

For example, wheels and tools like pip are not set up to handle GPUs, CPU micro-architectures, and specialized libraries (e.g. for linear algebra); in those cases, users want a wheel that targets specific versions of those variables. Limits on the size of wheels that are imposed by PyPI need to be addressed as part of WheelNext. In addition, there are native dependencies: libraries written in C, C++, Rust, or other languages that are needed by Python modules, along with any dependencies. It is difficult for users to specify exactly what they need in those cases.

The WheelNext project has a set of axioms that it has adopted as part of its philosophy. "If it works for you now, great, it'll continue to work for you later"; the project is not looking to change things for parts of the community where the status quo already works. Beyond that, the project is prioritizing the user experience of package installation and trying to push any complexity into the tools. WheelNext does not want to create another silo or its own ecosystem, it wants to meet users where they are today; users already have tools that they like, so it does not make sense to force them to learn another.

The idea is for WheelNext to come up with ecosystem-wide solutions, not ones that only work for a single tool or service. For example, there are "lots of third-party indexes that exist" beyond just PyPI, Warsaw said; the intent is to think about "how are we going to standardize what we need to standardize for interoperability". Backward compatibility will be prioritized, but if there comes a need to break that to improve things, the plan is "to do so intentionally and explicitly with a very defined migration path".

Problems

At that point, Dekhtiar took the stage to further discuss the problems that WheelNext is trying to address. First up was the PyPI file-size limitations; by default PyPI files can be up to 100MB in size; that can be raised to 1GB by PyPI staff, which is a manual process, thus kind of painful. Meanwhile, a project is limited to 10GB in total size for all of its files; "you are kind of boxed in by multiple limits at the same time". This problem is particularly acute for scientific projects and those shipping large AI models.

[Jonathan Dekhtiar]

One possible solution to that is for those who need to distribute larger objects to have their own indexes. "That does not work as well as we would wish", in part because pip's interface is painful to use with extra indexes. In addition, there can be security problems because multiple indexes can provide packages with the same names, some of which could be buggy or malicious. The dependency resolver will try to choose the best version to install based on version numbers, which can be set by attackers; "in terms of security, it is difficult to manage". So, improving the mechanism for resolving and prioritizing indexes is an important target for WheelNext.

As Warsaw had noted, backward compatibility is a major goal for the initiative; the intent is to fix the problems "without reinventing the workflow", Dekhtiar said. It is difficult to do so using the existing wheel format because it cannot be extended in a backward-compatible way; there is no way to express that older versions of pip should see one thing, while newer versions should see a different format. That lack is holding up the ability to add support for symbolic links, Zstandard compression (which would help with PyPI file-size limitations somewhat), and METADATA.json files (which would help implementing Python dependency lock files and package resolution).

Many Python scientific-computing packages rely on binary libraries of various sorts. A given application might use multiple packages, each of which relies on (and ships) its own version of the OpenBLAS linear-algebra library. Being able to share common libraries between packages in an efficient manner would reduce the number of redundant libraries that are shipped in packages and loaded at run time.

Some packages support multiple options for backends that provide visualization tools, web servers, or GUI toolkits, for example. They often require that one of the options is chosen when the package is installed, but users may not have a opinion about which they want. Like Python does with default arguments to functions, he said, it would be nice to have a way to specify a default "extra" that will get installed if no other choice is made.

Right now, wheels are identified by a set of platform identifiers that do not include all of the different possibilities. In particular, packages can be built and optimized for specialized hardware, such as GPUs, FPGAs, CPU micro-architectures, specialized instruction sets (e.g. AVX512), and so on, but there is no mechanism to select wheels based on that criteria. Without fine-grained selection, "what you end up having is the lowest common denominator, which is optimized for nobody", Dekhtiar said.

The problem has been "solved" for some projects by having a web-site selector that allows users to choose the right package, but it forces them to read the documentation and set up a different index for getting their packages. "This is awesome, because it allows us to do what we need", but "we are a little bit sad that this is the best answer we have today and we wish we could do better".

Since the talk was 30 minutes in length, he said, they could not cover the entirety of WheelNext, but he wanted to quickly go through some of the PEPs that have been discussed along the way. He started with the (withdrawn) PEP 759 ("External Wheel Hosting"); it is a proposed solution for the problems with multiple repositories (indexes) for Python artifacts. PEP 771 ("Default Extras for Python Software") is meant to address the need for specifying default backends as he had described earlier. He said that PEP 777 ("How to Re-invent the Wheel") was meant to help ensure that the wheel format evolves in ways that would not require backward-incompatible changes in the future. PEP 778 ("Supporting Symlinks in Wheels"), which was recently deferred, is challenging because not all filesystems support symbolic links, but there is a need to share libraries as he mentioned earlier. The build isolation passthrough PEP does not have a number, but it is meant to help in building Python extensions based on experimental or development versions of packages on the local machine.

Governance

With that, Warsaw stepped up to talk about packaging governance and PEP 772 ("Packaging Council governance process") in particular. Over the years, in the Python community, "there's been as little bureaucracy as we could possibly get away with and more of a grassroots movement for handling things". As it becomes clear "that we need some more formalism, we figure out how to do that"; the creation of the Python steering council is a good example of how that works.

The community has recognized a need for some more formalism in packaging governance recently. There are essentially two developers who each "have a vertical slice of the packaging ecosystem and they have standing delegations from the Python steering council" to decide on packaging PEPs, Warsaw, who is a member of the steering council, said. There are concerns about the "bus factor", but having people in that situation "also means that there is a lot of burden on that one person to do everything and make sure that they get it right" for their slice.

So the PEP is an effort to bring the steering-council model ("which is mostly successful") to the packaging community. The idea is that the steering council can delegate packaging decisions to a council that is elected by the large community of packaging stakeholders. Those who are familiar with the workings of the steering council will find the election of the packaging council and its operation to be similar. There are some differences, due to the nature of the packaging community; currently the main effort is to define the voting community that will vote on the five members of the packaging council. His hope was that the PEP could go to the steering council for a vote soon and that the packaging council could get started sometime this year; an updated version of the PEP was announced shortly after PyCon.

More PEP

Dekhtiar returned to the podium to talk about more WheelNext initiatives; PEP 766 ("Explicit Priority Choices Among Multiple Indexes") was up first. As he had said, the pip interface for using multiple indexes is cumbersome; it would be better if users could specify "PyTorch comes from here, NumPy comes from there, and the CUDA wheels come from there". The interface needs work, but there is also a need to protect against security problems when the installers are choosing packages based on their origin and version numbers. PEP 766 is more meant "to define the vocabulary, the wording, than actually behavior"; the intent is to have common language for the installers to use when describing their resolution behavior with respect to multiple indexes.

Sharing binary files, like OpenBLAS, between wheels or with system-installed libraries (if they are present) is difficult; "there is no safe way to do that that is common and standardized across the ecosystem". The WheelNext participants want to find a solution for a native library loader that is a kind of "best practice" approach to the problem, which can be shared throughout the community. He likened it to importlib, but one "that's specific around loading binaries". There is a saying in the WheelNext community, he said, "'good enough is better than nothing at all' and right now we have nothing at all" for handling shared libraries.

Wheel variants is another potential PEP. Dekhtiar said that he wanted to share the dream that WheelNext participants have about the next iteration of the wheel format. Today, the platform that is associated with a wheel includes the Python ABI version (e.g. 3.12, 3.13, or 3.13 free-threaded), the operating system, the C library (e.g. manylinux for the GNU C library), and the CPU architecture. Those are encoded into the name of the wheel file. Those tags are not sufficient to describe all of the platforms in use, but "constantly adding tags to better describe your platform is not a scalable practice". There are different GPUs, application-specific integrated circuits (ASICs), and, some day, quantum-computing devices; even if the community wanted to fully describe today's systems, "we don't have the language to be able to do that".

Instead, the idea of wheel variants is to have the installer determine what the local system has installed, then use that information to choose the right wheel. For example, for JAX and PyTorch, the installer could determine which version of CUDA is installed, what kind of Tensor Processing Unit (TPU) there is, and which instructions are supported by the CPU, then it can "pick the best". He went through some scenarios using a prototype pip that would download vendor plugins to detect various aspects of the environment (CPU micro-architecture or CUDA version, for example). From a combination of the package metadata and the running system, it would determine which wheels to request for installation. At the time of the talk, the prototype worked with a subset of packages and just pip as an installer, but the hope is to get it working with others in order to collect more feedback.

Conclusion

Warsaw finished the presentation with a "call to action", inviting people to get involved with WheelNext and to bring their use cases. The project has various ways to participate and is actively seeking feedback and contributions. For interested readers, the YouTube video of the talk is also available.

[Thanks to the Linux Foundation for its travel sponsorship that allowed me to travel to Pittsburgh for PyCon US.]


Index entries for this article
ConferencePyCon/2025
PythonPackaging


to post comments

Wheel we get multiple file compression?

Posted Jul 9, 2025 15:38 UTC (Wed) by nickodell (subscriber, #125165) [Link] (1 responses)

I hope that one of the future changes to the wheel format includes a compressor that can compress repeated content in multiple different files. At present, if one has repeated content in multiple files, the .zip compression format cannot take advantage of it. The current compression format compresses each file separately, and includes all of them in an archive. However, it would be more efficient if the files were bundled together into an archive, then compressed, in the manner that .tar.gz does. This comes at a cost of having poor random access. In other words, it becomes slower to access one file out of the package. However, I believe this is a relatively rare operation.
This may seem like a weird and unnecessary feature. If you have repeated content, why not just remove the duplication? This would be the ideal solution, but it is often tricky. Recently, a project I work on that distributes many Cython files saved two megabytes in the distributed version by enabling Cython's new shared library. (Matus Valo's work in this area has been wonderful.) But this change to Cython took thousands of lines of code changes to accomplish, and it would have been much less necessary if wheel formats could compress inter-file duplication better.

Wheel we get multiple file compression?

Posted Jul 18, 2025 21:55 UTC (Fri) by zahlman (guest, #175387) [Link]

I have seen ideas thrown around for that in the community. There have even been suggestions about support for putting the actual code in a second internal archive while leaving metadata arranged as usual. Certainly it's at least desired to support other compression formats - after all, lzma support exists in the standard library (and there's a fairly popular third-party package for zstd). For cases with *identical* files, PEP 778 (mentioned in the article) aims at support for symlinks.

Of course, there are multiple steps to implementing any of these kinds of changes across the ecosystem. Ideally, changes to metadata standards should ensure that older installers (i.e. older versions of pip) automatically ignore packages they won't know how to handle. And newer ones of course have to actually implement the new specifications. That's especially an issue for symlinks — the packaging formats need to be able to describe them even for archive formats that don't provide native support, but more importantly, Windows installers need to be able to work without admin rights. Presumably this means describing a more abstract form of link, and then Python code would have to be rewritten to actually use those links. .pth files work for Python source, but for everything else (data files, compiled C .so files etc.) an entirely new mechanism would be needed. And right now, it seems that the PEP 778 author/sponsor/delegate aren't even sure if they want to tackle that wide of a scope.

On the flip side, I worry that there isn't actually enough demand for these features to get them prioritized. It seems like lots of developers out there are perfectly happy to e.g. specify "numpy" without a version as a dependency for today's new one-off hundred-line script, and download 15-20MB for it again because the latest point version isn't in cache.

Nomdeterministic installations

Posted Jul 10, 2025 4:18 UTC (Thu) by donald.buczek (subscriber, #112892) [Link] (12 responses)

> He went through some scenarios using a prototype pip that would download vendor plugins to detect various aspects of the environment (CPU micro-architecture or CUDA version, for exampe)

terrible

Nomdeterministic installations

Posted Jul 10, 2025 7:20 UTC (Thu) by gpth (subscriber, #142055) [Link] (11 responses)

Care to elaborate on why this would be "terrible"?

Thanks

Nomdeterministic installations

Posted Jul 10, 2025 7:54 UTC (Thu) by cpitrat (subscriber, #116459) [Link] (4 responses)

IIUC the idea is to download some code (a small python script probaby) which runs on your machine and returns some information about your system.

I guess the "terrible" comes from the concern of transparently executing remote code.

Yet you're already downloading code that you'll execute anyway, coming, in theory, from the same source. And I think pip is already executing some python code at installation time. So I'm not sure if this idea should bring additional new concerns ...

Nomdeterministic installations

Posted Jul 10, 2025 8:01 UTC (Thu) by cpitrat (subscriber, #116459) [Link] (1 responses)

I just thought of a potential additional concern: supposing the script runs and returns which version of the package to use, one idea that comes to mind is some targeted attack: if hostname== target: return "compromised_package"

This supposes, of course, that the source cannot be trusted or has been compromised. The script would have to be obfuscated enough that the targetting is not obvious (e.g. relying on some very rare piece of hardware that the target is known to use).

Similar thing can already be done in the code of the package itself, but it would be in plain view for any user. With the exotic package, it could be hidden (unless there's a way to identify the package is not build from published source) in a package that nobody is likely to review.

Nomdeterministic installations

Posted Jul 18, 2025 22:12 UTC (Fri) by zahlman (guest, #175387) [Link]

> Similar thing can already be done in the code of the package itself, but it would be in plain view for any user. With the exotic package, it could be hidden (unless there's a way to identify the package is not build from published source) in a package that nobody is likely to review.

My understanding is that NVidia has already been doing this sort of thing with setup.py for years. Not involving malware, I presume, but my understanding is that they explicitly re-direct pip to download the real package from their own index, after running code to determine which one.

In principle, setup.py can be audited before running, but in practice you have to go quite some distance out of your way. `pip download` is not usable for this task (see https://zahlman.github.io/posts/2025/02/28/python-packagi...) so you need to arrange your own separate download explicitly and then convince pip to use that file. And then multiply that by all your transitive dependencies, of course.

Such code isn't included in, and doesn't run from wheels (i.e. specifying `--only-binary=:all:`), but then you don't get the potential benefits from trusted code, either. Assuming a wheel for your platform is available in the first place.

It seems that people want to be able to install code and its dependencies in a completely streamlined, fast way; but they also want to be able to use packages that interface to C code, and not have to worry about niche platform details (I saw https://faultlore.com/blah/c-isnt-a-language/ the other day and it seems relevant here), and avoid redundancy, and also have everything be secure. It really seems like something's gotta give.

Nomdeterministic installations

Posted Jul 10, 2025 16:45 UTC (Thu) by SLi (subscriber, #53131) [Link] (1 responses)

Tools like poetry have now problems, I believe, resolving dependencies because packages can dynamically compute some metadata relevant to the resolving (I forget the exact details). Obviously made worse by not being able to get any of that without downloading the package.

So this proposal sounds like something that would, perhaps, solve the needing-to-download part. But how would it play otherwise with the problem? Or is computable dependency resolution a lost cause?

Nomdeterministic installations

Posted Jul 18, 2025 22:33 UTC (Fri) by zahlman (guest, #175387) [Link]

> packages can dynamically compute some metadata relevant to the resolving (I forget the exact details). Obviously made worse by not being able to get any of that without downloading the package.

Packages adhering to recent metadata standards get their metadata files extracted automatically and made separately available on PyPI (the relevant standards are described in https://peps.python.org/pep-0658/). But the metadata for a modern source distribution (a `PKG-INFO` file — not pyproject.toml, which you may think of as "source" for the "built" PKG-INFO metadata) is still allowed to declare everything except the metadata version, package name and package version as dynamic. And in older versions of the standard, anything you omit is implicitly dynamic.

There is, now, a hook defined for getting this metadata (https://peps.python.org/pep-0517/#prepare-metadata-for-bu...), but there's nothing to force packages (more realistically, the build systems they depend on) to implement it. By default, the flow is: your installer program builds the entire "wheel" for the package (by setting up any build-system dependencies in an isolated environment, and then running the source package's included build-orchestration code), then checks what metadata is in *that* resulting archive file. (I sort-of touched on this in https://lwn.net/Articles/1020576/ , but without specifically talking about metadata.)

It isn't really supposed to be this way. In principle, https://peps.python.org/pep-0508/ describes a system for making a project's dependencies conditional on the target Python version, OS etc. But apparently this system still can't provide enough information for some packages — and many others are just packaged lazily, or haven't been updated in many years and are packaged according to very outdated standards. And this only helps for dependencies, not for anything else that might be platform specific. (Apparently, *licenses* are platform-specific for some projects out there, if I understood past discussion correctly.)

This is arguably just what you get when you want to support decades of legacy while having people ship projects that mix Python and C (and Fortran, and now Rust, and probably some other things in rare cases).

Nondeterministic installations

Posted Jul 10, 2025 10:08 UTC (Thu) by donald.buczek (subscriber, #112892) [Link] (5 responses)

I apologize for the lazy single-word negative comment and the typo in the subject line.

I'm not a fan of this for multiple reasons. Let me elaborate on just one of these reasons, which I wanted to express with the term "nondeterministic installation".

We run our own in-house distribution. You can think of me as a packager. The systems on which we install software are different from those where the software eventually runs. So every installer which wrongly assumes that the environment it sees at installation time is the same as at runtime needs extra work from us. We operate in a scientific environment, and want reproducibility. Of course, if at all possible, we prefer building from source. Regardless of whether we build from source, whenever an installer attempts to download something, we need to reverse-engineer the process. We then download the data in advance and modify the installer to use the local copy. We don't want the installation to fail or produce different results the next time because something external has changed or is no longer available.

Additionally, I wouldn't trust "vendor plugins" at all to correctly detect "various aspects of the environment". In my experience they only work for standard installations on one or two big distributions and nothing else.

Nondeterministic installations

Posted Jul 10, 2025 12:53 UTC (Thu) by gpth (subscriber, #142055) [Link] (1 responses)

I don't have all the details of your environment and you might have already considered this, in which case I apologize in advance :).

To me Python packaging, even with frozen, specific versions seems not enough for complete reproducibility, exactly for the reasons you highlighted. Without knowing all the details, multi-architecture container images (ie. https://developers.redhat.com/articles/2023/11/03/how-bui...) sound like a solution that would provide better reproducibility here.

Nondeterministic installations

Posted Jul 11, 2025 14:36 UTC (Fri) by donald.buczek (subscriber, #112892) [Link]

Thanks. I didn't want to go into every detail, but we already have our own solutions utilizing wrappers, environment variables, and symbolic links. These provide the programs with the runtime environment they need and enable us to install several versions and variations of the same software stack in parallel, which users can select from at runtime. These tools can also hide hardware complexity from the applications by providing hardware-specific variants of libraries etc.

If software just follows standards, everything is fine. For example, use execlp() to activate external programs, as it respects the PATH environment variable. Activate shared libraries with the dynamic linker which honors LD_LIBRARY_PATH. Open files by their canonical names, which allows redirection by symbolic links.

If, on the other hand, an installer 'locates' all the files at installation time and configures them with resolved paths, the setup may fail if the environment changes. This can occur, for example, if you switch to a machine with a different GPU, necessitating a different driver version and library variants.

But even outside our environment the prototypical user with their personal notebook and nothing else should be annoyed, when updating one thing calls for re-installation of another thing, because the vendor-supplied oracle needs to look at the environment a second time. Backup and Restore to a replacement system will no longer work.

Nondeterministic installations

Posted Jul 11, 2025 22:54 UTC (Fri) by dbnichol (subscriber, #39622) [Link]

That was basically my first thought, too. You definitely want to have an explicit path so the installer is deterministic. There are lots of cases where you need to know that you're going to consistently resolve a specific set of packages. However, having heuristics that are more likely to just DTRT is a win for casual developers. I didn't read the actual proposal, but I'd be surprised if this wasn't being considered already.

Nondeterministic installations

Posted Jul 14, 2025 19:07 UTC (Mon) by rjones (subscriber, #159862) [Link] (1 responses)

I don't see much difference between this proposal and the sort of "./configure && make && make install" dance that is very common in any sort of compiled language where it sets various flags and attributes based on the environment it finds itself in. Unless that stuff is well documented and thought out there has always been a element of 'reverse engineering' when dealing with that sort of thing.

Of course, besides the downloading part.

Provided they "do the right thing" and support isolated or offline installs intelligently and in a standardized way then you shouldn't have to do any reverse engineering. Whenever dealing with python or most any other language you have to figure out a way to cache it to your networks or locally if you want to avoid depending on pulling whatever is hosted on the internet.

It could range from "easier then dealing with rpm or deb files" to "being nearly impossible to use" depending on the details of their implementation.

Nondeterministic installations

Posted Jul 15, 2025 9:12 UTC (Tue) by donald.buczek (subscriber, #112892) [Link]

> I don't see much difference between this proposal and the sort of "./configure && make && make install" dance that is very common in any sort of compiled language where it sets various flags and attributes based on the environment it finds itself in.
> Unless that stuff is well documented and thought out there has always been a element of 'reverse engineering' when dealing with that sort of thing.

Well, while in theory configure-scripts could also do all kinds of unwanted stuff, most of the time they are generated by GNU Autotools. And while GNU Autotools is archaic and unnecessarily complex, it has set and conforms to standards which are very much welcome. Autotools-generated scripts usually respect environment variables like $DESTDIR. They won't attempt to "edit" files into your /etc or replace some library in your /usr/lib with patched versions.

Typically, a "configure" checks for the "availability" of a feature, for example a library with a specific minimum API version. If it's not available, the script will either abort the configuration process or adjust the configuration to exclude features that require the missing library. However, once you have such a required component in your system, there is no reason it would go away. If the dependencies are well-designed, upgrading them typically does not cause problems. A typical configure script will not check for hardware attributes like the amount of memory or the number of cores or the specific version of your graphics card.

Yes, sometimes we can't just rebuild very old stuff, because of changes in the C-compiler or when the required software was not as well designed. But generally, packages with Autotools work well for us.

> It could range from "easier then dealing with rpm or deb files" to "being nearly impossible to use" depending on the details of their implementation.
Right. Maybe the problem is as follows: When I think of "vendor plugins to detect various aspects of the environment", I think of the NVIDIA driver installer and that is a major pain point.

Right. Maybe the problem is as follows: When I think of "vendor plugins to detect various aspects of the environment", I think of the NVIDIA driver installer and that is a major pain point.

Every language reinvent the wheel

Posted Jul 14, 2025 12:54 UTC (Mon) by vivo (subscriber, #48315) [Link] (3 responses)

managing dependancies is much more difficult than language architects realize, that's why Linux distibution package manager are usually nontrivial programs.

I do really hope that the most used languages in open source converge to one and only one package manager be it based on rpm, dpkg or portage.
This would make the life of the distributions much easier and the entry barrier for developers lower bonus it would improve the quality of dependancy management for everybody

Every language reinvent the wheel

Posted Jul 14, 2025 13:05 UTC (Mon) by Wol (subscriber, #4433) [Link]

> rpm, dpkg or portage.

And here lies the crux of the problem. Rpm and dpkg would need massive re-engineering to cope with what portage does, I suspect, while portage is massive overkill for what rpm or dpkg do.

And then one only has to look at Red Hat and SuSE - two rpm-based distros that (less now than previously) were pretty much incompatible despite sharing the same package manager.

Everyone thinks that rpm distros were mostly cross-compatible, but that was for the same reason that dpkg distros are compatible - dpkg distros are all (afaik) Debian derivatives. MOST rpm distros are descended from Red Hat, but SUSE is descended from Yggdrasil/Slackware (and the rest... I can't remember the complications).

Cheers,
Wol

Every language reinvent the wheel

Posted Jul 15, 2025 7:17 UTC (Tue) by taladar (subscriber, #68407) [Link]

That is never going to happen, mostly because languages differ quite significantly in how they handle compile-time options and most of the distro package managers other than portage don't handle them at all. Portage on the other hand has a lot of special casing for situations where the simple on/off USE flags don't work.

Take e.g. Rust features, they are also on/off flags like USE flags in portage but work entirely differently again with cargo assuming that any version of a crate with a specific feature enable (regardless of the values of other features) will satisfy a dependency that asks for that crate with that feature.

If you wanted a grand unification of all package managers you should probably first start trying to define what a package actually is (including details like optional compile time features and other compile time options such as alternative dependencies (think openssl/gnutls/...)). You would quickly realize that this is somewhere between hard and impossible.

Every language reinvent the wheel

Posted Jul 18, 2025 22:52 UTC (Fri) by zahlman (guest, #175387) [Link]

> managing dependancies is much more difficult than language architects realize, that's why Linux distibution package manager are usually nontrivial programs.

My understanding is that Guido van Rossum understood this quite well and made it very explicitly not his problem, which is why pip and Setuptools are technically third party and distutils got deprecated and eventually removed from the standard library.

For pure Python projects that can rely on the basic already-compiled bits of C code in the standard library (for basic filesystem interaction etc.), dependency management is generally not a big issue in Python IMX. Python's design makes it impractical to have multiple versions of a package in the same environment, which occasionally causes problems. But usually, everything is smooth even with compiled C code as long as it can be pre-compiled and the system is sufficiently standard to select a pre-compiled version. When I switched over to Linux for home use it never even occurred to me to worry about whether I'd lose (or complicate) access to popular Python packages, and indeed I didn't.

> I do really hope that the most used languages in open source converge to one and only one package manager be it based on rpm, dpkg or portage.

Perhaps it's gauche to point it out on LWN, but this will not satisfy the needs of the very large percentage of Python programmers who are on Windows.

But also, to my understanding, none of these tools (or their corresponding package formats) are oriented towards installing things in a custom location, which is essential for Python development and even for a lot of end users. I'm not sure even chroot would help here — currently, pip needs to consult the `sysconfig` standard library (https://docs.python.org/3/library/sysconfig.html) to determine install paths, and it also supports installing for the system (recent Python versions may require a security override), in a separate user-level directory or in a virtual environment. (And you really do need virtual environments.)

This does not feel radical enough

Posted Jul 15, 2025 21:47 UTC (Tue) by gray_-_wolf (subscriber, #131074) [Link] (10 responses)

Since (at least partially) the motivation here is scientific computations and reproducibility, I wonder whether it would not be better to just adopt GNU Guix (or, I guess, Nix). You would be able to express dependencies on native libraries and fine-tune to specific architectures as much as you would want. If binary substitutes would be available, the package would be fetched, otherwise it would be compiled from source.

Well, I just wonder whether "wheel" is really the best model to base reproducible (scientific) computing on. Was more radical approach considered?

This does not feel radical enough

Posted Jul 16, 2025 1:45 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (8 responses)

By “reproducible”, do you mean that the results are bit-for-bit identical, or that the results are scientifically equivalent? The former is often impossible. The latter is much more likely to be feasible.

This does not feel radical enough

Posted Jul 18, 2025 18:37 UTC (Fri) by dvdeug (guest, #10998) [Link] (6 responses)

Why would bit-for-bit identical be impossible? Don't use randomness, and be careful about using multiple threads. You might need to worry about different ISAs; best to run the same binary on all. There's a lot of work on Debian to build arbitrary packages to be bit-for-bit identical; it's definitely possible. Note that old console video games are bit-for-bit identical; TASbot gives the same series of inputs, and the game gives the same outputs every time.

As for the results are scientifically equivalent? Early weather simulations established that not bit-for-bit identical simulations will diverge. Even in less chaotic cases, how do you know they're scientifically equivalent? Bit-for-bit is easy to check; scientifically equivalent is hard, if not impossible, to check.

This does not feel radical enough

Posted Jul 18, 2025 22:54 UTC (Fri) by zahlman (guest, #175387) [Link]

> Note that old console video games are bit-for-bit identical; TASbot gives the same series of inputs, and the game gives the same outputs every time.

The latter does not prove the former.

This does not feel radical enough

Posted Jul 19, 2025 13:25 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (4 responses)

Two results are scientifically equivalent if the difference between them is within the margin of error. The divergent behavior you mentioned is real, but what they indicate is that neither result is to be trusted sufficiently far in the future, because even tiny differences in the inputs to the simulation would also cause divergence.

This does not feel radical enough

Posted Jul 19, 2025 15:12 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

> because even tiny differences in the inputs to the simulation would also cause divergence.

But do they? Depends on the chaos!

Some chaotic structures diverge rapidly with small differences in the inputs. Others (it's called a "strange attractor" iirc) find it very hard to break away from a stable pattern. Your high-pressure summer weather system that resolutely refuses to move is one of these.

Cheers,
Wol

This does not feel radical enough

Posted Jul 19, 2025 21:09 UTC (Sat) by kleptog (subscriber, #1183) [Link]

Sure, but it's usually fairly straight-forward to determine whether a model is stable or not. If your model isn't stable then the results are going to be suspect no matter what you do.

And there's a whole branch of mathematics about how to fix algorithms to improve numerical stability. Floating point numbers on computer have properties that mean you sometimes you won't get the answer you hope with a naive implementation.

This does not feel radical enough

Posted Jul 31, 2025 6:03 UTC (Thu) by dvdeug (guest, #10998) [Link] (1 responses)

If you're checking to see if an asteroid could hit Earth, it's not good enough to say that a simulation that was scientifically equivalent showed it won't. Yes, with current measurements there's a scientifically equivalent simulation that says it won't, but we need to know if it could, consistent with the current measurements, so we know if we need to measure it better. Likewise with weather, we know that with the limits of our current knowledge, it's consistent there won't be a tornado three days from now. The question is, could there be? Moreover, you don't know beforehand whether chaos will pop up; being bit for bit lets you check that, whereas scientifically equivalent is unrepeatable.

This does not feel radical enough

Posted Jul 31, 2025 8:33 UTC (Thu) by farnz (subscriber, #17727) [Link]

A lot of that is about what your simulation is meant to tell you; a simulation that says "will happen" or "won't happen" is only really of use as a component of a larger simulation that turns "measurements with error bars" into "Probability Density Function of event happening".

With a "yes"/"no", you're never going to know whether it's because you hit a lucky outcome; with a full PDF, you can see that "most likely is asteroid miss by 100 km, but there's a good chance of an asteroid hit", or "outcome is chaotic - all possible sizes of storm are equally likely from the measurements.

This does not feel radical enough

Posted Jul 20, 2025 6:58 UTC (Sun) by donald.buczek (subscriber, #112892) [Link]

> By “reproducible”, do you mean that the results are bit-for-bit identical, or that the results are scientifically equivalent?

To answer that question: In reality we (IT) are happy when we can provide an environment where you can run and recompile applications which were developed a decade ago. The actual scientific applications are developed, for example, by Bioinformaticians, who should think about the aspects of reproducibility on this level. Some do, most don't. If the output depends on races or other sources of randomness, we can't help. Most of the time, it doesn't matter for the scientific conclusions, though. Still it would be good for review if you could reproduce the output exactly and not just statistically.

This does not feel radical enough

Posted Jul 16, 2025 13:36 UTC (Wed) by aragilar (subscriber, #122569) [Link]

The elephant in the room is Windows. Once you're willing to drop Windows there are lots of options, but practically once you include Windows you only have conda (and various reimplementations of it). For some reason people want to push the PyPI/PyPA ecosystem to be a clone of the conda ecosystem due to some perceived issues with conda (when it's likely the issues people have with conda are due the constraints needed to support such a setup).

Shared library packages

Posted Jul 17, 2025 6:51 UTC (Thu) by callegar (guest, #16148) [Link] (12 responses)

What would be very useful at this point is the ability to be able to package binary libraries to support other packages and to be shared among them. An obvious example is blas. As of today every Pypi wheel seems to vendors its blas, which is horrible:

1. it hinders the possibility of picking the best blas for your system or to test different options;

2. most important it breaks the possibility to do multithreading right. Every individual blas ends up with its own view of how many cores the system has and on how many of them it is parallelizing. This means that if multiple packages end up doing things concurrently and each uses blas, you end up with a very suboptimal case where more threads than cores are employed. For this reason it looks like many packages that vendor some blas in their wheel build it to be single core. And again this is a performance loss.

3. it makes packages bigger than they should and memory usage larger than it should, with an obvious loss in cache performance.

Shared library packages

Posted Jul 17, 2025 6:54 UTC (Thu) by callegar (guest, #16148) [Link]

I mean being able to solve at least this issue would already be a huge win. I don't know if this can be done *in the short term* without having to thoroughly change the way in which things are packaged.

Shared library packages

Posted Jul 17, 2025 7:38 UTC (Thu) by intelfx (subscriber, #130118) [Link] (10 responses)

> What would be very useful at this point is the ability to be able to package binary libraries to support other packages and to be shared among them

I think this is the final undeniable proof that every language-specific package silo which is created to bypass "those useless distros" eventually becomes just a worse ad-hoc distro as soon as people start using it to solve actually hard packaging problems.

Re-inventing distro mechanisms

Posted Jul 17, 2025 8:59 UTC (Thu) by callegar (guest, #16148) [Link] (9 responses)

Reinventing distribution-like mechanisms is sometimes not just the consequence of an initial sense of superiority with respect to the "useless" distros, but unfortunately a necessity to work around some inherent aspects of certain modern programming languages and environments.

A notable aspect with Python (and other languages) is that while packages can in principle be shared among many applications (ultimately looking a bit like shared libraries), it is impossible to make different version of the same package co-exist, which makes the traditional way of packaging things adopted in distros a nightmare.

If you need applications A, B and C and they all rely on package X, then either you vendor X in A, B and C or the distro needs to find a single version of X that can satisfy A, B and C at the same time. If this is impossible, then the traditional distro approach will fail or force the developers to patch downstream A, B, C or even X to solve the issue. For languages that enable having different versions of the same package coexist, it becomes a matter of providing both a package of X-1 and of X-2, so that, for example A can depend on X-1 and pull it in when installed and B can depend on X-2.

The reason why you need tools like conda or uv, capable of managing a huge cache of pre-built packages, of creating virtual environmens and of creating in there a forest of links to the packages in the cache is not just "providing some isolation", but also (and I would say in great portion) a workaround for not having the possibility of dropping all packages in a single place in multiple versions as needed and having the projects go seek themselves the versions they can work with.

In some sense the "keep it simple here" (no package versioning) ends up "making it complex somewhere else" (no practical possibility to rely on common distro packaging tools and the *real* need to reinvent them).

Re-inventing distro mechanisms

Posted Jul 17, 2025 20:27 UTC (Thu) by raven667 (subscriber, #5198) [Link] (8 responses)

> A notable aspect with Python (and other languages) is that while packages can in principle be shared among many applications (ultimately looking a bit like shared libraries), it is impossible to make different version of the same package co-exist

Is this a solvable problem, creating a new mechanism for loading modules or declaring dependencies to get a soname-like experience for Python that can be retrofitted in in a way that affects new code which is updated to take advantage of it but not existing code which doesn't know about it? Maybe some special attribute of __init__ or something which can provide version range info, and a new directory structure or naming convention for module_name@version or something, with a constraint that the same python interpreter maybe cannot load two different versions of the same module_name at the same time and will have an import error exception instead if its attempted. This could allow the python interpreter to have the same behavior as if you used a virtualenv but integrated with the system-wide directory structure and far more tractable for a package manager to update by not having overlapping files.

Re-inventing distro mechanisms

Posted Jul 18, 2025 23:26 UTC (Fri) by zahlman (guest, #175387) [Link] (7 responses)

> Is this a solvable problem, creating a new mechanism for loading modules or declaring dependencies to get a soname-like experience for Python that can be retrofitted in in a way that affects new code which is updated to take advantage of it but not existing code which doesn't know about it?

Many people have this idea (there was a DPO thread recently, even: https://discuss.python.org/t/_/97416) but it really isn't feasible, even without considering "retrofitting".

When you import a module in Python by the default means, it's cached process-wide (in a dictionary exposed as `sys.modules`). This allows for module objects to function as singletons — doing setup work just once, allowing for "lazy loading" of a module imported within a function, customizing an import with runtime logic (commonly used to implement a fallback) etc. etc. And of course it avoids some kinds of circular import problems (though there is still a problem if you import specific names `from` a module - https://stackoverflow.com/questions/9252543), and saves time if an import statement is reached multiple times.

But this means that, even if you come up with a syntax to specify a package version, and a scheme for locating the right source code to load, you break a contract that huge amounts of existing code rely upon for correctness. In particular, you will still have diamond dependency problems.

Suppose that A.py does `import B` and `import C`, and calls `B.foo()` and `C.bar()`; those modules both `import X`, and try to implement their functions using X functionality. Suppose further that they're written for different versions of X. Now suppose we add a syntax that allows each of them to find and use a separate X.py (such that each one implements the API they expect), and revamp the import system so that the separate X module-objects can coexist in `sys.modules` (so that B and C can keep using them).

*Now the code in A can break*. It may, implicitly, expect the `B.foo()` call to have impacted the state of X in a way that is relevant to the `C.bar()` call, but in reality C has been completely isolated from that state change. And there is no general solution to that, because in general the internal state for the different versions of X can be mutually incomprehensible. They are effectively separate libraries that happen to look similar and have the same name.

In the real world, you *can* vendor B with its own vendored X1, and vendor C with its own vendored X2, and patch the import statements so that the vendored B and C access their own vendored Xs directly. But you can only do this with the foresight that B and C both need X, and then you have to write the A code with awareness of the X state-sharing problems. And none of what you vendor will be practically usable by *other* code that happens to have the same dependencies. In practice, vendoring is pretty rare in the Python ecosystem. (Pip does it, but that's because of bootstrapping issues.)

Re-inventing distro mechanisms

Posted Jul 19, 2025 4:22 UTC (Sat) by raven667 (subscriber, #5198) [Link] (6 responses)

> Suppose that A.py does `import B` and `import C`, and calls `B.foo()` and `C.bar()`; those modules both `import X`, and try to implement their functions using X functionality. Suppose further that they're written for different versions of X. Now suppose we add a syntax that allows each of them to find and use a separate X.py (such that each one implements the API they expect), and revamp the import system so that the separate X module-objects can coexist in `sys.modules` (so that B and C can keep using them).

No, not that, I'm no where near qualified to be a language designer but I was not suggesting that Python could be evolved to support two modules of different versions loaded in one interpreter process at the same time, once they are loaded in the interpreter there should be only one instance of X, and if the second import specifies that it is not compatible with the version of X which is loaded then it should fail (which is actually better than today which only has version checks on import if they are explicitly coded and not automatically I don't think)

Solving the whole problem, like Rust does where different parts can load different versions of libraries, which can be implemented in different versions of the language standard, is amazing, but defining a more easily tractable subset of the problem then solving that is often good enough.

Re-inventing distro mechanisms

Posted Jul 24, 2025 6:26 UTC (Thu) by callegar (guest, #16148) [Link] (5 responses)

What is practically happening is that for some packages what would be the "major" version number, i.e., the number indicating non-backward compatible API changes in semantic versioning becomes a part of the package name.

So you get `foo1`, `foo2` and `foo3` rather than `foo` and you can have `foo1`, `foo2` and `foo3` coexist. This clearly does not let you *share state* between `foo1` and `foo2`. However, if the APIs are different the very idea of *sharing state* becomes scary.

Obviously, this requires discipline in package naming. But I think it would help a lot the work of distros if more packages followed this approach.

Re-inventing distro mechanisms

Posted Jul 24, 2025 10:50 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

There's two problems with this approach (it's been tried before, and doesn't help distros):
  1. Distros don't want to package foo1, foo2 and foo3. They want to just package a single maintained version, ideally, but can compromise if there are multiple maintained versions. However, in practice what seems to happen is that foo1 gets abandoned by its upstream, foo3 is the only maintained version, and the distro is now on the hook for pushing all the packages that depend on foo to stop using foo1 or foo2. We've seen this with, for example, GTK+, where GTK+ major versions can coexist, and the work of telling projects to move to a supported version of GTK+ has entirely fallen on the distros.
  2. The same discipline required from foo's developers to ensure that you cannot have different major versions sharing state is also the discipline needed to do things like glibc's versioning linker script. Otherwise, you get things like "we read /etc/foo/config" in both foo1 and foo3, but foo's developers forget that a particular key had a meaning in foo1, and reuse it for a different meaning in foo3 - after all, no-one maintains foo1 any more, so no-one remembers it that well.

This pushes towards the glibc solution (one version, but with backwards compatibility), rather than parallel installability - not least because parallel installability leads towards an NP-complete problem for the packaging team trying to minimise the number of versions of foo that they maintain in the distro.

Re-inventing distro mechanisms

Posted Jul 25, 2025 4:45 UTC (Fri) by donald.buczek (subscriber, #112892) [Link] (3 responses)

> Otherwise, you get things like "we read /etc/foo/config" in both foo1 and foo3, but foo's developers forget that a particular key had a meaning in foo1, and reuse it for a different meaning in foo3

Exactly, the system that has long been chosen for Unix-like operating systems sorts files according to function (/usr/bin, /etc, ...).

You can get quite far with $PREFIX installations and wrappers, but it's ugly and opaque. I sometimes think that a system that always bundles software in name-version-variant directories and supports the dynamic networking of these components as a core principle would be better from today's perspective.

Re-inventing distro mechanisms

Posted Jul 25, 2025 7:22 UTC (Fri) by taladar (subscriber, #68407) [Link] (2 responses)

> I sometimes think that a system that always bundles software in name-version-variant directories and supports the dynamic networking of these components as a core principle would be better from today's perspective.

To me that feels just like pushing the problem onto the user. The complexity is all still there, the distro just does not have to care so much about it but the user who wants to use the components together still does and has to constantly make choices related to that.

Re-inventing distro mechanisms

Posted Jul 26, 2025 6:14 UTC (Sat) by donald.buczek (subscriber, #112892) [Link] (1 responses)

But the user would still be free to run a simple command or use ./configure without selecting specific version or variants of existing software and the system would assume the single recommended version or variant.

The user, acting as an admin, would still be able to install new versions and the distribution-provided package manager would analyze the dependencies. It would just not try to resolve this to a result in which each package can only exist once. It might keep other/older variants around which are required by other packages. It might support some kind of diamond dependencies, too.

The basic difference would be, that packages go into their own file system tree so that they don't conflict with each other if multiple variants of the same package are wanted or needed.

Re-inventing distro mechanisms

Posted Jul 26, 2025 10:59 UTC (Sat) by farnz (subscriber, #17727) [Link]

This begins to sound like a reinvention of NixOS and similar distributions, which puts every package into its own prefix, and can support just about any dependency setup you care about as a result.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds