|
|
Subscribe / Log in / New account

Namespaces for the Python Package Index

By Jake Edge
May 3, 2023

PyCon

The Python packaging picture is generally a bit murky; there are lots of different stakeholders, with disparate wishes and needs, which all adds up to a fairly large set of multi-faceted problems. Back in the first three months of the year, we looked at various discussions around packaging, some of which are still ongoing. A packaging summit was held at PyCon 2023 to bring some of the participants of those discussions together in one room. One of its sessions was on adding a namespaces feature to the Python Package Index (PyPI). It provides a look into some of the difficulties that can arise, especially when trying to accommodate a long legacy of existing practices, which is often a millstone around the neck of those trying to make packaging improvements.

PyPI namespaces

PyPI has long been the go-to source for various kinds of Python modules, libraries, and applications. One of the PyPI maintainers, Dustin Ingram, wanted to discuss some ideas for providing namespaces for PyPI packages. The basic problem is that everyone is sharing a single global namespace on PyPI, he said. There are around 250,000 packages on PyPI currently, but many of them are old packages that may not really be in use any longer; overall, there is a lot of contention for package names.

[Dustin Ingram]

The PEP 541 ("Package Index Name Retention") process for handling name conflicts "does not really scale", he said. It provides a means to acquire the package name from an abandoned or unmaintained project, but it is an involved process that requires a fair amount of manual review. The PyPI administrators have not prioritized that work, so there is a big backlog. Meanwhile, the task also requires "extremely high trust permissions" in order to transfer the ownership of a package, so it cannot just be handed off to new volunteers.

Typosquatting on package names is another problem that PyPI faces. There is code in place to prevent people from registering names that are confusingly similar to existing package names, but it is restrictive, of necessity. So it stops people who are trying to do legitimate things, which is not desirable.

Meanwhile, there are companies and others that want to be able to restrict the prefixes of package names so that packages coming from an "official source" can be distinguished from those coming from elsewhere. An easy way to handle that would be to hand out a namespace that only the organization can publish packages to. During PyCon, the PyPI project announced the addition of organizations for the index. The idea is that companies, projects, and, well, organizations can register with PyPI in order to be able to maintain a collection of projects as part of a single PyPI page, such as the one for the Python Cryptographic Authority (PyCA). But support for namespaces is not part of the new feature.

When Ingram mentioned the new organizations feature, it was met with a round of applause from the two dozen or so summit attendees. Organizations was the number one feature request for PyPI, previously, but now that it has been implemented, PyPI namespaces moves up into that slot, he said. It is not a surprise that organizations see that as a means to ensure that it is clear when code is coming from them.

"This has been done before", he said. One way would be to have GitHub-style namespaces, so that users and organizations could have the same project names that do not collide. Also, the npm ecosystem for JavaScript implemented a namespace feature in 2015 when it was at a similar size to what PyPI is today. There may be some lessons to be learned there. He is not close enough to that community to say whether it has been a success, but it does seem to be a widely used feature at this point.

One important requirement for the PyPI namespaces feature is that it not cause any breaking changes. Current installers should continue to function as they do today even if the target package has moved into a namespace. While he was speaking for himself, he believes that the other PyPI administrators are generally of the same mind. He wondered what attendees thought about the feature, whether it should be pursued, and, if so, how to make it happen.

Discussion

Former PyPI project manager (PM) Sumana Harihareswara (who wrote about PyPI for LWN back in 2018) wondered if Shamika Mohanan, the current PM, had gathered information on use cases and user-experience expectations for the namespaces feature as Mohanan had done for the organizations feature. Mohanan was not at the summit, but Ingram said that all of the work done so far has been targeting the organizations feature, not namespaces, at least yet.

Harihareswara said that, during the pip overhaul in 2020, the team did a lot of user-experience research, including user testing and interviews. She suggested doing that kind of research for the namespaces feature because understanding the mental models of the users, developers, and companies will result in a better feature. Ingram agreed and noted that he did not expect that the feature would be completed quickly; like organizations, it is a complicated feature that will require quite a bit of work. He expects that it will require funding and thinks that the funding should include money for research of that sort.

There are already namespaces, of a sort, in use; for example, pytest has a convention that the names of its plugins start with pytest-, as pytest developer Phebe Polk pointed out. Having namespace support in PyPI would be nice, she said, but it is unclear how the feature would work with existing conventions like that; there may also be conflicts with names that are being used in private PyPI mirrors used internally. One attendee suggested that registering empty namespaces, in order to reserve them for internal use, should be supported.

Bernát Gábor said that he works at Bloomberg, which has a lot of internal packages, but he also maintains public packages as well. He wondered about adding existing packages into a new namespace once it gets created. Companies will probably want to own everything under their namespaces, but open-source projects may want to allow other projects to publish packages under their namespaces. Ingram thought that the owner of the namespace would be able to make those kinds of decisions; he said that following the GitHub model makes sense.

The large companies already have problems with package names that refer to their names or products, but are not official packages, such as tools for NVIDIA or Amazon Web Services (AWS), Peter Wang said. "Names are a hard problem, we all know that". Organizations are going to expect that they completely control their namespace, but he wondered about packages from others that add some functionality on top of the official package.

Many companies and other organizations have their own internal repositories and mirrors that they use, so there is a question of how that interacts with the PyPI namespaces, he continued. How the resolution order will be determined when searching for a package, for example. To his eye it seems like a federated name system of some sort, perhaps modeled on the Domain Name System (DNS) or public-key infrastructure (PKI), may be needed. That may also mesh well with future plans for package signing in order to address supply-chain-security concerns.

Ingram agreed that package names that refer to another organization (e.g. a google-* package from a FOSS developer) are going to be problematic. He thinks it will be important to make a clear distinction between packages that are in a specific namespace versus those that have a prefix or other part of the name that refers to an organization. Currently, there are packages that have prefixes that do accurately identify the organization behind them, others that are unaffiliated with the prefixed organization, while still others that have no real naming convention at all. It is likely that new syntax will be needed to make the distinction clear and that a mapping layer will be required to map between names outside and inside of namespaces.

An attendee asked how many of those present were familiar with how npm had added its namespaces. Only about five people raised their hands, so it probably makes sense to put together a report on how that was done, he said. That will help inform any decisions based on the npm experience.

Clashes do not only exist at the PyPI package-name level, Toshio Kuratomi said, there is also the question of what name will be used in the Python import statement. The problem already exists today, but he thinks it will get worse with namespaces. For example, Fedora had a mock package at one time, but it was not on PyPI, so a mock package that was added to the index effectively pushed Fedora's package aside. Ingram agreed that mismatches between the package name and the imported module name are a real problem, but that problem already exists, so it is a bit out of scope for a discussion of the namespaces feature.

Organizations may wish to have aliases so that they can, essentially, typosquat themselves, an attendee said. For example, Meta may want to have aliases for its name, Instagram, Facebook, and other variants of those names. Ingram asked if being able to reserve multiple empty repositories would suffice. That would take care of the security problem, the attendee said, but not the usability problem; users may want to get the same package under the canonical name and its variants. Ingram agreed that some kind of aliasing would need to be incorporated into the feature.

Another attendee asked about whether there are efforts to target heavy users to get them to help finance PyPI and its infrastructure. Ingram said that while there are some heavy users, they are not really causing problems for PyPI right now. There are some users, for example those providing large GPU binaries for TensorFlow models, who may be heading in that direction, but the PyPI administrators are working with those projects to ensure that the problem does not get out of control. There are also already some protections in place to prevent overconsumption of PyPI's resources.

Harihareswara suggested that the financial tradeoffs in various potential implementations for namespaces be clearly aired in the upcoming discussions. If there are, for example, somewhat suboptimal designs that would end up saving an enormous amount of time (and money), that should be factored into the decision-making. Knowing what budget is available for the project will also help guide the community in making the appropriate choices. Ingram agreed, noting that the person who fills the newly announced position for a PyPI Safety and Security Engineer will likely have a role to play in the design of the feature as well. PyPI namespaces are pretty clearly part of the safety and security story for the package index and the language as a whole.

[I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]


Index entries for this article
ConferencePyCon/2023
PythonPackaging


to post comments

Namespaces for the Python Package Index

Posted May 3, 2023 19:21 UTC (Wed) by SLi (subscriber, #53131) [Link] (12 responses)

It feels like there's some context missing here. From the text, I get the idea that what is planned is some kind of DNS-style namespace system with local lookup (why? Why not the Java reverse domain name system)
?), but none of this is really explained.

Namespaces for the Python Package Index

Posted May 4, 2023 6:35 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (11 responses)

> Flat is better than nested.

(The Zen of Python, PEP 20)

In more explicit terms:

* Nobody likes typing out com.example.foo a hundred million times in their import statements.
* Who is to say whether example.com will still be owned by the same people in a year or ten?
* There needs to be a process for PyPI to reclaim an abandoned, compromised, or maliciously-used name, and that process should not involve IANA and/or UDRP.
* I shouldn't need to buy a whole domain for a hobby project (and I am far more likely to do a hobby project in Python than in Java).

Regardless, this surely is not just a local-only thing, because namespace packages have been around for a very long time, and should totally satisfy that need for people doing their own custom deployments. See [1] for details. If this article is to make any sense at all, it must be talking about some sort of *centralized* standard.

[1]: https://packaging.python.org/en/latest/guides/packaging-n...

Namespaces for the Python Package Index

Posted May 4, 2023 7:43 UTC (Thu) by lunaryorn (subscriber, #111088) [Link] (7 responses)

I once heard the theory that typosquatting is largely not a problem in the Java ecosystem because these reverse DNS group names make artifact names so long that on one actually types them out; people copy them from elsewhere (e.g. search.maven.com, Github, etc) which makes it much harder to trick people into copying wrong artifact names. And there are probably way to many potential typos to attempt typosquatting on Maven Central.

I actually really like the Java naming system; I think it's one of the few things Java really got right from the start. I believe it helps to manage trust because you can selectively delegate trust along the namespace hierarchy. It also supports routing in proxy repositories: You can summarily accept certain root namespaces into your proxy repository, require manual verification for others (e.g. the whole io.github hierarchy), and make sure that your own "com.example.your-company.internal" packages never get resolved from a public repository.

It also moves load off maintainers in a central repository: Sonatype can afford a comparatively strict and cumbersome application process, because they only check group names once, whereas moderation on PyPi or crates.io has no chance but to check each and every package.

All this is somewhat impossible to do with flat packaging as in Rust or Python. Sure these names are harder to type out, but that's what tooling exists for, in my opinion.

Namespaces for the Python Package Index

Posted May 4, 2023 7:52 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (4 responses)

The whole point of Python is to move away from the Java style of doing things (i.e. big enterprisey hierarchical structures). If you want Java, then use Java.

Namespaces for the Python Package Index

Posted May 4, 2023 9:56 UTC (Thu) by SLi (subscriber, #53131) [Link]

That sounds a bit like the purpose is to do it differently only because Java does it this way. As I said, I don't understand the upsides and downsides of the Java system, but I'm sure "Java does it this way, so this can't be good" is not a good justification.

Namespaces for the Python Package Index

Posted May 4, 2023 13:53 UTC (Thu) by lunaryorn (subscriber, #111088) [Link]

I have a déjà vu. Didn't they say the same (or, for the more "elite" part of the community, rather "If you want Haskell, then use Haskell") about PEP 484 and the whole typing thing? And look where we are today… ;)

Namespaces for the Python Package Index

Posted May 4, 2023 17:26 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

Didn't Python have UCS-2 or UTF-16 as its internal encoding for the longest time? That's a lesson Java should have taught anyone to avoid…

Namespaces for the Python Package Index

Posted May 5, 2023 2:25 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

No, it was even dumber than that. Python had "narrow" and "wide" builds, which used UCS-2 and UCS-4 respectively (i.e. the language was not aware of surrogate pairs and treated them as two characters). To a first approximation, the Windows builds were narrow and the Linux builds were (mostly) wide (and I have no idea what they did for macOS).

This was all cleaned up in Python 3. Now, strings are sequences of abstract code points, and the encoding is an internal implementation detail. If you want to use "bytes encoded in UTF-8" instead, you can easily do that, but it's just not what the language does by default.

Namespaces for the Python Package Index

Posted May 4, 2023 8:45 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Domain names can and do contain the - character. How do you include that in a Java namespace?

Namespaces for the Python Package Index

Posted May 4, 2023 8:58 UTC (Thu) by lunaryorn (subscriber, #111088) [Link]

The established convention would be to use an underscore instead.

Namespaces for the Python Package Index

Posted May 4, 2023 10:05 UTC (Thu) by Karellen (subscriber, #67644) [Link]

> Flat is better than nested.

Don't namespaces inherently introduce at least one level of nesting? Given that a DNS-based naming system will only give 2 levels of nesting in the common case (e.g. instead of "mynamespace.mypackage" you'll often end up with "tld.mydomain.mypackage"), is it that much of a big deal?

> Who is to say whether example.com will still be owned by the same people in a year or ten?

Won't you need a process for retiring/reusing namespaces anyway, if the original owner dies? How is this different?

> There needs to be a process for PyPI to reclaim an abandoned, compromised, or maliciously-used name,

Couldn't they just decide to block an arbitrary set of domains within PyPI? So they could reject packages for the "ru.h4x0r" namespace without involving IANA?

> I shouldn't need to buy a whole domain for a hobby project

Then you can use an un-namespaced package name, as you do now. It'll still the existing problems of collisions and squatting, but no worse than the situation currently is. Or PyPI could designate "local." as a private use namespace, mirroring the ".local" special use tld.

https://en.wikipedia.org/wiki/.local

Namespaces for the Python Package Index

Posted May 4, 2023 11:15 UTC (Thu) by ebassi (subscriber, #54855) [Link] (1 responses)

> Flat is better than nested.

30 years of Linux distributions should have thoroughly debunked this mantra; and that's a limited, heavily gated pool of components. Once you open it to a whole ecosystem and minimise the friction towards publishing, you have the perfect recipe for a disaster.

Very, very few people sign up for curating the wild west, and you can only get so far with minimal volunteer resources.

The idea of a flat namespace in which everyone plays nice and nobody squats on a name they just reserved for a rainy day project that never comes is naive at best; it's like thinking that the Internet is still the same as it was in the '90s.

> I shouldn't need to buy a whole domain for a hobby project (and I am far more likely to do a hobby project in Python than in Java)

That would have been a problem before GitHub/GitLab/source hosting services; you can use a reverse domain like com.github.yourusername.yourproject, and satisfy the requirements for a domain. If the verification process is automated, and based on something like a token held in a file accessible via a well-known URL, then you can publish it using something like GitHub or GitLab pages, which both reflect your user name as the namespace.

Namespaces for the Python Package Index

Posted May 4, 2023 16:48 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

Slight tangent, but even that sort of system is subject to abuse... a user on the Bluesky social media network has 'registered' s3.amazonaws.com as their domain, since they were able to place a file there and have it pass the verification check :-)


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds