|
|
Subscribe / Log in / New account

PyTorch and the PyPI supply chain

By Jake Edge
January 11, 2023

The PyTorch compromise that happened right at the end of 2022 was rather ugly, but its impact was not widespread—seemingly, at least. The incident does highlight some of the perils of relying on an external "supply chain" for the components that are used to build one's software. It also would appear to be another case of "security researchers" run amok, though perhaps that part of the story is only meant to cover the tracks—or ass—of the perpetrator.

Beyond that, the incident shows that the Python Package Index (PyPI) and the pip package installer act in ways that arguably assisted the compromise. That clearly comes as a surprise to many, though those behaviors are well-known and well-established in the Python Package Authority (PyPA) community. There is, at minimum, a need for education on that topic.

Compromise

People (or continuous-integration bots) who installed the nightly build of the PyTorch machine-learning framework using pip between December 25 and 30 got an unwelcome surprise. A binary program was installed with a dependent module that was triggered when that module was imported into a PyTorch-using code base. That binary gathers system information (e.g. name servers, host names) and the contents of various interesting files (e.g. /etc/passwd, $HOME/.ssh/*, the first 1000 files in $HOME), then uploads that information to an attacker-controlled server using encrypted DNS queries.

In order to build PyTorch, multiple dependencies of various sorts are required. Some are regular PyPI packages that should be downloaded from that repository, while others are PyTorch-specific packages that should come from the PyTorch nightly repository. A single pip command is used to install from both PyPI and the PyTorch nightly repository given on the command line, but pip does not distinguish between the two repositories; it treats them both as equal possibilities for fulfilling the need for a given package.

If there is a dependency on, say, torchtriton from some other part of PyTorch and there is a package by that name available on PyPI, pip can choose it to install instead of the one by the same name in the PyTorch repository. That is exactly what happened, of course; an attacker registered the torchtriton PyPI package and uploaded a version of that code that functioned exactly the same as the original—except that it added the malicious payload that is executed when it is imported. It is unknown exactly how many sites were actually affected, but the malicious torchtriton package was downloaded from PyPI around 2,800 times, according to a lengthy analysis of the compromise by Tzachi Zorn.

Once the PyTorch project was alerted to the malware at PyPI on December 30, it took immediate steps to fix the problem. The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton; a placeholder project called pytorch-triton was registered at PyPI so that the problem could not recur. In addition, PyTorch nightly builds that referred to torchtriton as a dependency were removed from the repositories so that any cached versions of the malicious package would not be picked up inadvertently. The PyPI administrators were also alerted and they promptly removed the malicious package. On December 31, the project put out the advisory linked above.

The analysis by Zorn (and another by Ax Sharma at BleepingComputer) describe efforts by the perpetrator of the attack to explain their actions. At first, the domain used for DNS lookups in order to exfiltrate the information put up a short text message [archive link] claiming that the information was gathered simply to identify the companies affected so that they could be alerted. Another, longer message was apparently sent to various outlets with similar claims, including that all of the data gathered by malicious payload had been deleted, which can be seen in those articles. It is pretty much impossible to verify one way or the other; it could be truthful and heartfelt—or it could simply be damage control.

Dependency confusion

The type of problem being exploited here is called "dependency confusion"; the technique was popularized by Alex Birsan in 2021, but the pip bug report linked above makes it clear that the problem was known in that community back in 2020. When the ‑‑extra‑index‑url option for pip is used, it consults that index and adds all of the packages it provides to its internal list. When it comes time to install a package, pip chooses the one with the highest version (or highest version that satisfies any version constraints that were specified) regardless of which repository it comes from.

PEP 440 ("Version Identification and Dependency Specification") governs how pip chooses which version to install. One might think pinning a dependency to a specific version would be sufficient, but, as Dustin Ingram pointed out in a recent discussion, that is not true. pip and other installers "prefer wheels with more specific tags over less specific tags". That makes it relatively easy for an attacker to shadow even a version-pinned dependency.

As Ingram noted in another message, the way to truly pin a dependency is by specifying the hash values of the binary artifacts to be installed as described in the pip documentation. That thread is interesting in other ways, however.

It starts with request for help in convincing the security administrators at a company to unblock PyPI. Kirk Graham ran into a problem at his company, which had wholesale blocked the repository "because there were '29 malwared malicious modules' at the site". Those modules had long been removed from PyPI but the reputation for unreliability lingered on. Brett Cannon pointed out that there are lots of other places where malicious code can sometimes be obtained:

My first question would be whether they block every project index out there (e.g., npm, crates.io, etc.), as they all have the same problem? Or what about GitHub? I mean where does the line get drawn for protecting you from potentially malicious code?

My follow-up is how do they expect you to do use any open source Python code? If so, how are you supposed to get that code? Straight from the repositories? I mean I know lots of large companies that ban pulling directly from code indexes like PyPI, but then these are large companies with dedicated teams to get the source, store it internally, do their own builds of wheels, etc. If you block access to using what the projects provide you have to be up for doing all the work they provide in getting you those files.

Several in the thread pointed to various services and tools for managing dependencies of open-source components, which might help solve the problem at the company. Graham was clearly frustrated with the situation and his company, but once he found out about PyTorch, he changed his tune to certain extent:

Over the holidays there was malicious code added to PyTorch module on PyPi. That makes me think our Security Director is right. If there isn't better security from PyPi and GitHub those sites will be blocked by more and more companies. Open Source needs to be more secure. /sigh

That is not an entirely accurate picture of what happened, which was pointed out in the thread, but the larger point still stands. To outsiders it looks like PyTorch itself was compromised on PyPI, when what actually happened is more nuanced than that.

The pip bug report came up in the thread as well. Reading through that report makes it clear that the problem does not lend itself to a simple or straightforward fix. The root of the problem is that people do not understand that using the PyPI repository is not without risks and they fail to fully evaluate what those risks are—and what they mean for their software supply chain. As Paul Moore put it when the bug was resurrected after the Birsan posting in 2021: "But I do think that we're trying to apply a technology solution to a people problem here, and that never goes well :-("

Much of what Moore and other PyPA developers have to say in the report is worth reading for those interested in the problem. So far, the most straightforward "solution" is to remove the ‑‑extra‑index‑url option entirely, but that has its own set of problems, as Moore noted:

There really is no "good" way of securing ‑‑extra‑index‑url if you look at it that way. Allow pip to look in 2 locations and you have to accept that all of your packages are now being served as securely as the least secure index. And the evidence of the "dependency confusion" article is that people simply aren't aware of that reality. So what the pip developers need to decide is whether our responsibility ends with having documented how multiple indexes work, or whether we should view the ability to have multiple indexes as an "attractive nuisance" and remove it to ensure that people aren't tempted to use it in an insecure manner.

The clamour of voices arguing "this is a security flaw", plus the sheer stress on the maintainers that would be involved in arguing that this isn't our problem, suggests that we should remove the feature. But there's no doubt that it would penalise people who use the ability correctly - and it feels wrong to be penalising those people for the sake of the group who didn't properly assess the risks.

The bug report thread was brought to life again after the PyTorch mess, naturally. Moore describes some concrete steps that could be taken to address the problem, but it still requires someone (or some organization) willing to take on that work, make a proposal, and push it through to completion. So far there has been a lot of talk about the problem, but little in the way of action to fix it.

It really should come as no surprise that grabbing random code from the internet sometimes results in less than ideal outcomes. The flipside of that is that, usually, "sometimes" is extremely rare, which in some ways leads directly to the "attractive nuisance" argument. These kinds of problems are not new and are seemingly not going away anytime soon either. Each time we have an event like this PyTorch compromise, it gives open-source software another black eye, which is perhaps not entirely fair, but also not entirely surprising.


Index entries for this article
SecurityPython
SecuritySupply chain
PythonPackaging
PythonSecurity


to post comments

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 1:13 UTC (Thu) by koh (subscriber, #101482) [Link] (24 responses)

To me the solution is pretty simple and nothing that hasn't been implemented elsewhere: identify the repository in the dependency specification.

Granted, I'm using Gentoo since quite a while, so to me it feels natural to say, e.g., '>=sys-libs/glibc-2.32::gentoo' in order to give the constraints
- package 'sys-libs/glibc'
- version larger or equal to 2.32
- repository called 'gentoo' (locally)

In a non-centralized setting with "‑‑extra‑index‑url" there is no 'local' name/reference to a repository, but that shouldn't be a problem - at least on the technical level. The URLs are still managed centrally (for most of the internet for most of the time - that's another can of worms, though).

I keep coming back to the question why every language needs their own package manager with the usual set of problems to (a) discover and (b) solve in incompatible manners...

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 2:47 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (8 responses)

> I keep coming back to the question why every language needs their own package manager with the usual set of problems to (a) discover and (b) solve in incompatible manners...

Because the alternative is waiting for distros to repackage a hundred thousand random project repos? Add in Nix, vcpkg, chocolatey, HomeBrew, etc.

Let's say I'm working on a project. I discover that I can split a new library out of it. What do I do? I make a repo (or directory; many language package managers don't care that much) and publish it. Users can upgrade to this version just fine today. If I need to wait for…something to happen elsewhere, my tool is stuck in out-of-date versions until someone picks up the ball and adds this new package.

Sure, you could say "just use what is in your distro", but that ignores reality. People want new compilers, new development tools, etc. These end up pulling in the same things the distro wants to provide, but in newer, incompatible versions. What are you to do? Uproot your distro when Debian turns out to be too slow?

I'm all for splitting out dependencies and using system copies when possible, but I can't link my development processes with Debian (or Arch, Fedora, etc.) release cycles. I've got work to do, you know? Far better to let developers pick their own distro sandbox they like playing in and letting them do development on top of it in a convenient manner.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 3:09 UTC (Thu) by bferrell (subscriber, #624) [Link] (2 responses)

No, devs want NIH shiny toys and don't want to take five minutes to discover "we been doing that 'this' way for years"

This kewl new app will ONLY use the cutting edge version of the language... But the dev has no clue to document this.

It's beginning to make RPM dependency hell look simple

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 11:44 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

The problem is not new features, it's the bugs. Sure, there are big commonly used projects like Django which are well tested, but there's a long tail of smaller packages for things that only a much smaller group of developers use. If you're a developer using one of these packages in a slightly unusual setting you find bugs.

No problem, create a patch, send it to the developer, they merge it, push a new minor release to PyPI and you can get on with it. In my experience, a month or two is the usual turnaround time. This fits in the release cycle, we just pause the ticket till the upstream release. Telling us to "use the version shipped by the distribution" is equivalent to saying "work around this bug for the next year or two". And it's not just one bug, it's several over several different packages. Eventually tracking which workarounds you're waiting for an upstream release becomes a significant amount of work.

Besides, workarounds are annoying, this is open source, we should be fixing the upstream packages, not working around the issues elsewhere.

I know projects with the strict rule that all packages must be installed from Debian. And it *almost* works. If the packages are missing features you can simply tell the product owner it's not possible yet. But there are always a few packages for which the Debian release is simply buggy, but such a corner case that only affects basically you it's not going to be updated there (because upstream has fixed it in a new version, and Debian isn't going to bump the version). So you end up making an exception for just these handful of packages (basically using py2deb). And hope it doesn't get too many.

The step from there to "just pull everything from PyPI with version/hash pinning" is very, very small.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 13:53 UTC (Thu) by farnz (subscriber, #17727) [Link]

And don't forget that predicting what a distro will have when you release your new version is hard. If you're going to release in April 2023, and something you depend upon has a necessary update released in December 2022, is that update going to be in the current stable Debian release when you make your release?

If you guess wrong, you end up in one of two sub-optimal situations:

  1. If you assume the update won't be submitted by the Debian Developers in time to make it into Debian 12 (Bookworm), or that the release will slip into May, and then the update makes it in anyway, you're carrying sub-optimal code to handle the pre-update version of your dependency.
  2. If you assume it will make it in, and then something causes Debian to have to delay the release, or if the update can't be put into Bookworm before the freeze kicks in without breaking something critical, your project now doesn't run on Debian stable on the day of release, because you're waiting for an update.

Bundling from a vendor source neatly sidesteps this - if Bookworm has the dependency version you need, then unbundling can be done then, while if it doesn't, no problem, you've got the vendored code. And then language repositories like PyPI make it simpler, because they're already working in terms of a dependency tree, not copied code, so you can look and go "aha, when I build the Debian package, I can unbundle libfoo, since Debian has the right version of libfoo already".

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 15:37 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (4 responses)

I don't know if waiting for the distributions to package everything is the right solution. I'm inclined to believe developers who say it's unlikely to work for them, if only because of the problem you highlight of there being too many libraries to package. I also believe, though, that the current solution of grabbing whatever is out there, trusting it's fine, and then acting surprised by supply chain attacks is also not working. It's just resulting in occasional spectacular failures rather than regular, boring unavailability of bug fixes and product enhancements.

What is needed is a system that provides some kind of real quality control, so developers can have a confidence the libraries they're using are what it says on the tin. This has the unfortunate side effect of slowing everything down for the QC step, but the alternative is occasionally getting pwned when attackers finally decide your system is worth attacking. Pretending everything is fine in an attempt to go as fast as possible is demonstrably not working.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 17:22 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

We need to be able to have quality control somewhere, the question is if PyPI is the organisation to do it. What I'm thinking is a kind of meta-index where different organisations can give a "trust rating" or something, and that I can filter on that in pip.

For example, it would be possible for someone to write a bot that checked if the contents of the wheels distributed by PyPI match the source in the indicated repository. The problem is there is nowhere to place this information in a way that is of any use. Or, it would be nice to straight up reject any package that has existed for less than 3 months. Or being able to namespace dependencies to ensure they come from the right repository.

Python is here paying the decision early on that no packaging/repository standard would be made and to let the community create one organically. It's biting back hard now. More recent languages did not repeat that mistake.

PS. Don't talk to me about solutions like Nexus which try to solve the problem on the client-side but don't really have any extra information to work with and so just end up adding an extra layer of frustration. Until the necessary information is available in machine readable form no client-side tooling can help.

PyTorch and the PyPI supply chain

Posted Jan 13, 2023 4:01 UTC (Fri) by pabs (subscriber, #43278) [Link]

I feel like distributed code review ala crev, along with reproducible and bootstrappable builds is the right model here.

https://github.com/crev-dev/
https://reproducible-builds.org/
https://bootstrappable.org/

PyTorch and the PyPI supply chain

Posted Jan 18, 2023 1:15 UTC (Wed) by hazmat (subscriber, #668) [Link] (1 responses)

re distros, too much plurality and too much complexity to get across the swathe when viewed in aggregate for a developer/publisher of a python dependency, ie language tools exist for a reason. i'd be willing to settle for simple signature upload and verification (sans gpg, say cosign) on the language repos.. but there has been a slow moving effort to try and make things better albeit glacial speeds.. https://blogs.vmware.com/opensource/2020/09/17/tuf-pypi-i... updates are scattered across internet/github issues / tulip chat logs. i do wonder if some of the ossf projects will help on the broader ecosystem. its unclear if we're not just creating another checkbox for enterprises sometimes vs actually moving the needle.

PyTorch and the PyPI supply chain

Posted Jan 18, 2023 15:34 UTC (Wed) by hazmat (subscriber, #668) [Link]

found the best link for current state at
https://github.com/pypi/warehouse/issues/10672

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 3:00 UTC (Thu) by mpr22 (subscriber, #60784) [Link] (5 responses)

> I keep coming back to the question why every language needs their own package manager with the usual set of problems to (a) discover and (b) solve in incompatible manners...

I am less than half joking when I say "blame Perl".

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 3:14 UTC (Thu) by bferrell (subscriber, #624) [Link] (1 responses)

Until VERY recently if you wrote PERL code that did this, the dev community would fetch piles of wood and burn you to the ground.

A few years back a VERY common module got re-written and made major changes to the behavior of the code... With no documentation. They just thought it was a "good idea (tm)".

Post hasty, that got changed and while the new behavior WAS a good idea and kept, it became a "turn it on with a variable if you want it" vs "here, let me shove this down your throat".

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 14:36 UTC (Thu) by smoogen (subscriber, #97) [Link]

I wonder if that is because the Perl community learned the lessons the hard way. I spent way too much time in the late 90's and early 00's fixing 1 am outages to undo some developer's 'grab the latest from CPAN' which 'fixed' whatever bug they had but added 200 new ones in a myriad of dependencies (or my favorite.. why is the perl on each web server or application different? Oh because each team of dev's did a CPAN update and compiled a new version as part of that..) Things became more stable after that... but I also was dealing with perl less and less because various web devs I worked with were finding it 'too stodgy' and moving to Ruby, Python and then Node because speed in module changes were there.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 15:03 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

In Perl's defense, the modern distribution system didn't exist yet when CPAN started. CPAN was built at more or less the same time as modern distribution packaging systems, so waiting for packages to go up on the distribution wasn't a serious option. Even if people had been willing to wait a few years for that system to develop, nobody knew that it was going to develop, or even that Linux was going to win the Unix wars, so some kind of homebrew packaging system was necessary.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 17:01 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

Which is, ironically, one of the language package managers that is the least painful to integrate into distributions, and one of the languages with the best track of backwards compatibility in modules.

PyTorch and the PyPI supply chain

Posted Jan 22, 2023 9:42 UTC (Sun) by oldtomas (guest, #72579) [Link]

Perhaps ironical, but not surprising.

I think Perl's growth happened at a time where "fitting in an environment" was the obvious thing to do. One data point? POD has as one of its main targets man pages.

Python (re- [1]) started a trend which I'll call "language monotheism", where each language had (or thought it had) to fight for absolute dominance. I think this might be something for computer sociologists to study some day.

[1] Not the first round, mind you. Older people might remember C vs Pascal, quiche eaters and that. Of course, nowadays, in the era of overabundance, survival and money are more at stake than back then.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 3:05 UTC (Thu) by flussence (guest, #85566) [Link]

Gentoo has also recently started adding the ability to verify upstream tarballs against upstream public keys and signatures, which is better than nothing, but is pointedly next to nothing. If the package in question gets its code from a git repo with signed commits, there's nothing to check that. If the package *itself* lives in a signed git repo, you can't reuse the download-checking key management (it actively fights you if you do) and have to figure out through trial and error how to manually fetch and manage GPG keys for a non-root system account.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 8:45 UTC (Thu) by ms (subscriber, #41272) [Link] (4 responses)

> To me the solution is pretty simple and nothing that hasn't been implemented elsewhere: identify the repository in the dependency specification.

I think that certainly helps. But there have also been lots of examples of submitting PRs that get malicious code in to repos; along with social engineering to take over code repos; and e.g. established chrome extensions being sold to a new owner and then malicious code gets injected. In these cases, the name of the repository hasn't changed.

Another thing that helps is getting away from this mantra of "always fetch the latest version that satisfies your semver constraints". If you take the Go approach of _minimal_ version rather than maximal, then the blast radius is much reduced: it is no longer sufficient to release a new compromised version - that on its own will not get picked up. You would also have to modify the deps of a repo that imports that, and of that, and so on, all the way up to the top.
https://research.swtch.com/vgo-mvs (the section on "Upgrade Timing" is most relevant here).
I'm certainly not claiming Go is the only language to do this; it is simply the one with which I'm most familiar.

What I absolutely detest is the attitude that "this is a people problem, we shouldn't try to solve it with technical means". Correct - you won't be able to _solve_ it. But that's not the point. The point is to reduce the probability of these farcical messes from occurring. And there is plenty of prior art out there that helps. Refusing to learn from that is just sticking your head in the sand.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 12:35 UTC (Thu) by khim (subscriber, #9252) [Link]

> You would also have to modify the deps of a repo that imports that, and of that, and so on, all the way up to the top.

Sounds like cultivation of Log4Shells instead of “dependency confusion”.

But yeah, that's definitely fit well into “simple non-solutions” scheme Go practices.

PyTorch and the PyPI supply chain

Posted Jan 14, 2023 19:42 UTC (Sat) by KJ7RRV (subscriber, #153595) [Link] (2 responses)

Doesn't using the minimum version instead of maximum result in not receiving security updates for dependencies until the depending package is also updated? That seems like a much worse outcome for security, especially considering dependencies of dependencies, etc. Or does Go have a way of specifying that an update is a security update?

PyTorch and the PyPI supply chain

Posted Jan 15, 2023 11:14 UTC (Sun) by farnz (subscriber, #17727) [Link] (1 responses)

There cannot be a way to specify that an update is a security update without losing any gains from the "minimal version" route; there is no way to distinguish "malicious actor flags version with back door as security update" from "good actor flags version removing back door as security update".

As with so many things, it all comes down to trust. If you trust upstream to release good updates, you want to take their latest code. If you don't trust upstream, you should be locking exact versions, and reviewing every new release upstream manually before you bring it in (which, in turn, has to be your top priority in case the fixes are security relevant to your code).

PyTorch and the PyPI supply chain

Posted Jan 15, 2023 12:18 UTC (Sun) by ms (subscriber, #41272) [Link]

Exactly. There is tooling to help find vulnerabilities, but yes, the basic premise is that you the developer must explicitly give permission for some dependency (even transitively) to be updated. Doing anything less gated than this is really just giving other developers permission to execute arbitrary code on your machine.

Both of these are relevant:
https://go.dev/blog/vuln
https://go.dev/blog/supply-chain

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 12:19 UTC (Thu) by khim (subscriber, #9252) [Link] (2 responses)

> I keep coming back to the question why every language needs their own package manager with the usual set of problems to (a) discover and (b) solve in incompatible manners...

Because Gentoo is not macOS or Windows, basically.

Newbies to the programming would, inevitably, use one of these two. And if your language doesn't support them well then it's chances of being used in place of more popular alternative is almost null.

And if you have something that works for beginners… people continue to use it for other things, because why not?

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 20:14 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

I started using gentoo precisely because I needed features that were not available in the latest version of my then distro (SUSE). I could have used the project's own install setup, but I would rather use the distro, so I changed to one that made it easy.

However, I wouldn't recommend gentoo to newbies ... (unless, of course, they want to do things the hard way :-)

Cheers,
Wol

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 21:19 UTC (Thu) by khim (subscriber, #9252) [Link]

The problem if distros is not technical, ultimately, it's social.

Most distro makers drank too much FSF kool aid and now believe others want to create sources for others to use.

Nothing could be further from the truth!

Neither users nor developers are interested in the software for software sake.

Their goal is to produce binary and to either give it away or use it.

That's why disconnect is so deep.

In a world where creation of software source is the goal you have to support various versions of dependencies (because this increases usability of your sources) and then, on top of that, you can afford “curated repos”, then, on top of that, you can provide “long term support” and all these other things.

In a world where software source only exists because it's not very convenient to write in machine code directly… situation is radically different: developers assume that they would decide what dependencies they would use and what targets they would support and users decide they would decide what version of application they would download and use.

Given insane disconnect between expectations it's no wonder no one is happy.

Gentoo, NixOS and other such distros support that mode, but they make assumption that this desire to control everything goes to the core… but most developers and users don't go that far: they are happy to use OS that hardware maker gives them too scared to replace OS that hardware maker gives to them, they want to control things on top of that.

Maybe if Gentoo or Nix supported macOS and Windows this would have been an acceptable compromise, but alas, they don't do that (at least they don't make it easy enough to use for newbie), thus we have no alternative to per-language package managers.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 6:55 UTC (Thu) by bof (subscriber, #110741) [Link] (10 responses)

What I do not understand, from a sysadmin perspective used to the distro model, is this:

All these language package repo things, run wide open to everyone uploading stuff. With the obvious downsides. So why isn't there trusted language "distros" with trusted groups of maintainers curating that into trustable, separate repos meant for the "consumers" out there? And why the frell does everybody consuming the packaging, accept that as God given (adding in a snide remark about the Dino distros)?

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 7:12 UTC (Thu) by maniax (subscriber, #4509) [Link] (3 responses)

I have pretty much the same question... In my case, anything that's not maintained by the Linux distribution (which I tend to trust) is installed from external sources only if really really really needed, and mostly set to a very specific version with the idea that as soon as it reaches the distro, it'll be updated. OR, a copy is maintained internally in separate, internal repositories.

And this is not only a security question. Stuff "out there" is usually too bleeding edge to be reliable enough, and just fetching "the latest and greatest" is bound to break stuff.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 8:28 UTC (Thu) by taladar (subscriber, #68407) [Link] (2 responses)

But stuff in the distros is just as bleeding edge, just bleeding edge backports that a person who is comparatively much less familiar with the code base developed and virtually nobody tested, usually with a lying version number on top of it.

Stability in a changing world is an illusion or in many cases even a deception sold to the gullible companies who desire it but don't understand how fast the world really moves in terms of software compatibility with the rest of the world (both in terms of protocols, data formats,... and in terms of legal and regulatory frameworks,...) and security issues.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 8:58 UTC (Thu) by ms (subscriber, #41272) [Link]

Exactly this. I choose to use NixOS, both at home, at work, and on some of my servers. I have zero belief anyone in that project is reviewing upstream code changes. The prevailing attitude is very much "if it compiles, that'll do". Tbqh, I wouldn't be at all surprised if that's pretty much the same right across most distros, with the exception of some of the bigger commercial distros. And even then, I fully expect focus would be on the most critical packages - the kernel, libc, security libraries, xorg/wayland, mutt... - for obvious economic reasons.

I think everything really does just boil down to "you just have to trust other people". Yep, checksums, and version numbers, and all that goodness is great for verifying things don't change that you don't want to. I wouldn't want to be without that. But when I'm looking for a library to solve a particular problem, I look at the number of stars and forks, the rate of commits and who they're from, and the issue tracker, and that's my starting point for establishing trust. And I think it's a good thing: a society where the default behaviour is not to trust, not to give the benefit of the doubt, not to assume good, is not worth having.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 10:12 UTC (Thu) by ballombe (subscriber, #9523) [Link]

> But stuff in the distros is just as bleeding edge, just bleeding edge backports that a person who is comparatively much less familiar with the code base developed and virtually nobody tested, usually with a lying version number on top of it.

The code does not run in a vacuum. Distributions are much more familiar with the environment where the code will run,
and most distribution developers are also part of upstream. they also tend to have more user-aligned view than upstream. user-hostile upstream do exist.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 8:22 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

Well I guess because nobody is doing this job, and because developers always want to use today's new thing, so they think that waiting for vetting is a waste of time.

And since it generally doesn't end up in malware being downloaded, the current system is good enough… until the next malware happens.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 12:45 UTC (Thu) by khim (subscriber, #9252) [Link]

> So why isn't there trusted language "distros" with trusted groups of maintainers curating that into trustable, separate repos meant for the "consumers" out there?

Because there are no “consumers”?

Developers want two things which can not be, obviously, satisfied simultaneouly:

  1. They want to be able to quickly get updates and bugfixes.
  2. They want to be able to be sure there are malicious code.

Distributions solve problem #2 well but entirely fail to handle #1. Language repos and AppStores solve #1 well, but suck at #2.

Since half a loaf is better than no loaf developers stick to what solves one problem and can half-ass the 2nd one rather than use something that fails entirely to solve half of the problem.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 13:24 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> So why isn't there trusted language "distros" with trusted groups of maintainers curating that into trustable, separate repos meant for the "consumers" out there?

Haskell has Stackage.

https://www.stackage.org/

https://www.stackage.org/package/stackage

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 15:05 UTC (Thu) by kpfleming (subscriber, #23250) [Link] (1 responses)

Similar things exist... for example the Conda repositories are a curated set of Python (and non-Python) packages which are tested together. It is extremely time-consuming (and thus expensive) to do this sort of thing, and when subjective decisions have to be made about which version of a package to include it gets even more difficult.

Said 'consumers' will need to be willing to compensate the people who do this work; it's definitely not an effort which can be funded with volunteer time (and we can already see how that works in other areas).

PyTorch and the PyPI supply chain

Posted Jan 13, 2023 6:02 UTC (Fri) by bof (subscriber, #110741) [Link]

> Said 'consumers' will need to be willing to compensate the people who do this work; it's definitely not an effort which can be funded with volunteer time (and we can already see how that works in other areas).

Absolutely. Where you have distributions now with significant parts of the important packages somewhat current in their latest and/or rolling releases, they are backed by enough manpower to have dedicated paid people take care of a certain subject area. And they have built their base of enterprise customer subscriptions to fund that in a sustainable fashion.

Seeing Python at the top of the yearly language popularity lists, I feel that something like that should work in the dynamic languages field, too.

So, Conda, right? Is it the only "player" right now doing something like that?

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 22:06 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> So why isn't there trusted language "distros"

Because that's a lot of work. There are companies that are selling this as a service, but they all kinda suck.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 11:12 UTC (Thu) by summentier (guest, #100638) [Link] (1 responses)

Package management and build automation is hard. I get that, I do.

But trying to coerce setuptools to do what you want it to do is not fun. Its code is an undocumented mess, its abstractions are leaky and incoherent, and its architecture like a Jenga tower resting on top of a pile of Mikado sticks. Look at nontrivial setup scripts bundled (e.g., for project such as numpy or tensorflow): they resemble ancient incantations much more than actual code.

So I do not envy pip's job. But much of what ails setuptools also seems to have infected pip: its documentation is ... terse, to say the very least, its code isn't great either, and it does like to act and fail in ever-surprising ways. Moreover, coming from Rust or Julia, it is very hard to be satisfied with the hodgepodge of virtualenvs one has to set up in case of dependency conflicts. So, respectfully, it seems in character that pip does something sub-optimal and I think a doc fix is not going to fix those deep structural issues. (Anaconda, while certainly well-intentioned, tends to make everything worse, particularly on supercomputers.)

I understand that pip is in a tight spot now with respect to backwards compatibility.
Hopefully new projects (such as poetry) will improve this, I have to say, rather sorry state.

PyTorch and the PyPI supply chain

Posted Jan 12, 2023 17:10 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> But trying to coerce setuptools to do what you want it to do is not fun. Its code is an undocumented mess, its abstractions are leaky and incoherent, and its architecture like a Jenga tower resting on top of a pile of Mikado sticks. Look at nontrivial setup scripts bundled (e.g., for project such as numpy or tensorflow): they resemble ancient incantations much more than actual code.

I think at one point, NumPy's additions to setuptools were on the same order of size as setuptools itself. SciPy probably didn't make things any easier.

I will agree about the undocumented mess 100% though. Figuring out what could go into some fields (globs, symlink traversal, etc.) involved tracing the value(s) through the code to where they hit some active API that actually used them. The duck typing helps with being able to get things done by abusing things like `../` traversal to grab things, but really hinders with making anyone aware of what is possible (and what of that is actually intended).

PyTorch and the PyPI supply chain

Posted Jan 14, 2023 9:29 UTC (Sat) by cyperpunks (subscriber, #39406) [Link] (5 responses)

The amount of work needed and the review done for a module to be part of Python Standard Library seems to very high, however to publish something to PyPI the requirements is more or less trivial.

Maybe it's a path forward is to split PyPI in curated/blessed part and a free for all section? The blessed part will move somewhat slower, but much faster than PSL.

PyTorch and the PyPI supply chain

Posted Jan 14, 2023 11:11 UTC (Sat) by amacater (subscriber, #790) [Link]

Split PyPI into two - a curated part and a non-curated part - and you might as well have distributions doing a competent job of curation on a subset of packages. There's a reason that some of us have used distributions for >25 years and look on non-curated software with amusement and horror.

And yes - I'm frankly amazed how many language / package distribution mechanisms for various operating systems have effectively reimplemented apt poorly.

PyTorch and the PyPI supply chain

Posted Jan 16, 2023 0:24 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

Maybe this is unfair, but the impression I've gotten from the discussions on here is that anything less than full speed ahead will upset a lot of developers. At the very least, each developer has their own idea about how much delay for quality control is acceptable, and any delay at all will upset some people. Whatever choice you make will make some people unhappy there's too much delay and others unhappy there isn't enough QC.

PyTorch and the PyPI supply chain

Posted Jan 16, 2023 0:58 UTC (Mon) by Wol (subscriber, #4433) [Link]

This is where you NEED your Benevolent Dictator For Life.

Sod full speed ahead. Sod quality control. Just take a step back. Think about what you're doing. INVEST TIME IN DESIGN. Then you *won't* *need* so much quality control. Then "full speed ahead" will feel like a tortoise (and won't do a Torrey Canyon). Then you'll end up with twice the quality in half the time.

The problem is that, without someone who has the power to knock heads together, having a sensible design discussion can be incredibly difficult. It just takes a couple of people who think their needs are the greatest, and are determined make their voice heard over everyone else, and things will implode.

Cheers,
Wol

PyTorch and the PyPI supply chain

Posted Jan 16, 2023 10:05 UTC (Mon) by kleptog (subscriber, #1183) [Link] (1 responses)

> Maybe this is unfair, but the impression I've gotten from the discussions on here is that anything less than full speed ahead will upset a lot of developers.

I read that here a lot too, but I've yet to meet such a developer in real life. Sure, you have junior developers that wonder what the point is. When they've spent a week trying to untangle dependencies to get the buildbot to pass again they suddenly appreciate the virtue of pinning versions.

Untangling package dependencies to find a working combination is one of the least interesting jobs there is.

PyTorch and the PyPI supply chain

Posted Jan 16, 2023 11:37 UTC (Mon) by farnz (subscriber, #17727) [Link]

The thing that appears as "full speed ahead" is not that all developers want to be on the latest version of everything, but that the combined effect of all developers wanting their pet dependency to be on the latest version (which adds a feature they need, or a bugfix that affects their product's security) is "full speed ahead".

Basically, anything other than "we only accept dependencies in the oldest distribution release in extended support" (RHEL6, for example) ends up looking like "full speed ahead" in discussions, because no matter how carefully you consider your update plans, there will be someone who perceives your decision to update a minimum supported dependency version as "moving too fast".

PyTorch and the PyPI supply chain

Posted Jan 26, 2023 17:27 UTC (Thu) by irvingleonard (guest, #156786) [Link] (1 responses)

Why is everyone blaming all kinds of ancillary stuff? The blames lies in the pytorch team. You're using pip, which uses the PYPI by default, hence you have 3 options:
1. You could use your private index by disabling PYPI altogether and provide every possible dependency. It would work for your package but break every other one out there.
2. You could embrace PYPI and don't use a private index. This might be not feasible for political (or technical?) reasons.
3. You could use them both at the same time, which is what they ended up doing.

Now, the problem is that if you use PYPI you have to play by its rules. Package names are an asset on PYPI: the first one that claims it will own it. They obviously didn't read that memo and got bitten by it. The "solution" was as simple as publishing a dummy package in PYPI with a very low version number for every "private" package that only lived in their private index. That dummy package could be a simple readme with the instructions on how to reach the private index and with that they would have prevent the hijacking.

Am I wrong in this analysis?

PyTorch and the PyPI supply chain

Posted Sep 11, 2023 19:49 UTC (Mon) by snnn (guest, #155862) [Link]

I know why PyTorch is in a separated index:

1. Their packages are huge. One file could be 1-2 GB. But PyPI is free. PyPI cannot be so generous to provide so much free storage for every PyPI project.
2. You may build PyTorch with different build configs. For example, different CUDA versions. PyTorch community wants to keep all of them under the same name: pytorch. Otherwise it would harder to other packages to setup dependency on PyTorch. Therefore, PyTorch chose two different approaches: 1. put all of them in the same index and use local version to distinguish them. 2. Put each of them into a different index. However, both approaches are not supported by PyPI.

This problem is a very general. Almost all machine learning packages with GPU acceleration capabilities need to deal with this. I believe every non- casual user should setup their private pypi index. Even the original problem is fixed, as long as you have multiple indexes, you are still at risk. You may think the problem in a different way: how much can I trust the Facebook's pypi index servers? What if someone puts a fake "wheel" package in PyTorch's PyPI index? Don't think no Facebook employee's account can be hacked if you still could remember last year Nvidia lost their GPG key.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds