PyPI was subpoenaed

[Posted May 24, 2023 by jake]

It is, it seems, a week of Python Package Index (PyPI) news. On the PyPI blog, Director of Infrastructure at the Python Software Foundation (PSF), Ee Durbin, has posted an admirably detailed description of the organization's response to three subpoenas it received for PyPI user information in March and April. The requests for information were quite broad and the PSF did produce the requested material (to the extent possible), which involved five PyPI user accounts, under the advice of counsel.

PyPI and the PSF are committed to the freedom, security, and privacy of our users. This process has offered time to revisit our current data and privacy standards, which are minimal, to ensure they take into account the varied interests of the Python community. Though we collect very little personal data from PyPI users, any unnecessarily held data are still subject to these kinds of requests in addition to the baseline risk of data compromise via malice or operator error.
As a result we are currently developing new data retention and disclosure policies. These policies will relate to our procedures for future government data requests, how and for what duration we store personally identifiable information such as user access records, and policies that make these explicit for our users and community.

The post goes on to detail exactly which fields in the database tables were used to fulfill the request (without identifying the targets, naturally). Meanwhile, another statement in the post leaves open the possibility that further subpoenas have been received since that time:

We have waited for the string of subpoenas to subside, though we were committed from the beginning to write and publish this post as a matter of transparency, and as allowed by the lack of a non-disclosure order associated with the subpoenas received in March and April 2023.

PyPI was subpoenaed

Posted May 24, 2023 21:28 UTC (Wed) by bredelings (subscriber, #53082) [Link] (5 responses)

Could be there were other subpoenas which they were not allowed to disclose.

PyPI was subpoenaed

Posted May 24, 2023 21:32 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link] (2 responses)

> Could be there were other subpoenas which they were not allowed to disclose.

Nobody who can answer the question will be allowed to answer the question in that case.

PyPI was subpoenaed

Posted May 25, 2023 11:14 UTC (Thu) by hkario (subscriber, #94864) [Link] (1 responses)

But they are able to say that the didn't when they haven't.

PyPI was subpoenaed

Posted May 25, 2023 11:49 UTC (Thu) by HenrikH (subscriber, #31152) [Link]

yes but not once they have, at that point that option is no longer an option.

PyPI was subpoenaed

Posted May 24, 2023 22:53 UTC (Wed) by carenas (guest, #46541) [Link]

Considering the wording, timing and the fact that it was likely reviewed by legal counsel, I would say that the answer to that is obvious by IANAL

PyPI was subpoenaed

Posted May 25, 2023 1:53 UTC (Thu) by njs (subscriber, #40338) [Link]

That's true of literally everyone at all times, though.

Everything PyPI has should be public

Posted May 25, 2023 2:09 UTC (Thu) by geofft (subscriber, #59789) [Link] (16 responses)

Most of this information - like what projects a user was uploading - really should be public anyway, and I appreciate that they detailed in the post that most of this information was indeed public. (So, I suppose, this subpoena is almost certainly "good" - the data they got is much more useful for going after someone uploading malware to PyPI than for violating someone's civil liberties.)

The IP addresses are the hard part, but I would argue that even PyPI should not have access to them, and it's actually really weird that we designed our internet around a system where you get an identifier that's long-term stable, shared across services, and pretty closely linked with your geographic location, as a matter of course. For all that my browser does to block third-party cookies and for all that I do to separate my activity (thanks, Firefox container tabs!), I live alone and my modem doesn't reboot that often, so it is pretty trivial for (say) LWN, PyPI, Google, and the IRS to cross-reference their records of my web browsing activity, if they wanted to. Sure, I could use a VPN, but in practice most people don't.

For certain types of traffic (e.g., some types of gaming and maybe video chat) it's actually important to have direct connections. For PyPI, not so much. We should really have shared infrastructure where PyPI can send you signed content - and you can send PyPI signed uploads, too - but all anyone knows is the next-hop address, akin to how email handles it. When I'm doing uploads, PyPI needs to know who I am, but not where I'm coming from. And when I'm doing a pip install, PyPI really doesn't need to know that I even exist. (When I'm doing a pip install at work, they in fact don't know I exist, because we have an internal mirror. I'm indistinguishable from my coworkers, and they only get one hit, no matter how many pip installs happen on our internal network.) I need to know that the content is signed by PyPI, but that's it.

This is one of the nice side effects of the public mirror architecture, which most older Linux distros still retain. You can usually find a mirror that's in your jurisdiction, so your IP address is at least protected by the strength of your own laws. But most mirrors rely on uncool technology (like PGP signatures) and I think there's a lot less enthusiasm for running them than there was a couple of decades ago. Should we try to make mirrors cool again?

Everything PyPI has should be public

Posted May 25, 2023 6:52 UTC (Thu) by paulbarker (subscriber, #95785) [Link]

I think the MicroMirror project is definitely making mirrors cool: https://blog.thelifeofkenneth.com/2023/05/building-micro-...

Everything PyPI has should be public

Posted May 25, 2023 8:16 UTC (Thu) by excors (subscriber, #95769) [Link]

> For certain types of traffic (e.g., some types of gaming and maybe video chat) it's actually important to have direct connections.

I don't think that's true for gaming: players will launch DoS attacks against any other player or server whose IP address they know, to gain a competitive advantage or to harass somebody or simply to be annoying in general, so it's much better if you can hide the IP addresses. There are services like the Steam Datagram Relay (https://partner.steamgames.com/doc/features/multiplayer/s...), where every peer connects to a nearby Steam relay server and can then send traffic to other peers via some persistent user identifier (the SteamID), which gets routed over Steam's private network.

That network can be optimised for latency, rather than bandwidth or transit cost, so it often gives better performance than a direct IP connection between peers. I presume there's some built-in DoS protection - every peer is an authenticated Steam user so it's much easier to rate-limit traffic and to detect and block abuse, and the worst you can do is DoS the relay server itself over IP, which doesn't really matter since peers can just switch to another relay server. It also avoids the pain of port forwarding / NAT hole punching.

There are similar services from Microsoft, Unity, etc. It would be nice if this didn't depend on the small number of gaming companies that are large enough to have a global presence for hosting relay servers, but it seems this is the best we can do on the current internet. If you were designing a new internet from scratch, I hope you'd have better support for this.

Everything PyPI has should be public

Posted May 25, 2023 10:27 UTC (Thu) by Karellen (subscriber, #67644) [Link] (4 responses)

it's actually really weird that we designed our internet around a system where you get an identifier that's long-term stable, shared across services, and pretty closely linked with your geographic location, as a matter of course.

How do expect to connect to another system if its identifiers aren't stable?

The thing is, the internet wasn't designed with "Client/Server" as its majority use-case; it was designed to be "peer-to-peer", with all users/devices being equivalent at the network level. Sure, some devices might act as servers in some situations, with clients connecting to them, but the same two devices might switch roles in other situations.

If you want to connect to your friend's computer to send them a message, how are you going to do that if their computer doesn't have a stable identifier? And how will they connect to yours if yours doesn't?

I think it's more disappointing that we've ended up with a software architectures that forces us through hugely powerful intermediaries in a way that gives them these insights, just to connect to each other, instead of using the network architecture to connecting to each other directly.

Everything PyPI has should be public

Posted May 25, 2023 11:21 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

Ideally, every device would have multiple identifiers - a long-term stable one for incoming connections, and one or more short-term stable ones for outgoing connections. The device should know about all of its identifiers, and software can choose which ones to share with remote sites - thus a connection to a remote web server might use a short term address that'll be replaced in 15 minutes, while a protocol like a VoIP protocol would have a way for me to change addresses during a long session, so that I can move from short-term identifier to short-term identifier. If I want to, I can give you my long-term stable identifier, and you can then find me again whenever you so desire.

Everything PyPI has should be public

Posted May 25, 2023 12:53 UTC (Thu) by atnot (guest, #124910) [Link] (1 responses)

I don't know whether this was intentional, but this is in fact almost exactly how the standard setup of IPv6 with privacy extensions works.

Everything PyPI has should be public

Posted May 25, 2023 13:33 UTC (Thu) by farnz (subscriber, #17727) [Link]

It was intentional, although IPv6 privacy extensions can't go far enough, because it'd be ideal to rotate not just the lower 64 bits of my identifier, but instead the whole thing, so that knowing that you had a connection from 2001:db8:1:2:3:4:5:6 doesn't tell you anything about who I am, not even roughly where on the planet I am.

This is, of course, impractical, because if you don't know where I am, how do you route back to me via the most efficient route? That said, if you care that much about privacy, you'd be using something like Tor to obfuscate your location completely.

Everything PyPI has should be public

Posted May 25, 2023 19:04 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]

The context within which TCP/IP and the first routers were created is completely different than the environment today. It was designed for the use of academic and military scientists to exchange data among themselves and U.S. Federal Government grantmakers.

It wasnt designed to stop sharing of information, it was designed to increase sharing of information among its users, and even provide algorithmic evasions around impediments to that sharing. Opening it to the public, private institutions and business, and corporate stewardahip came much later... and the kludge pile to prevent what the Internet intended has spawned several different industries and disciplines.

Not sure it was the right move. ;)

Everything PyPI has should be public

Posted May 25, 2023 15:59 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

> We should really have shared infrastructure where PyPI can send you signed content - and you can send PyPI signed uploads, too - but all anyone knows is the next-hop address, akin to how email handles it. When I'm doing uploads, PyPI needs to know who I am, but not where I'm coming from. And when I'm doing a pip install, PyPI really doesn't need to know that I even exist.

Ironically, this is almost precisely what ISP proxies did. Although its purpose was to save on bandwidth, as a side effect it made everybody form the same ISP indistinguishable at that level. Of course, once you add in user agents it works much less well.

It probably was phased out because it has all the same issues as CGNAT but with more resources.

Everything PyPI has should be public

Posted May 25, 2023 16:14 UTC (Thu) by farnz (subscriber, #17727) [Link]

It also didn't interact well with the design of HTTPS. The ISP proxy was a deliberate MitM, and there's no good way in HTTPS for a proxy to do anything more sophisticated than pass the stream through. Once you're simply relaying HTTPS to the origins, the proxy becomes of low value - no better than a SOCKS5 proxy, for those who remember using those to escape firewalls.

Everything PyPI has should be public

Posted May 25, 2023 22:43 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (5 responses)

The IP addresses are the hard part, but I would argue that even PyPI should not have access to them, and it's actually really weird that we designed our internet around a system where you get an identifier that's long-term stable, shared across services, and pretty closely linked with your geographic location, as a matter of course.

This is what happens when you keep extending a system far beyond what it was initially intended to do. IPv4 was designed 40+ years ago for a network that was fundamentally different in goal and scope from what we have today, so some of the assumptions behind its design don't match the reality of today's network. It was mostly intended to let a bunch of researchers talk to each other, often by bouncing unencrypted traffic through third parties. Privacy and security weren't even a consideration. If we actually wanted a network that was designed to protect users from snooping by third parties, we'd have to completely redesign it from the ground up. Of course there are powerful entities, from privacy-invading corporations to governments, who very much want to continue spying on network traffic and will fight any attempt to upgrade privacy and security.

Everything PyPI has should be public

Posted May 26, 2023 13:12 UTC (Fri) by kleptog (subscriber, #1183) [Link] (4 responses)

> Privacy and security weren't even a consideration. If we actually wanted a network that was designed to protect users from snooping by third parties, we'd have to completely redesign it from the ground up. Of course there are powerful entities, from privacy-invading corporations to governments, who very much want to continue spying on network traffic and will fight any attempt to upgrade privacy and security.

Is that really it though? We did redesign it from the ground up (IPv6). The primary constraints here I think are physical reality. You could make a perfectly secure system with infinite resources (everyone picks a random identifier, every packet is presented to every node) but that's clearly infeasible.

In reality, the choice to declare there are *no* trusted nodes leads to the conclusion routing identifiers must be global, because there is no trust boundary where you could perform a translation. Like, for example, how the Linux kernel mangles pointers in printk's. In theory China could use their Great Firewall to make all IP ranges inside China completely opaque. It'd be confusing, but it could probably be made to work with enough pain.

The more interesting question is: why do we store complete IP addresses in HTTP logs? And why is it unreasonably difficult to obfuscate them? Why is such obfuscation not the default (at least for non-RFC1918 addresses)? How can I tell an AWS ELB to obfuscate IPs in the logs? For nginx with some trickery you can get some obfuscation in the access.log, but for the error.log it's not possible. Why not?

I don't think it's corporations or governments preventing this. I think it's technical constraints, and an element of allowing perfect to be the enemy of good.

Everything PyPI has should be public

Posted May 26, 2023 14:41 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I don't think it's corporations or governments preventing this. I think it's technical constraints, and an element of allowing perfect to be the enemy of good.

What about corner cutting, and the router manufacturers not having a clue what a router is supposed to do bufferbloat cough cough port numbers cough cough who bothers to read the spec cough cough ...

Cheers,
Wol

Everything PyPI has should be public

Posted May 26, 2023 15:30 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

Part of the problem with obfuscating IP addresses is knowing what use you're going to make of them in the future. If you just need a host identifier, a simple run of a cryptographic hash (with salt) is enough to reduce IP addresses to something that's usable for equality checks, but nothing else.

But often, you want to make use of information from outside to group IP addresses, and now the fun comes in - someone's home network might be 192.0.2.1/32, or 2001:db8:8001::/48. Their ISP might be using 192.0.2.0/24 and 2001:db8:8000::/36 for all home customers, and being able to look and say "ah, yes, the problem is with customers in 2001:db8:8000::/36, but not 2001:db8:c000::/36, so it's $ISP's home customers having issues" is useful; if you've obfuscated the IP address without knowing how $ISP structures them, though, it's too late to make that determination, as you could have obfuscated away the information you need to group by (e.g. because you decided to swap IPv6 addresses for bottom 64 bits and the ASN, losing the distinction between AS64500's home and business customers).

Everything PyPI has should be public

Posted May 26, 2023 15:51 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

That sounds like an incredibly niche use-case. If you're a network engineer at a major internet exchange, sure. But for the vast majority of websites the IP address is never going to be used for for anything other than country determination for statistics, so simply dropping the last octet loses nothing.

You can always think of situations where the full IP gives relevant information. I just don't see the argument, other than inertia, why it should be the default in HTTP logs. The few people for who it is relevant can turn it off.

Everything PyPI has should be public

Posted May 26, 2023 16:04 UTC (Fri) by farnz (subscriber, #17727) [Link]

Sure, but how do you tell a website owner "you're never going to make it big - you might as well drop the last octet" (or /96 suffix in IPv6)?

It's my experience that the people who would lose least from assuming that they're going to stay smallish are the ones who assume that they're going to grow to at least the scale of Amazon.nl - and we then get into a social problem, where Amazon.nl are big enough that they benefit from knowing the full IP and correlating possible problems by close co-operation with Dutch ISPs, and the site owner does not want to know that they're never going to be that big, so they choose products that treat them as Amazon.nl scale, rather than ones that obfuscate part of the IP in storage by default.

Everything PyPI has should be public

Posted May 27, 2023 17:42 UTC (Sat) by geofft (subscriber, #59789) [Link]

A relevant update from PyPI from yesterday: https://blog.pypi.org/posts/2023-05-26-reducing-stored-ip...