|
|
Subscribe / Log in / New account

Everything PyPI has should be public

Everything PyPI has should be public

Posted May 26, 2023 13:12 UTC (Fri) by kleptog (subscriber, #1183)
In reply to: Everything PyPI has should be public by rgmoore
Parent article: PyPI was subpoenaed

> Privacy and security weren't even a consideration. If we actually wanted a network that was designed to protect users from snooping by third parties, we'd have to completely redesign it from the ground up. Of course there are powerful entities, from privacy-invading corporations to governments, who very much want to continue spying on network traffic and will fight any attempt to upgrade privacy and security.

Is that really it though? We did redesign it from the ground up (IPv6). The primary constraints here I think are physical reality. You could make a perfectly secure system with infinite resources (everyone picks a random identifier, every packet is presented to every node) but that's clearly infeasible.

In reality, the choice to declare there are *no* trusted nodes leads to the conclusion routing identifiers must be global, because there is no trust boundary where you could perform a translation. Like, for example, how the Linux kernel mangles pointers in printk's. In theory China could use their Great Firewall to make all IP ranges inside China completely opaque. It'd be confusing, but it could probably be made to work with enough pain.

The more interesting question is: why do we store complete IP addresses in HTTP logs? And why is it unreasonably difficult to obfuscate them? Why is such obfuscation not the default (at least for non-RFC1918 addresses)? How can I tell an AWS ELB to obfuscate IPs in the logs? For nginx with some trickery you can get some obfuscation in the access.log, but for the error.log it's not possible. Why not?

I don't think it's corporations or governments preventing this. I think it's technical constraints, and an element of allowing perfect to be the enemy of good.


to post comments

Everything PyPI has should be public

Posted May 26, 2023 14:41 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I don't think it's corporations or governments preventing this. I think it's technical constraints, and an element of allowing perfect to be the enemy of good.

What about corner cutting, and the router manufacturers not having a clue what a router is supposed to do bufferbloat cough cough port numbers cough cough who bothers to read the spec cough cough ...

Cheers,
Wol

Everything PyPI has should be public

Posted May 26, 2023 15:30 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

Part of the problem with obfuscating IP addresses is knowing what use you're going to make of them in the future. If you just need a host identifier, a simple run of a cryptographic hash (with salt) is enough to reduce IP addresses to something that's usable for equality checks, but nothing else.

But often, you want to make use of information from outside to group IP addresses, and now the fun comes in - someone's home network might be 192.0.2.1/32, or 2001:db8:8001::/48. Their ISP might be using 192.0.2.0/24 and 2001:db8:8000::/36 for all home customers, and being able to look and say "ah, yes, the problem is with customers in 2001:db8:8000::/36, but not 2001:db8:c000::/36, so it's $ISP's home customers having issues" is useful; if you've obfuscated the IP address without knowing how $ISP structures them, though, it's too late to make that determination, as you could have obfuscated away the information you need to group by (e.g. because you decided to swap IPv6 addresses for bottom 64 bits and the ASN, losing the distinction between AS64500's home and business customers).

Everything PyPI has should be public

Posted May 26, 2023 15:51 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

That sounds like an incredibly niche use-case. If you're a network engineer at a major internet exchange, sure. But for the vast majority of websites the IP address is never going to be used for for anything other than country determination for statistics, so simply dropping the last octet loses nothing.

You can always think of situations where the full IP gives relevant information. I just don't see the argument, other than inertia, why it should be the default in HTTP logs. The few people for who it is relevant can turn it off.

Everything PyPI has should be public

Posted May 26, 2023 16:04 UTC (Fri) by farnz (subscriber, #17727) [Link]

Sure, but how do you tell a website owner "you're never going to make it big - you might as well drop the last octet" (or /96 suffix in IPv6)?

It's my experience that the people who would lose least from assuming that they're going to stay smallish are the ones who assume that they're going to grow to at least the scale of Amazon.nl - and we then get into a social problem, where Amazon.nl are big enough that they benefit from knowing the full IP and correlating possible problems by close co-operation with Dutch ISPs, and the site owner does not want to know that they're never going to be that big, so they choose products that treat them as Amazon.nl scale, rather than ones that obfuscate part of the IP in storage by default.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds