Everything PyPI has should be public
Everything PyPI has should be public
Posted May 26, 2023 13:12 UTC (Fri) by kleptog (subscriber, #1183)In reply to: Everything PyPI has should be public by rgmoore
Parent article: PyPI was subpoenaed
Is that really it though? We did redesign it from the ground up (IPv6). The primary constraints here I think are physical reality. You could make a perfectly secure system with infinite resources (everyone picks a random identifier, every packet is presented to every node) but that's clearly infeasible.
In reality, the choice to declare there are *no* trusted nodes leads to the conclusion routing identifiers must be global, because there is no trust boundary where you could perform a translation. Like, for example, how the Linux kernel mangles pointers in printk's. In theory China could use their Great Firewall to make all IP ranges inside China completely opaque. It'd be confusing, but it could probably be made to work with enough pain.
The more interesting question is: why do we store complete IP addresses in HTTP logs? And why is it unreasonably difficult to obfuscate them? Why is such obfuscation not the default (at least for non-RFC1918 addresses)? How can I tell an AWS ELB to obfuscate IPs in the logs? For nginx with some trickery you can get some obfuscation in the access.log, but for the error.log it's not possible. Why not?
I don't think it's corporations or governments preventing this. I think it's technical constraints, and an element of allowing perfect to be the enemy of good.
Posted May 26, 2023 14:41 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
What about corner cutting, and the router manufacturers not having a clue what a router is supposed to do bufferbloat cough cough port numbers cough cough who bothers to read the spec cough cough ...
Cheers,
Posted May 26, 2023 15:30 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (2 responses)
Part of the problem with obfuscating IP addresses is knowing what use you're going to make of them in the future. If you just need a host identifier, a simple run of a cryptographic hash (with salt) is enough to reduce IP addresses to something that's usable for equality checks, but nothing else.
But often, you want to make use of information from outside to group IP addresses, and now the fun comes in - someone's home network might be 192.0.2.1/32, or 2001:db8:8001::/48. Their ISP might be using 192.0.2.0/24 and 2001:db8:8000::/36 for all home customers, and being able to look and say "ah, yes, the problem is with customers in 2001:db8:8000::/36, but not 2001:db8:c000::/36, so it's $ISP's home customers having issues" is useful; if you've obfuscated the IP address without knowing how $ISP structures them, though, it's too late to make that determination, as you could have obfuscated away the information you need to group by (e.g. because you decided to swap IPv6 addresses for bottom 64 bits and the ASN, losing the distinction between AS64500's home and business customers).
Posted May 26, 2023 15:51 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (1 responses)
You can always think of situations where the full IP gives relevant information. I just don't see the argument, other than inertia, why it should be the default in HTTP logs. The few people for who it is relevant can turn it off.
Posted May 26, 2023 16:04 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
Sure, but how do you tell a website owner "you're never going to make it big - you might as well drop the last octet" (or /96 suffix in IPv6)?
It's my experience that the people who would lose least from assuming that they're going to stay smallish are the ones who assume that they're going to grow to at least the scale of Amazon.nl - and we then get into a social problem, where Amazon.nl are big enough that they benefit from knowing the full IP and correlating possible problems by close co-operation with Dutch ISPs, and the site owner does not want to know that they're never going to be that big, so they choose products that treat them as Amazon.nl scale, rather than ones that obfuscate part of the IP in storage by default.
Everything PyPI has should be public
Wol
Everything PyPI has should be public
Everything PyPI has should be public
Everything PyPI has should be public
