Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Posted Feb 2, 2023 17:56 UTC (Thu) by agateau (subscriber, #57569)Parent article: Git archive generation meets Hyrum's law
Posted Feb 2, 2023 18:50 UTC (Thu)
by sionescu (subscriber, #59410)
[Link]
Posted Feb 2, 2023 19:03 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
* Your cache uses an ever-growing amount of storage, which you would otherwise be using to host repositories, so now repository hosting gets more expensive.
Posted Feb 3, 2023 14:14 UTC (Fri)
by agateau (subscriber, #57569)
[Link] (1 responses)
Depending on them not ever changing was a bad idea.
Assuming the archives one can find in a GitHub releases would never change, on the other hand, sounds like a reasonable assumption. Those should be generated once. GitHub already lets you attach arbitrary files to a release, so an archive of the sources should not be a problem (he says without having any numbers). They could limit this to only creating archives for releases, not tags, to reduce the number of generated archives.
Posted Feb 3, 2023 14:38 UTC (Fri)
by paulj (subscriber, #341)
[Link]
The people doing this are utterly clueless, and it's insanity to coddle them.
Posted Feb 2, 2023 19:30 UTC (Thu)
by bnewbold (subscriber, #72587)
[Link] (3 responses)
It occurs to me that I've been assuming that the original issue was with "release" archives (aka, git tag'd commits resulting in tarballs). If the issue has been with pulling archives of arbitrary git commits, i'm less sympathetic to the assumption of stability, as it does seem reasonable to generate those on-the-fly and not persist the result.
Posted Feb 3, 2023 7:38 UTC (Fri)
by mb (subscriber, #50428)
[Link] (2 responses)
That doesn't work.
Posted Feb 3, 2023 19:02 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
`archive.org` might do it, but I suspect they weird far less DDoS power than a Google crawler.
Posted Feb 3, 2023 19:27 UTC (Fri)
by pizza (subscriber, #46)
[Link]
And then there are the distributed bots that spoof their identifier and really don't GaF about what robots.txt has in it.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
* After you have been doing this for a few years or so, the vast majority of your cache is holding data that nobody is ever going to look at again, so now you need to implement a hierarchical cache (i.e. push all the low-traffic files out to tape to cut down on costs).
* But retrieving a file from tape probably takes *longer* than just generating a fresh archive, so your cache isn't a cache anymore, it's a bottleneck.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
It only takes one search engine bot to hit a large number of these generated links for your cache to explode.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law