Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:56 UTC (Thu) by agateau (subscriber, #57569)
Parent article: Git archive generation meets Hyrum's law

It seems to me the idea of generating the archives on the fly is wrong. GitHub (and other git forges) should generate the archive once, store the result and then always serve the same file.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:50 UTC (Thu) by sionescu (subscriber, #59410) [Link]

This is the right answer. Command line utilities generally don't have the expectation of producing a stable output, but the web pretty much has that expectation. Github should store an archive the first time a release tarball is fetched, and serve that forever.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

I imagine they do have some degree of caching already - it would be very expensive to generate an archive every single time anyone in the world requests it. You are effectively proposing to keep things in the cache for eternity, but what is the benefit of doing that, compared to a more conventional cache invalidation strategy? It has multiple drawbacks:

* Your cache uses an ever-growing amount of storage, which you would otherwise be using to host repositories, so now repository hosting gets more expensive.
* After you have been doing this for a few years or so, the vast majority of your cache is holding data that nobody is ever going to look at again, so now you need to implement a hierarchical cache (i.e. push all the low-traffic files out to tape to cut down on costs).
* But retrieving a file from tape probably takes *longer* than just generating a fresh archive, so your cache isn't a cache anymore, it's a bottleneck.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 14:14 UTC (Fri) by agateau (subscriber, #57569) [Link] (1 responses)

Mmm, I just realized on-the-fly archives are available for *all* commits. I agree caching archives for those would be impractical.

Depending on them not ever changing was a bad idea.

Assuming the archives one can find in a GitHub releases would never change, on the other hand, sounds like a reasonable assumption. Those should be generated once. GitHub already lets you attach arbitrary files to a release, so an archive of the sources should not be a problem (he says without having any numbers). They could limit this to only creating archives for releases, not tags, to reduce the number of generated archives.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 14:38 UTC (Fri) by paulj (subscriber, #341) [Link]

Right, the issue is that random developers are configuring their build systems to download on-the-fly git-archives of arbitrary commits of projects. Rather than just doing a shallow clone of the git commit ID - which *IS* guaranteed to be stable, with cryptographic strength guarantees! (And many build systems, inc. CMake, etc., have modules to make it easy to specify build dependencies as git commits to checkout).

The people doing this are utterly clueless, and it's insanity to coddle them.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:30 UTC (Thu) by bnewbold (subscriber, #72587) [Link] (3 responses)

This was my response as well. Or at least, once an archive has been requested (downloaded), store that.

It occurs to me that I've been assuming that the original issue was with "release" archives (aka, git tag'd commits resulting in tarballs). If the issue has been with pulling archives of arbitrary git commits, i'm less sympathetic to the assumption of stability, as it does seem reasonable to generate those on-the-fly and not persist the result.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 7:38 UTC (Fri) by mb (subscriber, #50428) [Link] (2 responses)

>Or at least, once an archive has been requested (downloaded), store that.

That doesn't work.
It only takes one search engine bot to hit a large number of these generated links for your cache to explode.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 19:02 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

These links are `nofollow`, right? Right? And ban bots trawling those with extreme prejudice.

`archive.org` might do it, but I suspect they weird far less DDoS power than a Google crawler.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 19:27 UTC (Fri) by pizza (subscriber, #46) [Link]

It's not really any one bot; it's that everyone and their cousin now has their own (nominally legit) crawler.

And then there are the distributed bots that spoof their identifier and really don't GaF about what robots.txt has in it.