Git archive generation meets Hyrum's law

By Jonathan Corbet
February 2, 2023

On January 30, the GitHub blog carried a brief notice that the checksums of archives (such as tarballs) generated by the site had just changed. GitHub's engineers were seemingly unaware of the consequences of such a change — consequences that were immediately evident to anybody familiar with either packaging systems or Hyrum's law. Those checksums were widely depended on by build systems, which immediately broke when the change went live; the resulting impact of jawbones hitting the floor was heard worldwide. The change has been reverted for now, but it is worth looking at how GitHub managed to casually break vast numbers of build systems — and why this sort of change will almost certainly happen again.

One widely used GitHub feature is the ability to download an archive file of the state of the repository at an arbitrary commit; it is often used by build systems to obtain a specific release of a package of interest. Internally, this archive is created at request time by the git archive subcommand. Most build systems will compare the resulting archive against a separately stored checksum to be sure that the archive is as expected and has not been corrupted; if the checksum fails to match, the build will be aborted. So when the checksums of GitHub-generated tarballs abruptly changed, builds started failing.

Unsurprisingly, people started to complain. The initial response from GitHub employee (and major Git contributor) brian m. carlson was less than fully understanding:

I'm saying that policy has never been correct and we've never guaranteed stable checksums for archives, just like Git has never guaranteed that. I apologize that things are broken here and that there hasn't been clearer communication in the past on this, but our policy hasn't changed in over 4 years.

This answer, it might be said, was not received well. Wyatt Anderson, for example, said:

The collective amount of human effort it will take to break glass, recover broken build systems that are impacted by this change, and republish artifacts across entire software ecosystems could probably cure cancer. Please consider reverting this change as soon as possible.

The outcry grew louder, and it took about two hours for Matt Cooper (another GitHub employee) to announce that the change was being reverted — for now: "we're reverting the change, and we'll communicate better about such changes in the future (including timelines)". Builds resumed working, and peace reigned once again.

The source of the problem

The developers at GitHub did not wake up one morning and hatch a scheme to break large numbers of build systems; instead, all they did was upgrade the version of Git used internally. In June 2022, René Scharfe changed git archive to use an internal implementation of the gzip compression algorithm rather than invoking the gzip program separately. This change, which found its way into the Git 2.38 release, allowed Git to drop the gzip dependency, more easily support compression across operating systems, and compress the data with less CPU time.

It also caused git archive to compress files differently. While the uncompressed data is identical, the compressed form differs, so the checksum of the compressed data differs as well. Once this change landed on GitHub's production systems, the checksums for tarballs generated on the fly abruptly changed. GitHub backed out the change, either by reverting to an older Git or by explicitly configuring the use of the standalone gzip program, and the immediate problem went away.

The resulting discussion on the Git mailing list has been relatively muted so far. Eli Schwartz started things off with a suggestion that Git should change its default back to using the external gzip program for now, then implement a "v2 archive format" using the internal compressor. Using a heuristic, git archive would always default to the older format for commits before some sort of cutoff date. That would ensure ongoing compatibility for older archives, but the idea of wiring that sort of heuristic into Git was not generally popular.

Ævar Arnfjörð Bjarmason, instead, suggested that the default could be changed to use the external gzip, retaining the internal implementation as an option or as a fallback should the external program not be found. The responsibility for output compatibility could then be shifted to the compression program: anybody who wants to ensure that their generated archive files do not change will have to ensure that their gzip does not change. Since the Git developers do not control that program, they cannot guarantee its forward compatibility in any case.

Carlson, though, argued for avoiding stability guarantees — especially implicit guarantees — if possible:

I made a change some years back to the archive format to fix the permissions on pax headers when extracted as files, and kernel.org was relying on that and broke. Linus yelled at me because of that.
Since then, I've been very opposed to us guaranteeing output format consistency without explicitly doing so. I had sent some patches before that I don't think ever got picked up that documented this explicitly. I very much don't want people to come to rely on our behaviour unless we explicitly guarantee it.

He went on to suggest that Git could guarantee the stability of the archive format in uncompressed form. That format would have to be versioned, though, since the SHA-256 transition, if and when it happens, will force changes in that format anyway (a claim that Bjarmason questioned). In general, carlson concluded, it may well become necessary for anybody who wants consistent results to decompress archive files before checking checksums. He later reiterated that, in his opinion, implementing a stable tar format is feasible, but adding compression is not: "I personally feel that's too hard to get right and am not planning on working on it".

Konstantin Ryabitsev said that, while he understands carlson's desire to avoid committing to an output format, "I also think it's one of those things that happen despite your best efforts to prevent it". He suggested adding a --stable option to git archive that was guaranteed to not change.

What next?

As of this writing, the Git community has not decided whether to make any changes as the result of this episode. Bjarmason argued that the Git community should accommodate the needs of its users, even if they came to depend on a feature that was never advertised as being stable:

That's unfortunate, and those people probably shouldn't have done that, but that's water under the bridge. I think it would be irresponsible to change the output willy-nilly at this point, especially when it seems rather easy to find some compromise everyone will be happy with.

He has since posted a patch set restoring the old behavior, but also documenting that this behavior could change in the future.

Committing to stability of this type is never a thing to be done lightly, though; such stability can be hard to maintain (especially when dealing with file formats defined by others) and can block other types of progress. For example, replacing gzip can yield better compression that can be performed more efficiently; an inability to move beyond that algorithm would prevent Git from obtaining those benefits. Even if Git restores the use of an external gzip program by default, that program might, itself, change, or downstream users like GitHub may decide that they no longer want to support that format.

It would thus be unsurprising if this problem were to refuse to go away. The Git project is reluctant to add a stability guarantee to its maintenance load, and the same is true of its downstream users; GitHub has said that it would give some warning before a checksum change returns, but has not said that such a change would not happen. The developers and users of build systems may want to be rethinking their reliance on the specific compression format used by one proprietary service on the Internet. The next time problems turn up, they will not be able to say they haven't been warned.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:18 UTC (Thu) by dskoll (subscriber, #1630) [Link] (5 responses)

Couldn't the checksum be based on the uncompressed data? Kernel release signatures are based on the uncompressed data, for example.

Sure, it would be a little annoying to have to uncompress to verify the archive integrity, but that would free up developers to tweak the compression to their hearts' content. It also means build processes that rely on checksums would need to be adjusted, but that would be a one-time adjustment.

I suppose there's the risk of a compression bomb that could DoS build systems, but those would be relatively easy to detect... along with the checksum, store the size of the uncompressed data you expect and abort if it starts uncompressing to a larger size.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:19 UTC (Thu) by dskoll (subscriber, #1630) [Link]

Ah, just noticed this was mentioned in the article. Missed it first time around.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 23:44 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

> Couldn't the checksum be based on the uncompressed data?

Given that we're talking about git, here's an interesting anecdote: originally, the identity of a git blob object was the hash of its _compressed_ data. Very early in the git history (you can find the commits if you look near the beginning), it was noticed that this was going to be a mistake, and the identity of blob objects was changed to be the hash of its _uncompressed_ data. This was a breaking change, since every commit (and tree and blob) would change its hash, but since there were only a couple of git repositories in the whole world (IIRC, mostly git itself, Linux, and sparse), and only a few people were using git back then, that change was acceptable.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 0:33 UTC (Fri) by edgewood (subscriber, #1123) [Link]

Ah, perhaps they learned from the Makefile tab mistake.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 6:35 UTC (Fri) by pabs (subscriber, #43278) [Link] (1 responses)

There are probably other attacks on decompression code than just DoS, depending on how badly written it is.

Git archive generation meets Hyrum's law

Posted Feb 23, 2023 17:10 UTC (Thu) by kijiki (subscriber, #34691) [Link]

The good news is that compression/decompression is a pure function from input to output, so it can be easily be very tightly sandboxed. SECCOMP_SET_MODE_STRICT level strict.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:25 UTC (Thu) by alonz (subscriber, #815) [Link] (21 responses)

Why not do the opposite - instead of stabilizing the checksum, make it so that every single download will have a different checksum (by including some hidden field that is always randomized)?

This way the "checksum is not a guarantee" becomes a real thing, and Hyrum gets a rest.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:44 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (18 responses)

While interesting as a mental exercise, what would be the point of that? Checksums are used for specific purposes (e.g. checking for potential corruption or making sure you get exactly what you expect, not something with the same filename but different content) . Those purposes do not disappear when you make checksums useless.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:52 UTC (Thu) by geert (subscriber, #98403) [Link] (8 responses)

Relying on the checksums of the compressed archives sounds like a bad idea, as those archives don't exist on the server, but are generated on the fly, using an external program. Imagine you want to download the same archive in 10 years, what do you do when checksum verification fails? Download again? Go to a different server?

What uniquely specifies the contents are the git commit ID (which is still sha1). And perhaps the checksum of the uncompressed archive (assumed git can keep that stable, and doesn't have to change it due to e.g. a security bug).

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:48 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

`export-subst` and `export-ignore` can pull metadata from the commit in question and bake it into the archive (or exclude the file completely).

What should have happened is that the creation of a release has an option to permalink the "Source code" links Github provides. Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:33 UTC (Fri) by himi (subscriber, #340) [Link] (4 responses)

That wouldn't help all the use cases which are targeting a specific commit rather than a release - which is going to be the case for the build bots and test infrastructure using this particular Github feature . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:46 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

I'm not sure what buildbots and infrastructure are downloading random commit snapshots instead of tagged releases (or they follow `main`…but then you have no stable hash anyways). CI should be working from a clone and if it gets a tarball, it can't know the hash a priori anyways.

Even when we patch stuff, I try to grab a tag and cherry-pick fixes we need back to it instead of following "random" development commits. But then we also try to rehost everything because customers can have pretty draconian firewall policies and they allow our hosting through to avoid downloading things from $anywhere.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 2:43 UTC (Fri) by himi (subscriber, #340) [Link] (2 responses)

Yeah, there are clearly better ways to implement this kind of thing, and it's not how you or I or any reasonably experienced dev would set things up, but there are a lot of people out there who are effectively stitching their development processes together from whatever odds and ends they can find. Maybe because they don't know any better, or because they don't have the resources to do it "properly", or because things just kind of grew that way through a process of accretion. And/or they may have set things up that way before Github integrated all those nice convenient free-ish services to support this kind of thing, and just never bothered fixing things that didn't seem broken. Probably a lot of them are now re-evaluating those decisions, having learned /why/ the reasonably experienced devs avoid doing things that way.

Github obviously isn't intending to support that kind of usage - if they were this change wouldn't have happened, or they'd have implemented the archive download in a reliably verifiable way from the start. But the service that Github /does/ provide is very easy to use/abuse for things Github isn't explicitly intending to support, and that's what's bitten people here. Github did nothing wrong, and neither did the people using their service in a way it wasn't really intended to support, but that's kind of the point of Hyrum's law, isn't it . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:49 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

That’s because developping and releasing are two different trades but git(hub|lab) and language-specific packaging systems in general made their success by targeting developper workflows. They suck loads for release processes, every release manager worth his salt knows it but they have the developper mindshare so dealing with their suckage is a fact of life.

A release manager knows intimately that converging on a stable and secure state is hard and will try to pin everything on official stable releases (except for the bugfix/security fixup releases that need to be fast-tracked and propagated as fast as possible).

A developper will use whatever random git commit he checkouted last and will try to pin everything to this commit to avoid the hassle of testing something else than his last workstation state (including freezing out security fixes). The more obscure bit of code he depends on the less he will want to update it (even though because it is obscure he has no idea what dangers lurk in there).

One consequence of github promoting proeminently the second workflow is that instead of serving a *small* number of trusted releases that can be cached once generated (compression included) it needs to generate on the fly archives for all the random not-shared code states it induced developpers to depend on.

No one one sane will depend on github release links. When you need to release something that depends on hundreds of artifacts you use the system which is the same for those hundred of artifacts (dynamicaly generated archives), not the one-of-a-kind release links which are not available for all projects, do not work the same when they are available, and may not even keep being available for a given artifact (as soon as a dev decides to pin a random git hash all bets are of).

Another consequence of the dev-oriented nature of github is any workflow that depend on archives is a second thought. Developpers use git repositories not the archived subset that goes into releases.

Git archive generation meets Hyrum's law

Posted Feb 16, 2023 14:31 UTC (Thu) by mrugiero (guest, #153040) [Link]

Projects like buildroot sometimes rely on packages of projects that don't do proper releases, such as downstream kernels for out-of-tree boards. I'm not 100% sure if they clone the repo rather than asking for the archive, but that alone is proof you sometimes need to rely on specific development commits, just because it's the only way to fix a version.

I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:44 UTC (Fri) by jengelh (guest, #33263) [Link]

>Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Speaking of it…
Using the file upload feature would sidestep the issues of verification of an on-the-fly (re-)generated archive (at the cost of diskspace at the hoster).

Git archive generation meets Hyrum's law

Posted Feb 5, 2023 1:31 UTC (Sun) by poki (guest, #73360) [Link]

Yes, `export-subst` was already causing a similar problem in principle (non-reproducibility in terms of a digest [like here] and [even more alerting] change of the actual unpacked content) as a fallout of the hardened short commit scaling some 5 years ago. It was caused by typically another hexdigit suddenly appearing in place of `%h` specifier in evolved projects using that in fact apparently semi-dynamic provision of git.

And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
https://lists.fedoraproject.org/archives/list/devel@lists...

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:58 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

The point is that the output of git archive when compressed is not stable. By making it unstable, so that each run of git archive requested compressed output of a given git commit has a different checksum, you stop people assuming that they can run git archive and get a fixed result - they know, instead, that each time you run git archive, you get a different answer.

Once you've built the archive, you can checksum it and keep it as a stable artefact long into the future. You just don't have a guarantee that you can regenerate the archive from the source and get the same checksum - if you need to validate that a given artefact matches the source, you need to do a deeper dive of some form.

Git archive generation meets Hyrum's law

Posted Feb 4, 2023 3:56 UTC (Sat) by yhw (subscriber, #163199) [Link] (2 responses)

How about making checksum versioned? All current checksum by default is $itself+gzip9. When new compression way introduced, make the zip NVR part of the checksum. Algorithm can be designed the compression upgrade not to break existing/old checksum based system.

Git archive generation meets Hyrum's law

Posted Feb 4, 2023 14:15 UTC (Sat) by farnz (subscriber, #17727) [Link] (1 responses)

That doesn't work because the same compressor with the same input data is not obliged to produce the same output. Even with exactly the same binary, you can get different results because the compression algorithm is not fully deterministic (e.g. multi-threaded compressors like pigz, which produces gzip-compatible output, and zstd can depend on thread timing, which in turn depends on the workload on the system and on the CPUs in the system.

To be compatible with today's compressors, you need to record not just the compressor, but also all scheduling decisions made during compression relative to the input data. This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

Git archive generation meets Hyrum's law

Posted Feb 5, 2023 0:22 UTC (Sun) by himi (subscriber, #340) [Link]

> This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

It also ignores the fact that what matters in the context of a git archive is the /contents/, not the exact shape of a processed version of those contents. And taking a step further back, what you care about most is the contents of the /repo/ at the point in its history that you're interested in - the archive is just a snapshot of that, and one that isn't even necessarily representative. There's a lot of ways you can change the actual meaningful contents of a git archive with command line options and filters without any changes to contents of the repo, and any changes to the defaults for those would potentially have a similar effect to the issue discussed in the article (though in that case the git devs would the ones getting the opprobrium).

All of which means that if you want to have a reliably verifiable and repeatable archive of the state of a git repo at a point in its history, you either need the repo itself (or a pruned sub-set with only the objects accessible from the commit you're interested in), or you need to explicitly build an archival format from the ground up with that goal in mind.

I'm sure there's some kind of saying somewhere in the crypto/data security world that you could paraphrase as "verify /all/ of what matters, and /only/ what matters" - if not, there should be. The issue here is a good example of why - generating the archive automatically with an ill-defined and unreliably repeatable process added a whole lot of gunk on top of the data they actually care about, and things are breaking because people are trying to do cryptographic verification of the gunk as well as the actual data.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:47 UTC (Thu) by flussence (guest, #85566) [Link]

Then the party publishing the stable checksum in this situation (i.e. not GitHub, it has never done so for these download links) can simply provide a corresponding stable tarball.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 11:59 UTC (Fri) by Lennie (subscriber, #49641) [Link] (3 responses)

The new QUIC protocol has adopted a similar strategy, quote from some one who wrote about it: "To prevent ossification, QUIC tries to encrypt as much data as possible, including signaling information [10], to hide it from network equipment and prevent vendors of said equipment from making assumptions that will interfere or prevent future changes to the protocol."

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 12:58 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

And for the bits in the unencrypted header, that is effectively fixed to 1, there is even an RFC to allow QUIC end-points to negotiate (in the somewhat encrypted handshake) deliberately twiddling that bit deliberately - see RFC9287, https://www.rfc-editor.org/rfc/rfc9287.html .

Which really means that bit shouldn't have been there in the header, probably.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 15:02 UTC (Fri) by wsy (subscriber, #121706) [Link] (1 responses)

For people living in dystopian countries, such a bit is frustrating. We need a protocol that's widely used by legit websites while indistinguishable from anti-censorship tools.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 15:27 UTC (Fri) by paulj (subscriber, #341) [Link]

Well, the unencrypted QUIC header is pretty much inscrutable to middle-boxes. There is very little information in it, besides a few bits and a "Connection Identifier" (CID), but the end-points rotate the CID regularly.

Even the CID can be left out (and, to be honest, I think the unencrypted CID is a wart - the rotation of it adds a /lot/ of complications to QUIC, including hard to fix races).

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 17:40 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (1 responses)

> Why not do the opposite - instead of stabilizing the checksum, make it so that every single download will have a different checksum

That's what is often called "greasing" in the world of protocols, and is meant to prevent ossification. And I agree. Too bad it wasn't done before, but it would have helped a lot here. In fact what users don't understand is that there's even no way to guarantee that the external gzip utility will provide the same bitstream forever either. Just fixing a vulnerability that would require to occasionally change a maximum length or to avoid a certain sequence of codes will be sufficient to virtually break every archive. Plus if some improvements are brought to gzip, it will be condemned to keep them disabled forever with the current principle.

Indeed they've been bad at communicating but users need to adjust their workflow to rely on decompressed archives' checksums only, even if that's more painful.

Git archive generation meets Hyrum's law

Posted Feb 7, 2023 4:53 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Indeed they've been bad at communicating but users need to adjust their workflow to rely on decompressed archives' checksums only, even if that's more painful.

That still doesn't help in general. How do you hash a decompressed zip? What if your decompressor is vulnerable and can be given bogus content before you verify what you're working with? How about `export-subst` changing content because you have a larger repo and now your `%h` expansion has an extra character.

Github needs to have an option to "freeze" the auto-generated tarballs as part of a release object instead of offering `/archive/` links at all. Random snapshots and whatnot are still a problem, but this solves the vast majority of the problems and avoids further confusion by offering `/archive/` URLs in places folks would assume they can get some level of stability.

Was it github's fault?

Posted Feb 2, 2023 16:35 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

And can we get the git archive request to add a flag to identify the chosen compression algorithm?

It seems that it was actually a change to git. Yes it could have been a github employee who made that change (was it?), but it wasn't github's systems per se that caused the grief.

Adding an (optional) flag means that we don't break existing systems, but by adding the flag I guess downstreams would get the benefit of faster downloads etc.

And then, as I think someone suggested, if you start tagging your repository, if the default depends on the date of the tag being downloaded then github, gitlab, whoever can move to upgraded compression algorithms without breaking pre-existing stored checksums. In fact, could you store the compression algorithm of choice with the tag?

Cheers,
Wol

Was it github's fault?

Posted Feb 2, 2023 17:56 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

> In fact, could you store the compression algorithm of choice with the tag?

This isn't about compression algorithm choice, but implementation. How would I record what `gzip` I used to make a source archive as part of the tag? What do I do for bz2, xz, zip, or any other compression format that I don't make on that day? Would I not be allowed to use a hypothetical SuperSqueeze algorithm on a tag of last year because it didn't exist then?

Was it github's fault?

Posted Feb 3, 2023 10:09 UTC (Fri) by farnz (subscriber, #17727) [Link]

And note that just knowing which implementation of gzip (or other compressor) you used is not guaranteed to be enough: while the decompression algorithm's output is fully determined by its input, the compressor's output is not, and for situations where the decision doesn't affect compression ratio significantly, I could well imagine that the decision is non-deterministic (e.g. racing two threads against each other, first one to finish determines the decision). Thus, you'd have to store not just the implementation you used, but also all sources of non-determinism that affected it (e.g. that thread 1 completed after thread 2) to be able to reproduce the original archive.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:16 UTC (Thu) by ballombe (subscriber, #9523) [Link] (10 responses)

There is no reason the internal gzip implementation cannot produce the same output as the external one, really.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:06 UTC (Thu) by epa (subscriber, #39769) [Link] (4 responses)

I think we need a 'canonical compressed form' for zlib / gzip / zip compressed data. It would correspond roughly to gzip -9. I mean the compression heuristics like how far to look in the sliding window for a match, and possibly fixing some choices in the Huffman coding (like if two sequences are equally probable, which codes to assign). With today's processing power, the tradeoff between compression speed and compressed size doesn't really matter. Nor does squeezing out the last few bytes. You just pick a fixed set of parameters that's easy to implement. For cryptographic applications you could, on decompressing, do an additional check that the data was indeed in canonical compressed format (just re-compress it and check). That way you have a one-to-one mapping between input data and compressed output, not one-to-many as now.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:31 UTC (Thu) by kilobyte (subscriber, #108024) [Link] (3 responses)

For example, the gzip behaviour precludes any parallelization. Computers hardly get any faster single-threaded, there are MASSIVE improvements in core counts, vectorization, etc. Thus even if we stick with the ancient gzip format, we should go with pigz instead. But even that would break the holy github [tar]balls.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 7:54 UTC (Mon) by epa (subscriber, #39769) [Link] (2 responses)

I think that's fine. Tarballs and reproducible build artefacts can use the slower, reproducible compression. It will still be more than fast enough on modern hardware, and in any case the time to compress the tarball is dwarfed by the time to create it. And it decompresses just as quickly. For cases when getting the exact same bytes doesn't matter, you can use a different implementation of gzip, or more likely you'd use a different compression scheme like zstd.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 10:35 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.

Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.

Git archive generation meets Hyrum's law

Posted Mar 25, 2023 12:47 UTC (Sat) by sammythesnake (guest, #17693) [Link]

If the archive contains, say, a100TB file of zeros, then you'd end up filling your hard drive before any opportunity to checksum it that way. If the archive is compressed before downloading, there's at least the option of doing something along the lines of zcat blah.tgz | sha...

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (3 responses)

Sure. But what does that mean for Git reimplementations (JGit, gitoxide, etc.)? But if everything is going to pin things to GNU gzip behavior…that should be documented (and probably ported to the BSD utils and other such implementations which may exist). And that doesn't help bzip2, xz, or zstd at all.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:26 UTC (Thu) by ballombe (subscriber, #9523) [Link] (2 responses)

All I am saying is that that you can use an internal gzip without breaking checksum, so this a false dichotomy.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:31 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (1 responses)

Not unless you just import the gzip source code directly.

There's near-infinite compressed gzip/deflate/etc bitstreams that decode to the same output.

That's the very nature of compression. They only define how to decompress, and the compressor can use whatever techniques it wants to build a valid bitstream.

Defining compression based on the compressor is, frankly, lunacy.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 22:45 UTC (Fri) by ballombe (subscriber, #9523) [Link]

> Not unless you just import the gzip source code directly.
So what ? The gzip source code is readily available. This is not an obstacle.

Git archive generation meets Hyrum's law

Posted Feb 7, 2023 9:39 UTC (Tue) by JanC_ (guest, #34940) [Link]

But there is no guarantee that gzip will always produce the same output either. If upstream gzip ever decide to change the default compression level from -6 to -7, or if they ever decide to change the exact parameters associated with -6, this would have the same effect of breaking all those people’s systems that currently depend on the output of gzip not changing.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:56 UTC (Thu) by agateau (subscriber, #57569) [Link] (8 responses)

It seems to me the idea of generating the archives on the fly is wrong. GitHub (and other git forges) should generate the archive once, store the result and then always serve the same file.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:50 UTC (Thu) by sionescu (subscriber, #59410) [Link]

This is the right answer. Command line utilities generally don't have the expectation of producing a stable output, but the web pretty much has that expectation. Github should store an archive the first time a release tarball is fetched, and serve that forever.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

I imagine they do have some degree of caching already - it would be very expensive to generate an archive every single time anyone in the world requests it. You are effectively proposing to keep things in the cache for eternity, but what is the benefit of doing that, compared to a more conventional cache invalidation strategy? It has multiple drawbacks:

* Your cache uses an ever-growing amount of storage, which you would otherwise be using to host repositories, so now repository hosting gets more expensive.
* After you have been doing this for a few years or so, the vast majority of your cache is holding data that nobody is ever going to look at again, so now you need to implement a hierarchical cache (i.e. push all the low-traffic files out to tape to cut down on costs).
* But retrieving a file from tape probably takes *longer* than just generating a fresh archive, so your cache isn't a cache anymore, it's a bottleneck.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 14:14 UTC (Fri) by agateau (subscriber, #57569) [Link] (1 responses)

Mmm, I just realized on-the-fly archives are available for *all* commits. I agree caching archives for those would be impractical.

Depending on them not ever changing was a bad idea.

Assuming the archives one can find in a GitHub releases would never change, on the other hand, sounds like a reasonable assumption. Those should be generated once. GitHub already lets you attach arbitrary files to a release, so an archive of the sources should not be a problem (he says without having any numbers). They could limit this to only creating archives for releases, not tags, to reduce the number of generated archives.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 14:38 UTC (Fri) by paulj (subscriber, #341) [Link]

Right, the issue is that random developers are configuring their build systems to download on-the-fly git-archives of arbitrary commits of projects. Rather than just doing a shallow clone of the git commit ID - which *IS* guaranteed to be stable, with cryptographic strength guarantees! (And many build systems, inc. CMake, etc., have modules to make it easy to specify build dependencies as git commits to checkout).

The people doing this are utterly clueless, and it's insanity to coddle them.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:30 UTC (Thu) by bnewbold (subscriber, #72587) [Link] (3 responses)

This was my response as well. Or at least, once an archive has been requested (downloaded), store that.

It occurs to me that I've been assuming that the original issue was with "release" archives (aka, git tag'd commits resulting in tarballs). If the issue has been with pulling archives of arbitrary git commits, i'm less sympathetic to the assumption of stability, as it does seem reasonable to generate those on-the-fly and not persist the result.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 7:38 UTC (Fri) by mb (subscriber, #50428) [Link] (2 responses)

>Or at least, once an archive has been requested (downloaded), store that.

That doesn't work.
It only takes one search engine bot to hit a large number of these generated links for your cache to explode.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 19:02 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

These links are `nofollow`, right? Right? And ban bots trawling those with extreme prejudice.

`archive.org` might do it, but I suspect they weird far less DDoS power than a Google crawler.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 19:27 UTC (Fri) by pizza (subscriber, #46) [Link]

It's not really any one bot; it's that everyone and their cousin now has their own (nominally legit) crawler.

And then there are the distributed bots that spoof their identifier and really don't GaF about what robots.txt has in it.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:01 UTC (Thu) by walters (subscriber, #7396) [Link]

Personally I still think something along the lines of https://github.com/cgwalters/git-evtag/ is the right solution.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:16 UTC (Thu) by flussence (guest, #85566) [Link] (6 responses)

One thing I've not seen mention of anywhere: anything depending on these archives to be stable was *already* broken! They *already* change silently without rhyme or reason! It's enough of a problem that Gentoo has standing QA rules for _years_ now forbidding use of those tarballs.

It's only getting attention now because someone, somewhere, saw an opportunity to go outrage-farming for clicks over it.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 22:05 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

The tarballs in question do *not* change silently without rhyme or reason - this would have been noticed well before now. Github had previously asserted that these would remain stable. Are you confusing this situation with tarballs pointing at specific commits rather than tags?

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 22:32 UTC (Thu) by vivo (subscriber, #48315) [Link] (2 responses)

this!
Github has releases which does exactly that provide an unchanging archive

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 22:36 UTC (Thu) by vivo (subscriber, #48315) [Link] (1 responses)

mjg59 already fixed my reasoning - they really broke releases

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:06 UTC (Fri) by smcv (subscriber, #53363) [Link]

Yes and no, unfortunately...

Promoting a tag to a "release" lets you attach arbitrary binary artifacts, such as your official release tarballs. These are stored as binary blobs and don't change. There's no guarantee that they bear any relationship to what's in git, so a malicious project maintainer could insert bad things into the official release tarball in a less visible way than committing them to git (as usual, you have to either trust the maintainer, or audit the code).

However, whether you attach official release tarballs or not, Github provides prominent "Source code" links which point to the output from git archive, and it doesn't seem to be possible to turn those off. It is these "Source code" tarballs that changed recently. Even if git archive doesn't change its output, they are annoying in projects that use submodules or Autotools, because every so often a well-intentioned user will download them, try to build them, find that the required git submodules are missing, and open a bug "your release tarballs are broken".

Flatpak makes a good example to look at for this. flatpak-1.x.tar.xz is the official release tarball generated by Autotools, which is what you would expect for an Autotools project: the source from git (including submodules), minus some files only needed during development, plus Autotools-generated cruft like the configure script. You can build directly from a git clone (after running ./autogen.sh), or you can build from flatpak-1.x.tar.xz (with no special preparation), but you can't easily build from the "Source code" tarballs (which are more or less useless, and I'd turn off display of that link if it was possible).

Outrage farming

Posted Feb 2, 2023 22:36 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

I'm curious as to where this "outrage farming" happened? Hopefully you're not referring to this article?

The discussions I found were mostly in issue trackers and such - projects reacting to their builds failing. Not the best venue if one is hoping to accomplish some "outrage farming".

Outrage farming

Posted Feb 5, 2023 20:51 UTC (Sun) by flussence (guest, #85566) [Link]

Not this article, but people around the internet have definitely been hyping it up in that way.

Would CHECKSUMS files help?

Posted Feb 2, 2023 20:28 UTC (Thu) by akkornel (subscriber, #75292) [Link] (13 responses)

I wonder if one partial solution might be to produce CHECKSUMS files.

For tags and releases, in addition to the existing downloads, have a CHECKSUMS file. Maybe call it CHECKSUMS.sha256 to say which algorithm was being used. The file would contain the checksums for the release artifacts (like installers), and also the auto-generated .zip and .tar.gz fles.

Instead of having to cache .zip and .tar.gz files, GitHub would only have to cache the (presumably smaller) CHECKSUMS file. GitHub could make the convention that, instead of storing a static checksum in your CI, you store the URL to the CHECKSUMS file.

When a back-end change is made that could affect the checksums, GitHub would delete the CHECKSUMS file from their local cache. When an un-cached CHECKSUMS file is requested, GitHub would regenerate it, returning an HTTP 503 Service Unavailable error if needed, possibly with a Retry-After header.

This solution would not work for all downloads. For example, if you go a repo's main page, you can download the repo as a .zip file. That kind of download would not be covered by this.

Would CHECKSUMS files help?

Posted Feb 2, 2023 20:35 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (12 responses)

How does this solve the "need to be able to redo a build from 3 years ago that baked in hashes" problem? You can't go and rewrite all software that does this verification in the past.

What should have been provided all this time is a button to do "add source archives as artifacts" button in releases to pin them at that point. Still can. They can even go and hit it via an internal script on every (public?) release in the system just to ensure that there's a better URL than the `/archive/` endpoint from today forward.

Would CHECKSUMS files help?

Posted Feb 2, 2023 21:05 UTC (Thu) by ceplm (subscriber, #41334) [Link] (11 responses)

If you are not caching those tarballs, your system is already broken.

To quote my former colleague, GCC developer: “We are sorry that our compiler processed this turd which pretends to be a syntactically correct C program and generated assembly from it. It will never happen again.”

Would CHECKSUMS files help?

Posted Feb 2, 2023 21:51 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

Sure, but the other approach is like the Linux kernel's approach to stability: whoops, but we're stuck with it, please revert. Given how much stuff broke with this…

FWIW, I agree that these links *should not* have been relied upon. Alas, they have been… It also seems to break every 3-5 years (2010 and 2017 at least). It's just that the amount of validation of these things today is…way more.

Would CHECKSUMS files help?

Posted Feb 3, 2023 2:55 UTC (Fri) by himi (subscriber, #340) [Link]

> It's just that the amount of validation of these things today is…way more.

That's actually the most surprising thing, to me - as I posted in a reply to you up thread, I expect most of the build/ci/cd/whatever systems that are being hit by this have been stitched together from odds and ends that people/projects happened to find lying around. That's the kind of system you /wouldn't/ expect to be hit by checksum verification problems - random heisenbugs, silent data corruption, or worse, but not erroring out due invalid checksums.

We should definitely be thankful that all those warnings about the importance of verifying the integrity of your inputs have sunk in enough that we /can/ hit this problem . . .

Would CHECKSUMS files help?

Posted Feb 3, 2023 13:30 UTC (Fri) by ceplm (subscriber, #41334) [Link] (5 responses)

I still hold that if your production systems depend on URLs outside of your control not changing for years, you are an idiot. And if you are a developer who insist that these resources should be binary identical for years, you should be fired on spot. Just saying.

I was working for Red Hat, now I am working for SUSE, and of course all our tarballs were and are cached and stored inside our storage. Anything else is just loony.

Would CHECKSUMS files help?

Posted Feb 3, 2023 13:52 UTC (Fri) by gioele (subscriber, #61675) [Link] (4 responses)

> I was working for Red Hat, now I am working for SUSE, and of course all our tarballs were and are cached and stored inside our storage. Anything else is just loony.

Debian does the same. When a package is uploaded an "orig tarball" is uploaded alongside it. From that point on that becomes the trusted source.

For many Debian packages the original tarballs, the URLs, and even the domains are long gone. These packages can currently be rebuild from scratch only because of these "cached" orig tarball.

Would CHECKSUMS files help?

Posted Feb 3, 2023 15:02 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (3 responses)

Yet Red Hat, Debian, Suse etc will have processes that re-download periodically the archive at the end of the original url and check the checksums match (it helps detecting compromises both sides plus some upstreams like to perform stealth releases that replace files in-place with new versions).

Would CHECKSUMS files help?

Posted Feb 3, 2023 20:51 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

At Google, we go one step further: If we want to depend on third-party code, we have to have our own internal fork of the entire codebase, and everything gets built from that fork (using our build system, on our hardware, etc.). At least 98% of the time, we don't even touch tarballs at all (the other 2% is basically as a last resort - I've never actually heard of binaries etc. getting checked into source control, but allegedly it does happen).[1]

[1]: https://opensource.google/documentation/reference/thirdparty

Would CHECKSUMS files help?

Posted Feb 7, 2023 9:52 UTC (Tue) by paulj (subscriber, #341) [Link]

Facebook (Meta, whatever) has a dedicated "third-party" repository. Third-party sources, pristine, are checked into that. You had to write a minimal build file to specify the details for it, so that the build system for the other internal Facebook repositories could use it. Pretty sure there was tooling to just let you specify an external git, hg, whatever repo and tag or commit ID and handle a lot of what was necessary - I don't remember the details.

So it's pretty clear when internal code depends on external code, cause there'll be a "//third-party2/blah/..." dependency in the build spec for it.

It was pretty clean and neat I have to say.

That said, it is possible to have modified versions of external code checked into fbcode, but this is discouraged and obviously requires additional approval (beyond what's needed for the third-party repo).

Would CHECKSUMS files help?

Posted Feb 4, 2023 10:03 UTC (Sat) by pabs (subscriber, #43278) [Link]

Debian definitely doesn't do that, it only checks for new upstream releases.

Would CHECKSUMS files help?

Posted Feb 3, 2023 9:14 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (2 responses)

Check pypy stats. https://pypistats.org/packages/__all__

Yesterday there were 773,835,075. Nobody uses caching it seems.

Would CHECKSUMS files help?

Posted Feb 3, 2023 10:28 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

I think this is caused by the increased use of build containers. If you don't explicitly take steps to cache the packages between builds, they'll get downloaded each and every time. On the developers machines they are cached by default.

Would CHECKSUMS files help?

Posted Feb 9, 2023 15:56 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

And the ubiquitous use of free CI systems which launch a new VM (or container) for every job, resulting in the inability to cache anything.

This makes jobs take longer and also puts massive pressure on the package repositories. For my own projects I build a custom image with all of the dependencies and tools pre-installed so that CI runs don't exhibit this behavior, but this takes time and effort to do.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 20:36 UTC (Thu) by cesarb (subscriber, #6266) [Link] (4 responses)

Part of the reason why this could happen is that the standalone gzip program (as well as its underlying zlib library) has essentially fossilized: AFAIK, other than security fixes, its compression code hasn't changed much in decades.

Contrast with zstd, which often changes its compression code for better compression or performance (or both); a new version of zstd will usually have a different output than an older version, at the same compression level. This can be felt when using delta RPMs for instance on Fedora: since Fedora currently uses zstd for its RPM packages, whenever they update the zstd library, the reconstruction of the full RPM from the delta RPM starts to fail (and it has to go back and download the full RPM) for a while, until both sides are again using the same zstd release.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 21:15 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

Why would you delta your compressed files? Wouldn't it be more efficient (and robust) to delta the uncompressed files, and then compress the deltas?

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 23:26 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

As far as I understand, delta RPMs do apply the deltas to the uncompressed files, but after that they have to compress the result, since the RPM it's trying to generate is compressed. It's that final compression step (after applying the deltas to the files found on your filesystem) which can generate a different result if the zstd library is not at the same version. And since the resulting generated RPM is not identical to what would have been downloaded without delta RPMs, its cryptographic checksum (which is over the full compressed RPM) will not match, and dnf will be forced to throw away the generated RPM and download the full RPM.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 0:13 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Ugh, that sounds pretty painful. You could checksum the uncompressed data, too, but at that point you're spending CPU cycles on fixing an edge case, so I can see why they don't want to do that.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:50 UTC (Fri) by himi (subscriber, #340) [Link]

> Why would you delta your compressed files?

Presumably for the same reason that the `--rsyncable` flag was added to gzip - making it easier/more efficient to synchronise collections of binary files. Which is what's happening when you pull down distro updates - it's a very different problem to the one this article is about, which is all about the state of a git repo at a given point in time.

xkcd 1172 (Workflow)

Posted Feb 2, 2023 20:43 UTC (Thu) by david.a.wheeler (subscriber, #72896) [Link]

Obligatory reference: https://xkcd.com/1172/

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 6:38 UTC (Fri) by pabs (subscriber, #43278) [Link]

I'm reminded of the pristine-tar tool by Joey Hess, which lets you store a minimal bit of data in git that lets you reproduce the tarball you want based on the minimal data plus a git commit.

http://joeyh.name/code/pristine-tar/ https://joeyh.name/blog/entry/generating_pristine_tarball...

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 7:47 UTC (Fri) by mb (subscriber, #50428) [Link] (9 responses)

I do understand the desire to check checksums after archive download before build.
But it's just plain wrong to do that for on-the-fly generated archives.

These build systems should never have depended on on-the-fly generated archives.
They should have used released archives (maybe plus patches).
They should have cached the archives on a server they control. What if github goes down?

Having everything on somebody else's server is a very very bad trend.
This trend will hit us in the face, in the foreseeable future. (Anybody, who was not affected by the worldwide downtime of the MS clould a couple of days ago?)

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 11:09 UTC (Fri) by cortana (subscriber, #24596) [Link] (8 responses)

These build systems should never have depended on on-the-fly generated archives.

You're not wrong, but to pick the latest release of a random GitHub project: NetBox 3.4.4's only artefacts are these on-the-fly generated archives. So consumers of this content have no choice but to use them...

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 11:25 UTC (Fri) by bof (subscriber, #110741) [Link] (7 responses)

Isn't one available other choice to have a plain local git clone / mirror, and check out from that as needed, down to the commit ID?

I get that for the largest projects, that becomes a bit undesirable, but for other stuff, where's the problem?

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 11:33 UTC (Fri) by cortana (subscriber, #24596) [Link] (2 responses)

In my case, the practical reason is that I'm building a container image and I don't want to download the whole history of the project only to throw it away.

(Admittedly I believe there's an option to git-clone which does this, only last time I used it I don't think it worked.)

The philosophical reason is a strongly held belief that a software release is a thing with certain other things attached to it (release notes, a source archive, maybe some binary archives). Once created, those artefacts are immutable.

If a software project isn't doing that then it's not a mature project doing release management, it's a developer chucking whatever works for them over the wall. Which is fine, most project start that way; but we've all been taking advantage of the convenience of GitHub's generated-on-the-fly source archives, instead of automating the creation of these source archives as part of our release processes and attaching them to GitHub releases.

As another poster said, for projects which _do_ do that they then have the problem that the GitHub 'source archive' links can't be removed, so now users have to learn "don't click on the obvious link to get the source code, instead you have to download this particularly-named archive attached to the release and ignore those other ones". GitHub really needs a setting that a project can set to get rid of those links!

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 11:52 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> (Admittedly I believe there's an option to git-clone which does this, only last time I used it I don't think it worked.)

There is. I think it's called a shallow clone. Something like --depth=1.

But last I heard, for a lot of projects, the size of the shallow clone is actually a *large* percentage of the full archive.

Cheers,
Wol

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 17:02 UTC (Fri) by cortana (subscriber, #24596) [Link]

Yeah I think that was it. I gave up on it because it didn't actaully save any time/disk space. Whereas downloading a source archive saved significant amounts of both!

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 15:10 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (3 responses)

Working from a clone is undesirable. It does not scale well for big projects (and you do not want different processes depending on the size of projects) and unless upstreams are extra careful all the historical relicensings are mixed in the repo.

When releasing, you *like* working with dead dumb archives at the end of a curl-able URL, with a *single* version of all the files you are releasing, and a *single* license attached to this archive (the multi-vendored repos devs so love are a legal soup which is hell to release without legal risks).

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 15:29 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

And the reason we're seeing pain here is that people are not actually working with "dead dumb archives" - the people being hurt are working with archives that are produced on-demand by git archive, and have been assuming that they are dumb archives, not the result of a computation on a git repo.

Basically, they've got something that's isomorphic to a git shallow clone at depth 1, but they thought they had a dumb archive. Oops.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 16:32 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

That’s a consequence of github making dumb release archives a second class choice, not the choice of people doing releases.

The ligua franca of release management is dumb archives, because devs like to move from svn to hg to git to whatever, or even not source control bulky binary files (images, fonts, music, whatever) so anything semi-generic will curl archives from a list of urls. And if github only provides reliably (for dubious values of reliably) generated archives that’s what people will use.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 15:56 UTC (Fri) by paulj (subscriber, #341) [Link]

Working from a shallow git clone of a git repo at a given commit ID / branch is better than working from a git archive of a given commit ID / branch, but then doing a checksum on the git archive file.

Various build system generators support specifying a git commit as dependency and doing a git shallow clone to obtain it.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 9:46 UTC (Fri) by Flameeyes (guest, #51238) [Link]

I guess it took 14 years for everyone else to catch up with what I was lamenting (as a Gentoo developer) early on.

https://flameeyes.blog/2009/05/09/i-still-dislike-github/

Git archive generation meets Hyrum's law

Posted Feb 4, 2023 19:09 UTC (Sat) by robert_s (subscriber, #42402) [Link]

NixOS' nixpkgs is wise to this and all archives fetched from github are done so through `fetchFromGitHub`, which extracts and normalizes archive contents before hashes are calculated. Similarly patches are fetched using `fetchpatch` which performs some amount of normalization to accommodate the source dynamically generating the patch in possibly-not-stable ways.

Git archive generation meets Hyrum's law

Posted Feb 14, 2023 11:17 UTC (Tue) by nolange (guest, #156796) [Link]

There is a feature used in debians gbp tooling called "pristine tar". What this does is allowing an upstream tar archive to be re-created, storing the options and applying a binary-diff if necessary.

I dont think it still goes far enough, it would need to add support for multiple "historical important" iterations of gzip (and other algorithms).

If required the functionality should be added there, check-in a small configuration file for "pristine tar" specifying used gzip implementation/compression. At that point github (or other websites) could invoke this tool instead to create a reproducible artifact.

If the file is missing, archive generation could be randomized (if commit date is newer than the date the feature got introduced) to violently remind users to either not depend on fixed hashes or add configuration to freeze the archive generation.