LWN: Comments on "Git archive generation meets Hyrum's law"

Git archive generation meets Hyrum's law

sammythesnake — Sat, 25 Mar 2023 12:47:57 +0000

If the archive contains, say, a100TB file of zeros, then you'd end up filling your hard drive before any opportunity to checksum it that way. If the archive is compressed before downloading, there's at least the option of doing something along the lines of zcat blah.tgz | sha...

Git archive generation meets Hyrum's law

kijiki — Thu, 23 Feb 2023 17:10:38 +0000

The good news is that compression/decompression is a pure function from input to output, so it can be easily be very tightly sandboxed. SECCOMP_SET_MODE_STRICT level strict.

Git archive generation meets Hyrum's law

mrugiero — Thu, 16 Feb 2023 14:31:57 +0000

Projects like buildroot sometimes rely on packages of projects that don't do proper releases, such as downstream kernels for out-of-tree boards. I'm not 100% sure if they clone the repo rather than asking for the archive, but that alone is proof you sometimes need to rely on specific development commits, just because it's the only way to fix a version.

I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.

Git archive generation meets Hyrum's law

nolange — Tue, 14 Feb 2023 11:17:20 +0000

There is a feature used in debians gbp tooling called "pristine tar". What this does is allowing an upstream tar archive to be re-created, storing the options and applying a binary-diff if necessary.

I dont think it still goes far enough, it would need to add support for multiple "historical important" iterations of gzip (and other algorithms).

If required the functionality should be added there, check-in a small configuration file for "pristine tar" specifying used gzip implementation/compression. At that point github (or other websites) could invoke this tool instead to create a reproducible artifact.

If the file is missing, archive generation could be randomized (if commit date is newer than the date the feature got introduced) to violently remind users to either not depend on fixed hashes or add configuration to freeze the archive generation.

Git archive generation meets Hyrum's law

paulj — Fri, 10 Feb 2023 15:27:19 +0000

Well, the unencrypted QUIC header is pretty much inscrutable to middle-boxes. There is very little information in it, besides a few bits and a "Connection Identifier" (CID), but the end-points rotate the CID regularly.

Even the CID can be left out (and, to be honest, I think the unencrypted CID is a wart - the rotation of it adds a /lot/ of complications to QUIC, including hard to fix races).

Git archive generation meets Hyrum's law

wsy — Fri, 10 Feb 2023 15:02:13 +0000

For people living in dystopian countries, such a bit is frustrating. We need a protocol that's widely used by legit websites while indistinguishable from anti-censorship tools.

Git archive generation meets Hyrum's law

paulj — Fri, 10 Feb 2023 12:58:14 +0000

And for the bits in the unencrypted header, that is effectively fixed to 1, there is even an RFC to allow QUIC end-points to negotiate (in the somewhat encrypted handshake) deliberately twiddling that bit deliberately - see RFC9287, https://www.rfc-editor.org/rfc/rfc9287.html .

Which really means that bit shouldn't have been there in the header, probably.

Git archive generation meets Hyrum's law

Lennie — Fri, 10 Feb 2023 11:59:00 +0000

The new QUIC protocol has adopted a similar strategy, quote from some one who wrote about it: "To prevent ossification, QUIC tries to encrypt as much data as possible, including signaling information [10], to hide it from network equipment and prevent vendors of said equipment from making assumptions that will interfere or prevent future changes to the protocol."

Would CHECKSUMS files help?

kpfleming — Thu, 09 Feb 2023 15:56:35 +0000

And the ubiquitous use of free CI systems which launch a new VM (or container) for every job, resulting in the inability to cache anything.

This makes jobs take longer and also puts massive pressure on the package repositories. For my own projects I build a custom image with all of the dependencies and tools pre-installed so that CI runs don't exhibit this behavior, but this takes time and effort to do.

Would CHECKSUMS files help?

paulj — Tue, 07 Feb 2023 09:52:54 +0000

Facebook (Meta, whatever) has a dedicated "third-party" repository. Third-party sources, pristine, are checked into that. You had to write a minimal build file to specify the details for it, so that the build system for the other internal Facebook repositories could use it. Pretty sure there was tooling to just let you specify an external git, hg, whatever repo and tag or commit ID and handle a lot of what was necessary - I don't remember the details.

So it's pretty clear when internal code depends on external code, cause there'll be a "//third-party2/blah/..." dependency in the build spec for it.

It was pretty clean and neat I have to say.

That said, it is possible to have modified versions of external code checked into fbcode, but this is discouraged and obviously requires additional approval (beyond what's needed for the third-party repo).

Git archive generation meets Hyrum's law

JanC_ — Tue, 07 Feb 2023 09:39:25 +0000

But there is no guarantee that gzip will always produce the same output either. If upstream gzip ever decide to change the default compression level from -6 to -7, or if they ever decide to change the exact parameters associated with -6, this would have the same effect of breaking all those people’s systems that currently depend on the output of gzip not changing.

Git archive generation meets Hyrum's law

mathstuf — Tue, 07 Feb 2023 04:53:38 +0000

> Indeed they've been bad at communicating but users need to adjust their workflow to rely on decompressed archives' checksums only, even if that's more painful.

That still doesn't help in general. How do you hash a decompressed zip? What if your decompressor is vulnerable and can be given bogus content before you verify what you're working with? How about `export-subst` changing content because you have a larger repo and now your `%h` expansion has an extra character.

Github needs to have an option to "freeze" the auto-generated tarballs as part of a release object instead of offering `/archive/` links at all. Random snapshots and whatnot are still a problem, but this solves the vast majority of the problems and avoids further confusion by offering `/archive/` URLs in places folks would assume they can get some level of stability.

Git archive generation meets Hyrum's law

wtarreau — Mon, 06 Feb 2023 17:40:50 +0000

> Why not do the opposite - instead of stabilizing the checksum, make it so that every single download will have a different checksum

That's what is often called "greasing" in the world of protocols, and is meant to prevent ossification. And I agree. Too bad it wasn't done before, but it would have helped a lot here. In fact what users don't understand is that there's even no way to guarantee that the external gzip utility will provide the same bitstream forever either. Just fixing a vulnerability that would require to occasionally change a maximum length or to avoid a certain sequence of codes will be sufficient to virtually break every archive. Plus if some improvements are brought to gzip, it will be condemned to keep them disabled forever with the current principle.

Indeed they've been bad at communicating but users need to adjust their workflow to rely on decompressed archives' checksums only, even if that's more painful.

Git archive generation meets Hyrum's law

farnz — Mon, 06 Feb 2023 10:35:45 +0000

Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.

Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.

Git archive generation meets Hyrum's law

epa — Mon, 06 Feb 2023 07:54:04 +0000

I think that's fine. Tarballs and reproducible build artefacts can use the slower, reproducible compression. It will still be more than fast enough on modern hardware, and in any case the time to compress the tarball is dwarfed by the time to create it. And it decompresses just as quickly. For cases when getting the exact same bytes doesn't matter, you can use a different implementation of gzip, or more likely you'd use a different compression scheme like zstd.

Outrage farming

flussence — Sun, 05 Feb 2023 20:51:52 +0000

Not this article, but people around the internet have definitely been hyping it up in that way.

Git archive generation meets Hyrum's law

poki — Sun, 05 Feb 2023 01:31:35 +0000

Yes, `export-subst` was already causing a similar problem in principle (non-reproducibility in terms of a digest [like here] and [even more alerting] change of the actual unpacked content) as a fallout of the hardened short commit scaling some 5 years ago. It was caused by typically another hexdigit suddenly appearing in place of `%h` specifier in evolved projects using that in fact apparently semi-dynamic provision of git.

And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
https://lists.fedoraproject.org/archives/list/devel@lists...

Git archive generation meets Hyrum's law

himi — Sun, 05 Feb 2023 00:22:53 +0000

> This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

It also ignores the fact that what matters in the context of a git archive is the /contents/, not the exact shape of a processed version of those contents. And taking a step further back, what you care about most is the contents of the /repo/ at the point in its history that you're interested in - the archive is just a snapshot of that, and one that isn't even necessarily representative. There's a lot of ways you can change the actual meaningful contents of a git archive with command line options and filters without any changes to contents of the repo, and any changes to the defaults for those would potentially have a similar effect to the issue discussed in the article (though in that case the git devs would the ones getting the opprobrium).

All of which means that if you want to have a reliably verifiable and repeatable archive of the state of a git repo at a point in its history, you either need the repo itself (or a pruned sub-set with only the objects accessible from the commit you're interested in), or you need to explicitly build an archival format from the ground up with that goal in mind.

I'm sure there's some kind of saying somewhere in the crypto/data security world that you could paraphrase as "verify /all/ of what matters, and /only/ what matters" - if not, there should be. The issue here is a good example of why - generating the archive automatically with an ill-defined and unreliably repeatable process added a whole lot of gunk on top of the data they actually care about, and things are breaking because people are trying to do cryptographic verification of the gunk as well as the actual data.

Git archive generation meets Hyrum's law

robert_s — Sat, 04 Feb 2023 19:09:15 +0000

NixOS' nixpkgs is wise to this and all archives fetched from github are done so through `fetchFromGitHub`, which extracts and normalizes archive contents before hashes are calculated. Similarly patches are fetched using `fetchpatch` which performs some amount of normalization to accommodate the source dynamically generating the patch in possibly-not-stable ways.

Git archive generation meets Hyrum's law

farnz — Sat, 04 Feb 2023 14:15:33 +0000

That doesn't work because the same compressor with the same input data is not obliged to produce the same output. Even with exactly the same binary, you can get different results because the compression algorithm is not fully deterministic (e.g. multi-threaded compressors like pigz, which produces gzip-compatible output, and zstd can depend on thread timing, which in turn depends on the workload on the system and on the CPUs in the system.

To be compatible with today's compressors, you need to record not just the compressor, but also all scheduling decisions made during compression relative to the input data. This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

Would CHECKSUMS files help?

pabs — Sat, 04 Feb 2023 10:03:03 +0000

Debian definitely doesn't do that, it only checks for new upstream releases.

Git archive generation meets Hyrum's law

yhw — Sat, 04 Feb 2023 03:56:42 +0000

How about making checksum versioned? All current checksum by default is $itself+gzip9. When new compression way introduced, make the zip NVR part of the checksum. Algorithm can be designed the compression upgrade not to break existing/old checksum based system.

Git archive generation meets Hyrum's law

ballombe — Fri, 03 Feb 2023 22:45:07 +0000

> Not unless you just import the gzip source code directly.
So what ? The gzip source code is readily available. This is not an obstacle.

Would CHECKSUMS files help?

NYKevin — Fri, 03 Feb 2023 20:51:07 +0000

At Google, we go one step further: If we want to depend on third-party code, we have to have our own internal fork of the entire codebase, and everything gets built from that fork (using our build system, on our hardware, etc.). At least 98% of the time, we don't even touch tarballs at all (the other 2% is basically as a last resort - I've never actually heard of binaries etc. getting checked into source control, but allegedly it does happen).[1]

[1]: https://opensource.google/documentation/reference/thirdparty

Git archive generation meets Hyrum's law

pizza — Fri, 03 Feb 2023 19:27:12 +0000

It's not really any one bot; it's that everyone and their cousin now has their own (nominally legit) crawler.

And then there are the distributed bots that spoof their identifier and really don't GaF about what robots.txt has in it.

Git archive generation meets Hyrum's law

mathstuf — Fri, 03 Feb 2023 19:02:27 +0000

These links are `nofollow`, right? Right? And ban bots trawling those with extreme prejudice.

`archive.org` might do it, but I suspect they weird far less DDoS power than a Google crawler.

Git archive generation meets Hyrum's law

cortana — Fri, 03 Feb 2023 17:02:13 +0000

Yeah I think that was it. I gave up on it because it didn't actaully save any time/disk space. Whereas downloading a source archive saved significant amounts of both!

Git archive generation meets Hyrum's law

nim-nim — Fri, 03 Feb 2023 16:32:31 +0000

That’s a consequence of github making dumb release archives a second class choice, not the choice of people doing releases.

The ligua franca of release management is dumb archives, because devs like to move from svn to hg to git to whatever, or even not source control bulky binary files (images, fonts, music, whatever) so anything semi-generic will curl archives from a list of urls. And if github only provides reliably (for dubious values of reliably) generated archives that’s what people will use.

Git archive generation meets Hyrum's law

paulj — Fri, 03 Feb 2023 15:56:05 +0000

Working from a shallow git clone of a git repo at a given commit ID / branch is better than working from a git archive of a given commit ID / branch, but then doing a checksum on the git archive file.

Various build system generators support specifying a git commit as dependency and doing a git shallow clone to obtain it.

Git archive generation meets Hyrum's law

farnz — Fri, 03 Feb 2023 15:29:04 +0000

And the reason we're seeing pain here is that people are not actually working with "dead dumb archives" - the people being hurt are working with archives that are produced on-demand by git archive, and have been assuming that they are dumb archives, not the result of a computation on a git repo.

Basically, they've got something that's isomorphic to a git shallow clone at depth 1, but they thought they had a dumb archive. Oops.

Git archive generation meets Hyrum's law

nim-nim — Fri, 03 Feb 2023 15:10:30 +0000

Working from a clone is undesirable. It does not scale well for big projects (and you do not want different processes depending on the size of projects) and unless upstreams are extra careful all the historical relicensings are mixed in the repo.

When releasing, you *like* working with dead dumb archives at the end of a curl-able URL, with a *single* version of all the files you are releasing, and a *single* license attached to this archive (the multi-vendored repos devs so love are a legal soup which is hell to release without legal risks).

Would CHECKSUMS files help?

nim-nim — Fri, 03 Feb 2023 15:02:42 +0000

Yet Red Hat, Debian, Suse etc will have processes that re-download periodically the archive at the end of the original url and check the checksums match (it helps detecting compromises both sides plus some upstreams like to perform stealth releases that replace files in-place with new versions).

Git archive generation meets Hyrum's law

paulj — Fri, 03 Feb 2023 14:38:21 +0000

Right, the issue is that random developers are configuring their build systems to download on-the-fly git-archives of arbitrary commits of projects. Rather than just doing a shallow clone of the git commit ID - which *IS* guaranteed to be stable, with cryptographic strength guarantees! (And many build systems, inc. CMake, etc., have modules to make it easy to specify build dependencies as git commits to checkout).

The people doing this are utterly clueless, and it's insanity to coddle them.

Git archive generation meets Hyrum's law

agateau — Fri, 03 Feb 2023 14:14:16 +0000

Mmm, I just realized on-the-fly archives are available for *all* commits. I agree caching archives for those would be impractical.

Depending on them not ever changing was a bad idea.

Assuming the archives one can find in a GitHub releases would never change, on the other hand, sounds like a reasonable assumption. Those should be generated once. GitHub already lets you attach arbitrary files to a release, so an archive of the sources should not be a problem (he says without having any numbers). They could limit this to only creating archives for releases, not tags, to reduce the number of generated archives.

Would CHECKSUMS files help?

gioele — Fri, 03 Feb 2023 13:52:56 +0000

> I was working for Red Hat, now I am working for SUSE, and of course all our tarballs were and are cached and stored inside our storage. Anything else is just loony.

Debian does the same. When a package is uploaded an "orig tarball" is uploaded alongside it. From that point on that becomes the trusted source.

For many Debian packages the original tarballs, the URLs, and even the domains are long gone. These packages can currently be rebuild from scratch only because of these "cached" orig tarball.

Would CHECKSUMS files help?

ceplm — Fri, 03 Feb 2023 13:30:35 +0000

I still hold that if your production systems depend on URLs outside of your control not changing for years, you are an idiot. And if you are a developer who insist that these resources should be binary identical for years, you should be fired on spot. Just saying.

I was working for Red Hat, now I am working for SUSE, and of course all our tarballs were and are cached and stored inside our storage. Anything else is just loony.

Git archive generation meets Hyrum's law

Wol — Fri, 03 Feb 2023 11:52:03 +0000

> (Admittedly I believe there's an option to git-clone which does this, only last time I used it I don't think it worked.)

There is. I think it's called a shallow clone. Something like --depth=1.

But last I heard, for a lot of projects, the size of the shallow clone is actually a *large* percentage of the full archive.

Cheers,
Wol

Git archive generation meets Hyrum's law

cortana — Fri, 03 Feb 2023 11:33:28 +0000

In my case, the practical reason is that I'm building a container image and I don't want to download the whole history of the project only to throw it away.

(Admittedly I believe there's an option to git-clone which does this, only last time I used it I don't think it worked.)

The philosophical reason is a strongly held belief that a software release is a thing with certain other things attached to it (release notes, a source archive, maybe some binary archives). Once created, those artefacts are immutable.

If a software project isn't doing that then it's not a mature project doing release management, it's a developer chucking whatever works for them over the wall. Which is fine, most project start that way; but we've all been taking advantage of the convenience of GitHub's generated-on-the-fly source archives, instead of automating the creation of these source archives as part of our release processes and attaching them to GitHub releases.

As another poster said, for projects which _do_ do that they then have the problem that the GitHub 'source archive' links can't be removed, so now users have to learn "don't click on the obvious link to get the source code, instead you have to download this particularly-named archive attached to the release and ignore those other ones". GitHub really needs a setting that a project can set to get rid of those links!

Git archive generation meets Hyrum's law

bof — Fri, 03 Feb 2023 11:25:10 +0000

Isn't one available other choice to have a plain local git clone / mirror, and check out from that as needed, down to the commit ID?

I get that for the largest projects, that becomes a bit undesirable, but for other stuff, where's the problem?

Git archive generation meets Hyrum's law

cortana — Fri, 03 Feb 2023 11:09:56 +0000

These build systems should never have depended on on-the-fly generated archives.

You're not wrong, but to pick the latest release of a random GitHub project: NetBox 3.4.4's only artefacts are these on-the-fly generated archives. So consumers of this content have no choice but to use them...