Git archive generation meets Hyrum's law [LWN.net]

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:52 UTC (Thu) by geert (subscriber, #98403) [Link] (8 responses)

Relying on the checksums of the compressed archives sounds like a bad idea, as those archives don't exist on the server, but are generated on the fly, using an external program. Imagine you want to download the same archive in 10 years, what do you do when checksum verification fails? Download again? Go to a different server?

What uniquely specifies the contents are the git commit ID (which is still sha1). And perhaps the checksum of the uncompressed archive (assumed git can keep that stable, and doesn't have to change it due to e.g. a security bug).

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:48 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

`export-subst` and `export-ignore` can pull metadata from the commit in question and bake it into the archive (or exclude the file completely).

What should have happened is that the creation of a release has an option to permalink the "Source code" links Github provides. Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:33 UTC (Fri) by himi (subscriber, #340) [Link] (4 responses)

That wouldn't help all the use cases which are targeting a specific commit rather than a release - which is going to be the case for the build bots and test infrastructure using this particular Github feature . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:46 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

I'm not sure what buildbots and infrastructure are downloading random commit snapshots instead of tagged releases (or they follow `main`…but then you have no stable hash anyways). CI should be working from a clone and if it gets a tarball, it can't know the hash a priori anyways.

Even when we patch stuff, I try to grab a tag and cherry-pick fixes we need back to it instead of following "random" development commits. But then we also try to rehost everything because customers can have pretty draconian firewall policies and they allow our hosting through to avoid downloading things from $anywhere.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 2:43 UTC (Fri) by himi (subscriber, #340) [Link] (2 responses)

Yeah, there are clearly better ways to implement this kind of thing, and it's not how you or I or any reasonably experienced dev would set things up, but there are a lot of people out there who are effectively stitching their development processes together from whatever odds and ends they can find. Maybe because they don't know any better, or because they don't have the resources to do it "properly", or because things just kind of grew that way through a process of accretion. And/or they may have set things up that way before Github integrated all those nice convenient free-ish services to support this kind of thing, and just never bothered fixing things that didn't seem broken. Probably a lot of them are now re-evaluating those decisions, having learned /why/ the reasonably experienced devs avoid doing things that way.

Github obviously isn't intending to support that kind of usage - if they were this change wouldn't have happened, or they'd have implemented the archive download in a reliably verifiable way from the start. But the service that Github /does/ provide is very easy to use/abuse for things Github isn't explicitly intending to support, and that's what's bitten people here. Github did nothing wrong, and neither did the people using their service in a way it wasn't really intended to support, but that's kind of the point of Hyrum's law, isn't it . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:49 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

That’s because developping and releasing are two different trades but git(hub|lab) and language-specific packaging systems in general made their success by targeting developper workflows. They suck loads for release processes, every release manager worth his salt knows it but they have the developper mindshare so dealing with their suckage is a fact of life.

A release manager knows intimately that converging on a stable and secure state is hard and will try to pin everything on official stable releases (except for the bugfix/security fixup releases that need to be fast-tracked and propagated as fast as possible).

A developper will use whatever random git commit he checkouted last and will try to pin everything to this commit to avoid the hassle of testing something else than his last workstation state (including freezing out security fixes). The more obscure bit of code he depends on the less he will want to update it (even though because it is obscure he has no idea what dangers lurk in there).

One consequence of github promoting proeminently the second workflow is that instead of serving a *small* number of trusted releases that can be cached once generated (compression included) it needs to generate on the fly archives for all the random not-shared code states it induced developpers to depend on.

No one one sane will depend on github release links. When you need to release something that depends on hundreds of artifacts you use the system which is the same for those hundred of artifacts (dynamicaly generated archives), not the one-of-a-kind release links which are not available for all projects, do not work the same when they are available, and may not even keep being available for a given artifact (as soon as a dev decides to pin a random git hash all bets are of).

Another consequence of the dev-oriented nature of github is any workflow that depend on archives is a second thought. Developpers use git repositories not the archived subset that goes into releases.

Git archive generation meets Hyrum's law

Posted Feb 16, 2023 14:31 UTC (Thu) by mrugiero (guest, #153040) [Link]

Projects like buildroot sometimes rely on packages of projects that don't do proper releases, such as downstream kernels for out-of-tree boards. I'm not 100% sure if they clone the repo rather than asking for the archive, but that alone is proof you sometimes need to rely on specific development commits, just because it's the only way to fix a version.

I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:44 UTC (Fri) by jengelh (guest, #33263) [Link]

>Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Speaking of it…
Using the file upload feature would sidestep the issues of verification of an on-the-fly (re-)generated archive (at the cost of diskspace at the hoster).

Git archive generation meets Hyrum's law

Posted Feb 5, 2023 1:31 UTC (Sun) by poki (guest, #73360) [Link]

Yes, `export-subst` was already causing a similar problem in principle (non-reproducibility in terms of a digest [like here] and [even more alerting] change of the actual unpacked content) as a fallout of the hardened short commit scaling some 5 years ago. It was caused by typically another hexdigit suddenly appearing in place of `%h` specifier in evolved projects using that in fact apparently semi-dynamic provision of git.

And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
https://lists.fedoraproject.org/archives/list/devel@lists...

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 16:58 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

The point is that the output of git archive when compressed is not stable. By making it unstable, so that each run of git archive requested compressed output of a given git commit has a different checksum, you stop people assuming that they can run git archive and get a fixed result - they know, instead, that each time you run git archive, you get a different answer.

Once you've built the archive, you can checksum it and keep it as a stable artefact long into the future. You just don't have a guarantee that you can regenerate the archive from the source and get the same checksum - if you need to validate that a given artefact matches the source, you need to do a deeper dive of some form.

Git archive generation meets Hyrum's law

Posted Feb 4, 2023 3:56 UTC (Sat) by yhw (subscriber, #163199) [Link] (2 responses)

How about making checksum versioned? All current checksum by default is $itself+gzip9. When new compression way introduced, make the zip NVR part of the checksum. Algorithm can be designed the compression upgrade not to break existing/old checksum based system.

Git archive generation meets Hyrum's law

Posted Feb 4, 2023 14:15 UTC (Sat) by farnz (subscriber, #17727) [Link] (1 responses)

That doesn't work because the same compressor with the same input data is not obliged to produce the same output. Even with exactly the same binary, you can get different results because the compression algorithm is not fully deterministic (e.g. multi-threaded compressors like pigz, which produces gzip-compatible output, and zstd can depend on thread timing, which in turn depends on the workload on the system and on the CPUs in the system.

To be compatible with today's compressors, you need to record not just the compressor, but also all scheduling decisions made during compression relative to the input data. This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

Git archive generation meets Hyrum's law

Posted Feb 5, 2023 0:22 UTC (Sun) by himi (subscriber, #340) [Link]

> This ends up being a huge amount of data to record, and eliminates the benefit of compressing.

It also ignores the fact that what matters in the context of a git archive is the /contents/, not the exact shape of a processed version of those contents. And taking a step further back, what you care about most is the contents of the /repo/ at the point in its history that you're interested in - the archive is just a snapshot of that, and one that isn't even necessarily representative. There's a lot of ways you can change the actual meaningful contents of a git archive with command line options and filters without any changes to contents of the repo, and any changes to the defaults for those would potentially have a similar effect to the issue discussed in the article (though in that case the git devs would the ones getting the opprobrium).

All of which means that if you want to have a reliably verifiable and repeatable archive of the state of a git repo at a point in its history, you either need the repo itself (or a pruned sub-set with only the objects accessible from the commit you're interested in), or you need to explicitly build an archival format from the ground up with that goal in mind.

I'm sure there's some kind of saying somewhere in the crypto/data security world that you could paraphrase as "verify /all/ of what matters, and /only/ what matters" - if not, there should be. The issue here is a good example of why - generating the archive automatically with an ill-defined and unreliably repeatable process added a whole lot of gunk on top of the data they actually care about, and things are breaking because people are trying to do cryptographic verification of the gunk as well as the actual data.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:47 UTC (Thu) by flussence (guest, #85566) [Link]

Then the party publishing the stable checksum in this situation (i.e. not GitHub, it has never done so for these download links) can simply provide a corresponding stable tarball.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 11:59 UTC (Fri) by Lennie (subscriber, #49641) [Link] (3 responses)

The new QUIC protocol has adopted a similar strategy, quote from some one who wrote about it: "To prevent ossification, QUIC tries to encrypt as much data as possible, including signaling information [10], to hide it from network equipment and prevent vendors of said equipment from making assumptions that will interfere or prevent future changes to the protocol."

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 12:58 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

And for the bits in the unencrypted header, that is effectively fixed to 1, there is even an RFC to allow QUIC end-points to negotiate (in the somewhat encrypted handshake) deliberately twiddling that bit deliberately - see RFC9287, https://www.rfc-editor.org/rfc/rfc9287.html .

Which really means that bit shouldn't have been there in the header, probably.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 15:02 UTC (Fri) by wsy (subscriber, #121706) [Link] (1 responses)

For people living in dystopian countries, such a bit is frustrating. We need a protocol that's widely used by legit websites while indistinguishable from anti-censorship tools.

Git archive generation meets Hyrum's law

Posted Feb 10, 2023 15:27 UTC (Fri) by paulj (subscriber, #341) [Link]

Well, the unencrypted QUIC header is pretty much inscrutable to middle-boxes. There is very little information in it, besides a few bits and a "Connection Identifier" (CID), but the end-points rotate the CID regularly.

Even the CID can be left out (and, to be honest, I think the unencrypted CID is a wart - the rotation of it adds a /lot/ of complications to QUIC, including hard to fix races).