Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Posted Feb 2, 2023 16:44 UTC (Thu) by zdzichu (subscriber, #17118)In reply to: Git archive generation meets Hyrum's law by alonz
Parent article: Git archive generation meets Hyrum's law
Posted Feb 2, 2023 16:52 UTC (Thu)
by geert (subscriber, #98403)
[Link] (8 responses)
What uniquely specifies the contents are the git commit ID (which is still sha1). And perhaps the checksum of the uncompressed archive (assumed git can keep that stable, and doesn't have to change it due to e.g. a security bug).
Posted Feb 2, 2023 17:48 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
What should have happened is that the creation of a release has an option to permalink the "Source code" links Github provides. Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).
Posted Feb 3, 2023 1:33 UTC (Fri)
by himi (subscriber, #340)
[Link] (4 responses)
Posted Feb 3, 2023 1:46 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Even when we patch stuff, I try to grab a tag and cherry-pick fixes we need back to it instead of following "random" development commits. But then we also try to rehost everything because customers can have pretty draconian firewall policies and they allow our hosting through to avoid downloading things from $anywhere.
Posted Feb 3, 2023 2:43 UTC (Fri)
by himi (subscriber, #340)
[Link] (2 responses)
Github obviously isn't intending to support that kind of usage - if they were this change wouldn't have happened, or they'd have implemented the archive download in a reliably verifiable way from the start. But the service that Github /does/ provide is very easy to use/abuse for things Github isn't explicitly intending to support, and that's what's bitten people here. Github did nothing wrong, and neither did the people using their service in a way it wasn't really intended to support, but that's kind of the point of Hyrum's law, isn't it . . .
Posted Feb 3, 2023 10:49 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
A release manager knows intimately that converging on a stable and secure state is hard and will try to pin everything on official stable releases (except for the bugfix/security fixup releases that need to be fast-tracked and propagated as fast as possible).
A developper will use whatever random git commit he checkouted last and will try to pin everything to this commit to avoid the hassle of testing something else than his last workstation state (including freezing out security fixes). The more obscure bit of code he depends on the less he will want to update it (even though because it is obscure he has no idea what dangers lurk in there).
One consequence of github promoting proeminently the second workflow is that instead of serving a *small* number of trusted releases that can be cached once generated (compression included) it needs to generate on the fly archives for all the random not-shared code states it induced developpers to depend on.
No one one sane will depend on github release links. When you need to release something that depends on hundreds of artifacts you use the system which is the same for those hundred of artifacts (dynamicaly generated archives), not the one-of-a-kind release links which are not available for all projects, do not work the same when they are available, and may not even keep being available for a given artifact (as soon as a dev decides to pin a random git hash all bets are of).
Another consequence of the dev-oriented nature of github is any workflow that depend on archives is a second thought. Developpers use git repositories not the archived subset that goes into releases.
Posted Feb 16, 2023 14:31 UTC (Thu)
by mrugiero (guest, #153040)
[Link]
I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.
Posted Feb 3, 2023 10:44 UTC (Fri)
by jengelh (guest, #33263)
[Link]
Speaking of it…
Posted Feb 5, 2023 1:31 UTC (Sun)
by poki (guest, #73360)
[Link]
And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
Posted Feb 2, 2023 16:58 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (3 responses)
The point is that the output of git archive when compressed is not stable. By making it unstable, so that each run of git archive requested compressed output of a given git commit has a different checksum, you stop people assuming that they can run git archive and get a fixed result - they know, instead, that each time you run git archive, you get a different answer.
Once you've built the archive, you can checksum it and keep it as a stable artefact long into the future. You just don't have a guarantee that you can regenerate the archive from the source and get the same checksum - if you need to validate that a given artefact matches the source, you need to do a deeper dive of some form.
Posted Feb 4, 2023 3:56 UTC (Sat)
by yhw (subscriber, #163199)
[Link] (2 responses)
Posted Feb 4, 2023 14:15 UTC (Sat)
by farnz (subscriber, #17727)
[Link] (1 responses)
That doesn't work because the same compressor with the same input data is not obliged to produce the same output. Even with exactly the same binary, you can get different results because the compression algorithm is not fully deterministic (e.g. multi-threaded compressors like pigz, which produces gzip-compatible output, and zstd can depend on thread timing, which in turn depends on the workload on the system and on the CPUs in the system.
To be compatible with today's compressors, you need to record not just the compressor, but also all scheduling decisions made during compression relative to the input data. This ends up being a huge amount of data to record, and eliminates the benefit of compressing.
Posted Feb 5, 2023 0:22 UTC (Sun)
by himi (subscriber, #340)
[Link]
It also ignores the fact that what matters in the context of a git archive is the /contents/, not the exact shape of a processed version of those contents. And taking a step further back, what you care about most is the contents of the /repo/ at the point in its history that you're interested in - the archive is just a snapshot of that, and one that isn't even necessarily representative. There's a lot of ways you can change the actual meaningful contents of a git archive with command line options and filters without any changes to contents of the repo, and any changes to the defaults for those would potentially have a similar effect to the issue discussed in the article (though in that case the git devs would the ones getting the opprobrium).
All of which means that if you want to have a reliably verifiable and repeatable archive of the state of a git repo at a point in its history, you either need the repo itself (or a pruned sub-set with only the objects accessible from the commit you're interested in), or you need to explicitly build an archival format from the ground up with that goal in mind.
I'm sure there's some kind of saying somewhere in the crypto/data security world that you could paraphrase as "verify /all/ of what matters, and /only/ what matters" - if not, there should be. The issue here is a good example of why - generating the archive automatically with an ill-defined and unreliably repeatable process added a whole lot of gunk on top of the data they actually care about, and things are breaking because people are trying to do cryptographic verification of the gunk as well as the actual data.
Posted Feb 2, 2023 19:47 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Feb 10, 2023 11:59 UTC (Fri)
by Lennie (subscriber, #49641)
[Link] (3 responses)
Posted Feb 10, 2023 12:58 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Which really means that bit shouldn't have been there in the header, probably.
Posted Feb 10, 2023 15:02 UTC (Fri)
by wsy (subscriber, #121706)
[Link] (1 responses)
Posted Feb 10, 2023 15:27 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Even the CID can be left out (and, to be honest, I think the unencrypted CID is a wart - the rotation of it adds a /lot/ of complications to QUIC, including hard to fix races).
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Using the file upload feature would sidestep the issues of verification of an on-the-fly (re-)generated archive (at the cost of diskspace at the hoster).
Git archive generation meets Hyrum's law
https://lists.fedoraproject.org/archives/list/devel@lists...
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law