Git archive generation meets Hyrum's law

Posted Feb 2, 2023 17:48 UTC (Thu) by mathstuf (subscriber, #69389)
In reply to: Git archive generation meets Hyrum's law by geert
Parent article: Git archive generation meets Hyrum's law

`export-subst` and `export-ignore` can pull metadata from the commit in question and bake it into the archive (or exclude the file completely).

What should have happened is that the creation of a release has an option to permalink the "Source code" links Github provides. Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:33 UTC (Fri) by himi (subscriber, #340) [Link] (4 responses)

That wouldn't help all the use cases which are targeting a specific commit rather than a release - which is going to be the case for the build bots and test infrastructure using this particular Github feature . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:46 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

I'm not sure what buildbots and infrastructure are downloading random commit snapshots instead of tagged releases (or they follow `main`…but then you have no stable hash anyways). CI should be working from a clone and if it gets a tarball, it can't know the hash a priori anyways.

Even when we patch stuff, I try to grab a tag and cherry-pick fixes we need back to it instead of following "random" development commits. But then we also try to rehost everything because customers can have pretty draconian firewall policies and they allow our hosting through to avoid downloading things from $anywhere.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 2:43 UTC (Fri) by himi (subscriber, #340) [Link] (2 responses)

Yeah, there are clearly better ways to implement this kind of thing, and it's not how you or I or any reasonably experienced dev would set things up, but there are a lot of people out there who are effectively stitching their development processes together from whatever odds and ends they can find. Maybe because they don't know any better, or because they don't have the resources to do it "properly", or because things just kind of grew that way through a process of accretion. And/or they may have set things up that way before Github integrated all those nice convenient free-ish services to support this kind of thing, and just never bothered fixing things that didn't seem broken. Probably a lot of them are now re-evaluating those decisions, having learned /why/ the reasonably experienced devs avoid doing things that way.

Github obviously isn't intending to support that kind of usage - if they were this change wouldn't have happened, or they'd have implemented the archive download in a reliably verifiable way from the start. But the service that Github /does/ provide is very easy to use/abuse for things Github isn't explicitly intending to support, and that's what's bitten people here. Github did nothing wrong, and neither did the people using their service in a way it wasn't really intended to support, but that's kind of the point of Hyrum's law, isn't it . . .

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:49 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

That’s because developping and releasing are two different trades but git(hub|lab) and language-specific packaging systems in general made their success by targeting developper workflows. They suck loads for release processes, every release manager worth his salt knows it but they have the developper mindshare so dealing with their suckage is a fact of life.

A release manager knows intimately that converging on a stable and secure state is hard and will try to pin everything on official stable releases (except for the bugfix/security fixup releases that need to be fast-tracked and propagated as fast as possible).

A developper will use whatever random git commit he checkouted last and will try to pin everything to this commit to avoid the hassle of testing something else than his last workstation state (including freezing out security fixes). The more obscure bit of code he depends on the less he will want to update it (even though because it is obscure he has no idea what dangers lurk in there).

One consequence of github promoting proeminently the second workflow is that instead of serving a *small* number of trusted releases that can be cached once generated (compression included) it needs to generate on the fly archives for all the random not-shared code states it induced developpers to depend on.

No one one sane will depend on github release links. When you need to release something that depends on hundreds of artifacts you use the system which is the same for those hundred of artifacts (dynamicaly generated archives), not the one-of-a-kind release links which are not available for all projects, do not work the same when they are available, and may not even keep being available for a given artifact (as soon as a dev decides to pin a random git hash all bets are of).

Another consequence of the dev-oriented nature of github is any workflow that depend on archives is a second thought. Developpers use git repositories not the archived subset that goes into releases.

Git archive generation meets Hyrum's law

Posted Feb 16, 2023 14:31 UTC (Thu) by mrugiero (guest, #153040) [Link]

Projects like buildroot sometimes rely on packages of projects that don't do proper releases, such as downstream kernels for out-of-tree boards. I'm not 100% sure if they clone the repo rather than asking for the archive, but that alone is proof you sometimes need to rely on specific development commits, just because it's the only way to fix a version.

I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 10:44 UTC (Fri) by jengelh (guest, #33263) [Link]

>Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).

Speaking of it…
Using the file upload feature would sidestep the issues of verification of an on-the-fly (re-)generated archive (at the cost of diskspace at the hoster).

Git archive generation meets Hyrum's law

Posted Feb 5, 2023 1:31 UTC (Sun) by poki (guest, #73360) [Link]

Yes, `export-subst` was already causing a similar problem in principle (non-reproducibility in terms of a digest [like here] and [even more alerting] change of the actual unpacked content) as a fallout of the hardened short commit scaling some 5 years ago. It was caused by typically another hexdigit suddenly appearing in place of `%h` specifier in evolved projects using that in fact apparently semi-dynamic provision of git.

And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
https://lists.fedoraproject.org/archives/list/devel@lists...