Git archive generation meets Hyrum's law
One widely used GitHub feature is the ability to download an archive file of the state of the repository at an arbitrary commit; it is often used by build systems to obtain a specific release of a package of interest. Internally, this archive is created at request time by the git archive subcommand. Most build systems will compare the resulting archive against a separately stored checksum to be sure that the archive is as expected and has not been corrupted; if the checksum fails to match, the build will be aborted. So when the checksums of GitHub-generated tarballs abruptly changed, builds started failing.
Unsurprisingly, people started to complain. The initial response from GitHub employee (and major Git contributor) brian m. carlson was less than fully understanding:
I'm saying that policy has never been correct and we've never guaranteed stable checksums for archives, just like Git has never guaranteed that. I apologize that things are broken here and that there hasn't been clearer communication in the past on this, but our policy hasn't changed in over 4 years.
This answer, it might be said, was not received well. Wyatt Anderson, for example, said:
The collective amount of human effort it will take to break glass, recover broken build systems that are impacted by this change, and republish artifacts across entire software ecosystems could probably cure cancer. Please consider reverting this change as soon as possible.
The outcry grew
louder, and it took about two hours for Matt Cooper (another GitHub
employee) to announce
that the change was being reverted — for now: "we're reverting the
change, and we'll communicate better about such changes in the future
(including timelines)
". Builds resumed working, and peace
reigned once again.
The source of the problem
The developers at GitHub did not wake up one morning and hatch a scheme to break large numbers of build systems; instead, all they did was upgrade the version of Git used internally. In June 2022, René Scharfe changed git archive to use an internal implementation of the gzip compression algorithm rather than invoking the gzip program separately. This change, which found its way into the Git 2.38 release, allowed Git to drop the gzip dependency, more easily support compression across operating systems, and compress the data with less CPU time.
It also caused git archive to compress files differently. While the uncompressed data is identical, the compressed form differs, so the checksum of the compressed data differs as well. Once this change landed on GitHub's production systems, the checksums for tarballs generated on the fly abruptly changed. GitHub backed out the change, either by reverting to an older Git or by explicitly configuring the use of the standalone gzip program, and the immediate problem went away.
The resulting discussion on the Git mailing list has been relatively muted
so far. Eli Schwartz started
things off with a suggestion that Git should change its default back to
using the external gzip program for now, then implement a "v2
archive format
" using the internal compressor. Using a heuristic,
git archive would always default to the older format for
commits before some sort of cutoff date. That would ensure ongoing
compatibility for older archives, but the idea of wiring that sort of
heuristic into Git was not generally popular.
Ævar Arnfjörð Bjarmason, instead, suggested that the default could be changed to use the external gzip, retaining the internal implementation as an option or as a fallback should the external program not be found. The responsibility for output compatibility could then be shifted to the compression program: anybody who wants to ensure that their generated archive files do not change will have to ensure that their gzip does not change. Since the Git developers do not control that program, they cannot guarantee its forward compatibility in any case.
Carlson, though, argued for avoiding stability guarantees — especially implicit guarantees — if possible:
I made a change some years back to the archive format to fix the permissions on pax headers when extracted as files, and kernel.org was relying on that and broke. Linus yelled at me because of that.Since then, I've been very opposed to us guaranteeing output format consistency without explicitly doing so. I had sent some patches before that I don't think ever got picked up that documented this explicitly. I very much don't want people to come to rely on our behaviour unless we explicitly guarantee it.
He went on to suggest that Git could guarantee the stability of the
archive format in uncompressed form. That format would have to be
versioned, though, since the SHA-256 transition, if and when it happens,
will force changes in that format anyway (a claim that Bjarmason questioned). In
general, carlson concluded, it may well become necessary for anybody who
wants consistent results to decompress archive
files before checking checksums. He later reiterated
that, in his opinion, implementing a stable tar format is
feasible, but adding compression is not: "I personally
feel that's too hard to get right and am not planning on working on it
".
Konstantin Ryabitsev said
that, while he understands carlson's desire to avoid committing to an
output format, "I also think it's one of those things that happen
despite your best efforts to prevent it
". He suggested adding a
--stable option to git archive that was guaranteed
to not change.
What next?
As of this writing, the Git community has not decided whether to make any changes as the result of this episode. Bjarmason argued that the Git community should accommodate the needs of its users, even if they came to depend on a feature that was never advertised as being stable:
That's unfortunate, and those people probably shouldn't have done that, but that's water under the bridge. I think it would be irresponsible to change the output willy-nilly at this point, especially when it seems rather easy to find some compromise everyone will be happy with.
He has since posted a patch set restoring the old behavior, but also documenting that this behavior could change in the future.
Committing to stability of this type is never a thing to be done lightly, though; such stability can be hard to maintain (especially when dealing with file formats defined by others) and can block other types of progress. For example, replacing gzip can yield better compression that can be performed more efficiently; an inability to move beyond that algorithm would prevent Git from obtaining those benefits. Even if Git restores the use of an external gzip program by default, that program might, itself, change, or downstream users like GitHub may decide that they no longer want to support that format.
It would thus be unsurprising if this problem were to refuse to go away.
The Git project is reluctant to add a stability guarantee to its
maintenance load, and the same is true of its downstream users;
GitHub has said that it would give some warning before a checksum change
returns, but has not said that such a change would not happen. The
developers and users of build systems may want to be rethinking their reliance
on the specific compression format used by one proprietary service on the
Internet. The next time problems turn up, they will not be able to say
they haven't been warned.
Posted Feb 2, 2023 16:18 UTC (Thu)
by dskoll (subscriber, #1630)
[Link] (5 responses)
Couldn't the checksum be based on the uncompressed data? Kernel release signatures are based on the uncompressed data, for example.
Sure, it would be a little annoying to have to uncompress to verify the archive integrity, but that would free up developers to tweak the compression to their hearts' content. It also means build processes that rely on checksums would need to be adjusted, but that would be a one-time adjustment.
I suppose there's the risk of a compression bomb that could DoS build systems, but those would be relatively easy to detect... along with the checksum, store the size of the uncompressed data you expect and abort if it starts uncompressing to a larger size.
Posted Feb 2, 2023 16:19 UTC (Thu)
by dskoll (subscriber, #1630)
[Link]
Ah, just noticed this was mentioned in the article. Missed it first time around.
Posted Feb 2, 2023 23:44 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Given that we're talking about git, here's an interesting anecdote: originally, the identity of a git blob object was the hash of its _compressed_ data. Very early in the git history (you can find the commits if you look near the beginning), it was noticed that this was going to be a mistake, and the identity of blob objects was changed to be the hash of its _uncompressed_ data. This was a breaking change, since every commit (and tree and blob) would change its hash, but since there were only a couple of git repositories in the whole world (IIRC, mostly git itself, Linux, and sparse), and only a few people were using git back then, that change was acceptable.
Posted Feb 3, 2023 0:33 UTC (Fri)
by edgewood (subscriber, #1123)
[Link]
Posted Feb 3, 2023 6:35 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Feb 23, 2023 17:10 UTC (Thu)
by kijiki (subscriber, #34691)
[Link]
Posted Feb 2, 2023 16:25 UTC (Thu)
by alonz (subscriber, #815)
[Link] (21 responses)
This way the "checksum is not a guarantee" becomes a real thing, and Hyrum gets a rest.
Posted Feb 2, 2023 16:44 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (18 responses)
Posted Feb 2, 2023 16:52 UTC (Thu)
by geert (subscriber, #98403)
[Link] (8 responses)
What uniquely specifies the contents are the git commit ID (which is still sha1). And perhaps the checksum of the uncompressed archive (assumed git can keep that stable, and doesn't have to change it due to e.g. a security bug).
Posted Feb 2, 2023 17:48 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
What should have happened is that the creation of a release has an option to permalink the "Source code" links Github provides. Note that this is not suitable for projects using things like autoconf (as it would lack `configure`) or having submodules (as `git archive` just ignores them).
Posted Feb 3, 2023 1:33 UTC (Fri)
by himi (subscriber, #340)
[Link] (4 responses)
Posted Feb 3, 2023 1:46 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Even when we patch stuff, I try to grab a tag and cherry-pick fixes we need back to it instead of following "random" development commits. But then we also try to rehost everything because customers can have pretty draconian firewall policies and they allow our hosting through to avoid downloading things from $anywhere.
Posted Feb 3, 2023 2:43 UTC (Fri)
by himi (subscriber, #340)
[Link] (2 responses)
Github obviously isn't intending to support that kind of usage - if they were this change wouldn't have happened, or they'd have implemented the archive download in a reliably verifiable way from the start. But the service that Github /does/ provide is very easy to use/abuse for things Github isn't explicitly intending to support, and that's what's bitten people here. Github did nothing wrong, and neither did the people using their service in a way it wasn't really intended to support, but that's kind of the point of Hyrum's law, isn't it . . .
Posted Feb 3, 2023 10:49 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
A release manager knows intimately that converging on a stable and secure state is hard and will try to pin everything on official stable releases (except for the bugfix/security fixup releases that need to be fast-tracked and propagated as fast as possible).
A developper will use whatever random git commit he checkouted last and will try to pin everything to this commit to avoid the hassle of testing something else than his last workstation state (including freezing out security fixes). The more obscure bit of code he depends on the less he will want to update it (even though because it is obscure he has no idea what dangers lurk in there).
One consequence of github promoting proeminently the second workflow is that instead of serving a *small* number of trusted releases that can be cached once generated (compression included) it needs to generate on the fly archives for all the random not-shared code states it induced developpers to depend on.
No one one sane will depend on github release links. When you need to release something that depends on hundreds of artifacts you use the system which is the same for those hundred of artifacts (dynamicaly generated archives), not the one-of-a-kind release links which are not available for all projects, do not work the same when they are available, and may not even keep being available for a given artifact (as soon as a dev decides to pin a random git hash all bets are of).
Another consequence of the dev-oriented nature of github is any workflow that depend on archives is a second thought. Developpers use git repositories not the archived subset that goes into releases.
Posted Feb 16, 2023 14:31 UTC (Thu)
by mrugiero (guest, #153040)
[Link]
I like the idea of randomizing file ordering inside the tar to avoid people relying on checksumming the compressed archive. Relying on that makes the producer overly restricted: imagine what would happen if I needed to reduce my bandwidth consumption and I was unable to switch compression levels at will. Or the heuristic used to find long matches change.
Posted Feb 3, 2023 10:44 UTC (Fri)
by jengelh (guest, #33263)
[Link]
Speaking of it…
Posted Feb 5, 2023 1:31 UTC (Sun)
by poki (guest, #73360)
[Link]
And also back then, `walters` suggested the same solution as now in the other thread; for posterity:
Posted Feb 2, 2023 16:58 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (3 responses)
The point is that the output of git archive when compressed is not stable. By making it unstable, so that each run of git archive requested compressed output of a given git commit has a different checksum, you stop people assuming that they can run git archive and get a fixed result - they know, instead, that each time you run git archive, you get a different answer.
Once you've built the archive, you can checksum it and keep it as a stable artefact long into the future. You just don't have a guarantee that you can regenerate the archive from the source and get the same checksum - if you need to validate that a given artefact matches the source, you need to do a deeper dive of some form.
Posted Feb 4, 2023 3:56 UTC (Sat)
by yhw (subscriber, #163199)
[Link] (2 responses)
Posted Feb 4, 2023 14:15 UTC (Sat)
by farnz (subscriber, #17727)
[Link] (1 responses)
That doesn't work because the same compressor with the same input data is not obliged to produce the same output. Even with exactly the same binary, you can get different results because the compression algorithm is not fully deterministic (e.g. multi-threaded compressors like pigz, which produces gzip-compatible output, and zstd can depend on thread timing, which in turn depends on the workload on the system and on the CPUs in the system.
To be compatible with today's compressors, you need to record not just the compressor, but also all scheduling decisions made during compression relative to the input data. This ends up being a huge amount of data to record, and eliminates the benefit of compressing.
Posted Feb 5, 2023 0:22 UTC (Sun)
by himi (subscriber, #340)
[Link]
It also ignores the fact that what matters in the context of a git archive is the /contents/, not the exact shape of a processed version of those contents. And taking a step further back, what you care about most is the contents of the /repo/ at the point in its history that you're interested in - the archive is just a snapshot of that, and one that isn't even necessarily representative. There's a lot of ways you can change the actual meaningful contents of a git archive with command line options and filters without any changes to contents of the repo, and any changes to the defaults for those would potentially have a similar effect to the issue discussed in the article (though in that case the git devs would the ones getting the opprobrium).
All of which means that if you want to have a reliably verifiable and repeatable archive of the state of a git repo at a point in its history, you either need the repo itself (or a pruned sub-set with only the objects accessible from the commit you're interested in), or you need to explicitly build an archival format from the ground up with that goal in mind.
I'm sure there's some kind of saying somewhere in the crypto/data security world that you could paraphrase as "verify /all/ of what matters, and /only/ what matters" - if not, there should be. The issue here is a good example of why - generating the archive automatically with an ill-defined and unreliably repeatable process added a whole lot of gunk on top of the data they actually care about, and things are breaking because people are trying to do cryptographic verification of the gunk as well as the actual data.
Posted Feb 2, 2023 19:47 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Feb 10, 2023 11:59 UTC (Fri)
by Lennie (subscriber, #49641)
[Link] (3 responses)
Posted Feb 10, 2023 12:58 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Which really means that bit shouldn't have been there in the header, probably.
Posted Feb 10, 2023 15:02 UTC (Fri)
by wsy (subscriber, #121706)
[Link] (1 responses)
Posted Feb 10, 2023 15:27 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Even the CID can be left out (and, to be honest, I think the unencrypted CID is a wart - the rotation of it adds a /lot/ of complications to QUIC, including hard to fix races).
Posted Feb 6, 2023 17:40 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (1 responses)
That's what is often called "greasing" in the world of protocols, and is meant to prevent ossification. And I agree. Too bad it wasn't done before, but it would have helped a lot here. In fact what users don't understand is that there's even no way to guarantee that the external gzip utility will provide the same bitstream forever either. Just fixing a vulnerability that would require to occasionally change a maximum length or to avoid a certain sequence of codes will be sufficient to virtually break every archive. Plus if some improvements are brought to gzip, it will be condemned to keep them disabled forever with the current principle.
Indeed they've been bad at communicating but users need to adjust their workflow to rely on decompressed archives' checksums only, even if that's more painful.
Posted Feb 7, 2023 4:53 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
That still doesn't help in general. How do you hash a decompressed zip? What if your decompressor is vulnerable and can be given bogus content before you verify what you're working with? How about `export-subst` changing content because you have a larger repo and now your `%h` expansion has an extra character.
Github needs to have an option to "freeze" the auto-generated tarballs as part of a release object instead of offering `/archive/` links at all. Random snapshots and whatnot are still a problem, but this solves the vast majority of the problems and avoids further confusion by offering `/archive/` URLs in places folks would assume they can get some level of stability.
Posted Feb 2, 2023 16:35 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (2 responses)
It seems that it was actually a change to git. Yes it could have been a github employee who made that change (was it?), but it wasn't github's systems per se that caused the grief.
Adding an (optional) flag means that we don't break existing systems, but by adding the flag I guess downstreams would get the benefit of faster downloads etc.
And then, as I think someone suggested, if you start tagging your repository, if the default depends on the date of the tag being downloaded then github, gitlab, whoever can move to upgraded compression algorithms without breaking pre-existing stored checksums. In fact, could you store the compression algorithm of choice with the tag?
Cheers,
Posted Feb 2, 2023 17:56 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
This isn't about compression algorithm choice, but implementation. How would I record what `gzip` I used to make a source archive as part of the tag? What do I do for bz2, xz, zip, or any other compression format that I don't make on that day? Would I not be allowed to use a hypothetical SuperSqueeze algorithm on a tag of last year because it didn't exist then?
Posted Feb 3, 2023 10:09 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
And note that just knowing which implementation of gzip (or other compressor) you used is not guaranteed to be enough: while the decompression algorithm's output is fully determined by its input, the compressor's output is not, and for situations where the decision doesn't affect compression ratio significantly, I could well imagine that the decision is non-deterministic (e.g. racing two threads against each other, first one to finish determines the decision). Thus, you'd have to store not just the implementation you used, but also all sources of non-determinism that affected it (e.g. that thread 1 completed after thread 2) to be able to reproduce the original archive.
Posted Feb 2, 2023 17:16 UTC (Thu)
by ballombe (subscriber, #9523)
[Link] (10 responses)
Posted Feb 2, 2023 18:06 UTC (Thu)
by epa (subscriber, #39769)
[Link] (4 responses)
Posted Feb 2, 2023 19:31 UTC (Thu)
by kilobyte (subscriber, #108024)
[Link] (3 responses)
Posted Feb 6, 2023 7:54 UTC (Mon)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Feb 6, 2023 10:35 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (1 responses)
Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.
Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.
Posted Mar 25, 2023 12:47 UTC (Sat)
by sammythesnake (guest, #17693)
[Link]
Posted Feb 2, 2023 18:33 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Feb 2, 2023 19:26 UTC (Thu)
by ballombe (subscriber, #9523)
[Link] (2 responses)
Posted Feb 3, 2023 1:31 UTC (Fri)
by WolfWings (subscriber, #56790)
[Link] (1 responses)
There's near-infinite compressed gzip/deflate/etc bitstreams that decode to the same output.
That's the very nature of compression. They only define how to decompress, and the compressor can use whatever techniques it wants to build a valid bitstream.
Defining compression based on the compressor is, frankly, lunacy.
Posted Feb 3, 2023 22:45 UTC (Fri)
by ballombe (subscriber, #9523)
[Link]
Posted Feb 7, 2023 9:39 UTC (Tue)
by JanC_ (guest, #34940)
[Link]
Posted Feb 2, 2023 17:56 UTC (Thu)
by agateau (subscriber, #57569)
[Link] (8 responses)
Posted Feb 2, 2023 18:50 UTC (Thu)
by sionescu (subscriber, #59410)
[Link]
Posted Feb 2, 2023 19:03 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
* Your cache uses an ever-growing amount of storage, which you would otherwise be using to host repositories, so now repository hosting gets more expensive.
Posted Feb 3, 2023 14:14 UTC (Fri)
by agateau (subscriber, #57569)
[Link] (1 responses)
Depending on them not ever changing was a bad idea.
Assuming the archives one can find in a GitHub releases would never change, on the other hand, sounds like a reasonable assumption. Those should be generated once. GitHub already lets you attach arbitrary files to a release, so an archive of the sources should not be a problem (he says without having any numbers). They could limit this to only creating archives for releases, not tags, to reduce the number of generated archives.
Posted Feb 3, 2023 14:38 UTC (Fri)
by paulj (subscriber, #341)
[Link]
The people doing this are utterly clueless, and it's insanity to coddle them.
Posted Feb 2, 2023 19:30 UTC (Thu)
by bnewbold (subscriber, #72587)
[Link] (3 responses)
It occurs to me that I've been assuming that the original issue was with "release" archives (aka, git tag'd commits resulting in tarballs). If the issue has been with pulling archives of arbitrary git commits, i'm less sympathetic to the assumption of stability, as it does seem reasonable to generate those on-the-fly and not persist the result.
Posted Feb 3, 2023 7:38 UTC (Fri)
by mb (subscriber, #50428)
[Link] (2 responses)
That doesn't work.
Posted Feb 3, 2023 19:02 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
`archive.org` might do it, but I suspect they weird far less DDoS power than a Google crawler.
Posted Feb 3, 2023 19:27 UTC (Fri)
by pizza (subscriber, #46)
[Link]
And then there are the distributed bots that spoof their identifier and really don't GaF about what robots.txt has in it.
Posted Feb 2, 2023 18:01 UTC (Thu)
by walters (subscriber, #7396)
[Link]
Posted Feb 2, 2023 19:16 UTC (Thu)
by flussence (guest, #85566)
[Link] (6 responses)
It's only getting attention now because someone, somewhere, saw an opportunity to go outrage-farming for clicks over it.
Posted Feb 2, 2023 22:05 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link]
Posted Feb 2, 2023 22:32 UTC (Thu)
by vivo (subscriber, #48315)
[Link] (2 responses)
Posted Feb 2, 2023 22:36 UTC (Thu)
by vivo (subscriber, #48315)
[Link] (1 responses)
Posted Feb 3, 2023 10:06 UTC (Fri)
by smcv (subscriber, #53363)
[Link]
Yes and no, unfortunately... Promoting a tag to a "release" lets you attach arbitrary binary artifacts, such as your official release tarballs. These are stored as binary blobs and don't change. There's no guarantee that they bear any relationship to what's in git, so a malicious project maintainer could insert bad things into the official release tarball in a less visible way than committing them to git (as usual, you have to either trust the maintainer, or audit the code). However, whether you attach official release tarballs or not, Github provides prominent "Source code" links which point to the output from git archive, and it doesn't seem to be possible to turn those off. It is these "Source code" tarballs that changed recently. Even if git archive doesn't change its output, they are annoying in projects that use submodules or Autotools, because every so often a well-intentioned user will download them, try to build them, find that the required git submodules are missing, and open a bug "your release tarballs are broken". Flatpak makes a good example to look at for this. flatpak-1.x.tar.xz is the official release tarball generated by Autotools, which is what you would expect for an Autotools project: the source from git (including submodules), minus some files only needed during development, plus Autotools-generated cruft like the configure script. You can build directly from a git clone (after running ./autogen.sh), or you can build from flatpak-1.x.tar.xz (with no special preparation), but you can't easily build from the "Source code" tarballs (which are more or less useless, and I'd turn off display of that link if it was possible).
Posted Feb 2, 2023 22:36 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
The discussions I found were mostly in issue trackers and such - projects reacting to their builds failing. Not the best venue if one is hoping to accomplish some "outrage farming".
Posted Feb 5, 2023 20:51 UTC (Sun)
by flussence (guest, #85566)
[Link]
Posted Feb 2, 2023 20:28 UTC (Thu)
by akkornel (subscriber, #75292)
[Link] (13 responses)
For tags and releases, in addition to the existing downloads, have a CHECKSUMS file. Maybe call it CHECKSUMS.sha256 to say which algorithm was being used. The file would contain the checksums for the release artifacts (like installers), and also the auto-generated .zip and .tar.gz fles.
Instead of having to cache .zip and .tar.gz files, GitHub would only have to cache the (presumably smaller) CHECKSUMS file. GitHub could make the convention that, instead of storing a static checksum in your CI, you store the URL to the CHECKSUMS file.
When a back-end change is made that could affect the checksums, GitHub would delete the CHECKSUMS file from their local cache. When an un-cached CHECKSUMS file is requested, GitHub would regenerate it, returning an HTTP 503 Service Unavailable error if needed, possibly with a Retry-After header.
This solution would not work for all downloads. For example, if you go a repo's main page, you can download the repo as a .zip file. That kind of download would not be covered by this.
Posted Feb 2, 2023 20:35 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (12 responses)
What should have been provided all this time is a button to do "add source archives as artifacts" button in releases to pin them at that point. Still can. They can even go and hit it via an internal script on every (public?) release in the system just to ensure that there's a better URL than the `/archive/` endpoint from today forward.
Posted Feb 2, 2023 21:05 UTC (Thu)
by ceplm (subscriber, #41334)
[Link] (11 responses)
To quote my former colleague, GCC developer: “We are sorry that our compiler processed this turd which pretends to be a syntactically correct C program and generated assembly from it. It will never happen again.”
Posted Feb 2, 2023 21:51 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
FWIW, I agree that these links *should not* have been relied upon. Alas, they have been… It also seems to break every 3-5 years (2010 and 2017 at least). It's just that the amount of validation of these things today is…way more.
Posted Feb 3, 2023 2:55 UTC (Fri)
by himi (subscriber, #340)
[Link]
That's actually the most surprising thing, to me - as I posted in a reply to you up thread, I expect most of the build/ci/cd/whatever systems that are being hit by this have been stitched together from odds and ends that people/projects happened to find lying around. That's the kind of system you /wouldn't/ expect to be hit by checksum verification problems - random heisenbugs, silent data corruption, or worse, but not erroring out due invalid checksums.
We should definitely be thankful that all those warnings about the importance of verifying the integrity of your inputs have sunk in enough that we /can/ hit this problem . . .
Posted Feb 3, 2023 13:30 UTC (Fri)
by ceplm (subscriber, #41334)
[Link] (5 responses)
I was working for Red Hat, now I am working for SUSE, and of course all our tarballs were and are cached and stored inside our storage. Anything else is just loony.
Posted Feb 3, 2023 13:52 UTC (Fri)
by gioele (subscriber, #61675)
[Link] (4 responses)
Debian does the same. When a package is uploaded an "orig tarball" is uploaded alongside it. From that point on that becomes the trusted source.
For many Debian packages the original tarballs, the URLs, and even the domains are long gone. These packages can currently be rebuild from scratch only because of these "cached" orig tarball.
Posted Feb 3, 2023 15:02 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
Posted Feb 3, 2023 20:51 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
[1]: https://opensource.google/documentation/reference/thirdparty
Posted Feb 7, 2023 9:52 UTC (Tue)
by paulj (subscriber, #341)
[Link]
So it's pretty clear when internal code depends on external code, cause there'll be a "//third-party2/blah/..." dependency in the build spec for it.
It was pretty clean and neat I have to say.
That said, it is possible to have modified versions of external code checked into fbcode, but this is discouraged and obviously requires additional approval (beyond what's needed for the third-party repo).
Posted Feb 4, 2023 10:03 UTC (Sat)
by pabs (subscriber, #43278)
[Link]
Posted Feb 3, 2023 9:14 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (2 responses)
Yesterday there were 773,835,075. Nobody uses caching it seems.
Posted Feb 3, 2023 10:28 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (1 responses)
Posted Feb 9, 2023 15:56 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link]
This makes jobs take longer and also puts massive pressure on the package repositories. For my own projects I build a custom image with all of the dependencies and tools pre-installed so that CI runs don't exhibit this behavior, but this takes time and effort to do.
Posted Feb 2, 2023 20:36 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (4 responses)
Contrast with zstd, which often changes its compression code for better compression or performance (or both); a new version of zstd will usually have a different output than an older version, at the same compression level. This can be felt when using delta RPMs for instance on Fedora: since Fedora currently uses zstd for its RPM packages, whenever they update the zstd library, the reconstruction of the full RPM from the delta RPM starts to fail (and it has to go back and download the full RPM) for a while, until both sides are again using the same zstd release.
Posted Feb 2, 2023 21:15 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Posted Feb 2, 2023 23:26 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Posted Feb 3, 2023 0:13 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Posted Feb 3, 2023 1:50 UTC (Fri)
by himi (subscriber, #340)
[Link]
Presumably for the same reason that the `--rsyncable` flag was added to gzip - making it easier/more efficient to synchronise collections of binary files. Which is what's happening when you pull down distro updates - it's a very different problem to the one this article is about, which is all about the state of a git repo at a given point in time.
Posted Feb 2, 2023 20:43 UTC (Thu)
by david.a.wheeler (subscriber, #72896)
[Link]
Posted Feb 3, 2023 6:38 UTC (Fri)
by pabs (subscriber, #43278)
[Link]
http://joeyh.name/code/pristine-tar/ https://joeyh.name/blog/entry/generating_pristine_tarball...
Posted Feb 3, 2023 7:47 UTC (Fri)
by mb (subscriber, #50428)
[Link] (9 responses)
These build systems should never have depended on on-the-fly generated archives.
Having everything on somebody else's server is a very very bad trend.
Posted Feb 3, 2023 11:09 UTC (Fri)
by cortana (subscriber, #24596)
[Link] (8 responses)
These build systems should never have depended on on-the-fly generated archives. You're not wrong, but to pick the latest release of a random GitHub project: NetBox 3.4.4's only artefacts are these on-the-fly generated archives. So consumers of this content have no choice but to use them...
Posted Feb 3, 2023 11:25 UTC (Fri)
by bof (subscriber, #110741)
[Link] (7 responses)
I get that for the largest projects, that becomes a bit undesirable, but for other stuff, where's the problem?
Posted Feb 3, 2023 11:33 UTC (Fri)
by cortana (subscriber, #24596)
[Link] (2 responses)
(Admittedly I believe there's an option to git-clone which does this, only last time I used it I don't think it worked.)
The philosophical reason is a strongly held belief that a software release is a thing with certain other things attached to it (release notes, a source archive, maybe some binary archives). Once created, those artefacts are immutable.
If a software project isn't doing that then it's not a mature project doing release management, it's a developer chucking whatever works for them over the wall. Which is fine, most project start that way; but we've all been taking advantage of the convenience of GitHub's generated-on-the-fly source archives, instead of automating the creation of these source archives as part of our release processes and attaching them to GitHub releases.
As another poster said, for projects which _do_ do that they then have the problem that the GitHub 'source archive' links can't be removed, so now users have to learn "don't click on the obvious link to get the source code, instead you have to download this particularly-named archive attached to the release and ignore those other ones". GitHub really needs a setting that a project can set to get rid of those links!
Posted Feb 3, 2023 11:52 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
There is. I think it's called a shallow clone. Something like --depth=1.
But last I heard, for a lot of projects, the size of the shallow clone is actually a *large* percentage of the full archive.
Cheers,
Posted Feb 3, 2023 17:02 UTC (Fri)
by cortana (subscriber, #24596)
[Link]
Posted Feb 3, 2023 15:10 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
When releasing, you *like* working with dead dumb archives at the end of a curl-able URL, with a *single* version of all the files you are releasing, and a *single* license attached to this archive (the multi-vendored repos devs so love are a legal soup which is hell to release without legal risks).
Posted Feb 3, 2023 15:29 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (1 responses)
And the reason we're seeing pain here is that people are not actually working with "dead dumb archives" - the people being hurt are working with archives that are produced on-demand by git archive, and have been assuming that they are dumb archives, not the result of a computation on a git repo.
Basically, they've got something that's isomorphic to a git shallow clone at depth 1, but they thought they had a dumb archive. Oops.
Posted Feb 3, 2023 16:32 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
The ligua franca of release management is dumb archives, because devs like to move from svn to hg to git to whatever, or even not source control bulky binary files (images, fonts, music, whatever) so anything semi-generic will curl archives from a list of urls. And if github only provides reliably (for dubious values of reliably) generated archives that’s what people will use.
Posted Feb 3, 2023 15:56 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Various build system generators support specifying a git commit as dependency and doing a git shallow clone to obtain it.
Posted Feb 3, 2023 9:46 UTC (Fri)
by Flameeyes (guest, #51238)
[Link]
Posted Feb 4, 2023 19:09 UTC (Sat)
by robert_s (subscriber, #42402)
[Link]
Posted Feb 14, 2023 11:17 UTC (Tue)
by nolange (guest, #156796)
[Link]
There is a feature used in debians gbp tooling called "pristine tar". What this does is allowing an upstream tar archive to be re-created, storing the options and applying a binary-diff if necessary.
I dont think it still goes far enough, it would need to add support for multiple "historical important" iterations of gzip (and other algorithms).
If required the functionality should be added there, check-in a small configuration file for "pristine tar" specifying used gzip implementation/compression. At that point github (or other websites) could invoke this tool instead to create a reproducible artifact.
If the file is missing, archive generation could be randomized (if commit date is newer than the date the feature got introduced) to violently remind users to either not depend on fixed hashes or add configuration to freeze the archive generation.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Ah, perhaps they learned from the Makefile tab mistake.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Using the file upload feature would sidestep the issues of verification of an on-the-fly (re-)generated archive (at the cost of diskspace at the hoster).
Git archive generation meets Hyrum's law
https://lists.fedoraproject.org/archives/list/devel@lists...
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Was it github's fault?
Wol
Was it github's fault?
Was it github's fault?
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
So what ? The gzip source code is readily available. This is not an obstacle.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
* After you have been doing this for a few years or so, the vast majority of your cache is holding data that nobody is ever going to look at again, so now you need to implement a hierarchical cache (i.e. push all the low-traffic files out to tape to cut down on costs).
* But retrieving a file from tape probably takes *longer* than just generating a fresh archive, so your cache isn't a cache anymore, it's a bottleneck.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
It only takes one search engine bot to hit a large number of these generated links for your cache to explode.
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Github has releases which does exactly that provide an unchanging archive
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
I'm curious as to where this "outrage farming" happened? Hopefully you're not referring to this article?
Outrage farming
Outrage farming
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Would CHECKSUMS files help?
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
xkcd 1172 (Workflow)
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
But it's just plain wrong to do that for on-the-fly generated archives.
They should have used released archives (maybe plus patches).
They should have cached the archives on a server they control. What if github goes down?
This trend will hit us in the face, in the foreseeable future. (Anybody, who was not affected by the worldwide downtime of the MS clould a couple of days ago?)
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Wol
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law
Git archive generation meets Hyrum's law