Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:31 UTC (Thu) by kilobyte (subscriber, #108024)
In reply to: Git archive generation meets Hyrum's law by epa
Parent article: Git archive generation meets Hyrum's law

For example, the gzip behaviour precludes any parallelization. Computers hardly get any faster single-threaded, there are MASSIVE improvements in core counts, vectorization, etc. Thus even if we stick with the ancient gzip format, we should go with pigz instead. But even that would break the holy github [tar]balls.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 7:54 UTC (Mon) by epa (subscriber, #39769) [Link] (2 responses)

I think that's fine. Tarballs and reproducible build artefacts can use the slower, reproducible compression. It will still be more than fast enough on modern hardware, and in any case the time to compress the tarball is dwarfed by the time to create it. And it decompresses just as quickly. For cases when getting the exact same bytes doesn't matter, you can use a different implementation of gzip, or more likely you'd use a different compression scheme like zstd.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 10:35 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.

Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.

Git archive generation meets Hyrum's law

Posted Mar 25, 2023 12:47 UTC (Sat) by sammythesnake (guest, #17693) [Link]

If the archive contains, say, a100TB file of zeros, then you'd end up filling your hard drive before any opportunity to checksum it that way. If the archive is compressed before downloading, there's at least the option of doing something along the lines of zcat blah.tgz | sha...