Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:06 UTC (Thu) by epa (subscriber, #39769)
In reply to: Git archive generation meets Hyrum's law by ballombe
Parent article: Git archive generation meets Hyrum's law

I think we need a 'canonical compressed form' for zlib / gzip / zip compressed data. It would correspond roughly to gzip -9. I mean the compression heuristics like how far to look in the sliding window for a match, and possibly fixing some choices in the Huffman coding (like if two sequences are equally probable, which codes to assign). With today's processing power, the tradeoff between compression speed and compressed size doesn't really matter. Nor does squeezing out the last few bytes. You just pick a fixed set of parameters that's easy to implement. For cryptographic applications you could, on decompressing, do an additional check that the data was indeed in canonical compressed format (just re-compress it and check). That way you have a one-to-one mapping between input data and compressed output, not one-to-many as now.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:31 UTC (Thu) by kilobyte (subscriber, #108024) [Link] (3 responses)

For example, the gzip behaviour precludes any parallelization. Computers hardly get any faster single-threaded, there are MASSIVE improvements in core counts, vectorization, etc. Thus even if we stick with the ancient gzip format, we should go with pigz instead. But even that would break the holy github [tar]balls.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 7:54 UTC (Mon) by epa (subscriber, #39769) [Link] (2 responses)

I think that's fine. Tarballs and reproducible build artefacts can use the slower, reproducible compression. It will still be more than fast enough on modern hardware, and in any case the time to compress the tarball is dwarfed by the time to create it. And it decompresses just as quickly. For cases when getting the exact same bytes doesn't matter, you can use a different implementation of gzip, or more likely you'd use a different compression scheme like zstd.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 10:35 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.

Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.

Git archive generation meets Hyrum's law

Posted Mar 25, 2023 12:47 UTC (Sat) by sammythesnake (guest, #17693) [Link]

If the archive contains, say, a100TB file of zeros, then you'd end up filling your hard drive before any opportunity to checksum it that way. If the archive is compressed before downloading, there's at least the option of doing something along the lines of zcat blah.tgz | sha...