Git archive generation meets Hyrum's law [LWN.net]

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:06 UTC (Thu) by epa (subscriber, #39769) [Link] (4 responses)

I think we need a 'canonical compressed form' for zlib / gzip / zip compressed data. It would correspond roughly to gzip -9. I mean the compression heuristics like how far to look in the sliding window for a match, and possibly fixing some choices in the Huffman coding (like if two sequences are equally probable, which codes to assign). With today's processing power, the tradeoff between compression speed and compressed size doesn't really matter. Nor does squeezing out the last few bytes. You just pick a fixed set of parameters that's easy to implement. For cryptographic applications you could, on decompressing, do an additional check that the data was indeed in canonical compressed format (just re-compress it and check). That way you have a one-to-one mapping between input data and compressed output, not one-to-many as now.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:31 UTC (Thu) by kilobyte (subscriber, #108024) [Link] (3 responses)

For example, the gzip behaviour precludes any parallelization. Computers hardly get any faster single-threaded, there are MASSIVE improvements in core counts, vectorization, etc. Thus even if we stick with the ancient gzip format, we should go with pigz instead. But even that would break the holy github [tar]balls.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 7:54 UTC (Mon) by epa (subscriber, #39769) [Link] (2 responses)

I think that's fine. Tarballs and reproducible build artefacts can use the slower, reproducible compression. It will still be more than fast enough on modern hardware, and in any case the time to compress the tarball is dwarfed by the time to create it. And it decompresses just as quickly. For cases when getting the exact same bytes doesn't matter, you can use a different implementation of gzip, or more likely you'd use a different compression scheme like zstd.

Git archive generation meets Hyrum's law

Posted Feb 6, 2023 10:35 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Or we could go one better; while making the compressor deterministic is hard, making the uncompressed form deterministic is not (when uncompressed, it's "just" a case of ensuring that everything is done in deterministic order). We then checksum the uncompressed form, and ship a compressed artefact without checksums.

Note in this context that HTTP supports "Content-Transfer" encodings: so we can compress for transfer, while still transferring and checksumming uncompressed data. And you can save the compressed form, so that you don't waste disk space - or even recompress to a higher compression if suitable.

Git archive generation meets Hyrum's law

Posted Mar 25, 2023 12:47 UTC (Sat) by sammythesnake (guest, #17693) [Link]

If the archive contains, say, a100TB file of zeros, then you'd end up filling your hard drive before any opportunity to checksum it that way. If the archive is compressed before downloading, there's at least the option of doing something along the lines of zcat blah.tgz | sha...

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 18:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (3 responses)

Sure. But what does that mean for Git reimplementations (JGit, gitoxide, etc.)? But if everything is going to pin things to GNU gzip behavior…that should be documented (and probably ported to the BSD utils and other such implementations which may exist). And that doesn't help bzip2, xz, or zstd at all.

Git archive generation meets Hyrum's law

Posted Feb 2, 2023 19:26 UTC (Thu) by ballombe (subscriber, #9523) [Link] (2 responses)

All I am saying is that that you can use an internal gzip without breaking checksum, so this a false dichotomy.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 1:31 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (1 responses)

Not unless you just import the gzip source code directly.

There's near-infinite compressed gzip/deflate/etc bitstreams that decode to the same output.

That's the very nature of compression. They only define how to decompress, and the compressor can use whatever techniques it wants to build a valid bitstream.

Defining compression based on the compressor is, frankly, lunacy.

Git archive generation meets Hyrum's law

Posted Feb 3, 2023 22:45 UTC (Fri) by ballombe (subscriber, #9523) [Link]

> Not unless you just import the gzip source code directly.
So what ? The gzip source code is readily available. This is not an obstacle.

Git archive generation meets Hyrum's law

Posted Feb 7, 2023 9:39 UTC (Tue) by JanC_ (guest, #34940) [Link]

But there is no guarantee that gzip will always produce the same output either. If upstream gzip ever decide to change the default compression level from -6 to -7, or if they ever decide to change the exact parameters associated with -6, this would have the same effect of breaking all those people’s systems that currently depend on the output of gzip not changing.