OCI zstd

Posted May 14, 2025 23:02 UTC (Wed) by tianon (subscriber, #98676)
Parent article: The future of Flatpak

> the OCI container standard has added zstd:chunked support

Just to be clear, the OCI has standardized support for zstd in general, but the clever zstd:chunked tricks are a podman-ecosystem specific format (that any zstd implementation should be able to handle reading, due to the way it hides the extra data in the chunking).

OCI zstd

Posted May 14, 2025 23:35 UTC (Wed) by vasi (subscriber, #83946) [Link] (6 responses)

Wow, reading the code this format is fascinating and weird. Tar-headers converted to JSON, compressed, and stored in a skippable frame! Interesting that this is done at the file-level, rather than just using rsyncable-mode for zstd.

OCI zstd

Posted May 15, 2025 15:39 UTC (Thu) by nliadm (subscriber, #94000) [Link] (5 responses)

"Rsyncable" output (as I understand it) is mostly about making sure rsync's blocking and the compressor's output line up as much as possible. For example, if I have a tar and append to it, it's valid for the compressor to encode "old" parts of the tar in a different way. "Rsyncable" output makes sure the "old" parts encode to the same rsync blocks.

The "zstd:chunked" and "estargz" schemes don't want stably-blocked output, they want random access to individual tar members. This means each member needs to be a complete output, which plays nicely with zstd and gzip's ability to be concatenated.

OCI zstd

Posted May 15, 2025 16:36 UTC (Thu) by vasi (subscriber, #83946) [Link]

Yeah, it seems there's a few different goals coming together here.

If it was just fast updates to container images, rsyncable (+ something like xdelta) would be sufficient.

If it was just partial fetches (ie: fast access to individual files), we wouldn't really need to make each member independently compressed, losing much of our compression ratio on small files. You just need framed compression, so you can jump to the beginning of a _block_; and a file index, so you know which blocks hold which files. This is basically what I built in pixz. It's generally fast enough to just grab the whole block containing a small file, without losing the compression advantages of reasonable block sizes.

But if we also specifically need deduplication, even across entirely unrelated images, then I guess we really do need to have independent compression of files, like zstd:chunked does.

It just feels a bit unfortunate to have invented a bespoke ZIP-like archive format, whose only implementation is within `containers/storage`. I think 7zip has zip + zstd working nowadays, which would feel cleaner to me.

OCI zstd

Posted May 15, 2025 17:36 UTC (Thu) by excors (subscriber, #95769) [Link] (3 responses)

Tangentially related to this, I recently looked into 'gzip --rsyncable' which will periodically flush the Deflate compression state: it aligns its output to a byte boundary (normally it's bit-based) and starts a new Deflate block (which resets all the Huffman trees etc). That means the compressed output bytes only depend on the uncompressed input from that point, plus (I think) the preceding 32KB input window, but not all the older input.

I believe "periodically" means "if (sum of the last 4096 bytes) % 4096 == 0" (rounded up to the end of a string match), which incidentally is a very poor checksum that makes it pretty inefficient at compressing long sequences of a single byte (e.g. 1MB of /dev/zero compresses to 30KB, whereas 1MB of a repeated two-byte pattern compresses to 1KB). Anyway, it means that changing one byte in the middle of the uncompressed input should only affect the next <36KB of compressed output, so rsync's blocks should get back in sync soon afterwards.

Unfortunately, since (I think) the flushing *doesn't* prevent new Deflate blocks referring to old data in the 32KB window, and a decompressor can only reconstruct that window by decompressing old Deflate blocks (which recursively depend on all data back to the start of the file), you can't use this to start decoding from the middle of a gzip --rsyncable file. You can (even without --rsyncable) construct a separate index file containing a subset of the block boundary positions and a copy of the 32KB window at each boundary, and use that to support reasonably efficient seeking to arbitrary positions within the compressed file, and I've written some code to do that, but it's a bit awkward compared to a compressed file format with native support for random access.

(I'm not sure of the details of 'zstd --rsyncable' but it does look a bit more sensible than gzip's implementation - at least it's got a proper checksum function.)

OCI zstd

Posted May 15, 2025 18:50 UTC (Thu) by vasi (subscriber, #83946) [Link] (2 responses)

Yeah, that's correct. My lzopfs project can do random access into gzip files, but it needs to search for blocks that are truly independent. Sometimes it doesn't find enough and needs to store the window data in an index.

You said you've written code to deal with this before, I'm curious where! Would love to see how others have dealt with these issues.

Zstd unfortunately works similar to gzip here, where even with rsyncable each block depends on the previous window. But it at least has a multi-frame format specification, with multiple independent implementations: zstd's contrib dir, zstd-seekable-format-go, t2sz, maybe more.

Xz is really my favorite here, since in multi-threaded mode (which is on by default nowadays) it creates completely independent blocks. Yes, it gives up a tiny bit of compression ratio, but it enables both random-access and parallel DEcompression.

OCI zstd

Posted May 15, 2025 19:54 UTC (Thu) by excors (subscriber, #95769) [Link]

My one's at https://crates.io/crates/indexed_deflate . Doesn't do anything particularly exciting - it uses miniz_oxide for decompression, which I patched to stop at block boundaries and expose some internal state, so all it needs to store in the index is the current bit offset and the window content. (I wanted this for a specific set of third-party non-rsyncable gzip files, and was okay with ~1MB granularity for seeking, so I didn't try to optimise it for the case of independent blocks.)

OCI zstd

Posted May 17, 2025 6:20 UTC (Sat) by tianon (subscriber, #98676) [Link]

> Would love to see how others have dealt with these issues.

A friend of mine wrote https://github.com/jonjohnsonjr/targz, which is essentially extracted from the code that powers the layer browsing functionality of https://oci.dag.dev/ (https://github.com/jonjohnsonjr/dagdotdev). 👀

My understanding of oci.dag.dev is that he creates an index of the tar inside the stream (without modifying the original compression in any way). Then he gets clever and stores that in a tar.gz so that if the *index* gets too big, he can make a map of the index too and just recurse.

(However, my own understanding of the details is very surface level, so if I've got the details wrong maybe he'll finally make an account just to correct me! ❤️)