|
|
Log in / Subscribe / Register

Integration into file formats.

Integration into file formats.

Posted Jan 14, 2026 22:45 UTC (Wed) by jepsis (subscriber, #130218)
In reply to: Integration into file formats. by himi
Parent article: Format-specific compression with OpenZL

OpenZL is not a file format. It is a universal and self-describing compression layer and does not need a format-specific compression format.
File format design is still important. OpenZL works on top of existing formats and removes the need for custom compression codecs, while still using the data structure.


to post comments

Integration into file formats.

Posted Jan 15, 2026 2:19 UTC (Thu) by himi (subscriber, #340) [Link]

Yes, I was suggesting that the lessons learned from creating data-format-specific compression logic could feed into designing file formats that incorporate that logic from the start, with the various OpenZL components being used in the actual implementation of the tooling for the new file formats. So rather than taking an existing '.foo' file, compressing it with a foo-specific profile, and storing the compressed stream in a '.foo.zl' file, you'd incorporate the foo-specific compression logic into the libfoo library (with your implementation making use of OpenZL components), and create a new '.fooz' file format that directly integrated the compressed stream(s) of data. After all, since libfoo is obviously specialised in handling this particular type of data, it seems like a good place to put specialised knowledge about how best to handle compressing that data - at least, in a world where there's tooling which can make it relatively easy to do that.

You can do something similar as it stands with existing compression libraries, but it's a lot of work for not much gain over using a general tool for whole-file compression. What the OpenZL project brings is a body of knowledge about data compression in general that can be used to inform the way that you set up your data streams to allow the best possible results, and a bunch of code that makes it easy to create a highly specialised compression pipeline - if that gets you something two thirds the size of the old '.foo.gz' files that can be compressed and decompressed in half the time it may well be worth the effort.

The body of knowledge could also feed more broadly into file format design choices - if laying your data out one way versus another costs you (say) 10% in terms of zstd compressed file size, that's kind of useful to know even if you're not going to try and make a super-specialised compression tool. As far as I know that sort of knowledge base doesn't exist at present.

Integration into file formats.

Posted Jan 15, 2026 2:48 UTC (Thu) by jepsis (subscriber, #130218) [Link] (1 responses)

Automatic decompression for such a file format is easy. Compression is the hard part. To write efficient representation you need clear intent i.e. how the data is expected to be used (streaming, random access, read-heavy, write-heavy), what the lifecycle looks like (archival or live data or if recompression is expected), and how the data is structured internally (schema, value distributions, chunking and ordering). Without this information any attempt to choose compression automatically is mostly guesswork and likely ends up with suboptimal result.

Integration into file formats.

Posted Jan 15, 2026 14:46 UTC (Thu) by willy (subscriber, #9762) [Link]

The two of you may be talking past each other a little. It depends whether this is archival data or working set whether building compression into the file format is a good idea. There's value in "today's data is stored in foo, last year's data is stored in foo.gz". But sometimes we're always dealing with data that needs to be compressed, and then it's worth building it into the file format.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds