|
|
Log in / Subscribe / Register

Integration into file formats.

Integration into file formats.

Posted Jan 14, 2026 22:21 UTC (Wed) by himi (subscriber, #340)
Parent article: Format-specific compression with OpenZL

Given the concerns about applying a format-specific compression algorithm to generalised data (both in terms of performance and safety), perhaps it would make sense to apply this kind of approach to the file formats themselves? i.e. integrate data compression into the file format itself rather than handing it off to a generalised compression tool. Obviously that would mean designing new file formats, or new versions of existing ones, as well as tooling transition from old to new formats (where that makes sense), but the payoffs could be pretty significant - there's a *lot* of big data sets being shuffled around the world these days, better compression of that data would save on both transfer time/network utilisation and storage requirements; better compression/decompression performance would be a nice cherry on top.

This seems like it would be entirely compatible with the OpenZL approach, though I think you'd need additional tooling to support this kind of use case. You'd also want to make sure there was lots of information about how to design file formats to suit this model, particularly the trade-offs between different data layouts; probably also consideration of archival versus live data formats (with archival being designed for maximum compression efficiency, versus the live format optimising for whatever IO patterns your active use case requires), and streaming versus random-access, and probably a bunch of other considerations I haven't thought of . . . In fact, the world in general could benefit quite a bit from having a readily available knowledge-base about designing good file formats, particularly if that was supported by high quality tooling and libraries.

Of course you'd still need to support the generalised use cases, and the current OpenZL model of special-case with fallback to general also makes lots of sense (there's a lot of uncompressed data already out there, after all), but building good support for compression into the file formats themselves seems like a reasonable next step, and supporting the development of better file formats in general would be a pretty good end goal.


to post comments

Integration into file formats.

Posted Jan 14, 2026 22:45 UTC (Wed) by jepsis (subscriber, #130218) [Link] (3 responses)

OpenZL is not a file format. It is a universal and self-describing compression layer and does not need a format-specific compression format.
File format design is still important. OpenZL works on top of existing formats and removes the need for custom compression codecs, while still using the data structure.

Integration into file formats.

Posted Jan 15, 2026 2:19 UTC (Thu) by himi (subscriber, #340) [Link]

Yes, I was suggesting that the lessons learned from creating data-format-specific compression logic could feed into designing file formats that incorporate that logic from the start, with the various OpenZL components being used in the actual implementation of the tooling for the new file formats. So rather than taking an existing '.foo' file, compressing it with a foo-specific profile, and storing the compressed stream in a '.foo.zl' file, you'd incorporate the foo-specific compression logic into the libfoo library (with your implementation making use of OpenZL components), and create a new '.fooz' file format that directly integrated the compressed stream(s) of data. After all, since libfoo is obviously specialised in handling this particular type of data, it seems like a good place to put specialised knowledge about how best to handle compressing that data - at least, in a world where there's tooling which can make it relatively easy to do that.

You can do something similar as it stands with existing compression libraries, but it's a lot of work for not much gain over using a general tool for whole-file compression. What the OpenZL project brings is a body of knowledge about data compression in general that can be used to inform the way that you set up your data streams to allow the best possible results, and a bunch of code that makes it easy to create a highly specialised compression pipeline - if that gets you something two thirds the size of the old '.foo.gz' files that can be compressed and decompressed in half the time it may well be worth the effort.

The body of knowledge could also feed more broadly into file format design choices - if laying your data out one way versus another costs you (say) 10% in terms of zstd compressed file size, that's kind of useful to know even if you're not going to try and make a super-specialised compression tool. As far as I know that sort of knowledge base doesn't exist at present.

Integration into file formats.

Posted Jan 15, 2026 2:48 UTC (Thu) by jepsis (subscriber, #130218) [Link] (1 responses)

Automatic decompression for such a file format is easy. Compression is the hard part. To write efficient representation you need clear intent i.e. how the data is expected to be used (streaming, random access, read-heavy, write-heavy), what the lifecycle looks like (archival or live data or if recompression is expected), and how the data is structured internally (schema, value distributions, chunking and ordering). Without this information any attempt to choose compression automatically is mostly guesswork and likely ends up with suboptimal result.

Integration into file formats.

Posted Jan 15, 2026 14:46 UTC (Thu) by willy (subscriber, #9762) [Link]

The two of you may be talking past each other a little. It depends whether this is archival data or working set whether building compression into the file format is a good idea. There's value in "today's data is stored in foo, last year's data is stored in foo.gz". But sometimes we're always dealing with data that needs to be compressed, and then it's worth building it into the file format.

Integration into file formats.

Posted Jan 15, 2026 6:55 UTC (Thu) by martinfick (subscriber, #4455) [Link] (4 responses)

As enticing as this may sound, this has a major drawback that it will not work very well with object stores. If the file/object data is already compressed when it is inserted, then it makes it much harder to perform any sort of cross file or version deltafication, such as what git can do. With many compression formats, if a single byte is altered in the raw data, it may drastically change the compressed output. When this happens, deltafication across file versions becomes almost impossible, or not very useful. It is much better to perform deltafication on the raw data first, and to then compress the deltas.

Another problem you will encounter, perhaps even worse, is with content addressable object stores, here once again git comes to mind. Inserting already compressed data makes it almost impossible to improve upon the original compression, and thus freezes/osifies the compression since any hashes of the content would be performed on the compressed content instead of the raw data. This leaves the storage at the whim of the original compression algorithm and speed settings without ever being able to change things if better algorithms are developed. If the compression were to be changed, the hash of the compressed data would change, and the object store would not see it as the same object even though the raw data would be the same! Instead, if the compression is left up to the storage, the storage will be able to take advantage of new compression techniques as they are developed, or even just the availability of more CPU cycles.

Integration into file formats.

Posted Jan 15, 2026 9:08 UTC (Thu) by himi (subscriber, #340) [Link] (2 responses)

That is indeed an issue . . . and one that probably doesn't have any resolution - if you want to have smarts in the storage layer (underneath the filesystem abstraction), you really need to make sure those smarts can see the raw data rather than any kind of processed format.

But there are definitely scenarios where that approach doesn't work for some reason. The use case that I was thinking of is one that we deal with where I work: we're continuously pulling down large amounts of satellite data (we run a regional hub for the ESA's Copernicus program) - basically a big collection of files, each one unique and unchanging; new data means new files, old files never get touched; if the underlying raw data gets reprocessed (e.g. reprocessing data from older satellites to be consistent with the processing done with current satellites, which happens occasionally) that results in a set of new files *alongside* the old ones. By its very nature the raw data pretty much *has* to have little to no commonality between files - it's sensor data, essentially long strings of numbers with a sizeable random noise component alongside the signal; if your storage layer can do any kind of meaningful deduplication or similar something's probably gone seriously wrong with the satellites. The only thing that's worth doing is compression - improved compression in this use case, both at rest and in flight, would be a major win.

That's what immediately came to my mind, but there's a whole lot of other scientific data sets that will have similar properties, and ideally we'd hang onto those raw data sets essentially indefinitely - there's always potential for extracting new information from data that's already been collected. One nice example is research extracting historical climate data from Royal Navy log books going back more than two hundred years; there's also lots of astronomical research beng done that's mostly reprocessing old raw data, and programs like JWST build that into their foundations - every bit of observational data from JWST will eventually be available for anyone to access and use for their own research.

Which all kind of agrees with your basic argument, I guess - the raw data is critical, you want to process it as little and as late as possible, at the point where you can gain as much value out of it as you can . . . but that means different things for different types of data.

All that said, one of the standard complaints from the data storage team where I work is researchers who keep ten copies of identical data because they can't keep track of where they put things (and then complain about hitting their quota . . . ) - magic in the storage layer to handle that kind of deduplication would definitely be nice.

Integration into file formats.

Posted Jan 15, 2026 11:50 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> magic in the storage layer to handle that kind of deduplication would definitely be nice.

Isn't this inherent in one the file-systems? ZFS springs to mind?

Some filesystems I believe keep a hash of disk blocks, and if two blocks have the same contents, the overlying files will be changed to point to the same block. Within this, they can either "check on write" and so dedup on the fly, or they do a post-hoc dedupe pass. Either way, I'm sure this functionality is available in at least one regular linux file system.

Cheers,
Wol

Integration into file formats.

Posted Jan 15, 2026 14:51 UTC (Thu) by willy (subscriber, #9762) [Link]

This is the kind of thing that sounds seductively attractive and then you actually try to do it and the metadata needed to keep track of everything blows up exponentially (literally, not in the modern meaning of "a lot"). And fragmentation increases massively, which turns out to matter even on NVMe drives.

There's specialist cases where this makes sense, but it's no free meal. Or maybe it is a free meal, in the sense that the drinks now cost 50% more.

Integration into file formats.

Posted Jan 17, 2026 22:45 UTC (Sat) by cesarb (subscriber, #6266) [Link]

> If the file/object data is already compressed when it is inserted, then it makes it much harder to perform any sort of cross file or version deltafication, such as what git can do. [...] Another problem you will encounter, perhaps even worse, is with content addressable object stores, here once again git comes to mind. Inserting already compressed data makes it almost impossible to improve upon the original compression, and thus freezes/osifies the compression since any hashes of the content would be performed on the compressed content instead of the raw data.

Funny you mention git. Very early in the git history, it worked exactly like that: the object identifier was the hash of the *compressed* data. See https://github.com/git/git/commit/d98b46f8d9a3daf965a39f8... ("Do SHA1 hash _before_ compression.") and https://github.com/git/git/commit/f18ca731663191477613645... ("The recent hash/compression switch-over missed the blob creation."), where it was changed to the current behavior of using the hash of the *uncompressed* data.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds