|
|
Log in / Subscribe / Register

Integration into file formats.

Integration into file formats.

Posted Jan 15, 2026 6:55 UTC (Thu) by martinfick (subscriber, #4455)
In reply to: Integration into file formats. by himi
Parent article: Format-specific compression with OpenZL

As enticing as this may sound, this has a major drawback that it will not work very well with object stores. If the file/object data is already compressed when it is inserted, then it makes it much harder to perform any sort of cross file or version deltafication, such as what git can do. With many compression formats, if a single byte is altered in the raw data, it may drastically change the compressed output. When this happens, deltafication across file versions becomes almost impossible, or not very useful. It is much better to perform deltafication on the raw data first, and to then compress the deltas.

Another problem you will encounter, perhaps even worse, is with content addressable object stores, here once again git comes to mind. Inserting already compressed data makes it almost impossible to improve upon the original compression, and thus freezes/osifies the compression since any hashes of the content would be performed on the compressed content instead of the raw data. This leaves the storage at the whim of the original compression algorithm and speed settings without ever being able to change things if better algorithms are developed. If the compression were to be changed, the hash of the compressed data would change, and the object store would not see it as the same object even though the raw data would be the same! Instead, if the compression is left up to the storage, the storage will be able to take advantage of new compression techniques as they are developed, or even just the availability of more CPU cycles.


to post comments

Integration into file formats.

Posted Jan 15, 2026 9:08 UTC (Thu) by himi (subscriber, #340) [Link] (2 responses)

That is indeed an issue . . . and one that probably doesn't have any resolution - if you want to have smarts in the storage layer (underneath the filesystem abstraction), you really need to make sure those smarts can see the raw data rather than any kind of processed format.

But there are definitely scenarios where that approach doesn't work for some reason. The use case that I was thinking of is one that we deal with where I work: we're continuously pulling down large amounts of satellite data (we run a regional hub for the ESA's Copernicus program) - basically a big collection of files, each one unique and unchanging; new data means new files, old files never get touched; if the underlying raw data gets reprocessed (e.g. reprocessing data from older satellites to be consistent with the processing done with current satellites, which happens occasionally) that results in a set of new files *alongside* the old ones. By its very nature the raw data pretty much *has* to have little to no commonality between files - it's sensor data, essentially long strings of numbers with a sizeable random noise component alongside the signal; if your storage layer can do any kind of meaningful deduplication or similar something's probably gone seriously wrong with the satellites. The only thing that's worth doing is compression - improved compression in this use case, both at rest and in flight, would be a major win.

That's what immediately came to my mind, but there's a whole lot of other scientific data sets that will have similar properties, and ideally we'd hang onto those raw data sets essentially indefinitely - there's always potential for extracting new information from data that's already been collected. One nice example is research extracting historical climate data from Royal Navy log books going back more than two hundred years; there's also lots of astronomical research beng done that's mostly reprocessing old raw data, and programs like JWST build that into their foundations - every bit of observational data from JWST will eventually be available for anyone to access and use for their own research.

Which all kind of agrees with your basic argument, I guess - the raw data is critical, you want to process it as little and as late as possible, at the point where you can gain as much value out of it as you can . . . but that means different things for different types of data.

All that said, one of the standard complaints from the data storage team where I work is researchers who keep ten copies of identical data because they can't keep track of where they put things (and then complain about hitting their quota . . . ) - magic in the storage layer to handle that kind of deduplication would definitely be nice.

Integration into file formats.

Posted Jan 15, 2026 11:50 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> magic in the storage layer to handle that kind of deduplication would definitely be nice.

Isn't this inherent in one the file-systems? ZFS springs to mind?

Some filesystems I believe keep a hash of disk blocks, and if two blocks have the same contents, the overlying files will be changed to point to the same block. Within this, they can either "check on write" and so dedup on the fly, or they do a post-hoc dedupe pass. Either way, I'm sure this functionality is available in at least one regular linux file system.

Cheers,
Wol

Integration into file formats.

Posted Jan 15, 2026 14:51 UTC (Thu) by willy (subscriber, #9762) [Link]

This is the kind of thing that sounds seductively attractive and then you actually try to do it and the metadata needed to keep track of everything blows up exponentially (literally, not in the modern meaning of "a lot"). And fragmentation increases massively, which turns out to matter even on NVMe drives.

There's specialist cases where this makes sense, but it's no free meal. Or maybe it is a free meal, in the sense that the drinks now cost 50% more.

Integration into file formats.

Posted Jan 17, 2026 22:45 UTC (Sat) by cesarb (subscriber, #6266) [Link]

> If the file/object data is already compressed when it is inserted, then it makes it much harder to perform any sort of cross file or version deltafication, such as what git can do. [...] Another problem you will encounter, perhaps even worse, is with content addressable object stores, here once again git comes to mind. Inserting already compressed data makes it almost impossible to improve upon the original compression, and thus freezes/osifies the compression since any hashes of the content would be performed on the compressed content instead of the raw data.

Funny you mention git. Very early in the git history, it worked exactly like that: the object identifier was the hash of the *compressed* data. See https://github.com/git/git/commit/d98b46f8d9a3daf965a39f8... ("Do SHA1 hash _before_ compression.") and https://github.com/git/git/commit/f18ca731663191477613645... ("The recent hash/compression switch-over missed the blob creation."), where it was changed to the current behavior of using the hash of the *uncompressed* data.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds