|
|
Log in / Subscribe / Register

Integration into file formats.

Integration into file formats.

Posted Jan 15, 2026 9:08 UTC (Thu) by himi (subscriber, #340)
In reply to: Integration into file formats. by martinfick
Parent article: Format-specific compression with OpenZL

That is indeed an issue . . . and one that probably doesn't have any resolution - if you want to have smarts in the storage layer (underneath the filesystem abstraction), you really need to make sure those smarts can see the raw data rather than any kind of processed format.

But there are definitely scenarios where that approach doesn't work for some reason. The use case that I was thinking of is one that we deal with where I work: we're continuously pulling down large amounts of satellite data (we run a regional hub for the ESA's Copernicus program) - basically a big collection of files, each one unique and unchanging; new data means new files, old files never get touched; if the underlying raw data gets reprocessed (e.g. reprocessing data from older satellites to be consistent with the processing done with current satellites, which happens occasionally) that results in a set of new files *alongside* the old ones. By its very nature the raw data pretty much *has* to have little to no commonality between files - it's sensor data, essentially long strings of numbers with a sizeable random noise component alongside the signal; if your storage layer can do any kind of meaningful deduplication or similar something's probably gone seriously wrong with the satellites. The only thing that's worth doing is compression - improved compression in this use case, both at rest and in flight, would be a major win.

That's what immediately came to my mind, but there's a whole lot of other scientific data sets that will have similar properties, and ideally we'd hang onto those raw data sets essentially indefinitely - there's always potential for extracting new information from data that's already been collected. One nice example is research extracting historical climate data from Royal Navy log books going back more than two hundred years; there's also lots of astronomical research beng done that's mostly reprocessing old raw data, and programs like JWST build that into their foundations - every bit of observational data from JWST will eventually be available for anyone to access and use for their own research.

Which all kind of agrees with your basic argument, I guess - the raw data is critical, you want to process it as little and as late as possible, at the point where you can gain as much value out of it as you can . . . but that means different things for different types of data.

All that said, one of the standard complaints from the data storage team where I work is researchers who keep ten copies of identical data because they can't keep track of where they put things (and then complain about hitting their quota . . . ) - magic in the storage layer to handle that kind of deduplication would definitely be nice.


to post comments

Integration into file formats.

Posted Jan 15, 2026 11:50 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> magic in the storage layer to handle that kind of deduplication would definitely be nice.

Isn't this inherent in one the file-systems? ZFS springs to mind?

Some filesystems I believe keep a hash of disk blocks, and if two blocks have the same contents, the overlying files will be changed to point to the same block. Within this, they can either "check on write" and so dedup on the fly, or they do a post-hoc dedupe pass. Either way, I'm sure this functionality is available in at least one regular linux file system.

Cheers,
Wol

Integration into file formats.

Posted Jan 15, 2026 14:51 UTC (Thu) by willy (subscriber, #9762) [Link]

This is the kind of thing that sounds seductively attractive and then you actually try to do it and the metadata needed to keep track of everything blows up exponentially (literally, not in the modern meaning of "a lot"). And fragmentation increases massively, which turns out to matter even on NVMe drives.

There's specialist cases where this makes sense, but it's no free meal. Or maybe it is a free meal, in the sense that the drinks now cost 50% more.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds