Integration into file formats.
Integration into file formats.
Posted Jan 15, 2026 9:08 UTC (Thu) by himi (subscriber, #340)In reply to: Integration into file formats. by martinfick
Parent article: Format-specific compression with OpenZL
But there are definitely scenarios where that approach doesn't work for some reason. The use case that I was thinking of is one that we deal with where I work: we're continuously pulling down large amounts of satellite data (we run a regional hub for the ESA's Copernicus program) - basically a big collection of files, each one unique and unchanging; new data means new files, old files never get touched; if the underlying raw data gets reprocessed (e.g. reprocessing data from older satellites to be consistent with the processing done with current satellites, which happens occasionally) that results in a set of new files *alongside* the old ones. By its very nature the raw data pretty much *has* to have little to no commonality between files - it's sensor data, essentially long strings of numbers with a sizeable random noise component alongside the signal; if your storage layer can do any kind of meaningful deduplication or similar something's probably gone seriously wrong with the satellites. The only thing that's worth doing is compression - improved compression in this use case, both at rest and in flight, would be a major win.
That's what immediately came to my mind, but there's a whole lot of other scientific data sets that will have similar properties, and ideally we'd hang onto those raw data sets essentially indefinitely - there's always potential for extracting new information from data that's already been collected. One nice example is research extracting historical climate data from Royal Navy log books going back more than two hundred years; there's also lots of astronomical research beng done that's mostly reprocessing old raw data, and programs like JWST build that into their foundations - every bit of observational data from JWST will eventually be available for anyone to access and use for their own research.
Which all kind of agrees with your basic argument, I guess - the raw data is critical, you want to process it as little and as late as possible, at the point where you can gain as much value out of it as you can . . . but that means different things for different types of data.
All that said, one of the standard complaints from the data storage team where I work is researchers who keep ten copies of identical data because they can't keep track of where they put things (and then complain about hitting their quota . . . ) - magic in the storage layer to handle that kind of deduplication would definitely be nice.
