Format-specific compression with OpenZL
Lossless data compression is an important tool for reducing the storage requirements of the world's ever-growing data sets. Yann Collet developed the LZ4 algorithm and designed the Zstandard (or Zstd) algorithm; he came to the 2025 Open Source Summit Japan in Tokyo to talk about where data compression goes from here. It turns out that we have reached a point where general-purpose algorithms are only going to provide limited improvement; for significant increases in compression, while keeping computation costs within reason for data-center use, turning to format-specific techniques will be needed.
Zstandard was introduced ten years ago and "it offered really
much better performance tradeoffs than what existed before
", Collet
began. The alternatives were zlib, which was a "very
good middle ground for decent speed and decent compression ratio
", but
was not fast enough, and LZ4, which provided much better compression speed
but did
not compress the data enough. Zstandard quickly supplanted the others
because it was fundamentally better for size and speed. In the years since,
Zstandard has improved, especially in its decompression speed, but those
advances are still fairly modest. "We are reaching the limits of that technology.
"
In looking at what can be done to improve things, there are other problems
beyond just the diminishing returns. The Zstandard format is limiting;
with a new format, gains of 2-3% for compression ratio and 10-20% for speed
are possible. "Is it worth it?
", he asked. It is not really about
the time needed to develop the new format, but that there is a huge
ecosystem of Zstandard users that would need to change, which is extremely
costly. He does not think there would be a serious shift to a new format
unless it offered overwhelming advantages. "If we introduce a new
compressor, it has to be vastly better.
"
There are other options, such as copy-based algorithms (e.g. LZ78), which copy repeated data from the compression dictionary to reconstruct the original; they can meet the needs for data-center compression, but they converge toward the same limits as Zstandard. That convergence was surprising, Collet said, because the techniques are quite different, but it stems from the fact that all of them make no assumptions about the data and simply treat it as a stream of undifferentiated bytes. There are high-compression algorithms that can achieve better results (e.g. PPM) but they run too slowly for data-center applications.
Format specific
Compressors that are only concerned with a specific format can do much
better. For a trivial example, a simple array of consecutive integer values
cannot be compressed by algorithms like LZ because there are no
repetitions. A simple delta transformation turns that into something that
can be heavily compressed, however. "If we know what we are compressing,
it's not just a bunch of bytes, [...] it opens more options and, because we
have more options, we should be able to compress better.
"
A more real example is a compressor for the Smithsonian
Astrophysical Observatory star catalog format, known as "SAO". It is
part of the Silesia
compression corpus, which consists of data sets that are used to compare
compression algorithms. "It's very well defined
", with a header
followed by an array of 28-byte structures with fixed fields and types.
Turning the array of structures into a structure of arrays is a "trivial
transformation
"; each array is homogeneous and it can be analyzed
separately. For example, the first two fields in the structure are 64-bit X
and Y positions. The X values are mostly sorted, so a delta compression
gives good results; the Y values are bounded and have a limited
number of values compared to the range, so a transpose transformation can
focus on compressing the high (largely unchanging) bytes, while other
techniques can be applied to the subset of all the possible values for the
low bytes. Other fields have properties that can be exploited as well.
He compared the results of a few different compressors on the SAO file.
Zstandard using its default (i.e. zstd -3) reduced the 7.2MB file
to 5.5MB for a 1.31 compression factor, which is not great. The data is
numeric, which Zstandard is not particularly good at compressing, and the
SAO file is "packed with information
", lacking zeroes and repeating
sequences, "it is difficult to compress
". But the speed of
Zstandard is good, Collet said, compressing at 100MB per second and
decompressing at 750MB/s; "if you want to deploy something in a data
center, you want this kind of speed
".
He compared the "best of the best
" widely available compression (lzma
-9), which got much better compression (4.4MB or 1.64 compression
factor), but the speed was not adequate for deployment (2.9MB/s
compression, 45MB/s decompression). For another data point, he used cmix, which is an
experimental compressor by Byron Knoll; you would not deploy
it, he said, but "it's recognized as the best compressor out there
".
It reduced the SAO file to 3.7MB, which is almost a factor of two, but
compressing and decompressing can only be done at 0.001MB/s.
Those results set the goals for the SAO-specific compressor: a factor of
around two and speed like that of Zstandard. It achieves those goals
easily, with a compressed size of 3.5MB (2.06 compression factor) and
speeds faster than those of Zstandard (215MB/s compression, 800MB/s
decompression). "Here we have enough gains to justify deploying
something new in our data centers; this is the next step we were looking
for.
" It turns out that knowing anything about the data gives a major
advantage in compression; it is "an insane advantage, a way too large
advantage to ignore
".
Drawbacks
There are some problems in switching to format-specific compression,
starting with the need to design algorithms for the formats. It will
require engineers, hopefully with data-compression experience, some
time to understand the format and devise an algorithm for it. That
typically takes around 18 months, he said, "and you don't know in
advance what you will get
"—it is not just time and money, but there is
uncertainty as well.
Once a good algorithm has been found, there will be a need to optimize it
and to safeguard it against attacks. "Every codec
[compressor/decompressor] is an injection
point.
" Since there are lots of formats, and there is a need to be
cost effective with developing these compressors, developers may rely on
only handling "safe" data instead of spending the effort on fuzzing and
other techniques for hardening. After a while, the codec may slowly
start being used on less-safe data, resulting in vulnerabilities and
attacks, however.
Once a codec is ready for deployment, there are still hurdles to overcome. Decompressors must be deployed everywhere the data may need to be accessed, which is not necessarily as easy as it sounds. That may include thousands (or hundreds of thousands) of servers all over the world, clients of various sorts, and so on; it is not uncommon that it takes longer to deploy a new compression algorithm than it did to develop it, Collet said.
There is also a large maintenance cost associated with a format-specific
compression. In addition, if the format needs to change, the compressor
will also, and all of the deployment woes arise again. The original
developers may well have moved on to other things, so finding people to
work on it may be hard and take time. This becomes a "silent velocity
obstacle
", because no one wants to consider changing the format, even
if there would be large benefits to doing so, because it is so daunting.
Enter OpenZL
So there is a tension between the promise of format-specific compression
and the problems that can come from using it. But the truth is that
those problems already exist, Collet said, because in every large
organization there are already groups using these compression techniques;
"the gains are so huge
" that they get adopted piecemeal. "OpenZL is our answer to this
tension; we believe that this solution solves all the problems that were
just mentioned.
"
OpenZL has a core library and tools that allow creating specialized
compressors. He likened it to the OpenGL graphics API, which "is not a 3D app but is a
set of primitives to do a 3D app
"; similarly, the OpenZL library gives users
primitives to build their own compressor. The idea is to define
compressors as graphs of pre-validated codecs, so that the these different
pieces can be combined in a myriad of different ways to produce
compressors—"pretty much like Lego
".
Using those codecs will allow creating new compressors in a matter of days,
instead of months. The graphs provide an enormous search space, by human
standards, but that space is not particularly large for computers, so it
can be systematically searched. "We can provide tools that will do this
work of finding the best arrangement of codecs and will give you an answer
in minutes.
" That is a game-changer, he said; users can know quickly
whether it even makes sense to pursue a format-specific compressor.
Assuming that it does make sense, the "deployment bottleneck
" will
soon rear its head. OpenZL avoids that by having a unified decompression
engine that can handle any graph, so there is only one program that needs
to be deployed. Updates and changes to the compressor are simply new
configurations; transitions can be handled by supporting multiple graphs
for a format. In addition, graphs can even be changed dynamically during
compression if desired. The maintenance headaches are reduced, as well,
since there is only a single code base that needs attention for bug fixes,
performance improvements, and security upgrades.
It is natural to think of these graphs as being static, but that is not the
reality. These compressors have a selector that chooses a graph by
analyzing the data, so the graph for a format can change based on the
input. The intent is to maintain performance, he said, but, more
importantly, to handle exceptions. If an integer array is expected, but
text is found, using a numeric compressor "is going to end badly
";
that should be detected and a switch made to Zstandard, which is the
fallback codec.
The first step to generating an OpenZL compressor is describing the data format. There are already around a dozen formats supported by OpenZL and dozens more will be added over the next few months, he said. Those will only cover common formats, however, so others will need to be described, either by providing a parser function or by using the Simple Data Description Language (SDDL) compiler.
SDDL can describe straightforward formats easily; it can also handle
more complex formats, "but at some point, it is no longer the right
tool
". If creating the SDDL becomes too difficult, the work can be
outsourced to an LLM
"and it actually works
", he said. There is one prompt that teaches
the LLM about the SDDL syntax and then it can be asked to generate the
SDDL. "If it's a good LLM, it should work well; like every LLM, you should
read it.
" It is approaching the point where no programming at all will
be needed to do this, Collet said.
OpenZL has tools that will use the description of the data and some sample files to create multiple compressors in a few minutes. Those different compressors allow users to choose the tradeoffs that matter to them: faster speed or more compression. In order to compress a file using one of them, the description of it, called a serialized compressor, is specified along with the file to compress. Decompression does not need to specify the compressor because the description is stored in the compressed data.
Any of the steps can be done manually, which might be somewhat painful, but
means that everything about the compressor can be examined. "We can
observe it, we can change it, we can see if we can find something
better
". That is important for debugging and research into compression
techniques.
He showed some graphs comparing OpenZL compression to existing tools, but
noted that "it's not a fair fight
". The graphs show OpenZL doing
much better than the competition. That's the whole point of
OpenZL, he said, "if you know something about your data, why not use it
to get better performance?
"
OpenZL is already deployed widely at his employer, Meta. One of the main workloads at Meta is LLMs, so there is a lot of data to handle. The Meta system is set up to constantly monitor the data being generated, periodically retrain the compressors based on that, and then deploy the resulting compressed files immediately—the decompressor can always handle the result. He noted that compression is not only about saving storage, it is also about transmission time savings for moving data around—to and from GPUs, for example. That directly translates to higher compute utilization.
OpenZL is open source and available on GitHub (under
the three-clause BSD license). The quick start
instructions are straightforward, Collet said; following those steps
will introduce all of the new concepts and tools. "It's not Zstandard++,
this thing is different
", so there are more steps and users need to
invest some time to come up to speed. If they do, they will get better
compression and more speed, however; "the difference is stark
".
It has not yet reached a 1.0 release, because the OpenZL
developers believe the final wire protocol needs to be built with the
community. Over the next few years, the idea is to engage with the
community to ensure that all of the different use cases are covered. In
addition, there is work on getting OpenZL acceleration working directly in
various types of hardware: CPUs, GPUs, and ASICs. That will take some
time, "but we expect to see the result of that before the end of the
decade
", he concluded.
Interested readers may wish to view the YouTube video of the talk or look at Collet's slides.
[ I would like to thank the Linux Foundation, LWN's travel sponsor, for
assistance with traveling to Tokyo for Open Source Summit Japan. ]
| Index entries for this article | |
|---|---|
| Conference | Open Source Summit Japan/2025 |
