User: Password:
|
|
Subscribe / Log in / New account

Compression formats for kernel.org

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
February 17, 2010
The kernel.org repository depends heavily on compression to keep its storage and bandwidth expenses down. An uncompressed tarball for the 2.6.32 release weighs in at 365MB; if downloaders grabbed the data in this format, the resulting bandwidth usage would be huge. So kernel.org does not make uncompressed tarballs available; instead, one can choose between versions compressed with gzip (79MB) or bzip2 (62MB). Bzip2 is the newer choice; it took a while to catch on because the needed tools were not widely shipped. Now, though, the folks at kernel.org are considering making a change in the compression formats used there.

What's driving this discussion is the availability of the XZ tool, which is based on the LZMA compression algorithm. XZ offers better compression performance - 53MB on that 2.6.32 tarball - but it suffers from a familiar problem: the tools are not yet widely available in distributions, especially those of the "enterprise" variety. This has led to pushback against the idea of standardizing on XZ in the near future, as can be seen in this comment from Ted Ts'o:

Keep in mind that there are people where who are still using RHEL 3, and some of them might want to download from ftp.kernel.org. So those people who are suggesting that we replace .gz files with .xz on kernel.org are *really* smoking something good.

In fact, there is little pressure to replace the gzip format anytime in the near future. Its compression performance may not be the best, but it does have the advantage of being far faster than any of the alternatives. From the discussion, it is fairly clear that some users care about decompression time. What is more likely is that XZ might eventually displace files in the bzip2 format. Then there would be a clear choice: speed and widespread availability or the best available compression. Even that change, though, is likely to be at least a year away; in the mean time, kernel.org will probably carry files in all three formats.

(This discussion also included a side thread on changing the 2.6.xx numbering scheme. Once again, though, the expected flame wars failed to materialize. There just does not seem to be much interest in or energy for this particular change.)


(Log in to post comments)

Compression formats for kernel.org

Posted Feb 18, 2010 10:07 UTC (Thu) by intgr (subscriber, #39733) [Link]

In fact, gzip may not necessarily be the speed leader for long. The XZ
format was especially designed to support parallelization, so on modern
quad-core processors it has the potential to be even faster than gzip.
Unfortunately, even though the file format can handle it, the current xz-
utils does not support parallelization yet.

Also, maybe one shouldn't be measuring decompression time in isolation, but
add in download time as well? If the user spent 5 less seconds downloading
the tarball, then does it matter if it takes 5 seconds more to decompress
it?

Compression formats for kernel.org

Posted Feb 18, 2010 11:04 UTC (Thu) by dlang (subscriber, #313) [Link]

don't limit your thinking to download speed.

I frequently compress my logfiles with gzip -9 even though I know that I will read them a few hours later. I do this because I have measured and found that it's faster to read the compressed data from disk and uncompress it than to read the uncompressed data from disk (even on some fairly beefy disk systems)

with bzip2 this is very much not the case.

I have not had a chance to measure xz in similar conditions yet, but from the sounds of things there's a good possibility that it will be a similar win (and if the decompression can be multithreaded it may be even better)

Compression formats for kernel.org

Posted Feb 18, 2010 14:43 UTC (Thu) by pointwood (guest, #2814) [Link]

PBzip2 (Parallel Bzip2) exists: http://compression.ca/pbzip2/

Compression formats for kernel.org

Posted Feb 18, 2010 17:42 UTC (Thu) by intgr (subscriber, #39733) [Link]

This is a good point, but do note that the topic was decompression speed. bzip2 is pretty good in terms of compression speed and ratio, but performs very badly at decompression.

Just for some rough figures, I'm decompressing the Linux kernel 2.6.32 source tarball, on my quad-core Phenom II system:
pbzcat, four threads, takes 4.1 seconds of wall-clock time (15.6s CPU time).
xzcat, single thread, takes just 4.7 seconds.
zcat, single thread, takes 2.3 seconds

So, parallel bzip2 decompression will probably beat gzip at 8 cores, whereas XZ would be on par with just 2 cores. While XZ is slow at compression, it will definitely beat gzip and bzip2 in parallel decompression.

Compression formats for kernel.org

Posted Feb 23, 2010 6:23 UTC (Tue) by SEJeff (subscriber, #51588) [Link]

And from someone who uses pbzip2 on gobs and gobs of large files every day...
it uses a ton of ram and is still slow.

Compression formats for kernel.org

Posted Feb 18, 2010 11:58 UTC (Thu) by zuki (subscriber, #41808) [Link]

I presume that if someone cares so much about download speed, they do it repeatedly. Wouldn't it be easier and _much_ faster to just use git to fetch the updates?

lrzip is often the winner

Posted Feb 18, 2010 12:01 UTC (Thu) by epa (subscriber, #39769) [Link]

Take a look at lrzip. It is an LZMA version of the older 'rzip', and works by first doing a simple compression using a very large window (say, 200 megabytes) before feeding the data to LZMA. This often allows it to get a better space-speed tradeoff than other compressors. Its strongest performance is when compressing archives with several almost-identical copies of the same data, for example a set of different kernel releases. For just one release, plain LZMA as implemented by XZ might be as good.

An alternative would be to distribute git trees for each release, but without any of the version history; just put all the files into a fresh git repository and do 'git pack' with maximum settings. Then compress that.

Note that there are at least two LZMA compression programs with a gzip-style interface: XZ and lzip. I have no idea why they haven't merged or at least standardized on a common file format.

"...the tools are not yet widely available in distributions"

Posted Feb 18, 2010 15:20 UTC (Thu) by dunlapg (subscriber, #57764) [Link]

Ubuntu 9.04 doesn't seem to know anything about it.

Ubuntu 9.10 gives you a scary warning message:

You are about to do something potentially harmful
To continue type in the phrase ‘Yes, do as I say!’
?]

Not seeing this replacing bzip2 for at least another year or two.

"...the tools are not yet widely available in distributions"

Posted Feb 18, 2010 19:24 UTC (Thu) by magnus (subscriber, #34778) [Link]

Looks like it conflicts with the lzma package that dpkg depends on..

"...the tools are not yet widely available in distributions"

Posted Feb 23, 2010 12:58 UTC (Tue) by nye (guest, #51576) [Link]

That's because lzma has been renamed to xz. They're the same thing.

"...the tools are not yet widely available in distributions"

Posted Feb 23, 2010 15:44 UTC (Tue) by johill (subscriber, #25196) [Link]

Not exactly, AIUI the container format is different even if the compression algorithm is still the same.

"...the tools are not yet widely available in distributions"

Posted Feb 24, 2010 11:54 UTC (Wed) by nye (guest, #51576) [Link]

No, I mean the package called 'lzma' has been renamed to 'xz' in more recent versions. There are indeed a couple of different ways of using the LZMA algorithm to produce a compressed archive but this is a different issue.

Compression formats for kernel.org

Posted Feb 18, 2010 18:46 UTC (Thu) by clugstj (subscriber, #4020) [Link]

I don't understand Ted's comment. If you are not savvy enough to download and build a simple compression tool, what business do you have downloading and trying to make use of the kernel source?

Compression formats for kernel.org

Posted Feb 18, 2010 22:37 UTC (Thu) by proski (subscriber, #104) [Link]

Exactly. The same is true from the security standpoint. Installing a utility is easier than installing a kernel.

Compression formats for kernel.org

Posted Feb 19, 2010 0:01 UTC (Fri) by gdt (subscriber, #6284) [Link]

But you are asking users to do exactly that when reporting errors. The LKML doesn't like reports against distribution kernels, it prefers reports again a recent kernel.org kernel.

So if you want decent error reports then you've got to make it easy for users — even beginners who have no interest in Linux beyond this one bug that is making their life hell — to download, compile, install and run the kernel.org kernel on their otherwise stock operating system.

If you don't want decent error reports and real user testing of recent kernels, then by all means use tools that aren't packaged with recent distributions. In summary: move tar.bz2 to whatever but keep tar.gz.

Compression formats for kernel.org

Posted Feb 19, 2010 0:19 UTC (Fri) by jspaleta (subscriber, #50639) [Link]

Uhm... are upstream kernel developers really asking for beginners to do all that on their own? I'm not sure they are. I think there is an implied expectation that distributors are suppose to act in enlightened self-interest to help their less-technical userbase produce useful bug reports.

I don't really expect the vast majority of linux beginners using Google Android or Palm WebOS users to have the technical competence or desire to compile stock kernels on their own without the intervention of Google or Palm employees who originally built and tested the patched kernel binaries being used by their users.

-jef

Compression formats for kernel.org

Posted Feb 19, 2010 10:51 UTC (Fri) by nix (subscriber, #2304) [Link]

Quite so.

I mean, sure, if you're using some out-of-tree thingy you got yourself, then obviously you're expected to be able to patch/compile/build your own kernel... but if it came from the distro, then *they* are the ones who should be interacting with upstream to pass on bug reports (although things might be interesting if there are bugs that only the end user can reproduce: in that situation I'd expect a three-way, with upstream providing diagnostic patches, the distro building them for the poor damn user or providing a script to do so, and the user running them and reporting the results. Maybe this is too much wild-eyed dreaming, but the alternative is that bugs in the manifold out-of-tree patches that some distros include will never be fixed unless upstream happens to have just the right hardware to reproduce them.)

Compression formats for kernel.org

Posted Feb 23, 2010 21:28 UTC (Tue) by meyert (subscriber, #32097) [Link]

So why not save disk space and directory entries on kernel.org (and all its mirrors?) and only provide the gzip version (that everybody seems to agree on) and abandon the bzip2 version? wouldn't that be the easiest solution?

Compression formats for kernel.org

Posted Feb 26, 2010 7:24 UTC (Fri) by efexis (guest, #26355) [Link]

As then everybody who was downloading the bz2 versions would be downloading the gz versions instead, which means you're actually serving more data. It's cheaper to throw extra storage at your server just once than it is to continuously be serving more data.


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds