LZ4: vendoring in the kernel
The LZ4 compression algorithm claims to be "extremely fast", especially on the decompression side. The project claims benchmark results showing LZ4 beating LZO decompression by a factor of four and zlib by nearly an order of magnitude. It is a lossless algorithm, so it is suitable for compressing data that must be recoverable in exactly its original form. Recent releases have added a "fast" mode that allows callers to control the trade-off between speed and the amount of compression applied.
One can imagine how this kind of fast compression would be useful to have in the kernel. And indeed, the kernel has had LZ4 capability since the 3.11 release in 2013. It was added by Chanho Min, who grabbed the r90 release from the LZ4 repository and stuffed it into the kernel under lib/lz4. A quick grep shows that it is currently used in the crypto layer, in the pstore subsystem, and in the squashfs filesystem. There are other places in the kernel that use compression, but they are not using LZ4 currently.
One of the advantages of copying the code into your own repository is that you are no longer dependent on an external dependency. Lefkowitz thought that independence was so valuable that he recommended copying for any dependency with less than about 35 million lines. In the kernel's case, there is an especially strong case against external dependencies: the kernel must be built as a standalone program using its complicated set of linker rules. It is probably possible to tweak the kernel's build system to allow it to link against externally supplied libraries, but one can imagine that there would be a fair amount of opposition to any such move. Kernel developers want to know exactly what is going into the end product.
The downside of vendoring, of course, is that you then lose out on all of the enhancements made in the original project. The LZ4 developers have made a number of releases since 2013; these have added numerous features, including the "LZ4 fast" mode. Some of the changes may have fixed bugs that, in the kernel, would constitute security vulnerabilities. None of those changes are in current kernels.
Toward the beginning of the year, Sven Schmidt posted a patch set updating LZ4 to the project's 1.7.2 release. The motivation was a desire to use the LZ4 fast mode in the Lustre filesystem, but he made the reasonable assumption that other parts of the kernel might want to take advantage of the fast mode as well. The patches are a wholesale replacement of the existing LZ4 code; the work initially done by Min to turn the LZ4 library into a kernel module has been replicated.
There do not appear to be any objections to upgrading the kernel's LZ4 implementation, but Greg Kroah-Hartman did note one potential problem and, in the process, highlighted one of the other hazards that go with vendoring. The existing in-kernel LZ4 implementation has not sat still since 2013; it has had a number of patches applied to it. Some of those were security fixes. When Schmidt replaced the LZ4 implementation, he replaced those fixes as well, potentially reintroducing problems that had already been fixed once.
Once his attention was called to the issue, Schmidt agreed to look at the patches and make sure that his replacement does not bring the old bugs back. With luck, he will also get any relevant changes merged back upstream, though Willy Tarreau suggested that some of the fixes, at least, were specific to the kernel. If such changes exist, they are unlikely to make it upstream and will thus be something the kernel has to carry indefinitely.
Making sure that the new LZ4 maintains the fixes applied to the old one is not a huge job; the number of patches is small. Happily, they exist as separate patches, rather than having been quietly folded into the source when LZ4 was initially added to the kernel. But it is a job that has to be remembered every time that somebody decides to update the kernel's LZ4 implementation. In this case, Kroah-Hartman noticed the problem, but the project cannot always count on his attentiveness to avoid regressions with future upgrades.
Such upgrades will almost certainly happen sooner or later. The upstream LZ4 project is already up to 1.7.6 as of this writing; it has added a new high-compression mode and fixed some bugs since 1.7.2 was released. At some point, somebody working in the kernel space will want the enhancements being made upstream.
The kernel has other copied subsystems like LZ4; they are mostly low-level compression and cryptographic code. Each one of these represents a sort of disconnect from the upstream project (in cases where there is still a functioning upstream project, at least). One could regard the highly modified kernels shipped in the mobile and embedded areas as being another example of the same thing; rather than upstream their code, these vendors simply copy it from one kernel to the next.
There are solid reasons for vendoring, but also real costs associated with
it. The prevalence of vendoring throughout our community suggests that we
are still struggling to find the best ways to integrate software that is
created by independent groups of developers, especially as the scale of our
projects continues to increase. For now, we will just have to hope that,
the next time somebody decides to update a library like LZ4 in the kernel,
they will remember what the old fixes are and make sure they carry over to
the new version.
| Index entries for this article | |
|---|---|
| Kernel | Development model |
Posted Feb 2, 2017 2:08 UTC (Thu)
by klindsay (subscriber, #7459)
[Link] (1 responses)
When I see "job that has to be remembered every time ... to avoid regressions ...", I can't help but think that a test suite for the in-kernel version of the LZ4 code would be an appropriate approach to deal with this.
Posted Feb 2, 2017 8:13 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link]
Posted Feb 2, 2017 9:51 UTC (Thu)
by karkhaz (subscriber, #99844)
[Link] (2 responses)
To clarify, are these "separate patches" a bunch of separate commits, or do they exist as individual patch files somewhere in the tree? I had a quick look through the tree but didn't find any patch files.
If they are separate commits, then indeed GKH's concern is valid, one must remember to rebase those commits back onto tip-of-tree every time the vendored code is updated. But there's a better way: if all the patches applied by the kernel are kept as separate files, and the `patch' command is used _as part of the build process, during every build_, then nobody has to remember anything. You keep the vendored code vanilla, and each patch is kept separately, and every time a patch gets upstreamed you remove it from the tree and update the vendored code to match.
This is how linux distros typically do things. See e.g. Arch Linux's source package for the kernel itself [0]. When building the package from source, the PKGBUILD file contains instructions on how to download the (vanilla) kernel, and also contains invocations to `patch' to correctly apply those (Arch Linux specific) patch files that you see in the directory. If I want to compile the kernel with _my own_ patches in addition to the Arch ones (so that it still works nicely on my Arch box) then I simply add one more patch file and run the build again.
[0] https://git.archlinux.org/svntogit/packages.git/tree/trun...
Posted Feb 2, 2017 10:07 UTC (Thu)
by gregkh (subscriber, #8)
[Link] (1 responses)
You can see them easily by running 'git log' on the specific files you are curious about.
Posted Feb 6, 2017 3:57 UTC (Mon)
by tterribe (guest, #66972)
[Link]
Primarily, it serves as visible documentation so someone doesn't have to remember to run 'git log' on a bunch of files before patching/replacing them. The fact that you have to update it when you patch the vendored code is also a good reminder that someone else (like a future you) will have to deal with that patch on the next code import, and gives good incentive to move patches upstream if possible.
Posted Feb 2, 2017 13:57 UTC (Thu)
by daurnimator (guest, #92358)
[Link] (3 responses)
Posted Feb 2, 2017 14:51 UTC (Thu)
by epa (subscriber, #39769)
[Link] (2 responses)
Make a branch at the point where LZ4 code was first copied to the kernel. Working on that branch, upgrade the LZ4 code to the latest release. Then when you merge the branch back in, conflict resolution will automatically notice the locally applied changes to LZ4 in the meantime, either patching them in or flagging them as conflicts. As a final check you can diff the resulting LZ4 code against the vanilla LZ4 latest version to make sure the local patches still make sense.
Posted Feb 2, 2017 14:56 UTC (Thu)
by daurnimator (guest, #92358)
[Link] (1 responses)
Posted Feb 2, 2017 16:43 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
To import, we take a subset of that repo (we usually do not care about docs and the like), put it in a tree, make a commit with its parent pointing to the previous commit (initial imports use a new root commit), and then merge into place using -Xsubtree. This allows us to keep the history as one would expect as well as not inflating our repo size with the full history of the import. Git checks ensure that the imported directory is only changed via this mechanism (and also protects against "evil merges").
Posted Feb 2, 2017 17:32 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Seems like it would be worthwhile for the upstream project to have a header file with all of the bits of functions a user like the kernel might need to change, and the kernel could compile with a kernel-specific header file instead.
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
(Along with another rule: "Any files that are vendored from external projects should include an explicit comment about it's origin, version and a reference to the regression test rule for patches.")
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
I don't know if the kernel does it already: but if you vendor code from a git-based project into a git-based project, **please** use 'git subtree' to do it.
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
- you'd be able to extract the kernel commits that applied to the LZ4 library, and send a PR upstream (or allow them to cherry pick).
- when pulling down changes, you get to maintain original commit messages (possibly e.g. mentioning CVEs) and dates as well as author attribution
LZ4: vendoring in the kernel
LZ4: vendoring in the kernel
