Trusting upstream
When one is trying to determine if there are compliance problems in a body of source code—either code from a device maker or from someone in the supply chain for a device—the sheer number of files to consider can be a difficult hurdle. A simple technique can reduce the search space significantly, though it does require a bit of a "leap of faith", according to Armijn Hemel. He presented his technique, along with a case study and a war story or two at LinuxCon Japan.
Hemel was a longtime core contributor to the gpl-violations.org project before retiring to a volunteer role. He is currently using his compliance background in his own company, Tjaldur Software Governance Solutions, where he consults with clients on license compliance issues. Hemel and Shane Coughlan also created the Binary Analysis Tool (BAT) to look inside binary blobs for possible compliance problems.
Consumer electronics
There are numerous license problems in today's consumer electronics market, Hemel said. There are many products containing GPL code with no corresponding source code release. Beyond that, there are products with only a partial release of the source code, as well as products that release the wrong source code. He mentioned a MIPS-based device that provided kernel source with a configuration file that chose the ARM architecture. There is no way that code could have run on the device using that configuration, he said.
That has led to quite a few cases of license enforcement in various countries, particularly Germany, France, and the US. There have been many cases handled by gpl-violations.org in Germany, most of which were settled out of court. Some went to court and the copyright holders were always able to get a judgment upholding the GPL. In the US, it is the Free Software Foundation, Software Freedom Law Center, and Software Freedom Conservancy that have been handling the GPL enforcement.
The origin of the license issues in the consumer electronics space is the supply chain. This chain can be quite long, he said; one he was involved in was four or five layers deep and he may not have reached the end of it. Things can go wrong at each step in the supply chain as software gets added, removed, and changed. Original design manufacturers (ODMs) and chipset vendors are notoriously sloppy, though chipset makers are slowly getting better.
Because it is a "winner takes all" market, there is tremendous pressure to be faster than the competition in supplying parts for devices. If a vendor in the supply chain can deliver a few days earlier than its competitors at the same price point, it can dominate. That leads to companies cutting corners. Some do not know they are violating licenses, but others do not care that they are, he said. Their competition is doing the same thing and there is a low chance of getting caught, so there is little incentive to actually comply with the licenses of the software they distribute.
Amount of code
Device makers get lots of code from all the different levels of the supply chain and they need to be able to determine whether the licenses on that code are being followed. While business relationships should be based on trust, Hemel said, it is also important to verify the code that is released with an incorporated part. Unfortunately, the number of files being distributed can make that difficult. If a company receives a letter from a lawyer requesting a response or fix in two weeks, for example, the sheer number of files might make that impossible to do.
For example, BusyBox, which is often distributed with embedded systems, is made up of 1700 files. The kernel used by Android has increased from 30,000 (Android 2.2 "Froyo") to 36,000 (Android 4.1 "Jelly Bean")—and the 3.8.4 kernel has 41,000 files. Qt 5 is 62,000 files. Those are just some of the components on a device, when you add it all up, an Android system consists of "millions of files in total", he said. The lines of code in just the C source files is similarly eye-opening, with 255,000 lines in BusyBox and 12 million in the 3.8.4 kernel.
At LinuxCon Europe in 2011, the long-term support initiative was announced. As part of that, the Yaminabe project to detect duplicate work in the kernel was also introduced. That project focused on the changes that various companies were making to the kernel, so it ignored all files that were unchanged from the upstream kernel sources as "uninteresting". It found that 95% of the source code going into Android handsets was unchanged. Hemel realized that the same technique could be applied to make compliance auditing easier.
Hemel's method starts with a simple assumption: everything that an upstream project has published is safe, at least from a compliance point of view. Compliance audits should focus on those files that aren't from an upstream distribution. This is not a mechanism to find code snippets that have been copied into the source (and might be dubious, license-wise), as there are clone detectors for that purpose. His method can be used as a first-level pre-filter, though.
Why trust upstream?
Trusting the upstream projects can be a little bit questionable from a license compliance perspective. Not all of them are diligent about the license on each and every file they distribute. But the project members (or the project itself) are the copyright holders and the project chose its license. That means that only the project or its contributors can sue for copyright infringement, which is something they are unlikely to do on files they distributed.
Most upstream code is used largely unmodified, so using upstream projects as a reference makes sense, but you have to choose which upstreams to trust. For example, the Linux kernel is a "high trust" upstream, Hemel said, because of its development methodology, including the developer's certificate of origin and the "Signed-off-by" lines that accompany patches. There is still some kernel code that is licensed as GPLv1-only, but there is "no chance" you will get sued by Linus Torvalds, Ted Ts'o, or other early kernel developers over its use, he said.
BusyBox is another high trust project as it has been the subject of various highly visible court cases over the years, so any license oddities have been shaken out. Any code from the GNU project is also code that he treats as safe.
On the other hand, projects like the Maven build tool central repository for Java are an
example of a low or no trust upstream. Maven is an "absolute mess" that
has become a dumping ground for Java code, with unclear copyrights, unclear
code origins, and so on. Hemel "cannot even describe how bad" the Maven
code base central repository is; it is a "copyright time bomb waiting to explode", he said.
For his own purposes, Hemel chooses to put a lot of trust in upstreams like Samba, GNOME, or KDE, while not putting much in projects that pull in a lot of upstream code, like OpenWRT, Fedora, or Debian. The latter two are quite diligent about the origin and licenses of the code they distribute, but he conservatively chooses to trust upstream projects directly, rather than projects that collect code from many other different projects.
Approach
So, his approach is simple and straightforward: generate a database of source code file checksums (really, SHA256 hashes) from upstream projects. When faced with a large body of code with unknown origins, the SHA256 of the files is computed and compared to the database. Any that are in the database can be ignored, while those that don't match can be analyzed or further scanned.
In terms of reducing the search space, the method is "extremely effective", Hemel said. It takes about ten minutes for a scan of a recent kernel, which includes running Ninka and FOSSology on source files that do not match the hashes in the database. Typically, he finds that only 5-10% of files are modified, so the search space is quickly reduced by 90% or more.
There are some caveats. Using the technique requires a "leap of faith" that the upstream is doing things well and not every upstream is worth trusting. A good database that contains multiple upstream versions is time consuming to create and to keep up to date. In addition, it cannot help with non-source-related compliance problems (e.g. configuration files). But it is a good tool to help prioritize auditing efforts, even if the upstreams are not treated as trusted. He has used the technique for Open Source Automation Development Lab (OSADL) audits and for other customers with great success.
Case study
Hemel presented something of a case study that looked at the code on a Linux-based router made by a "well-known Chinese router manufacturer". The wireless chip came from well-known chipset vendor as well. He looked at three components of the router: the Linux kernel, BusyBox, and the U-Boot bootloader.
The kernel source had around 25,000 files, of which just over 900 (or 4%) were not found in any kernel.org kernel version. 600 of those turned out to be just changes made by the version control system (CVS/RCS/Perforce version numbers, IDs, and the like). Some of what was left were proprietary files from the chipset or device manufacturers. Overall, just 300 files (1.8%) were left to look at more closely.
For BusyBox, there were 442 files and just 62 (14%) that were not in the database. The changed files were mostly just version control identifiers (17 files), device/chipset files, a modified copy of bridge-utils, and a few bug fixes.
The situation was much the same for U-Boot: 2989 files scanned with 395 (13%) not in the database. Most of those files were either chipset vendor files or ones with Perforce changes, but there were several with different licenses than the GPL (which is what U-Boot uses). But there is also a file with the text: "Permission granted for non-commercial use"—not something that the router could claim. As it turned out, the file was just present in the U-Boot directory and was not used in the binary built for the device.
Scripts to create the database are available in BAT version 14, a basic scanning script is coming in BAT 15 but is already available in the Subversion repository for the project. Fancier tools are available to Hemel's clients, he said. One obvious opportunity for collaboration, which did not come up in the talk, would be to collectively create and maintain a database of hash values for high-profile projects.
How to convince the legal department that this is a valid approach was the subject of some discussion at the end of the talk. It is a problem, Hemel said, because legal teams may not feel confident about the technique even though it is a "no brainer" for developers. Another audience member suggested that giving examples of others who have successfully used the technique is often the best way to make the lawyers comfortable with it. Also, legal calls, where lawyers can discuss the problem and possible solutions with other lawyers who have already been down that path, can be valuable.
Working with the upstream projects to clarify any licensing ambiguities is also useful. It can be tricky to get those projects to fix files with an unclear license, especially when the project's intent is clear. In many ways, "git pull" (and similar commands) have made it much easier to pull in code from third-party projects, but sometimes that adds complexity on the legal side. That is something that can be overcome with education and working with those third-party projects.
[I would like to thank the Linux Foundation for travel assistance to Tokyo
for LinuxCon Japan.]
| Index entries for this article | |
|---|---|
| Conference | LinuxCon Japan/2013 |
