LWN.net Logo

Trusting upstream

By Jake Edge
June 4, 2013
LinuxCon Japan 2013

When one is trying to determine if there are compliance problems in a body of source code—either code from a device maker or from someone in the supply chain for a device—the sheer number of files to consider can be a difficult hurdle. A simple technique can reduce the search space significantly, though it does require a bit of a "leap of faith", according to Armijn Hemel. He presented his technique, along with a case study and a war story or two at LinuxCon Japan.

[Armijn Hemel]

Hemel was a longtime core contributor to the gpl-violations.org project before retiring to a volunteer role. He is currently using his compliance background in his own company, Tjaldur Software Governance Solutions, where he consults with clients on license compliance issues. Hemel and Shane Coughlan also created the Binary Analysis Tool (BAT) to look inside binary blobs for possible compliance problems.

Consumer electronics

There are numerous license problems in today's consumer electronics market, Hemel said. There are many products containing GPL code with no corresponding source code release. Beyond that, there are products with only a partial release of the source code, as well as products that release the wrong source code. He mentioned a MIPS-based device that provided kernel source with a configuration file that chose the ARM architecture. There is no way that code could have run on the device using that configuration, he said.

That has led to quite a few cases of license enforcement in various countries, particularly Germany, France, and the US. There have been many cases handled by gpl-violations.org in Germany, most of which were settled out of court. Some went to court and the copyright holders were always able to get a judgment upholding the GPL. In the US, it is the Free Software Foundation, Software Freedom Law Center, and Software Freedom Conservancy that have been handling the GPL enforcement.

The origin of the license issues in the consumer electronics space is the supply chain. This chain can be quite long, he said; one he was involved in was four or five layers deep and he may not have reached the end of it. Things can go wrong at each step in the supply chain as software gets added, removed, and changed. Original design manufacturers (ODMs) and chipset vendors are notoriously sloppy, though chipset makers are slowly getting better.

Because it is a "winner takes all" market, there is tremendous pressure to be faster than the competition in supplying parts for devices. If a vendor in the supply chain can deliver a few days earlier than its competitors at the same price point, it can dominate. That leads to companies cutting corners. Some do not know they are violating licenses, but others do not care that they are, he said. Their competition is doing the same thing and there is a low chance of getting caught, so there is little incentive to actually comply with the licenses of the software they distribute.

Amount of code

Device makers get lots of code from all the different levels of the supply chain and they need to be able to determine whether the licenses on that code are being followed. While business relationships should be based on trust, Hemel said, it is also important to verify the code that is released with an incorporated part. Unfortunately, the number of files being distributed can make that difficult. If a company receives a letter from a lawyer requesting a response or fix in two weeks, for example, the sheer number of files might make that impossible to do.

For example, BusyBox, which is often distributed with embedded systems, is made up of 1700 files. The kernel used by Android has increased from 30,000 (Android 2.2 "Froyo") to 36,000 (Android 4.1 "Jelly Bean")—and the 3.8.4 kernel has 41,000 files. Qt 5 is 62,000 files. Those are just some of the components on a device, when you add it all up, an Android system consists of "millions of files in total", he said. The lines of code in just the C source files is similarly eye-opening, with 255,000 lines in BusyBox and 12 million in the 3.8.4 kernel.

At LinuxCon Europe in 2011, the long-term support initiative was announced. As part of that, the Yaminabe project to detect duplicate work in the kernel was also introduced. That project focused on the changes that various companies were making to the kernel, so it ignored all files that were unchanged from the upstream kernel sources as "uninteresting". It found that 95% of the source code going into Android handsets was unchanged. Hemel realized that the same technique could be applied to make compliance auditing easier.

Hemel's method starts with a simple assumption: everything that an upstream project has published is safe, at least from a compliance point of view. Compliance audits should focus on those files that aren't from an upstream distribution. This is not a mechanism to find code snippets that have been copied into the source (and might be dubious, license-wise), as there are clone detectors for that purpose. His method can be used as a first-level pre-filter, though.

Why trust upstream?

Trusting the upstream projects can be a little bit questionable from a license compliance perspective. Not all of them are diligent about the license on each and every file they distribute. But the project members (or the project itself) are the copyright holders and the project chose its license. That means that only the project or its contributors can sue for copyright infringement, which is something they are unlikely to do on files they distributed.

Most upstream code is used largely unmodified, so using upstream projects as a reference makes sense, but you have to choose which upstreams to trust. For example, the Linux kernel is a "high trust" upstream, Hemel said, because of its development methodology, including the developer's certificate of origin and the "Signed-off-by" lines that accompany patches. There is still some kernel code that is licensed as GPLv1-only, but there is "no chance" you will get sued by Linus Torvalds, Ted Ts'o, or other early kernel developers over its use, he said.

BusyBox is another high trust project as it has been the subject of various highly visible court cases over the years, so any license oddities have been shaken out. Any code from the GNU project is also code that he treats as safe.

On the other hand, projects like the Maven build tool central repository for Java are an example of a low or no trust upstream. Maven is an "absolute mess" that has become a dumping ground for Java code, with unclear copyrights, unclear code origins, and so on. Hemel "cannot even describe how bad" the Maven code base central repository is; it is a "copyright time bomb waiting to explode", he said.

For his own purposes, Hemel chooses to put a lot of trust in upstreams like Samba, GNOME, or KDE, while not putting much in projects that pull in a lot of upstream code, like OpenWRT, Fedora, or Debian. The latter two are quite diligent about the origin and licenses of the code they distribute, but he conservatively chooses to trust upstream projects directly, rather than projects that collect code from many other different projects.

Approach

So, his approach is simple and straightforward: generate a database of source code file checksums (really, SHA256 hashes) from upstream projects. When faced with a large body of code with unknown origins, the SHA256 of the files is computed and compared to the database. Any that are in the database can be ignored, while those that don't match can be analyzed or further scanned.

In terms of reducing the search space, the method is "extremely effective", Hemel said. It takes about ten minutes for a scan of a recent kernel, which includes running Ninka and FOSSology on source files that do not match the hashes in the database. Typically, he finds that only 5-10% of files are modified, so the search space is quickly reduced by 90% or more.

There are some caveats. Using the technique requires a "leap of faith" that the upstream is doing things well and not every upstream is worth trusting. A good database that contains multiple upstream versions is time consuming to create and to keep up to date. In addition, it cannot help with non-source-related compliance problems (e.g. configuration files). But it is a good tool to help prioritize auditing efforts, even if the upstreams are not treated as trusted. He has used the technique for Open Source Automation Development Lab (OSADL) audits and for other customers with great success.

Case study

Hemel presented something of a case study that looked at the code on a Linux-based router made by a "well-known Chinese router manufacturer". The wireless chip came from well-known chipset vendor as well. He looked at three components of the router: the Linux kernel, BusyBox, and the U-Boot bootloader.

The kernel source had around 25,000 files, of which just over 900 (or 4%) were not found in any kernel.org kernel version. 600 of those turned out to be just changes made by the version control system (CVS/RCS/Perforce version numbers, IDs, and the like). Some of what was left were proprietary files from the chipset or device manufacturers. Overall, just 300 files (1.8%) were left to look at more closely.

For BusyBox, there were 442 files and just 62 (14%) that were not in the database. The changed files were mostly just version control identifiers (17 files), device/chipset files, a modified copy of bridge-utils, and a few bug fixes.

The situation was much the same for U-Boot: 2989 files scanned with 395 (13%) not in the database. Most of those files were either chipset vendor files or ones with Perforce changes, but there were several with different licenses than the GPL (which is what U-Boot uses). But there is also a file with the text: "Permission granted for non-commercial use"—not something that the router could claim. As it turned out, the file was just present in the U-Boot directory and was not used in the binary built for the device.

Scripts to create the database are available in BAT version 14, a basic scanning script is coming in BAT 15 but is already available in the Subversion repository for the project. Fancier tools are available to Hemel's clients, he said. One obvious opportunity for collaboration, which did not come up in the talk, would be to collectively create and maintain a database of hash values for high-profile projects.

How to convince the legal department that this is a valid approach was the subject of some discussion at the end of the talk. It is a problem, Hemel said, because legal teams may not feel confident about the technique even though it is a "no brainer" for developers. Another audience member suggested that giving examples of others who have successfully used the technique is often the best way to make the lawyers comfortable with it. Also, legal calls, where lawyers can discuss the problem and possible solutions with other lawyers who have already been down that path, can be valuable.

Working with the upstream projects to clarify any licensing ambiguities is also useful. It can be tricky to get those projects to fix files with an unclear license, especially when the project's intent is clear. In many ways, "git pull" (and similar commands) have made it much easier to pull in code from third-party projects, but sometimes that adds complexity on the legal side. That is something that can be overcome with education and working with those third-party projects.

[I would like to thank the Linux Foundation for travel assistance to Tokyo for LinuxCon Japan.]


(Log in to post comments)

Trusting upstream

Posted Jun 5, 2013 10:44 UTC (Wed) by armijn (subscriber, #3653) [Link]

A short clarification: I have found the OpenWrt developers to be very conscious about licensing issues as well. In my method I prefer to go directly to upstream, if possible and err on the safe side.

As I said during my talk: YMMV. In some situations it might be perfectly reasonable to trust certain origins, in other situations you might only want to trust the original upstream.

Maven and trusting upstreams/downstreams

Posted Jun 5, 2013 15:04 UTC (Wed) by sochotnicky (guest, #65774) [Link]

disclaimer: IANAL

> On the other hand, projects like the Maven build tool for Java are an example of a low or no trust upstream. Maven is an "absolute mess" that has become a dumping ground for Java code, with unclear copyrights, unclear code origins, and so on. Hemel "cannot even describe how bad" the Maven code base is; it is a "copyright time bomb waiting to explode", he said.

Was he talking about Apache Maven the project or Maven Central the repository where almost all Java projects publish their binaries (similar to PyPi/CPAN)? If it's the former then I don't agree. Current codebase is pure ASL 2.0, no bundled code, dependencies are also mostly ASL/MIT/BSD.

As for the Maven Central...sure, like with any repository where anything goes the licensing is often unclear, confusing or even missing completely. But that has nothing to do with Maven as a tool. Maven metadata even includes license tag and most good upstreams make use of it.

So...either Mr. Hemel is confused about what Maven is/isn't or there was some misunderstanding elsewhere?

> For his own purposes, Hemel chooses to put a lot of trust in upstreams like Samba, GNOME, or KDE, while not putting much in projects that pull in a lot of upstream code, like OpenWRT, Fedora, or Debian. The latter two are quite diligent about the origin and licenses of the code they distribute, but he conservatively chooses to trust upstream projects directly, rather than projects that collect code from many other different projects.

Big projects like Apache and Eclipse are among the best from licensing POV. They rarely make licensing mistakes, even have formal licensing reviews in place and when a problem is found they are usually responsive (and responsible). There's one thing bugging distributions quite often though;
ASL 2.0 point 4 a):

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

A lot of upstreams do not include ASL 2.0 license in their tarballs so to comply with this point of ASL 2.0 license even source RPMs have to include this file manually. It's a small thing and most upstreams are ready to apply patches, but it takes time...

Contrary to Mr. Hemel, I would trust distributions such as Debian or Fedora more than upstream projects themselves. Reasons:

- Both projects have formal licensing reviews of each project before inclusion
- Both projects have lawyers at their disposal if any inconsistency/confusion is encountered
- Upstream projects as original authors do not usually have *that* high incentive to make sure licensing is 100% percent correct and clear. Distributions operate world-wide and they need clear and correct licensing information. Barring clear-cut violations when mixing GPL code with some incompatible licenses, upstream developers can do whatever they want. They cause problems downstream, but not for themselves

Maven and trusting upstreams/downstreams

Posted Jun 5, 2013 16:07 UTC (Wed) by armijn (subscriber, #3653) [Link]

In my talk I said the Maven central repository.

Maven and trusting upstreams/downstreams

Posted Jun 5, 2013 16:53 UTC (Wed) by jake (editor, #205) [Link]

> In my talk I said the Maven central repository.

Ah, my apologies for misunderstanding. Since you also talked about Maven being a build tool for Java, it might be pretty easy for others to misinterpret too.

jake

Trusting upstream

Posted Jun 6, 2013 10:56 UTC (Thu) by etienne (subscriber, #25256) [Link]

> But there is also a file with the text: "Permission granted for non-commercial use"

I have already seen that string when dumping the BIOS of a PC, probably less usual now (maybe because sections of FLASH are compressed)...

Trusting upstream

Posted Jun 7, 2013 19:28 UTC (Fri) by armijn (subscriber, #3653) [Link]

The code in question is actually quite old (from 1989), and is often found in embedded distributions (including EmDebian) in a file called lzari.c or similar.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds