When one is trying to determine if there are compliance problems in a body
of
source code—either code from a device maker or from someone in the supply chain
for a device—the sheer number of files to consider can be a difficult
hurdle. A simple technique can reduce the search space
significantly, though it does require a bit of a "leap of faith", according
to Armijn Hemel. He presented his technique, along with a
case study and a war story or two at LinuxCon Japan.
Hemel was a longtime core contributor to the gpl-violations.org project before retiring
to a volunteer role. He is currently using his compliance background
in his own company, Tjaldur
Software Governance Solutions, where he consults with clients on
license compliance issues. Hemel and Shane Coughlan also created the Binary Analysis Tool (BAT)
to look inside binary blobs
for possible compliance problems.
Consumer electronics
There are numerous license problems in today's consumer electronics market,
Hemel said. There are many products containing GPL code with no
corresponding
source code release. Beyond that, there are products with only a partial
release of the source code, as well as products that release the wrong
source code. He mentioned a MIPS-based device that provided kernel source
with a configuration file that chose the ARM architecture. There is no way
that code could have run on the device using that configuration, he said.
That has led to quite a few cases of license enforcement in various
countries, particularly Germany, France, and the US. There have been many
cases handled by gpl-violations.org in Germany, most of which were settled
out of court. Some went to court and the copyright holders were always
able to get a judgment upholding the GPL. In the US, it is the Free
Software Foundation, Software
Freedom Law Center, and Software Freedom
Conservancy that have been
handling the GPL enforcement.
The origin of the license issues in the consumer electronics space is the
supply chain. This chain can be quite long, he said; one he was involved
in was four or five layers deep and he may not have reached the end of it.
Things can go wrong at each step in the supply chain as software gets
added, removed, and changed. Original design manufacturers (ODMs) and
chipset vendors are notoriously sloppy, though chipset makers are slowly
getting better.
Because it is a "winner takes all" market, there is tremendous pressure to
be faster than the competition in supplying parts for devices. If a vendor
in the supply chain can deliver a few days earlier than its competitors at
the same price point, it can dominate. That leads to companies cutting
corners. Some do not know they are violating licenses, but others do not
care that they are, he said. Their competition is doing the same thing and
there is a low chance of getting caught, so there is little incentive to
actually comply with the licenses of the software they distribute.
Amount of code
Device makers get lots of code from all the different levels of
the supply chain and they need to be able to determine whether the licenses
on that code are being followed.
While business relationships should be based on trust, Hemel said, it is
also important to verify the code that is released with an incorporated
part. Unfortunately, the number of files being distributed can make that
difficult.
If a company receives a letter from a lawyer requesting a
response or fix
in two weeks, for example, the sheer number of files might make that
impossible to do.
For example, BusyBox, which is often distributed with embedded systems, is
made up of 1700 files. The kernel used by Android has increased
from 30,000 (Android 2.2 "Froyo") to 36,000 (Android 4.1 "Jelly
Bean")—and the 3.8.4 kernel has
41,000 files. Qt 5 is 62,000 files. Those are just some of the
components on a device, when you add it all up, an
Android system consists of "millions of files in total", he said. The
lines of code in just the C source files is similarly eye-opening, with
255,000 lines in BusyBox and 12 million in the 3.8.4 kernel.
At LinuxCon Europe in 2011, the
long-term support initiative was
announced. As part of that, the Yaminabe
project to detect duplicate work in the kernel was also introduced.
That project focused on the changes that various companies were making to
the kernel, so it ignored all files that were unchanged from the upstream
kernel sources as "uninteresting". It found that 95% of the source code
going into Android handsets was unchanged. Hemel realized that the same
technique could be applied to make compliance auditing easier.
Hemel's method starts with a simple assumption: everything that an upstream
project has published is safe, at least from a compliance point of view.
Compliance audits should focus on those files that aren't from an
upstream distribution. This is not a mechanism to find code snippets that
have been copied into the source (and might be dubious, license-wise),
as there are clone detectors for that purpose. His method can be used as a
first-level pre-filter, though.
Why trust upstream?
Trusting the upstream projects can be a little bit questionable from a license
compliance perspective. Not all of them are diligent about the license on
each and every file they distribute. But the project members (or the
project itself) are the copyright holders and the project chose its
license. That means that only the project or its contributors can sue for
copyright infringement, which is something they are unlikely to do on files
they distributed.
Most upstream code is used largely unmodified, so using upstream projects
as a reference makes sense, but you have to choose which upstreams
to trust. For example, the Linux kernel is a "high trust" upstream, Hemel
said, because of its development methodology, including the developer's
certificate of origin and the "Signed-off-by" lines that accompany
patches. There is still some kernel code that is licensed as GPLv1-only,
but there is "no chance" you will get sued by Linus Torvalds, Ted Ts'o, or
other early kernel developers
over its use, he said.
BusyBox is another high trust project as it has been the subject of various
highly visible court cases over the years, so any license oddities have
been shaken out. Any code from the GNU project is also code that he treats
as safe.
On the other hand, projects like the Maven build tool central repository for Java are an
example of a low or no trust upstream. Maven is an "absolute mess" that
has become a dumping ground for Java code, with unclear copyrights, unclear
code origins, and so on. Hemel "cannot even describe how bad" the Maven
code base central repository is; it is a "copyright time bomb waiting to explode", he said.
For his own purposes, Hemel chooses to put a lot of trust in upstreams like
Samba, GNOME, or KDE, while not putting much in projects that pull in a
lot of upstream code, like OpenWRT, Fedora, or Debian. The latter two are
quite diligent about the origin and licenses of the code they distribute, but he
conservatively chooses to trust upstream projects directly, rather than
projects that collect code from many other different projects.
Approach
So, his approach is simple and straightforward: generate a database of
source code file checksums (really, SHA256 hashes) from upstream projects.
When faced with a large body of code with unknown origins, the SHA256 of
the files is computed and compared to the database. Any that are in the
database can be ignored, while those that don't match can be analyzed or further
scanned.
In terms of reducing the search space, the method is "extremely effective",
Hemel said. It takes about ten minutes for a scan of a recent kernel, which
includes running Ninka and FOSSology on source
files that do not match the hashes in the database. Typically, he finds that
only 5-10% of files are modified, so the search space is quickly reduced by
90% or more.
There are some caveats.
Using the technique requires a "leap of faith" that the upstream is doing
things well
and not every upstream is worth trusting. A good database that contains
multiple upstream versions is time consuming to create and to keep up to
date. In addition, it cannot help with non-source-related compliance
problems (e.g. configuration files). But it is a good tool to help prioritize
auditing efforts, even if the upstreams are not treated as trusted.
He has used the technique for Open
Source Automation Development Lab (OSADL) audits and for other
customers with great success.
Case study
Hemel presented something of a case study that looked at the code on a
Linux-based router made by a "well-known Chinese router manufacturer". The
wireless chip came from well-known chipset vendor as well. He looked at
three components of the router: the Linux kernel, BusyBox, and the U-Boot
bootloader.
The kernel source had around 25,000 files, of which just over 900 (or 4%)
were not found in any kernel.org kernel version. 600 of those turned out
to be just changes made by the version control system (CVS/RCS/Perforce
version numbers, IDs, and the like). Some of what was left were
proprietary files from the chipset or device manufacturers. Overall, just
300 files (1.8%) were left to look at
more closely.
For BusyBox, there were 442 files and just 62 (14%) that were not in the
database. The changed files were mostly just version control identifiers
(17 files), device/chipset files, a modified copy of bridge-utils, and a
few bug fixes.
The situation was much the same for U-Boot: 2989 files scanned with 395
(13%) not in the database. Most of those files were either chipset vendor
files or ones with Perforce changes, but there were several with different
licenses than the GPL (which is what U-Boot uses). But there is also a
file with the text: "Permission
granted for non-commercial use"—not something that the router could
claim. As it turned out, the file was just present in the U-Boot directory
and was not used in the binary built for the device.
Scripts to create the database are available in BAT version
14, a basic scanning script is coming in BAT 15 but is already
available in the Subversion
repository for the project. Fancier tools are available to Hemel's
clients, he said. One obvious opportunity for collaboration, which did not
come up in the talk, would be to collectively create and maintain a
database of hash values for high-profile projects.
How to convince the legal department that this is a valid approach was the
subject of some discussion at the end of the talk. It is a problem, Hemel
said, because legal teams may not feel confident about the technique even
though it is a "no brainer" for developers. Another audience member suggested
that giving examples of others who have successfully used the technique is
often the
best way to make the lawyers comfortable with it. Also, legal calls, where
lawyers can discuss the problem and possible solutions with other lawyers
who have already been down that path, can be valuable.
Working with the upstream projects to clarify any licensing ambiguities is
also useful. It can be tricky to get those projects to fix files with an
unclear license, especially
when the project's intent is clear. In many ways, "git pull"
(and similar commands) have made it much easier to pull in code from
third-party projects, but sometimes that adds complexity on the legal side.
That is something that can be overcome with education and working with
those third-party projects.
[I would like to thank the Linux Foundation for travel assistance to Tokyo
for LinuxCon Japan.]
(
Log in to post comments)