LWN.net Logo

A binary analysis tool for GPL compliance investigations

By Jake Edge
May 5, 2010

There are thousands of embedded devices running Linux today, with more released hourly it seems. Many of those are in full compliance with the licenses for the free software that they ship, but some, sadly, are not. In most cases, it is probably due to ignorance, but sometimes arrogance or even malfeasance play a role. A new Apache-licensed Binary Analysis Tool from Armijn Hemel and Shane Coughlan is meant to help developers and others interested in GPL compliance in determining whether Linux or BusyBox are present in a particular device.

There are multiple levels to GPL compliance investigations. If the device is not shipped with source, nor an offer to provide it, one can assume that it contains no GPL code. In that case, just detecting the presence of the Linux kernel or BusyBox is enough to identify a problem. For devices that do ship or offer source, there is another step: determining whether the source code and configuration that was provided corresponds to the code on the device. That process was described by Hemel and Coughlan in a series of LWN articles (part 1, part 2, and part 3).

The first step is to extract any filesystems that exist in a firmware image, so that they can be investigated further. The Binary Analysis Tool provides the bruteforce.py script to detect various kinds of filesystems, including those that are compressed, and to extract them from the image. It then digs down inside the filesystem to find "interesting" files. Right now, the output is terse, but that is slated to change "in the near future", according the README file.

Beyond that, there are scripts to look at BusyBox and kernel binaries to extract configuration information. Running:

    python busybox.py --binary=/path/to/busybox
on a BusyBox binary results in a list of configuration options that shows which of the applets were built into the binary:
    CONFIG_ADDGROUP=y
    CONFIG_ADDUSER=y
    CONFIG_ADJTIMEX=y
    ...
BusyBox configuration is important because it can be a clue as to whether or not the source corresponds to the binary. In fact the tool provides an automated way to compare the configuration found in a binary with one that is included in the source: busybox-compare-configs.py.

The tool uses a database of sorts for BusyBox configurations going back to the 0.52 release. The busybox-version.py command can be used to manually determine the version of a binary, or the other tools will do so automatically—though it can be overridden on the command line. In addition, the busybox.py script can check for applets in a binary for which there is no configuration option in the official BusyBox sources, which would indicate that additional code (for which source must be released) has been added.

There are also scripts to extract configuration and strings from a Linux kernel. extractkernelstrings.py is used on a provided kernel source tree and generates a database of strings that should be present in the kernel image. findkernelstrings.py then uses that database and the kernel image file to find matches, and, more importantly, things that do not match. Once again, this can lead to a determination that the source code and shipped binaries are either not the same, or not configured in the same way.

Due to various reverse engineering laws worldwide, the Binary Analysis Tool does not do any kind of decompilation or disassembly of the code that it finds. It strictly looks at the symbol tables and strings stored in the binaries to do its work. For much the same reason, it does not try to "crack" any encryption or DRM that might be protecting the firmware image or its contents.

The tool is still a bit rough around the edges, but does come with fairly extensive documentation, both as PDF Quick Start and User guides and various documentation files in the source tree. It comes as a tarball or can be grabbed from an svn repository. The list of dependencies seems a bit large for a program of this type. For the kernel strings database, it includes the PyLucene Python library for accessing the Java-based Lucene text searching and indexing, which necessitates installing OpenJDK and Ant. More obvious dependencies for things like python-magic for magic numbers, e2tools and squashfs tools for accessing filesystems, and various compression utilities are required as well.

The development of the Binary Analysis Tool was supported by the NLnet Foundation and the Linux Foundation, and it was created by Hemel as part of his work at Loohuis Consulting and by Coughlan at OpenDawn. It is still being actively developed with releases scheduled for May and July. Contributions of bug reports, development time, or money to continue development are welcome.

While the scripts will be useful as a starting point for those who are investigating GPL compliance, there is still quite a bit of work to be done. The tool provides a framework for looking at two of the most common GPL-licensed components appearing in embedded devices, but there are others. It's no coincidence that that the tool focuses on BusyBox and the Linux kernel, which have been the most successful at enforcing license compliance in the last several years. As other projects are used more widely in embedded devices, there will be a need to expand the coverage of tools like this.

There are uses for the tool beyond those of developers trying to ensure that their code is used properly. Embedded device manufacturers will also find it useful. There have been numerous cases of OEMs getting code from their suppliers without the proper source files—or even notice that it contains GPL code. Companies can also test their competitor's products for compliance to help level the playing field. Any tool that makes it easier to spot license compliance problems is a boon for developers, users, and device makers.


(Log in to post comments)

A binary analysis tool for GPL compliance investigations

Posted May 6, 2010 8:44 UTC (Thu) by Hanno (guest, #41730) [Link]

Is this related to Bincrowd?

Is this legal?

Posted May 6, 2010 8:48 UTC (Thu) by NAR (subscriber, #1313) [Link]

I've noted that the tool doesn't do any kind of decompilation or disassembly of the code, but the first EULA I've found says that

You may not reverse engineer, decompile, or disassemble the Software, except and only to the extent that such activity is expressly permitted by applicable law notwithstanding this limitation.

Isn't this reverse engineering? Or is it expressly permitted to check for copyright infringement?

Is this legal?

Posted May 6, 2010 10:56 UTC (Thu) by njh (subscriber, #4425) [Link]

I don't think that it is "reverse engineering". That term is normally used to mean studying code of the binary program to figure out how it works (what it does, what algorithms it uses, what its undocumented file formats or network protocols might look like), rather than simply looking for identifying marks which tip you off as to the heritage of the code-base.

Is this legal?

Posted May 6, 2010 22:10 UTC (Thu) by gerdesj (subscriber, #5446) [Link]

In /. tradition we probably need a car analogy and the compulsory IANAL (IAJ - in any jurisdiction - let alone the one the poster is from).

I push for: "Its like getting under the bonnet of a car, removing the spark plugs and determining what model they are"

So that you can replace the bloody things without having to get a garage to do it!

Binary related examination is not like using your eyes to see the part numbers. You need another program to do that for you. In this case looking for symbols is not decompilation, you are quite literally just getting the part numbers out of the binary in question and then looking to see if they belong to the manufacturer or were made by someone else who has a license that prohibits your usage of those parts without telling the world about changes that were made to those parts.

Is this legal?

Posted May 7, 2010 23:22 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Though I don't think it comes close to decompilation, I think it is reverse engineering, and so is looking at the part number on the outside of the spark plug.

But I think both cases are, in legal terms, "de minimis," i.e. too minor to have legal significance.

Engineering is going from requirements to design. Reverse engineering is going from implementation to design. The choice of model of spark plug is an essential part of the design of a car, as is the choice to use Busybox and to configure it with CONFIG_ADDGROUP.

I don't think the purpose of the reverse engineering (whether you're going to make competing cars or change your own spark plugs) really matters in a reverse engineering restriction. That comes up a lot when users reverse engineer their binary-only software.

A binary analysis tool for GPL compliance investigations

Posted May 6, 2010 12:16 UTC (Thu) by tmroy (guest, #56146) [Link]

Its funny that a tool to be used to check for GPL License compliance is itself not licensed under the GPL.

A binary analysis tool for GPL compliance investigations

Posted May 7, 2010 7:46 UTC (Fri) by Neil_Brown (guest, #65976) [Link]

Its funny that a tool to be used to check for GPL License compliance is itself not licensed under the GPL.

Not really, to my mind.

If the aim is the increase adoption / use of the tool, then, picking a licence which is likely to appeal to as many companies as possible is a sensible step, and many companies - whether through accurate reasoning or not - avoid GPL'd code through fear of a "viral effect".

If the tool could check for violations of three different licences, would you expect it to be licensed under each of those licences?

A binary analysis tool for GPL compliance investigations

Posted May 18, 2010 20:38 UTC (Tue) by Duncan (guest, #6647) [Link]

Neil's reply implies a good point, which I'd like to make more explicit.

Software such as this would be just the sort of audit tool used to monitor continued compliance with a GPL violation settlement agreement. Yet the time to propose a GPL licensed tool for such auditing purposes is NOT in or shortly after a settlement negotiation where a company was forced to back down from a previous practice regarding use of previous GPL licensed software. Rather, let them get comfortable with the more neutrally licensed compliance monitoring tool, demonstrating that yes, it is possible to comply with the license without having it eat your children, etc, and given time, they may well choose to continue to work with the community of their own accord and on new projects.

But thanks for your question. It had been nagging at the back of my mind as well, and without your voicing it and Neil's reply, I'd have not realized how practical the choice of Apache license, as opposed to GPL, actually was, here. It took reading the question and his reply, then formulating my own, to crystallize my own understanding of the issue.

Duncan

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds