By Jake Edge
May 5, 2010
There are thousands of embedded devices running Linux today, with more
released hourly it seems. Many of those are in full compliance with the
licenses for the free software that they ship, but some, sadly, are not.
In most cases, it is probably due to ignorance, but sometimes arrogance or
even malfeasance play a role. A new Apache-licensed Binary Analysis Tool from
Armijn Hemel and Shane Coughlan is meant to help developers and others
interested in GPL compliance in determining whether Linux or BusyBox are
present in a particular device.
There are multiple levels to GPL compliance investigations. If the
device is not shipped with source, nor an offer to provide it, one can
assume that it contains no GPL code. In that case, just detecting the
presence of the Linux kernel or BusyBox is enough to identify a problem.
For devices that do ship or offer source, there is another step:
determining whether the source code and configuration that was provided
corresponds to the code on the device. That process was described by Hemel
and Coughlan in a series of LWN articles (part 1, part 2, and part 3).
The first step is to extract any filesystems that exist in
a firmware image, so that they can be investigated further. The Binary
Analysis Tool provides
the
bruteforce.py script to detect various kinds of filesystems,
including those that are compressed, and to extract them from the image.
It then digs down inside the filesystem to find "interesting" files. Right
now, the output is terse, but that is slated to change "in the near
future", according the README file.
Beyond that, there are scripts to look at BusyBox and kernel binaries to
extract configuration information. Running:
python busybox.py --binary=/path/to/busybox
on a BusyBox binary results in a list of configuration options that shows
which of the applets were built into the binary:
CONFIG_ADDGROUP=y
CONFIG_ADDUSER=y
CONFIG_ADJTIMEX=y
...
BusyBox configuration is important because it can be a clue as to whether or not
the source corresponds to the binary. In fact the tool provides an
automated way to compare the configuration found in a binary with one that
is included in the source:
busybox-compare-configs.py.
The tool uses a database of sorts for BusyBox configurations going back to
the 0.52 release. The busybox-version.py command can be used to
manually determine the version of a binary, or the other tools will do so
automatically—though it can be overridden on the command line. In
addition, the busybox.py script can check for applets in a binary
for which there is no configuration option in the official BusyBox sources,
which would indicate that additional code (for which source must be
released) has been added.
There are also scripts to extract configuration and strings from a Linux
kernel. extractkernelstrings.py is used on a provided kernel
source tree and generates a database of strings that should be present in
the kernel image. findkernelstrings.py then uses that database and
the kernel image file to find matches, and, more importantly, things that do
not match. Once again, this can lead to a determination that the source
code and shipped binaries are either not the same, or not configured in the
same way.
Due to various reverse engineering laws worldwide, the Binary Analysis Tool does
not do any kind of decompilation or disassembly of the code that it finds.
It strictly looks at the symbol tables and strings stored in the binaries
to do its work. For much the same reason, it does not try to "crack" any
encryption or DRM that might be protecting the firmware image or its contents.
The tool is still a bit rough around the edges, but does come with fairly
extensive documentation,
both as PDF Quick Start and User guides and various documentation files in
the source tree.
It comes as a tarball or can be grabbed
from an svn repository. The list of dependencies seems a bit
large for a program of this type. For the kernel strings database, it
includes the PyLucene Python library for
accessing the Java-based Lucene text
searching and indexing, which necessitates installing OpenJDK and Ant.
More obvious dependencies for things like python-magic for magic numbers,
e2tools and squashfs tools for accessing filesystems, and various
compression utilities are required as well.
The development of the Binary Analysis Tool was supported by the NLnet Foundation and the Linux Foundation, and it was
created by Hemel as part of his work at Loohuis Consulting and by
Coughlan at OpenDawn. It is still
being actively developed with releases scheduled for May and July.
Contributions
of bug reports, development time, or money to continue development are welcome.
While the scripts will be useful as a starting point for those who are
investigating GPL compliance, there is still quite a bit of work to be
done. The tool provides a framework for looking at two of the most common
GPL-licensed components appearing in embedded devices, but there are
others. It's no coincidence that that the tool focuses on BusyBox and the
Linux kernel, which have been
the most successful at
enforcing license compliance in the last several years. As other projects
are used more widely in embedded devices, there will be a need to
expand the coverage of tools like this.
There are uses for the tool beyond those of developers trying to ensure
that their code is used properly.
Embedded device manufacturers will also find it useful. There have been
numerous cases of OEMs getting code from their suppliers without the proper
source files—or even notice that it contains GPL code. Companies can
also test their competitor's products for compliance to help level the
playing field. Any tool that makes it easier to spot license compliance
problems is a boon for developers, users, and device makers.
(
Log in to post comments)