[ Editor's note: This is part two of a series of three articles on
FOSS license compliance. Part one
introduces the topic and describes what developers can do to protect their
rights. Part three is coming soon and will look at what
companies can do to comply, as well as what to do in the case of a
violation. ]
This article examines a field called compliance engineering.
Compliance engineering was pioneered by technical experts who wanted to
address misuses of software, and was made famous by gpl-violations.org, FSF, and similar organizations
correcting Free and Open Source Software (FOSS) license violations. The
field has grown into a commercial segment with companies like Blackduck Software and
consultancy firms like Loohuis
Consulting offering formal services to third parties.
Rather than attempting to examine compliance engineering in all market
segments and under all conditions, this article will focus on explaining
some of the tools and skills required to undertake due diligence
activities related to licensing and binary code in the embedded industry.
It is based on the GPL
Compliance Engineering Guide, which in turn is based on the experience
of engineers contributing to the gpl-violations.org project.
Some of the methods described in this article may not be permitted by
the DMCA
or similar legislation in certain jurisdictions. It is important to stress
that the goal of compliance engineering is not to reverse engineer a product
so it can be resold for monetary gain, but rather to apply digital
forensics to see if copyright was violated. You should consult a lawyer to
find out the legal status of the engineering methods described here.
Context and confusion
The first phase of compliance engineering is not engineering. It is
about about understanding the license that applies to code and
understanding what that means with regards to obligations in a particular
market segment. This dry art is sometimes challenging because of the
culture of FOSS. FOSS has an innovative, fast moving, and diverse
ecosystem. Contributors tend to be passionate about their work and about
how it is released, shared, and further improved by the community as a
whole. This can be something of a double-edged sword, providing
exceptional engagement and occasionally an overabundance of enthusiasm in
areas like software licensing or compliance.
The gpl-violations.org
project enforces the copyright of Harald Welte and other
Linux kernel developers, and has a mechanism for third parties to report
suspected issues with use of Linux and related GPL code. One of the most
common false positives reported is that companies are violating the GNU GPL version 2 by
providing a binary firmware release for embedded devices without shipping
source code in the package or offering it on a website for download. This
highlights a misunderstanding regarding what the GPL requires. It
is true that the GPL comes into effect when distributing code and that
offering a binary firmware for download is distribution, but compliance with
the license terms is more subtle than it may appear to parties who have not
read the license carefully.
In the GPLv2 license there is no requirement for source code to be
provided in the product package or on a website to ensure
compliance. Instead, in sections 3a and 3b of the GPLv2 license there are
two options regarding source code available to people distributing binary
versions of licensed software. One is to accompany a product with the
source code and the other is to include a written offer to supply the
source code to any third party for three years. When someone gets a device
with GPLv2 code and wants to check compliance, they need to look for
accompanying source or a written offer on the manual, the box, a separate
leaflet, web interface menus and any interactive menus.
It gets a little more complex when you consider that the above
constitutes only the terms applying to source code. Finding source code or
a written offer for it does not constitute GPLv2 full compliance. Instead
compliance depends on whether the offered source code is complete and
corresponds precisely to what is on the product, if the product also
shipped with a copy of the license, and what else is shipped in what way
alongside the GPL code. The full text of the license spells out how the
parameters of this relationship work.
Compliance engineering is an activity that requires a mixture of
technical and legal skills. Practitioners have to identify false
positives and negatives, and to contextualize their analysis within
applicable jurisdictional constraints. This can appear daunting for
parties who have a casual approach to reading licenses. However, the
skills and tools applied are relatively simple as long as a balanced
approach is taken when understanding what is explicitly required in a
license and what is actually present in a product. Given these two skills
anyone can help make sure that people who use GPL or other FOSS licenses
are adhering to the terms the copyright holders selected.
The nuts and bolts
Compliance engineers in organizations like gpl-violations.org do not
have an extensive toolset. In the embedded market the product from a
software perspective is a firmware image, and this
is just a compilation of binary code. The contents may include everything
needed to power an embedded device (bootloader, plus operating system) or
just updates to certain parts of the embedded device software.
Checking if firmware is meeting the terms of a license like the GPLv2
requires the application of knowledge and a sequence of tests such as
extracting visible strings from binary files and correlating them to source
code. One aspect is identifying GPL software components and making sure
they are included in source releases, and another requires opening the device to get
physical access to serial ports. The only essential tools required are a
Linux machine, a good editor, binutils, util-linux, and
the ability to mount file systems over loopback or tools like unsquashfs
to unpack file systems to disk.
Opening firmware
The most common operating systems for embedded devices today are
Linux-kernel based or VxWorks. There are a
few specialized operating systems and variants of BSD available in the
market, but they are becoming less common. Linux-based firmware nearly
always contains the kernel itself, one or more file systems, and sometimes
a bootloader.
The quickest way to find file systems or kernels in a firmware is to
search for padding. Padding usually consists of NOP characters such as
zeroes which fill up space. This ensures that the individual components of
a firmware are at the right offsets. The bootloader uses these offsets to
quickly jump to the location of the kernel or a file system. Therefore if
you see padding there will either be something following it, or it marks
the end of the file. Once you have identified the components you will know
what type of firmware you are dealing with, what's in there on the
architecture level, and (with a little bit of experience) what's likely to
be problematic with regards complete source code releases.
If you can't find any padding in the firmware then another method is to
look for strings like "done, booting the kernel", as these
indicate that something else will follow immediately afterwards. This
method is a little more tricky and involves things like searching for
markers that indicate compression (gzip header, bzip2 header, etc.), a file
system (squashfs header, cramfs header, etc.), and so on. The quickest way
to do this is to use hexdump -C and search for headers. Detailed
information about headers is already available on most Linux systems in
/usr/share/magic.
Problems you can encounter
The techniques employed for compliance engineering are essentially the
same as those employed for debugging an embedded system. While this means
the basic knowledge is easy to obtain, but it also means that issues can
arise when the tools you are attempting to apply are different from the tools
used for designing and building the system in the first place:
- Encryption: Some devices have a firmware image that is encrypted. The
bootloader decrypts it during boot time with a key that is stored in the
device. Unless you know the decryption key it is impossible to take these
devices apart by looking at the firmware only. Examples are ADSL
modem/routers which are based on the Broadcom bcm63xx chipset. There are
also companies that encrypt their firmware images using a simple XOR. It is
often quite easy to find these if you see patterns that repeat themselves
very often.
- Code changes: Sometimes slight changes were made to the file system
code in the kernel, which make it hard or even impossible to mount a file
system over loopback without adapting a kernel driver. Examples include
Broadcom bcm63xx-based devices and devices based on the Texas Instruments
AR7 chipset, which both use SquashFS implementations with
some modifications to either the LZMA
compression (AR7) or the file system code.
To explore what code is present in these cases you need network access
or even physical access to the device.
Network scanning
With portscanners like nmap you can make
a fairly accurate guesstimate of what a certain device is running by using
fingerprinting: many network stacks respond slightly differently to different
network packets. While a fingerprint is not enough to use as evidence, scanning can
give you useful information, like which TCP ports are open and which
services are running. Surprisingly often you can still find a running
telnet daemon which will give you direct access to the device. Sometimes
exploiting bugs in the web interface also allow you to download or transfer
individual files or even the whole (decrypted) file system.
Physical access
Most embedded devices have a serial port, and this is sometimes the
only way to find violations. This may not be visible and sometimes is only
present as a series of solder pads on the internal board. After adding pin
headers you can connect a serial port to the device and – perhaps with the
addition of a voltage level shifter – attach the device to a PC. Projects
like OpenWrt have a lot of hardware information on their website and this
can be useful in working out how to start.
Once physical access is granted things get easier. The bootloader is
usually configured to be accessible via the serial port for maintenance
work such as uploading a new firmware, and this often translates into a
shell starting via the serial port after device initialization. Many devices
are shipped with GPL licensed bootloaders, such as RedBoot, u-boot, and
others. The bootloader often comes preloaded on a device and is not
included in firmware updates because the firmware update only overwrites
parts of the flash and leaves the bootloader alone. More problematically,
the bootloader may not be included in the source packages released by the
vendor, as they overlook its status as GPL code.
Example: OpenWrt firmware
GPL compliance engineering is best demonstrated using a concrete
example. In this example we will take apart a firmware from the OpenWrt
project. OpenWrt is a project that makes a kit to build alternative
firmwares for routers and some storage devices. There are prebuilt
firmwares (as well as sources) available for download from the OpenWrt website. In
this example we have taken firmware 8.09.1 for a generic brcm47xx device
(openwrt-brcm47xx-squashfs.trx).
Running the strings command on the file seems to return random
bytes, but if you look a bit deeper there is structure. The hexdump
tool has a few options which come in really handy, such as -C
which displays the hexadecimal offset of the file, the characters in
hexadecimal notation and the ASCII representation of those characters, if
available.
A trained eye will spot that at hex offset 0x001c there is the start of
a gzip header, starting with the hex values 0x1f 0x8b
0x08:
$ hexdump -C openwrt-brcm47xx-squashfs.trx
00000000 48 44 52 30 00 10 22 00 28 fa 8b 1c 00 00 01 00 |HDR0..".(.......|
00000010 1c 00 00 00 0c 09 00 00 00 d4 0b 00 1f 8b 08 00 |................|
00000020 00 00 00 00 02 03 8d 57 5d 68 1c d7 15 fe e6 ce |.......W]h......|
...
Extracting can be done using an editor, or easier with dd:
$ dd if=openwrt-brcm47xx-squashfs.trx of=tmpfile bs=4 skip=7
This command reads the file openwrt-brcm47xx-squashfs.trx and
outputs it to another file, skipping the first 28 bytes.
$ file tmpfile
tmpfile: gzip compressed data, from Unix, max compression
With zcat this file can be uncompressed to standard output and
redirected to another file:
$ zcat tmpfile > foo
The result in this particular case is not a Linux kernel image or a
file system, but the LZMA loader used to uncompress the LZMA compressed
kernel that is used by OpenWrt. LZMA does not always use the same headers
for compressed files, which makes it quite easy to miss. In this case the
LZMA compressed kernel can be found at offset 0x090c.
$ dd if=openwrt-brcm47xx-squashfs.trx of=kernel.lzma bs=4 skip=579
Unpacking the kernel can be done using the lzma tool.
$ lzma -cd kernel.lzma > bar
Running the strings tool on the result quite clearly shows
strings from the Linux kernel.
In openwrt-brcm47xx-squashfs.trx you can see padding in action
around hex offset 0x0bd280, immediately followed by a header for a
little endian SquashFS file system.
$ hexdump -C openwrt-brcm47xx-squashfs.trx
...
000bd270 1d 09 36 96 85 67 df 8f 1b 25 ff c0 f8 ed 90 00 |..6..g...%......|
000bd280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000bd400 68 73 71 73 9b 02 00 00 00 c6 e1 e2 d1 2a 00 00 |hsqs.........*..|
...
$ dd if=openwrt-brcm47xx-squashfs.trx of=squashfs bs=16 skip=48448
From just the header of the file system it is not obvious which
compression method is used:
$ file squashfs
squashfs: Squashfs filesystem, little endian, version 3.0, 1322493 bytes,\
667 inodes, blocksize: 65536 bytes, created: Tue Jun 2 01:40:40 2009
The two most used compression techniques are zlib and LZMA, the latter
becoming more popular quickly. Unpacking with the unsquashfs tool
will give an error:
zlib::uncompress failed, unknown error -3
This indicates that probably LZMA compression is used instead of
zlib. Unpacking requires a version of unsquashfs that can handle
LZMA. The OpenWrt source distribution contains all necessary configuration
and buildscripts to fairly easily build a version of unsquashfs with
LZMA support.
The OpenWrt example is fairly typical for real cases that are handled
by gpl-violations.org, where unpacking the firmware is usually the step
that takes the least effort, often just taking a few minutes. Matching the
binary files to sources and correct configuration information and verifying
that the sources and binaries match is a process that takes a lot more
time.
In conclusion
Compliance engineering is a demanding and occasionally tedious aspect
of the software field. Emotion has little place in the analysis applied
and the rewards of volunteer work are not visible to most people. Yet
compliance engineering is also essential, providing as it does a clear
imperative for people to obey the terms of FOSS licenses. It contributes
part of the certainty and stability necessary for diverse stakeholders to
work together on common code, and it allows a clear mechanism for
discovering which parties are misunderstanding their obligations as part of
the broader ecosystem. Transactions between individuals, projects and
businesses cannot be sustained without such mechanisms.
It is important to remember that the skills involved in compliance
engineering are not necessarily limited to a small subset of consultants
and companies. Documents like the GPL
Compliance Engineering Guide describe how to dig through binary code
suspected of issues. Engineers from all aspects of FOSS can contribute
assistance to a project or business when it comes to forensic analysis or
due diligence, and they can report any issues discovered to the copyright
holders or to entities like FSF's
Free Software Licensing and Compliance Lab, gpl-violations.org, FSFE's Freedom Task Force and Software Freedom Law Center.
(
Log in to post comments)