At LinuxCon North America in New Orleans, Samsung's Young-taek Kim
described his company's experience rolling out support for the
Software Package Data Exchange (SPDX)
standard in its product development tools. SPDX, of course, is a
data format for tracking software components, licenses, and
copyrights. The company was able to
improve its efficiency regarding license compliance, but that was not
the only benefit to the program. The implementation team also came
away from the experience with feedback for several ways to improve the
SPDX specification itself.
Why SPDX
Kim is an engineer in Samsung's Open Source Initiative (OSI) team.
Like the open-source groups inside many large corporations, the team
is charged (in addition to its development duties) with educating and
guiding other units in the company about open source principles. Kim
gave a quick overview of SPDX before describing the OSI team's task
and where SPDX fit into Samsung's workflow.
The SPDX specification is designed to produce a standardized "bill
of materials" for an open source software package, he said. It
communicates the licenses and copyrights that make up a
package—including, importantly, packages that are derived from
multiple sources. A constant problem in business scenarios is
making sure that one's company gets good information about these
factors from software suppliers and subcontractors. It is common, he
said, for a supplier to say simply "this is open source" and provide
no further information. The package could be MIT-licensed or under
the GPL, but if one does not know which of those licenses it
is, one does not know how to comply with it.
In practice, Kim said, he has often manually vetted a package by
looking through the source. He does not mind this process, but it
clearly results in duplication of effort when multiple project teams
in multiple divisions repeat the vetting for a package that is already
in use elsewhere. Standardizing the license and copyright information
with SPDX lets the company create a central database to unambiguously
keep track of the packages it has already vetted, and it helps resolve
complex compliance questions that arise from combining multiple
packages. Both benefits were of interest to Samsung.
Samsung's pilot program
Kim explained that Samsung wanted to reduce the overhead of license
compliance, so it charged the OSI team with deploying SPDX data
interchange in a pilot program inside the company. He then described
Samsung's existing open source compliance process. The company breaks
the process into four steps: discovering an open source package,
developing a product with the package, verifying the obligations
imposed by the open source license, and releasing the appropriate
material to satisfy that obligation.
The SPDX pilot program was charged with improving those final two
steps. Before the program, the verification stage meant confirming
the license on a package by having a human read through the
source, which is time-consuming, often redundant (such as when the
same package has already been verified by a different product team),
and prone to error. Human beings, he said, can reach different
conclusions when reading the same code. The obligation-satisfaction
stage was also largely manual (e.g., a person having to post source
code on a public Samsung web site, make it available to customers, or
insert a copyright statement onto a product screen) and could be
expensive (especially when printing a source code offer in a user
manual was involved—and even more expensive when re-printing is
necessary).
The pilot program's first goal was to reduce the time lost to
re-verification. The OSI team developed a tool called AIRS to
identify software packages and verify their license and copyright in
SPDX format. AIRS started out with a command-line interface, but is
also usable as a Java library. It uses the Protex code verifier from
Black Duck to scan a package and pick out license and
copyright information. It then exports this information as SPDX
data, including the licenses and copyrights of all components and
(perhaps most importantly) the "concluded license" that applies to the
combined work as a whole. It identifies files by SHA1 checksum, which
helps catch duplicates—meaning that files which have already
been scanned and analyzed once do not need to be re-scanned even when
directory structures have been rearranged.
The eventual design is for AIRS to store this SPDX data in a
central, company-wide database, which can then be queried whenever a
new (or a duplicate) package is imported for testing. Right now,
teams within the company exchange SPDX information internally using
other tools. However, the chief benefit of AIRS is that it can
identify the correct license and copyright of a package automatically.
Even for a small development team, that demonstrably saves time.
The second goal of the pilot program was to simplify the
obligation-satisfaction step, Kim said. For this, the OSI team
developed a web tool (tied in to AIRS) that can automatically publish
the appropriate license notice for a package on the company's web
site. It generates the page for each package based on the stored SPDX
data, and even generates a QR Code containing a link to the license
page URL. Samsung intends to start putting these URLs on physical
product packaging, perhaps as soon as October.
SPDX in the future
Overall, the company was quite happy with the pilot program, Kim
said, so work is continuing. The AIRS centralized SPDX database is
the first order of business, but there are several other to-do list
items. One is support for verification engines other than Protex;
another is the ability to identify the same code snippet even when the
file checksum changes. The OSI team also wrote its own SPDX parser
when developing AIRS, which Kim said he hopes to release as an open
source project in its own right.
In reply to an audience question, Kim said that the company may
start requiring external software suppliers to provide SPDX data on the
packages that they supply. What makes that request tricky is that
Samsung is still responsible for verifying that the information is
correct, so it will probably have to use AIRS to process the
suppliers' code anyway.
Despite its general satisfaction, Samsung ran into several problems
with SPDX itself when running its pilot program. First was the "Artifact
of Project" property (defined in sections 6.8 to 6.10 of the SPDX
specification [PDF]), which is meant to indicate that the file in
question belongs to a specific project. In the specification, the
cardinality of this property is "one," so a given file can only be
associated with a single project. Samsung found that insufficient to
record projects that constitute combined works, and had to modify its
SPDX output to list every project that a file belongs to.
The property also requires parent projects to be described with the
Description of a
Project (DOAP) format, which duplicates the same RDF/XML data for
every file in a project—a simple database reference would save
space. In addition, Kim said, Samsung found it problematic that SPDX
does not account for sub-projects within a project, which is a common
situation when creating large products. It also ran into problems
caused by the fact that SPDX does not enforce a common rule for the
formatting of file paths; packages can reference files with relative
path names, which makes it difficult to match them up for the purpose
of determining the concluded license. Requiring the file paths be
normalized would simplify things.
SPDX is often touted
for its ability to ensure correctness in license-compliance efforts,
so it is interesting to see that it can enable other benefits, too,
such as reducing the amount of duplicated work undertaken by
developers. Samsung is an enormous company, so even saving a small amount of
time on a per-project basis can add up to a lot.
[The author would like to thank the Linux Foundation for
assistance with travel to New Orleans.]
(
Log in to post comments)