Digging for license information with FOSSology

By Jake Edge
November 6, 2019

At Open Source Summit Europe 2019, Michael C. Jaeger and Maximilian Huber updated attendees on the FOSSology project, which is an open-source license-compliance tool. They introduced FOSSology and talked about how it can be used, but they also looked at the new features added in the last few releases. Beyond that, they presented some experiments the project has been doing with creating machine-learning models for license recognition.

FOSSology is a Linux Foundation (LF) project, Jaeger said, that started with code released by HP in 2008. It was initially a program for scanning Linux distributions for the licenses of the software they contained. The company had a lot of projects that used Linux and realized that it was scanning the same files over and over, so it came up with a server solution that would track the files that were scanned along with the licenses that were found.

For a number of years FOSSology was distributed and maintained by HP, until it became an LF project in 2015. It is easier for companies to collaborate on software in a project at an organization like the LF, he said, it makes for a safer harbor for competitors to work together—in Germany, at least. He works for Siemens AG, which is a rather large Germany company.

Breaking up archive files into their constituent files—some of which may need to be unpacked themselves—then scanning the individual source and other files for their licenses is the basic task of FOSSology. It has a powerful license scanner, he said. Its web-based interface can then give an overview of the contents—which licenses apply to various parts of the tree, for example—and allow users to drill down into the file hierarchy to the individual files to see their copyrights and license-relevant text. When looking at the file, FOSSology highlights that license-relevant text and shows a comparison with the reference text of the license it has determined for the file.

Determining the license that applies to a file is challenging, however. Files have a wide variety of license-relevant text in them, some of which is ambiguous. It depends on the kind of source code you are working with, but the scanner is unable to decide on a license for up to 30% of files it sees, so it is up to a human reviewer to tag the right license. It is then important to also track what reviewers decide on files in the FOSSology database.

The Software Package Data Exchange (SPDX) format is used to describe various things in a package, including licensing information. FOSSology can both import and export SPDX information, which allows exchanging information between two FOSSology users to share analysis work. FOSSology is one of a few tools that can consume SPDX information; it can be used to review what another party has concluded about the licensing of a code base. In addition, when a package gets updated, the previous analysis can be used as a starting point; the new dependencies and other changes can be incorporated into that rather than starting from scratch.

Releases and new features

There have been two major releases so far this year: 3.5 in April and 3.6 in September. The 3.7 release is coming, the first release candidate came out at the end of October. it is important that users' large FOSSology databases are preserved in any upgrade, so the project is careful to ensure that works before a release is done. Some other license-scanning tools have an easier job preparing for a release, Jaeger said, since they do not have databases.

One of the new features this year is a REST API that will allow FOSSology to integrate with other tools. Over time, the plan is to add to that API, but on the basis of use cases. So anyone who has a use case for compliance automation that needs additional support from FOSSology should bring it to the project, he said.

Another new feature is the OJO agent for detecting SPDX license identifiers in scanned files. Its output can be considered with other findings on the source files; if none conflict with the license it found, that license can be determined to apply to the file.

At that point, he turned things over to Huber, who started by digging in a bit more deeply on the capabilities provided by the REST API. FOSSology provides a service, he said; you upload source code to it and then scan that code. You want to be able to automate that process, so that you can script things. There is also a Python library that runs on the server, fossdriver, that can be used to do even more FOSSology operations than what is supported by the REST API. Beyond that, FOSSology is made up of individual command-line scanners and other tools that can be used standalone in various ways.

The REST API is able to handle all four steps of the typical FOSSology workflow: prepare, scan, observe, and download. For the prepare step, there are API elements for listing and creating folders as well as for uploading packages to FOSSology into a folder. Scanning can be controlled via the API; scheduling and setting options for the scan are both supported. To observe the process, there is a way to list the running jobs and retrieve their status. Lastly, there are interfaces for downloading the reports in order to view them or to integrate their output into reporting from other tools. More information can be found in the "getting started" document or the REST API documentation.

Huber then gave a "short and probably boring" demo; "boring because it's so simple". He showed the web page of an instance of SW360, which is another open-source license-compliance tool, where he added a new entry for a package to its database. He hit the "magic button" that sent that information to FOSSology, which got the code and did the scan. He switched over to looking at the FOSSology web page to see that the scanning process had completed; when he went back to SW360, it had already downloaded the report and attached it to its database entry. All of that was done using the FOSSology REST API, Huber said.

Machine learning

In the last year, the project has been looking at how machine learning could be harnessed for license identification. Normally, license identification is done by way of regular expressions and rules that are created by hand. Instead, existing FOSSology databases could provide curated data for training a machine-learning model. That model could then be used to determine the licenses that applied to new code uploaded to the program.

The first step is to identify the features in the source files that will help lead to proper conclusions of which license applies. The source code is preprocessed to extract the comments, which are cleaned up and lemmatized. The training data consisted of around 1000 license texts and 4000 license statements.

The model that was generated still has a number of problems, he said. There are some licenses that are so similar they make it difficult for the model to distinguish between them. The next improvement that the project is working on is to have a multi-stage process; the first stage would simply determine if the file is a single-license file, multi-license file, or contains no license information at all. After that, there would be a stage that could determine the license family (e.g. GPL or MIT), then a stage to distinguish between the variants within a given family.

The data set being used for training is biased, however; most real FOSSology databases have roughly 28% of the files licensed under Apache 2.0, he said. Some licenses only appear a few times in the data set, which makes it difficult to train for them. That is a problem, but the multi-stage approach helps there too. The code to build a model is available on GitHub; it requires a FOSSology installation with a populated database. It is experimental at this point, so it is not distributed with FOSSology itself.

Concluding thoughts

Huber handed the microphone back to Jaeger to wrap up the presentation. He said that FOSSology participated in the Google Summer of Code (GSoC) for 2019; the project had three GSoC participants working on various projects. FOSSology has been working on integrating with three different open-source projects as well. Software Heritage is a repository of published software, while ClearlyDefined is a repository of metadata about published software. In both cases, FOSSology has plans to interact with them via their REST APIs. The third project is not as well known, he said. Atarashi takes a new approach in scanning for licenses. Instead of using regular expressions and rules, it uses text statistics and information-retrieval techniques.

Another initiative that the project has undertaken is FOSSology Slides, which is a site for gathering slides that can be used to talk and teach about FOSSology. They are all licensed under CC BY-SA 4.0 (as are the slides [PDF] from the OSS EU talk). They can be used as is, or adapted for other uses; he encouraged anyone to contribute their FOSSology slides as well. One nice outcome of that is that some Japanese FOSSology users translated slides from FOSSology Slides to that language and contributed them back, Jaeger said. Other translations would be welcome for those who want to contribute to the project but are not software developers.

A FOSSology user in the audience pointed out that the tool is only able to analyze the code it is given, so package dependencies have to be figured out separately. Jaeger agreed, noting that FOSSology is focused on understanding the licenses in the code it is given; there are other tools that can help figure out what the dependencies are and there are no plans to add that to FOSSology. He suggested the OSS Review Toolkit (ORT) as one possibility.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend Open Source Summit Europe in Lyon, France.]

Index entries for this article
Conference	Open Source Summit Europe/2019

Digging for license information with FOSSology

Posted Nov 9, 2019 3:06 UTC (Sat) by pabs (subscriber, #43278) [Link]

A recent bachelor thesis comparing FOSSology and other license crawlers: https://osr.cs.fau.de/2019/08/07/final-thesis-a-compariso... https://osr.cs.fau.de/wp-content/uploads/2019/08/wolter_2...