Streamlining license compliance with FOSSology 3.0
License compliance in big free-software projects is not a simple task. Beyond the basic requirements (such as providing access to source code), compliance can consist of numerous details: figuring out how the licenses on individual components combine in an aggregate work, ensuring that required license texts are reproduced where needed, tracking the names of copyright holders to properly give credit, and so on. Little wonder, then, that compliance management has grown into a sizeable industry in recent years. Perhaps the best-known open-source compliance tool is FOSSology, which released version 3.0 in early November. The update adds new user-interface features intended to make project workflow smoother, and it adds several new functional enhancements.
![Examining a license decision [Examining a license
detection]](https://static.lwn.net/images/2015/11-fossology-license-sm.png)
Broadly speaking, FOSSology follows the same design used by other license-compliance systems. Users upload source code from a project, at which point the program scans the contents of the uploaded files to look for licensing information. The goal is to identify what license applies to every individual file—a task that requires some heuristics when, say, license statements may appear in per-file headers and in directory-wide README files. The end result is an unambiguous understanding of what licenses and copyrights apply to the total codebase; the license requirements can then be met (and copyrights listed) when the source code is distributed. Determining which license applies is a problem that cannot be completely automated, so FOSSology (and similar tools) provide a workflow in which users can examine the hard-to-determine cases and apply a decision. It is also important, in the long run, that users don't have to repeat too much of the process whenever refreshing just one portion of a large codebase.
The FOSSology 3.0 release makes improvements to several facets of this workflow. The web-based user interface has been improved (both to be faster and to provide additional flexibility) and there are some new options for the critical step in the aforementioned process: automatically detecting license information by scanning the uploaded code.
Scanners
One distinguishing feature found in FOSSology is that it supports multiple, pluggable code-scanning engines. Earlier releases supported two scanners, Monk and Nomos. The new release adds support for a third, called Ninka. Monk is a basic full-text scanner that looks for matches against known license text, while Nomos is a regular-expression based scanner that picks out significant phrases that may come from variant wordings of a license.
![Examining a copyright [Examining a copyright detection]](https://static.lwn.net/images/2015/11-fossology-copyright-sm.png)
Unlike the others, Ninka originates from outside the FOSSology project; it is based on ideas from a 2010 research paper (available at the Ninka site) and attempts to identify licenses based on sentence-level matching. All of the scanners included in FOSSology can be run as standalone utilities, although their main usage is intended to be through batch-scanning jobs that are scheduled and performed automatically, then later reviewed.
In addition to the scanning engines, FOSSology supports user-written filters and heuristics. On that front, the new release adds a new option: whenever the Monk and Nomos scanners automatically detect the same license for a file, a rule can be enabled that automatically accepts the determination and saves it, sparing the human reviewer from manually inspecting that file. Presumably that equates to the user placing a high degree of confidence in Monk and Nomos—although, naturally, human error can occur just like heuristic errors.
![Examining export information [Examining an export
detection]](https://static.lwn.net/images/2015/11-fossology-export-sm.png)
In any case, it does not appear that anyone expects FOSSology users to switch off manual review entirely; quite a bit of work went into revising the user interface. The release notes highlight a new UI for the license-review and copyright-statement–editing tasks, as well as a new jQuery-based "folder view" that supports sorting, filtering, and viewing extended file attributes. The additional attributes that FOSSology exposes include some rather important data; when files are scanned, there are modules that attempt to pick out other details of significance besides the license, such as authorship and copyright statements. In fact, the 3.0 release adds a new interface for reviewing and editing the copyright-detection results; in addition to updating copyright information (or fixing simple typos), users can now flag files for further review or discussion, adding notes where needed.
Copyright statements were detected in prior releases, too; it is just the editing interface that is new for 3.0. But the new release does add support for detecting an entirely new class of data: customs or export-control information. Many readers will be familiar with the export restrictions that have been imposed on encryption software over the years. For now, encryption seems to be the primary target of the export-control scanner—based on the keywords it looks for—although "avionics" is included as well, and it will flag all instances of "foreign trade" and other such general terms.
Other new features
Among the other new features added in this release, FOSSology now supports the idea of "candidate licenses," which amount to a state in between the licenses currently tracked in the FOSSology instance's database and a completely unrecognized license. The reasoning is that users may want to tag files as having a license that is not yet in the database or perhaps even to create a filter that recognizes the new license. In prior versions of FOSSology, an admin user would have to add each new license to the database before this could happen. By supporting candidate licenses, users processing code uploads can tag files as needed without waiting for the admin, but if the candidate license later turns out to be unneeded, it has not been unnecessarily added to the database.
There are several other workflow additions in a similar vein. For instance, users can now save a license-conclusion decision for a particular file and have that decision tied to the hash of the file. As long as the file's hash remains the same on subsequent uploads, the file will not reappear in the list of files to review. Hopefully, such little additions speed up the process of reviewing uploads, but without running the risk of letting incomplete or inaccurate decisions creep in.
Last but certainly not least, the FOSSology 3.0 release adds some new import and export options. Users can export Software Package Data Exchange (SPDX) 2.0 files that represent the licensing and copyright information for a project's entire codebase. FOSSology can now also import and export data in comma-separated value (CSV) form, which may make it easier to connect with other tools. And it can generate README or COPYING files based on the license that has been determined to apply to a directory or to a project.
Given how intricate and complex license compliance can be, the
obvious conclusion is that tracking
and updating compliance information is likely to always include a
significant time investment. But tools like FOSSology make the
process as smooth as possible, and it is encouraging to see that the
latest release has found so many areas for improvement. With more
scanners implementing different approaches and with more flexibility
in how licensing information is processed, perhaps keeping license
compliance in order will someday be reduced to a job simple enough
that it becomes routine. No doubt many free-software developers would
welcome that.