|
|
Subscribe / Log in / New account

ClearlyDefined: Putting license information in one place

By Jake Edge
April 30, 2019

Determining the license that any given package uses can be difficult, but it is essential in order to properly comply with that license and, thus, the developer's wishes. There is an enormous amount of "open source" software available these days that is not clearly licensed, which is where the ClearlyDefined project comes in. The project is collecting a curated list of packages, source location, and license information; some of that collection can be automated, but ClearlyDefined is targeting the community to provide curation in the form of cleanups and additions.

Licensing information is notoriously complex to get right. Packages are often made up of source files that come with their own licenses, based on where the code originally came from—or the aims of the original developers. For example, even though the Linux kernel is licensed under GPLv2, it has many different licenses throughout the tree. The effort over the past few years to add Software Package Data Exchange (SPDX) headers to the kernel's source files is still ongoing. What seems like it should be a simple, straightforward process turns out to be quite a bit less so.

The kernel effort is just for one "package", however. There are untold thousands (tens, hundreds, ...) of other packages that today's software relies upon; the licensing information for many of those is even harder to work out. But there are a number of reasons that it is important to have that information available.

Without proper license information, some will be unwilling to use the code in question, or at least to distribute it. Others may need to spend some time tracking that information down before they can use the software. That effort may well satisfy their compliance needs, but the end result does not (necessarily) help others. If, for example, an organization sets out with the goal to create a list of the packages and licenses from a container image, the work of determining the license may need to be repeated by others, potentially including other parts of the same organization. There is no easy way share that information with others that use the image or the packages contained within it.

Eliminating that duplication of effort is one of the ways that ClearlyDefined is trying to help fix the problem. Its home page presents an interface that allows community members to add information they have discovered for packages. It also provides a REST API that can be used to retrieve various kinds of data from its repository.

There is more to it than that, however. The real underlying problem lives in the upstream repository and/or source files. Making it easy for upstream projects to update their source files to have SPDX headers, as well as providing an overall license for the whole, could, eventually, make ClearlyDefined obsolete—though that may be just a tad overly optimistic.

Even projects that are simply providing a library or component for others to consume can encounter problems that ClearlyDefined can help solve. It is the rare project indeed that has no outside dependencies, so having licensing and other information available for those dependencies will make it easier for others to pick up and use the code. The idea is that anyone in the community can help curate the license information; ClearlyDefined is meant to be a portal where that work can be done.

The starting point is a description of the project, which includes the location of its source code, the bug-tracker location, and project web site. Beyond that, there will generally be multiple entries for a project, one for each release. Those entries will have the release date associated with it. But projects are often made up of different pieces, some of which may not matter from a licensing standpoint because they are not distributed. So, things like tests, build utilities, examples, and so on can be defined as "facets" of the project; the "core" facet consists of the files that are actually part of the distributed code.

Each facet then gets assigned licensing information that has either been automatically determined (by code scanners like ScanCode and FOSSology) or has been contributed to the project, as the Eclipse Foundation did when ClearlyDefined became an incubator project of the Open Source Initiative (OSI) a little over a year ago. The "declared" license, which comes from the license choice stored in the source repository (if any), is recorded, along with the number of files in the facet. Any SPDX headers discovered in the source files along with the copyright attribution information are recorded as well.

Others can help out by contributing data through the curation process on the web site. It is handled in the same way that contributions to GitHub-hosted projects normally are, with pull requests (PRs), in this case to the curated-data repository. The process is a bit clunky right now, as the project admits, but it is actively seeking ways to make the curation process work more smoothly.

Another area where ClearlyDefined is not, yet, clearly defined, is in its security component. The overarching idea is to track security vulnerabilities and fixes so that users can understand the security status of the components they use. How that will be done is still under discussion; for now, the project is mostly focused on the licensing piece.

The project's charter gives a nice overview of the project and its goals. As might be expected of an OSI project, ClearlyDefined is committed to using open-source tools and to releasing its code as open-source software. For example, "harvesting" data will not be done using proprietary tools:

Harvesting is the act [of] getting data from upstream projects. This may be as simple as reading prescribed data from canonical locations to full-on analysis of the source code using a variety of open tools. The discovered data is stored in its entirety in its native form in ClearlyDefined infrastructure and made available to the community on demand. The harvesting tools themselves are always fully open and accessible to the community for vetting and inspection. The project is open to including new tools subject to a vote, as described below.

Harvesting may be run by the ClearlyDefined project itself or by designated parties, typically curators. In all cases, only output from agreed to tools and configurations will be admitted to the system. Harvesting operators are free to focus on a given domain of projects that best suit their expertise and interests.

As the stats page shows, there are nearly five million definitions currently in the database (as of this writing, anyway). Multiple repositories are being harvested, including npm for Node.js, PyPI for Python, Maven for Java, Crate for Rust, GitHub, and others. ClearlyDefined was the subject of a lively workshop at the recent FSFE Legal and Licensing Workshop (LLW), led by project lead Jeff McAffer of GitHub. The project has lots of partners, such as Google, Microsoft, Amazon Web Services, Qualcomm, Software Heritage, and Codescoop.

The data ClearlyDefined gathers is clearly helpful, but it needs a lot of attention from the community in order to get it into a fully useful state. Once done, that data has a lot of value, however, especially in not having to redo the work over and over. Hopefully that value will lure more companies into the fold to help curate, quite possibly using data they have already gathered as part of their compliance efforts. Crowdsourcing the data seems ... clearly ... like the right way to go.



to post comments

ClearlyDefined: Putting license information in one place

Posted May 1, 2019 21:52 UTC (Wed) by ovitters (guest, #27950) [Link] (1 responses)

I tried reading the mentioned SPDX specification. Current latest version is at https://spdx.org/spdx-specification-21-web-version. It seems written by enterprise companies or something. The page is super long and is pretty unreadable. It seems to provide things pretty similar to DOAP (https://github.com/ewilderj/doap/wiki). The current DOAP site also doesn't have good documentation but at least in the past I remember things being pretty readable.

I work for a big company btw, the big company way of doing things is not always something that leads to success.

ClearlyDefined: Putting license information in one place

Posted May 4, 2019 11:54 UTC (Sat) by kpfleming (subscriber, #23250) [Link]

SPDX itself is large and complicated as you've noted, but thankfully ClearlyDefined isn't dealing with the entirety of SPDX. It's only addressing license statements, which are specified using SPDX license identifiers or expressions. You can read about those here::

https://spdx.org/ids-how

The identifier/expression language, along with the SPDX list of known licenses, is all you would need to understand the information which is put into source files in the process.

ClearlyDefined: Putting license information in one place

Posted May 4, 2019 22:46 UTC (Sat) by spwhitton (subscriber, #71678) [Link] (1 responses)

Debian cares a lot about knowing that everything we distribute in the main archive complies with DFSG, but achieving that is almost completely manual. Members of the FTP team look at every file in every new source package, and confirm that the Debian copyright file takes account of all licenses and copyright holders. There are tools to automatically generate these Debian copyright files, but TTBOMK none of them are comprehensive. There just too much variation in the free software that gets uploaded to Debian.

It's nice to see that people are thinking about how we might save volunteer time in this area, though integrating Debian's processes with something like ClearlyDefined is a very long way off!

ClearlyDefined: Putting license information in one place

Posted May 5, 2019 2:53 UTC (Sun) by pabs (subscriber, #43278) [Link]

A wiki page about some of the available tools:

https://wiki.debian.org/CopyrightReviewTools

There is also one in development that uses machine learning:

https://salsa.debian.org/lumin/licensecheck-ng


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds