GitHub unveils its Licenses API
Since opening its doors in 2008, GitHub has grown to become the largest active project-hosting service for open-source software. But it has also attracted a fair share of criticism for some of its implementation choices—with one of the leading complaints being that it takes a lax approach to software licensing. That, in turn, leads to a glut of repositories bearing little or no licensing details. The company recently announced a new tool to help combat the license-confusion issue: a site-wide API for querying and reporting license information. Whether that API is up to the task, however, remains to be seen.
None of the above
By way of background information, GitHub does not require users to choose a license when setting up a new project. An existing project can also be forked into a new repository with one click, but nothing subsequently prevents the new repository's owner from changing or removing the upstream license information (if it exists).
From a legal standpoint, of course, the fork inherits its license from upstream automatically (unless the upstream project is public domain or under some other less-common license). But from a practical standpoint, this provenance is difficult to trace. Throw in other GitHub users submitting pull requests for patches that have no license information, and one has a recipe for confusion.
The bigger problem, however, is that the majority of GitHub repositories carry no license information at all, because the users who own them have not chosen to add such information. In 2013, GitHub introduced its first tool designed to combat that issue, launching ChooseALicense.com, a web site that explains the features and differences of popular FOSS licenses.
ChooseALicense.com allows GitHub users to select a license, and the GitHub new-project-configuration page has a license selector, but using it is not obligatory. In fact, the ChooseALicense.com home page includes the following as its last option:
That "no license" link, incidentally, attempts to explain the downside of selecting no license—most notably, it strongly discourages other developers (both FOSS and proprietary) from using or redistributing the code in any fashion, for fear of getting entangled in a copyright problem. But the page also points out that the GitHub terms of service dictate that other users have the right to view and fork any GitHub repository.
A new interface
One could probably quibble endlessly over the details of ChooseALicense.com and its wording. The upshot, though, is that it did not have a serious impact on the license-confusion problem. A March 9 post on the GitHub blog presented some startling statistics: that less than 20% of GitHub repositories have a license, and that the percentage is declining. The introduction of the license-selection tool in 2013 produced a spike in licensed repositories, followed by a downward trend that continues to the present. The post also included some statistics on license popularity; the three licenses featured most prominently on the license-chooser site (MIT, Apache, and GPLv2) are, unsurprisingly, the most often selected.
This data set, however, is far from complete; as the post explains, the team only logged licenses that were found in a file named LICENSE, and only matched that file's contents against a short set of known licenses. Nevertheless, GitHub did evidently determine that the problem was real enough to warrant a new attempt at a solution.
The team's answer is a new site-wide API called, fittingly, the Licenses API. It is currently in preview, which means that interested developers must supply a special HTTP header with any requests in order to access it.
But the API is, at least currently, a frustratingly limited one. It offers just three functions:
- GET /licenses returns a JSON-formatted list of all of the licenses tracked by the site.
- GET /licenses/licensename returns the license text and associated metadata for licensename.
- GET /repos/username/reponame returns any licensing information for username's reponame repository (along with other repository information).
Arguably the biggest limitation is that, as was the case with the statistics gathered for the blog post, the license of a repository is determined only by examining the contents of a LICENSE file. On the plus side, the license information returned by the API conforms to the Software Package Data Exchange (SPDX) specification, which should make it easy to integrate with existing software.
To be sure, determining and counting licenses is not a simple matter—as many in the community know. In 2013, for example, a pair of presentations at the Free Software Legal and Licensing Workshop explored several strategies for tabulating statistics on FOSS license usage. Both presentations ended with caveats about the difficulty of the problem—whatever methodology is used to approach it.
Nevertheless, the GitHub Licenses API does appear to be strangely naive in its approach. For example, it is well-established that a significant number of projects place their license in a file named COPYING, rather than LICENSE, because that has long been the convention used by the GNU project. Even scanning for that filename (or other obvious candidates, like GPL.txt) would enhance the quality of the data available significantly. Far better would be allowing the repository owner to designate what file contains the license.
Furthermore, the Licenses API could be used to accumulate more meaningful statistics, such as which forks include different license information than their corresponding upstream repository, but there is no indication yet that GitHub intends to pursue such a survey. It may fall on volunteers in the community to undertake that sort of work. There are, after all, multiple source-code auditing tools that are compatible with SPDX and can be used to audit license information and compliance. Regrettably, the GitHub Licenses API does not look like it will lighten that workload significantly, since the information it returns is so restricted in scope.
Power to choose
GitHub is right to be concerned about the paucity of license information in the repositories hosted at its site. But both the 2013 license chooser and the new Licenses API seem to stem from an assumption on GitHub's part that the reason so many repositories lack licenses is that license selection is either confusing or difficult to find information on. Neither effort strikes at the heart of the problem: that GitHub makes license selection optional and, thus, makes licensing an afterthought.
SourceForge has long required new projects to select a license while performing the initial project setup. Later, when Google Code supplanted SourceForge as the hosting service of choice, it, too, required the user to select a license during the first step. So too do Launchpad.net, GNU Savannah, and BerliOS. FedoraHosted and Debian's Alioth both involve manually requesting access to create a new project, a process that, presumably, involves discussing whether or not the project will be released under a license compatible with that distribution.
It is hard to escape the fact that only GitHub and its direct competitors (like Gitorious and GitLab) fail to raise the licensing question during project setup, and equally hard to avoid the conclusion that this is why they are littered with so many non-licensed and mis-licensed repositories. An API for querying licenses may be a positive step, but it is not likely to resolve the problem, since it side-steps the underlying issue.
Hopefully, the current form of the Licenses API is merely the
beginning, and GitHub will proceed to develop it into a truly useful
tool. There is certainly a need for one, and being the most active
project-hosting provider means that GitHub is best positioned to do
something about it.
