Licenses for data

By Jake Edge
May 9, 2018

The amount of available data is growing larger these days, to the point that some data sets are far larger than any one company or organization can create and maintain. So companies and others want to share data in ways that are similar to how they share code. Some of those companies are members of the Linux Foundation (LF), which is part of why that organization got involved in the process of creating licenses for this data. LF VP of Strategic Programs Mike Dolan came to the 2018 Legal and Licensing Workshop (LLW) to describe how the Community Data License Agreement (CDLA) came about.

The kinds of data affected are for applications like machine learning, blockchains, AI, and open geolocation, he said. Governments, companies, and other organizations want to share their data and the model they want to follow is the one they have learned from open-source software. So the idea behind the CDLA is to share data openly using what has been learned about licensing from decades of sharing source code.

Version 1.0 of the CDLA was announced by the LF in October 2017. There are two different CDLA agreements that were inspired by the difference between permissive and copyleft licensing for software, he said. The "sharing" agreement is like a copyleft license, such as the GPL, while the "permissive" agreement is more like the MIT or BSD licenses. The difference comes into play if a recipient publishes the data or an enhanced version of it—they must release it if it was licensed under the sharing agreement. If the data is just used internally, there is no requirement to release it.

Data is not the same as source code, Dolan said. Facts are not copyrightable in many jurisdictions; only the creative expression of the data can be protected. But some data providers are trying to lock down access to their data with a variety of often ambiguous usage terms. They try to make things complex by using broad language.

The current practices for those releasing open data vary. Some are releasing as public domain and others are using open-source software or Creative Commons licenses. There are other open data licenses like the Open Database License used by OpenStreetMap and the Canadian government has its own Open Government License.

None of those have really gained traction for various reasons. There is a consensus that software licenses are not appropriate for data and the public domain and CC0 approaches concern some. For those reasons, LF members and others thought there was a need for new licenses for data. The intent is to try to prevent license proliferation and to try to prevent valuable data being released under licenses that do not allow aggregation in ways that will allow the data to be fully utilized over time.

Data can be long-lasting or even perpetual. It may also be hard or impossible to recreate the conditions under which it was gathered. If you have a data set containing oceanic temperatures over time, there is no opportunity to regather the data at some later point. That means the license under which it is released may be critical to how it can be used decades or even centuries from now.

One of the areas that took a lot of time to work out was the copyleft obligations; where do they begin and end? It ended up that any modifications or additions to the data that are published must be released under the CDLA sharing agreement. Any analysis of the data is explicitly excluded from that requirement, though those results may be included voluntarily. That exclusion includes any "computational or transformational activity", such as creating a TensorFlow model from a data set.

Dolan said that these agreements will be used by communities that are training AI and machine-learning systems, public-private infrastructure initiatives (such as for traffic data), and organizations with mutual interests that will be best served by pooling their data resources. The CDLA is already in use by Cisco on a data set of network anomalies that it has released on GitHub. In addition, data.world, which is positioning itself as the GitHub for data, recently added CDLA to its list of licenses.

Dolan concluded by answering a question that he always gets about the relationship of CDLA and Europe's General Data Protection Regulation (GDPR). CDLA is for data that can be shared, thus does not come under the GDPR; releasing data under a CDLA license does not magically make data shareable that would normally not be because of the GDPR.

An audience member asked about how ocean temperature data could even have a copyright, but Dolan noted that CDLA is not creating any new rights. There are database rights in Europe and similar rights elsewhere that already create this situation. CDLA simply provides a clear set of terms so that companies understand their responsibilities if there are any rights embodied in the data.

[I would like to thank the LLW Platinum sponsors, Intel, the Linux Foundation, and Red Hat, for their travel assistance support to Barcelona for the conference.]

Index entries for this article
Conference	Free Software Legal & Licensing Workshop/2018

Licenses for data

Posted May 10, 2018 6:59 UTC (Thu) by epa (subscriber, #39769) [Link]

This new CDLA looks like a reasonable choice. It doesn't have the problematic contract terms of the Open Database Licence (ODbL) which attempt to add new restrictions over and above what's covered by copyright and database right.

Licenses for data

Posted May 11, 2018 10:14 UTC (Fri) by viiru (subscriber, #53129) [Link]

Creating a new license to try to prevent license proliferation sounds hilariously like https://xkcd.com/927/