Machine learning for lawyers

By Jake Edge
May 3, 2017

Machine learning is a technique that has taken the computing world by storm over the last few years. As Luis Villa discussed in his 2017 Free Software Legal and Licensing Workshop (LLW) talk, there are legal implications that need to be considered, especially with regard to the data sets that are used by machine-learning systems. The talk, which was not under the Chatham House Rule default for the workshop, also provided a simplified introduction to machine learning geared toward a legal audience.

Villa has started his own solo practice in San Francisco that is concerned mostly with open-source issues, though he would like to get into some open-data work as well. His presentation started with an xkcd cartoon that highlights the difficulty in having computers recognize features in images. Someone asks for an app that can tell if the photo is taken in a national park—which is easy—and that can distinguish if the photo is a bird, which will require "a research team and five years". But a few weeks later, Flickr released its PARK or BIRD application. Flickr was able to do that because it had been working on machine learning, he said.

In 2011, Google began experimenting with "what has now become known as machine learning"; by 2014, Yahoo! was marketing machine learning by way of the PARK or BIRD application. In 2017, interns are using machine learning. He asked: "what changed?" There are some new algorithms that came about in 2006 and GPUs got much faster, which provided the horsepower needed to do the intricate math required. But the biggest change is that the data to train machine learning systems used to be fairly scarce and is now far more abundant.

Machine learning 101

He looked at what machine learning is not to begin with. It is not a general-purpose artificial intelligence like HAL 9000, nor is it continuously learning like a child. Machine-learning systems are quite specialized to do one thing "incredibly well".

Going back to xkcd, Villa said that it is straightforward to use the location data stored in a photo and a database of national park boundaries to determine that part. There are "discrete steps that can be easily explained" to do so. But trying to explain what a bird is to a computer is as difficult as explaining it to a three-year-old.

The best definition of machine learning that he has is adapted from An AI Pattern Language [PDF]: machine learning is a way to recognize a pattern in data and to improve that recognition by exposure to more data.

But before you can do that, you need to gather lots of data. These days, there are publicly available data sets that can be used. For example, the MNIST database contains 70,000 handwritten numerals that can be used to train a machine-learning system to recognize digits. The YouTube-8M data set has eight million videos with labels, while the Heterogeneity Activity Recognition Data Set contains 43 million records of smartphone and smartwatch sensor data; both of these can be used for training and are freely available.

Another is the YouTube-BoundingBoxes data set where humans annotated 380,000 YouTube video clips to identify various features (Villa showed a selection of cat bounding boxes, naturally, in his slides [PDF]). These data sets have been sampled, selected, and modified by various humans, which might make them collective works—and it may not be clear who owns the rights. That should be setting off "lawyer alarm bells" for those in the room, he said.

Typically, a machine-learning system is not just handed the raw video (or other) data, but the data is preprocessed in some fashion to extract "features" that are relevant to what is being trained. That is generally done using more traditional techniques like filters of various sorts. But they all modify the original data set in some fashion.

Then the outcomes need to be defined. All of the 2s in the MNIST data set need to be defined to be a 2. There is a technique called unsupervised learning, but it is mostly of academic interest. For systems that are going to be used for real machine-learning tasks, the training data will be tagged with the outcome desired.

Then things start to get messy, he said. In order to do the recognition, a neural network needs to be built. They are built up of individual (artificial) neurons that simply multiply their inputs together. If that value passes a certain threshold defined for the neuron, it outputs a 1 (though, in truth, it is more complicated than that, he said). Neurons can have hundreds or thousands of inputs and get composed into large networks.

How the neurons are arranged in the network and how many neurons are in each "layer" of the network is something of an art. But, in the end, each neuron gets a threshold associated with it. This is done by "rolling a bunch of dice", then using the training data to refine those values. The result is huge multidimensonal array of numbers, called a model. So, to recognize a handwritten number, an image of it would be input to a neural network using a, say, 784-dimension array that represents the model and a useful result would pop out.

This big array is all numbers; it is not text, images, or anything that "we can comprehend or understand in our puny human brains", Villa said. Thus, it may not be protectable under traditional copyright law.

Legal questions

In the training data sets, protected works (e.g. YouTube videos) are often used. Those protected works are then enhanced by others by adding things like bounding boxes and labels. It is unclear what rights exist in these data sets. Under US copyright law, this is likely to be considered fair use, he thinks. In the EU, though, the database directive is largely untested, so no one knows how these data sets might be treated. It would be hard to try to create a license for these works without knowing how the case law will play out—any license that is created may need fixing later.

Some protected works are created in the process, at least arguably. The neural networks embody some creativity, even though much of it is machine generated. He asked: would they be considered databases in the EU?

There are some questions about the output from machine-learning systems. Clearly a human translator's work is protected by copyright, but what about output from Google Translate? "Who knows?"

Other facets to the legal questions surrounding machine learning exist as well. Discrimination may be a factor. He pointed to Google Translate output for "a lawyer" and "a nurse" into Spanish; the lawyer used the masculine form, while the nurse used the feminine form, even though Spanish can handle either gender. In the training set used, evidently, most lawyers were male and most nurses female.

In addition, privacy questions rear their head. These massive data sets make machine learning possible, but also give incentives for even more data collection. The models themselves may bring legal problems because they are not human comprehensible. There are the concepts of a "right to explanation" in the EU or "due process" in the US that require ways to explain (and argue against) decisions made. When jail sentences are being suggested by machine-learning systems, as they are in the US in some places, it is unclear how the two can be rectified.

Open data

A move toward open data is already happening in this area. Machine learning gravitates toward data sources with the least friction. The PARK or BIRD application was partially trained using Wikipedia, for example. That can lead to problems, however. If Wikipedians are more likely to be white and male (and/or from the US), those biases are going to be incorporated into the models.

The models themselves (which are basically a large collection of numbers) are being released under various open licenses, such as the Apache license; it is not clear if that makes sense or not at this point. There is also a lot of open source powering all of this work, including things like Google's Parsey McParseface.

There are multiple cross-cutting issues that will be "insanely hard to deal with", Villa said. Dealing with patents that cut across copyright was difficult, but this will be worse. Privacy, bias, cross-jurisdictional issues, the right to explanation, and so on are all intersecting here. The data itself is going to be around for a long time and the field is evolving rapidly. It is "hubris to think we can write a copyleft license" for these pieces at this time. Governments are going to legislate—badly—in this area and any licenses are going to need to be flexible or they will need to be changed.

[I would like to thank Intel, the Linux Foundation, and Red Hat for their travel assistance to Barcelona for LLW.]

Index entries for this article
Conference	Free Software Legal & Licensing Workshop/2017