AI from a legal perspective
The AI boom is clearly upon us, but there are still plenty of questions swirling around this technology. Some of those questions are legal ones and there have been lawsuits filed to try to get clarification—and perhaps monetary damages. Van Lindberg is a lawyer who is well-known in the open-source world; he came to Open Source Summit Europe 2023 in Bilbao, Spain to try to put the current work in AI into its legal context.
Lindberg began by introducing himself; he has been involved in computer law for around 25 years at this point. Throughout that time he also worked in open source (notably as the former General Counsel for the Python Software Foundation). He has also been working on AI issues since 2008, so he is well-positioned to assist in those matters now that "the entire world started going crazy about AI".
He asked the audience whether they primarily identify as technologists or members of the legal field; the response was "not quite half and half", he said, but noted that he would disappoint both groups, "just in different parts of the presentation". His technical description of the techniques being used in AI today would perhaps be somewhat boring for the technologists, while the cases he would talk about are likely already known to those with legal training. But, he said, the good news is that the intersection of AI and law is such a rapidly developing field that there should be something new in his talk for everyone.
In the talk, he would only be focusing on "generative machine learning"; he is aware of the older AI research, but "generative ML is the thing that has really started to drive this AI revolution". It is also much more interesting and challenging from a legal perspective. He would be looking at the advances in the field over just the last five years, though most of what he would present is from the last three or four years, he said.
His presentation is a shortened version of his paper that was published earlier in 2023. It looks at how models are trained and the inferences they perform, along with the copyright implications from a US perspective. The talk would be US-law-focused, as well, because that is "where a lot of the current action is". He pointed to a UK-law-focused article for those who are interested; so far, he has been unable to find a similar work that looks at the issue from an EU-law perspective.
Models
When it is applied to AI, the term "model" is misunderstood by a lot of people, Lindberg said. Many think of it as "the magic black box that does what I want it to do", but it is important to understand what models really are, how they work, and how they are trained, in order to "apply the correct legal analysis". He used an analogy to try to give the audience a "good mental picture" of what is going on with models.
Imagine someone is given the job of "art inspector" and is tasked with inspecting all of the art in the Louvre museum. When he is hired, he knows "absolutely nothing about art"; he does not know what makes it good or bad, what the different types of art are, and so on. He sets out to fix that lack of knowledge by measuring everything he can think of with regard to each work of art: size, weight, materials used, colors employed, creation date and location, etc. He also measures random things like the number of syllables in the artist's name and what corner they choose to sign their work in; he records all of this information in his notebook (i.e. database).
That work starts to get pretty boring, so he invents a game: before he measures something, he is going to use what he already knows to make a guess about the measurement. At first, his guesses are terrible, but after looking at thousands, or millions, of paintings, his guesses start to get better, then much better. He can make pretty accurate guesses about a work with just a little information about it as a starting point; he has effectively recognized hidden patterns that allow him to make these accurate guesses.
That analogy shows the process for model training. It consists of four steps, measuring, predicting, checking, and updating, that get repeated billions of times. When people talk about model creation, they often say that the process is "reading" the data or "it is sucking in all this content"; that is sort of true, but is not exactly what is going on. The training process is extracting certain statistical measurements of the data; it is calculating probabilities associated with those measurements and the data set.
Those probabilities are then used to predict things about some other training data, which has known-correct answers. Those predictions are checked against the answers. For example, a model for something like ChatGPT would use the next word in the existing text to check. For "it was a dark and stormy X", "night" would be a high-probability completion, while "elephant" would be extremely low. Based on the check, the training process updates all of its probabilities to make it a little bit more likely to produce a high-probability completion the next time. It follows this process many millions or billions of times.
That describes the training process for "almost any type of ML", he said; the differences are in what kinds of data are being trained on. The result is the model, which has a particular architecture that consists of the mechanisms used to break down the inputs, the methods used to analyze those inputs, and then a way to represent the output. The architecture is not an implementation, but is simply a logical construct that lives in the heads of the model developers; the code that gets written using, say, PyTorch is an implementation of the model architecture.
The architecture has separate layers for input, output, and some hidden layers that are critical to the inferences (guesses) the model is meant to make. They are arranged as in an artificial neural network like the one above from Lindberg's slides. The input layer turns the input into a number in some fashion; the input could be a pixel value, a word, or a value in a log file. "It doesn't really matter" what the input represents, just that it gets turned into a useful number. The output layer then turns the results from the hidden layers into the final prediction that is the result of the model.
The hidden layers have probabilities, generally called "weights", associated with the inputs and outputs from the nodes in the hidden layer. Unlike a financial model, say, where the model is deterministic, an ML model is "a probabilistic mapping from a set of inputs to a set of outputs". The model uses a technique that is much like Bayes's Theorem, which is used to do probabilistic calculations, he said; it is "essentially a multi-billion parameter Bayesian calculation". The weights in a disk file are just a huge matrix of floating-point values that correspond to the probabilities for each of the different parts of the model.
The weights are "simply a pile of numbers; it is not creative, it is not expressive". They are just the result of a mechanical process, he said. That is important to recognize because it directly impacts the nature of how the law is likely to treat these models. The mental model that people apply to AI will guide their beliefs in how an ML model should be treated; those who see it as the "magic black box" will impute things to it "that simply isn't true". That can lead people to believe that things like logic, emotion, and intent are somehow inside the model; they anthropomorphize the model. The model is, instead, simply "a really complicated statistical equation—that's it".
Intellectual property
When looking at how intellectual property (IP) law applies to AI, there are several parts of the machine-learning process where it might be applied. It could be applied to the training, the model itself, the architecture and code, or the output; each of those needs to be analyzed independently to try to figure out the applicable law. "The inputs are not the outputs and neither one is actually that model in the middle", Lindberg said.
Much of the current activity around AI and IP is about the question of how much copyrighted material can be used in training a machine-learning model. Artists, in particular, but also programmers to a certain extent, are concerned that their works are getting incorporated into these models without recompense to the copyright holders. The argument that the creators make is that there would be no way to train the model if the works did not exist, so they deserve to be paid.
But copyright does not protect every use of a work, he said, and those protections are largely the same in Europe, Asia, and the US. There are a few specific verbs in the US copyright act: copy, create derivative works, and perform; those are the only acts that are protected by copyright. All other uses are either outside of copyright or they are a "fair use"; the latter means that they have been judged to fall outside of the copyright protections.
One of the classic fair uses is "doing analysis of a work"; analyses of this sort can be summaries, reviews, or criticisms of the work. So, reading a book and doing a review in, say, The New York Times, is a perfectly acceptable use of a copyrighted work. Similarly, textual analysis of copyrighted works to gather statistical or stylistic information is not something that copyright protects against. It is a fair use of the work.
The Google Books lawsuit is one that has a lot of relevance for the kinds of lawsuits that are being filed against AI efforts, he said. Google Books is an index of books that the search giant created by scanning physical books, which was the target of a copyright-infringement brought by the Authors Guild. It was ultimately determined that Google Books was a fair use, in part because the search index was not a replacement for the book itself; copyright is intended to protect creators from others using their work in a competitive manner in the marketplace.
Proponents of the current crop of lawsuits point out that generative AI is creating works that are competitive with the original work, at least potentially. But "copyright is about a work, a very specific work that can be infringed", he said; copyright does not protect "your perception in the marketplace, your ability to produce works in the marketplace in general".
The ongoing AI cases have not been decided yet, but what he has argued in his paper is that the models are effectively the same as what Google Books has done. The training of the models simply takes a bunch of measurements and what the models produce is not competition for the work itself. He believes that the courts will find the models to be fair uses of the works, but "nobody knows".
There is one tricky piece, however: what happens when one of these models produces text or an image that is exactly like one of its inputs? "The answer is: that's infringing." It is clearly possible to create a copyright-infringing output from an AI model. He believes the model itself will be found to be fair use, "the outputs, maybe, maybe not". It completely depends on the specific output.
It is trivial to generate a copyright-infringing output from these models, Lindberg said; the easy way is to go to an image-generating model and give it something like "Iron Man" as a prompt. In the US and other places, a character that is sufficiently detailed can be copyrighted; that means the copyright covers more than a specific book, movie, or image. So using that prompt for an image-generation AI will create a "completely new picture of Iron Man that is absolutely copyright-infringing—100%".
Code, which does not have as many degrees of freedom as a natural language like English, will be more frequently reproduced, at least seemingly. Because of that lack of freedom, the probabilities in the model will converge to make the output look like some of the input code more frequently than is seen with text. The model has not memorized the code, per se, but it "memorized how to recreate it, which is a version of copying". Those outputs may then be copyright-infringing.
It turns out that larger amounts of training input lead to fewer infringing outputs. A study that tried to generate infringements was able to find 108 copied output images from a model that used 90,000 training images. But when they applied the same technique to a full-scale image model, the number of copied images found dropped to near zero.
Lawsuits
He put up a slide with five lawsuits that have been filed; the first four (v. GitHub, Stability AI, OpenAI, and Meta) were all filed as class actions by the same US law firm. The other, Getty Images v. Stability AI, has been filed in two places, Delaware and the UK; both are focused on Stable Diffusion, but are different from the other set.
The most prominent case, at least among OSSEU attendees, is probably Doe v. GitHub; it targets the GitHub Copilot tool, which is an AI-based code-autocompletion tool. The case is unusual because it is billed as a copyright suit, but it does not assert copyright infringement. "There is no 'you copied our stuff' in that entire lawsuit." Instead, there are accusations that GitHub removed copyright information, that the output is unfair competition, and that all outputs are necessarily derivative works of the inputs.
The lawsuit plays a bit loose with the idea of a "derivative work", he said; in a legal sense that means that there is "specific expression from one work that has been copied into another". Instead, the lawsuit argues that everything is derived from the inputs, thus everything is a derivative work. GitHub has filed a motion to dismiss the case, in part because there is no copyright infringement claimed; other arguments are made, but not any direct copying. He believes that is because the plaintiffs cannot find something that is infringing.
The second case is Andersen v. Stability AI, which uses bad analogies in its reasoning, he believes. It calls Stable Diffusion, which is an AI-based image creation tool, "essentially a 21st-century collage tool"; the argument is that the model is breaking everything up into pixels, then creating a collage with those pixels, thus the outputs are derivative works. That case also has a motion to dismiss it and it sounds like pretty much all of the case will be dismissed, he said, at least preliminarily. Two of the lead plaintiffs had not registered their copyrights in the works used, while a third had their works removed from the most recent version of the model because they did not meet the quality standards.
The last two in that first set are about the GPT4 and LLaMA models, which are text-based. In both cases, one of the lead plaintiffs is author and comedian Sarah Silverman. The suits are making copyright-infringement claims, but doing so in an interesting way, he said. Instead of showing two texts, one copyrighted and one generated, then showing places where the latter was copied from the former, the suits are taking a different path.
The complaint shows that asking the AI tools for a summary of Silverman's work results in ... a summary of her work. That means, the suit argues, that "the work must be in there somewhere, we just don't know how to get it out". But, Lindberg noted, creating a summary is something that is protected as a fair use of the work. In his analysis, all four in that first set "are not good lawyering"; in fact, if you want to protect authors and artists, you should want those lawsuits to be dismissed quickly.
The Getty Images cases are more interesting, he thinks. They are also copyright-infringement cases, but these have found generated images that are "very reminiscent" of those that were part of the inputs. The Stable Diffusion model also learned that the Getty Images watermark was an important element, so it dutifully reproduces them, at least sort of. "It's creating these terrible-looking photos with a bad version of the Getty Images watermark." The strongest argument that Getty Images has, he thinks, is that Stable Diffusion is violating its trademark in using the watermark. "That argument may win, but notice that's not a copyright argument."
Copyrightable?
Another interesting question with regard to AI is whether its output can be copyrighted or not. While the UK copyright office says that those outputs can be copyrighted, the US copyright office is currently saying that they cannot be. That US ruling was made with regard to the Zarya of the Dawn AI-illustrated comic book; it was determined that the AI-created images in it were not subject to copyright.
Lindberg assisted the author, Kris Kashtanova, in creating a response for the copyright office, which had revoked the previously issued copyright once the AI nature of the work came to light. The copyright office said that the author did not have enough control over the AI-generated output to make it eligible for copyright; there needs to be substantial human control over the output in order for it to be eligible. Kashtanova decided not to appeal that judgment, but Lindberg is working on another, similar case.
That ruling also means that the output of, say, Copilot is not currently eligible for copyright protection in the US. He believes that the copyright office is in the midst of a "speed run" replaying the history of photography copyrights; originally photographs were not eligible, then they were eligible if there was sufficient work done in setting up the photograph (e.g. lighting, costumes). Eventually it was decided that all photographs are eligible for copyright protection and he believes that will happen with model output too; we are currently at the "sufficient work" stage, but the copyright office is seeking comment on the matter.
At that point, time was running out on the talk. It was clear that Lindberg had some other topics he wanted to present, but 40 minutes was simply not enough time to do so. The topics he was able to get to certainly provided some useful information, for both technologists and those in the legal field.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Bilbao for OSSEU.]
| Index entries for this article | |
|---|---|
| Conference | Open Source Summit Europe/2023 |
