AI from a legal perspective

By Jake Edge
September 26, 2023

The AI boom is clearly upon us, but there are still plenty of questions swirling around this technology. Some of those questions are legal ones and there have been lawsuits filed to try to get clarification—and perhaps monetary damages. Van Lindberg is a lawyer who is well-known in the open-source world; he came to Open Source Summit Europe 2023 in Bilbao, Spain to try to put the current work in AI into its legal context.

Lindberg began by introducing himself; he has been involved in computer law for around 25 years at this point. Throughout that time he also worked in open source (notably as the former General Counsel for the Python Software Foundation). He has also been working on AI issues since 2008, so he is well-positioned to assist in those matters now that "the entire world started going crazy about AI".

He asked the audience whether they primarily identify as technologists or members of the legal field; the response was "not quite half and half", he said, but noted that he would disappoint both groups, "just in different parts of the presentation". His technical description of the techniques being used in AI today would perhaps be somewhat boring for the technologists, while the cases he would talk about are likely already known to those with legal training. But, he said, the good news is that the intersection of AI and law is such a rapidly developing field that there should be something new in his talk for everyone.

In the talk, he would only be focusing on "generative machine learning"; he is aware of the older AI research, but "generative ML is the thing that has really started to drive this AI revolution". It is also much more interesting and challenging from a legal perspective. He would be looking at the advances in the field over just the last five years, though most of what he would present is from the last three or four years, he said.

His presentation is a shortened version of his paper that was published earlier in 2023. It looks at how models are trained and the inferences they perform, along with the copyright implications from a US perspective. The talk would be US-law-focused, as well, because that is "where a lot of the current action is". He pointed to a UK-law-focused article for those who are interested; so far, he has been unable to find a similar work that looks at the issue from an EU-law perspective.

Models

When it is applied to AI, the term "model" is misunderstood by a lot of people, Lindberg said. Many think of it as "the magic black box that does what I want it to do", but it is important to understand what models really are, how they work, and how they are trained, in order to "apply the correct legal analysis". He used an analogy to try to give the audience a "good mental picture" of what is going on with models.

Imagine someone is given the job of "art inspector" and is tasked with inspecting all of the art in the Louvre museum. When he is hired, he knows "absolutely nothing about art"; he does not know what makes it good or bad, what the different types of art are, and so on. He sets out to fix that lack of knowledge by measuring everything he can think of with regard to each work of art: size, weight, materials used, colors employed, creation date and location, etc. He also measures random things like the number of syllables in the artist's name and what corner they choose to sign their work in; he records all of this information in his notebook (i.e. database).

That work starts to get pretty boring, so he invents a game: before he measures something, he is going to use what he already knows to make a guess about the measurement. At first, his guesses are terrible, but after looking at thousands, or millions, of paintings, his guesses start to get better, then much better. He can make pretty accurate guesses about a work with just a little information about it as a starting point; he has effectively recognized hidden patterns that allow him to make these accurate guesses.

That analogy shows the process for model training. It consists of four steps, measuring, predicting, checking, and updating, that get repeated billions of times. When people talk about model creation, they often say that the process is "reading" the data or "it is sucking in all this content"; that is sort of true, but is not exactly what is going on. The training process is extracting certain statistical measurements of the data; it is calculating probabilities associated with those measurements and the data set.

Those probabilities are then used to predict things about some other training data, which has known-correct answers. Those predictions are checked against the answers. For example, a model for something like ChatGPT would use the next word in the existing text to check. For "it was a dark and stormy X", "night" would be a high-probability completion, while "elephant" would be extremely low. Based on the check, the training process updates all of its probabilities to make it a little bit more likely to produce a high-probability completion the next time. It follows this process many millions or billions of times.

That describes the training process for "almost any type of ML", he said; the differences are in what kinds of data are being trained on. The result is the model, which has a particular architecture that consists of the mechanisms used to break down the inputs, the methods used to analyze those inputs, and then a way to represent the output. The architecture is not an implementation, but is simply a logical construct that lives in the heads of the model developers; the code that gets written using, say, PyTorch is an implementation of the model architecture.

The architecture has separate layers for input, output, and some hidden layers that are critical to the inferences (guesses) the model is meant to make. They are arranged as in an artificial neural network like the one above from Lindberg's slides. The input layer turns the input into a number in some fashion; the input could be a pixel value, a word, or a value in a log file. "It doesn't really matter" what the input represents, just that it gets turned into a useful number. The output layer then turns the results from the hidden layers into the final prediction that is the result of the model.

The hidden layers have probabilities, generally called "weights", associated with the inputs and outputs from the nodes in the hidden layer. Unlike a financial model, say, where the model is deterministic, an ML model is "a probabilistic mapping from a set of inputs to a set of outputs". The model uses a technique that is much like Bayes's Theorem, which is used to do probabilistic calculations, he said; it is "essentially a multi-billion parameter Bayesian calculation". The weights in a disk file are just a huge matrix of floating-point values that correspond to the probabilities for each of the different parts of the model.

The weights are "simply a pile of numbers; it is not creative, it is not expressive". They are just the result of a mechanical process, he said. That is important to recognize because it directly impacts the nature of how the law is likely to treat these models. The mental model that people apply to AI will guide their beliefs in how an ML model should be treated; those who see it as the "magic black box" will impute things to it "that simply isn't true". That can lead people to believe that things like logic, emotion, and intent are somehow inside the model; they anthropomorphize the model. The model is, instead, simply "a really complicated statistical equation—that's it".

Intellectual property

When looking at how intellectual property (IP) law applies to AI, there are several parts of the machine-learning process where it might be applied. It could be applied to the training, the model itself, the architecture and code, or the output; each of those needs to be analyzed independently to try to figure out the applicable law. "The inputs are not the outputs and neither one is actually that model in the middle", Lindberg said.

Much of the current activity around AI and IP is about the question of how much copyrighted material can be used in training a machine-learning model. Artists, in particular, but also programmers to a certain extent, are concerned that their works are getting incorporated into these models without recompense to the copyright holders. The argument that the creators make is that there would be no way to train the model if the works did not exist, so they deserve to be paid.

But copyright does not protect every use of a work, he said, and those protections are largely the same in Europe, Asia, and the US. There are a few specific verbs in the US copyright act: copy, create derivative works, and perform; those are the only acts that are protected by copyright. All other uses are either outside of copyright or they are a "fair use"; the latter means that they have been judged to fall outside of the copyright protections.

One of the classic fair uses is "doing analysis of a work"; analyses of this sort can be summaries, reviews, or criticisms of the work. So, reading a book and doing a review in, say, The New York Times, is a perfectly acceptable use of a copyrighted work. Similarly, textual analysis of copyrighted works to gather statistical or stylistic information is not something that copyright protects against. It is a fair use of the work.

The Google Books lawsuit is one that has a lot of relevance for the kinds of lawsuits that are being filed against AI efforts, he said. Google Books is an index of books that the search giant created by scanning physical books, which was the target of a copyright-infringement brought by the Authors Guild. It was ultimately determined that Google Books was a fair use, in part because the search index was not a replacement for the book itself; copyright is intended to protect creators from others using their work in a competitive manner in the marketplace.

Proponents of the current crop of lawsuits point out that generative AI is creating works that are competitive with the original work, at least potentially. But "copyright is about a work, a very specific work that can be infringed", he said; copyright does not protect "your perception in the marketplace, your ability to produce works in the marketplace in general".

The ongoing AI cases have not been decided yet, but what he has argued in his paper is that the models are effectively the same as what Google Books has done. The training of the models simply takes a bunch of measurements and what the models produce is not competition for the work itself. He believes that the courts will find the models to be fair uses of the works, but "nobody knows".

There is one tricky piece, however: what happens when one of these models produces text or an image that is exactly like one of its inputs? "The answer is: that's infringing." It is clearly possible to create a copyright-infringing output from an AI model. He believes the model itself will be found to be fair use, "the outputs, maybe, maybe not". It completely depends on the specific output.

It is trivial to generate a copyright-infringing output from these models, Lindberg said; the easy way is to go to an image-generating model and give it something like "Iron Man" as a prompt. In the US and other places, a character that is sufficiently detailed can be copyrighted; that means the copyright covers more than a specific book, movie, or image. So using that prompt for an image-generation AI will create a "completely new picture of Iron Man that is absolutely copyright-infringing—100%".

Code, which does not have as many degrees of freedom as a natural language like English, will be more frequently reproduced, at least seemingly. Because of that lack of freedom, the probabilities in the model will converge to make the output look like some of the input code more frequently than is seen with text. The model has not memorized the code, per se, but it "memorized how to recreate it, which is a version of copying". Those outputs may then be copyright-infringing.

It turns out that larger amounts of training input lead to fewer infringing outputs. A study that tried to generate infringements was able to find 108 copied output images from a model that used 90,000 training images. But when they applied the same technique to a full-scale image model, the number of copied images found dropped to near zero.

Lawsuits

He put up a slide with five lawsuits that have been filed; the first four (v. GitHub, Stability AI, OpenAI, and Meta) were all filed as class actions by the same US law firm. The other, Getty Images v. Stability AI, has been filed in two places, Delaware and the UK; both are focused on Stable Diffusion, but are different from the other set.

The most prominent case, at least among OSSEU attendees, is probably Doe v. GitHub; it targets the GitHub Copilot tool, which is an AI-based code-autocompletion tool. The case is unusual because it is billed as a copyright suit, but it does not assert copyright infringement. "There is no 'you copied our stuff' in that entire lawsuit." Instead, there are accusations that GitHub removed copyright information, that the output is unfair competition, and that all outputs are necessarily derivative works of the inputs.

The lawsuit plays a bit loose with the idea of a "derivative work", he said; in a legal sense that means that there is "specific expression from one work that has been copied into another". Instead, the lawsuit argues that everything is derived from the inputs, thus everything is a derivative work. GitHub has filed a motion to dismiss the case, in part because there is no copyright infringement claimed; other arguments are made, but not any direct copying. He believes that is because the plaintiffs cannot find something that is infringing.

The second case is Andersen v. Stability AI, which uses bad analogies in its reasoning, he believes. It calls Stable Diffusion, which is an AI-based image creation tool, "essentially a 21st-century collage tool"; the argument is that the model is breaking everything up into pixels, then creating a collage with those pixels, thus the outputs are derivative works. That case also has a motion to dismiss it and it sounds like pretty much all of the case will be dismissed, he said, at least preliminarily. Two of the lead plaintiffs had not registered their copyrights in the works used, while a third had their works removed from the most recent version of the model because they did not meet the quality standards.

The last two in that first set are about the GPT4 and LLaMA models, which are text-based. In both cases, one of the lead plaintiffs is author and comedian Sarah Silverman. The suits are making copyright-infringement claims, but doing so in an interesting way, he said. Instead of showing two texts, one copyrighted and one generated, then showing places where the latter was copied from the former, the suits are taking a different path.

The complaint shows that asking the AI tools for a summary of Silverman's work results in ... a summary of her work. That means, the suit argues, that "the work must be in there somewhere, we just don't know how to get it out". But, Lindberg noted, creating a summary is something that is protected as a fair use of the work. In his analysis, all four in that first set "are not good lawyering"; in fact, if you want to protect authors and artists, you should want those lawsuits to be dismissed quickly.

The Getty Images cases are more interesting, he thinks. They are also copyright-infringement cases, but these have found generated images that are "very reminiscent" of those that were part of the inputs. The Stable Diffusion model also learned that the Getty Images watermark was an important element, so it dutifully reproduces them, at least sort of. "It's creating these terrible-looking photos with a bad version of the Getty Images watermark." The strongest argument that Getty Images has, he thinks, is that Stable Diffusion is violating its trademark in using the watermark. "That argument may win, but notice that's not a copyright argument."

Copyrightable?

Another interesting question with regard to AI is whether its output can be copyrighted or not. While the UK copyright office says that those outputs can be copyrighted, the US copyright office is currently saying that they cannot be. That US ruling was made with regard to the Zarya of the Dawn AI-illustrated comic book; it was determined that the AI-created images in it were not subject to copyright.

Lindberg assisted the author, Kris Kashtanova, in creating a response for the copyright office, which had revoked the previously issued copyright once the AI nature of the work came to light. The copyright office said that the author did not have enough control over the AI-generated output to make it eligible for copyright; there needs to be substantial human control over the output in order for it to be eligible. Kashtanova decided not to appeal that judgment, but Lindberg is working on another, similar case.

That ruling also means that the output of, say, Copilot is not currently eligible for copyright protection in the US. He believes that the copyright office is in the midst of a "speed run" replaying the history of photography copyrights; originally photographs were not eligible, then they were eligible if there was sufficient work done in setting up the photograph (e.g. lighting, costumes). Eventually it was decided that all photographs are eligible for copyright protection and he believes that will happen with model output too; we are currently at the "sufficient work" stage, but the copyright office is seeking comment on the matter.

At that point, time was running out on the talk. It was clear that Lindberg had some other topics he wanted to present, but 40 minutes was simply not enough time to do so. The topics he was able to get to certainly provided some useful information, for both technologists and those in the legal field.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Bilbao for OSSEU.]

Index entries for this article
Conference	Open Source Summit Europe/2023

AI from a legal perspective

Posted Sep 26, 2023 21:01 UTC (Tue) by kleptog (subscriber, #1183) [Link] (4 responses)

What an interesting topic, and it sounds like a great talk. Also helpful to see a breakdown of all the pending cases, I happen to agree with his reasoning.

I think it's telling that the cases filed so far are all in common law jurisdictions. Trying to file such a case in most of Europe would simply lead to the court saying "the law doesn't say, please talk to the legislature". Even the discussions around the EU's AI Act aren't really answering the question. So far the only response is to suggest AI model output must be marked as such. I expect that unless someone can actually point out a concrete form of harm to a particular actor, nothing will really happen on the legislature side, though that might coax a court to give some sort of ruling.

I don't see any indication that the US legislature is interested in this topic? I guess the US is going for the "make law via court cases first, codify later" approach?

I expect that in a few years generative AI will be treated as just another tool like a spell-checker or magic image selector.

AI from a legal perspective

Posted Sep 27, 2023 11:27 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (1 responses)

Let’s assume that the models themselves are not infringing, but the output of the models can. And the users of those models are given no indication that would help them identify if the output is infringing or not.

What can a judge do ? Give model sellers a free pass ? And then have the force of the law fall on users, who assume than since sellers are given a free pass, that mean their own use of model outputs is legal too ?

If the judge is honest, he will state the model sellers are contributory to infringement, unless they manage to give their users sufficient information to stay within the law. And if they can not too bad they’re the only ones with sufficient control on the model inputs to give this info.

Which won’t please model sellers at all since their own business model relies on the free pass illusion.

AI from a legal perspective

Posted Sep 28, 2023 10:53 UTC (Thu) by kleptog (subscriber, #1183) [Link]

> Let’s assume that the models themselves are not infringing, but the output of the models can. And the users of those models are given no indication that would help them identify if the output is infringing or not.

This isn't any different from the internet now. If you download a file from somewhere, it is up to you to "know" whether it's copyrighted or not. Much of the law around this is basically trying to nail people on the basis of "they should have known". For example, if you're downloading a file with the name of a well-known movie via a torrent, you don't get to claim "I didn't know". The "my grandson installed this app for me so I could watch movies but I didn't know" defense has surely been tried.

> What can a judge do ? Give model sellers a free pass ? And then have the force of the law fall on users, who assume than since sellers are given a free pass, that mean their own use of model outputs is legal too

Well, I have not yet seen any evidence that ML models will spontaneously reproduce copyrighted works without explicit prompting, so I don't see this being a problem in practice. And even with prompting I expect them to do as well as any human: the titles of the chapters, names of characters and the general plot, but not the text verbatim. Otherwise we'll have invented the ultimate text compression algorithm, and that's worth more than any ML model. And I'm sure that the ML model can helpfully list any closely related works so you can check for yourself.

AI from a legal perspective

Posted Sep 27, 2023 17:41 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> I don't see any indication that the US legislature is interested in this topic?

The US Congress is not interested in any topic unless it enables them to shout at one another and score political points. As for the individual states, they are preempted from legislating here, because copyright is a federal concern.

AI from a legal perspective

Posted Sep 27, 2023 19:16 UTC (Wed) by rgmoore (✭ supporter ✭, #75) [Link]

The US Congress is not interested in any topic unless it enables them to shout at one another and score political points.

I realize this is getting a bit far afield, but this just isn't true. Congress manages to get a surprising amount done on uncontroversial topics; the shouting just gets all the media attention because it's better entertainment. Copyright is an area that doesn't attract that kind of heat, so Congress might actually be able to pass some legislation there. It might be better to discuss it internationally first, so the US rules wind up being similar to the rules elsewhere in the world. We've done that for other areas of copyright, and AI should be no exception.

AI from a legal perspective

Posted Sep 27, 2023 12:21 UTC (Wed) by karim (subscriber, #114) [Link] (1 responses)

If the output may or may not be copyright infringing then that's an actual win for copyright owners. Because this then would require users of the output to actually make a judgment call on whether they can use the resulting work or not as-is. IOW, generative AI tools may be useful to help you think, but using output as-is engenders an inherent unbounded liability.

AI from a legal perspective

Posted Sep 27, 2023 16:29 UTC (Wed) by Wol (subscriber, #4433) [Link]

And given all the stuff about "poisoning", it could be a win for AI, too. The quicker we can ban the spewing of UNREVIEWED AI, the better, or all the signs are the web will drown in a sea of garbage,

Cheers,
Wol

AI from a legal perspective

Posted Sep 27, 2023 12:23 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (20 responses)

The "it must be there" argument is more interesting than it seems on the surface, in my opinion. I tried asking ChatGPT 3.5 to reproduce a relatively short poem, and it decided it's not willing to commit copyright infringement.

However, asking for a comment one line at a time manages to extract five lines before it starts to hallucinate without possibility of redemption (interestingly, this happens after the author uses a somewhat tricky metrical device and the model is also confused about the meaning of the sentence; https://chat.openai.com/share/655795fe-3f35-434b-a980-2ac...).

The specific poem I used is not copyrighted, but it's probably possible to find something that is both new enough to be copyrighted, and short enough that the model knows it in its entirety. Would that have some chance of success in a lawsuit?

AI from a legal perspective

Posted Sep 27, 2023 12:38 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (9 responses)

So here it gets half the poem right and the author died in 1981: https://chat.openai.com/share/d21a8f19-86aa-40e5-a8ea-371...

AI from a legal perspective

Posted Sep 28, 2023 7:53 UTC (Thu) by gfernandes (subscriber, #119910) [Link] (8 responses)

The conversation shows commentary on specific lines, repeating those lines as reference for the commentary.

This is fair use.

A human might memorise, or refer to, a literary work like that poem and produce a commentary on it. This would not be "copying", or "distributing" the original work as is.

So I don't really see how the argument is interesting.

AI from a legal perspective

Posted Sep 28, 2023 11:28 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (7 responses)

The commentary (which is crap anyway) is only there to trick the AI into spitting them out because otherwise it complains that it is protected by copyright.

So here is the next step. I convinced the AI that (correctly) there was no copyright protection on a text. It then will will have no problems reproducing that text and it becomes much better at not hallucinating.

And the copyright unlock will then extend to other texts as well! In the above linked chat, it will also reproduce the text of the poet who died in 1981. It includes a commentary that is so wrong that you can't even argue it is fair use (the poem talks about a dead woman, ChatGPT thinks the poet is walking along the sea with her), but you _can_ probably argue that the weights contain a copyrighted text and are a derivative of it.

Of course it's not always going to work. It hallucinated completely the first copyrighted poem I tried; it only got one and a half sentences right from Fredric Brown's short story "Sentry". Maybe poems are easier because they use "weird" language with fewer possible continuations. But still, the argument is not completely without value.

AI from a legal perspective

Posted Sep 28, 2023 21:11 UTC (Thu) by kleptog (subscriber, #1183) [Link] (6 responses)

What exactly is the argument here? That there exists an input to ChatGPT such that the output contains a copyrighted poem? There exists an input of 560 characters that when sent to cat reproduces the poem exactly. Only 363 bytes are need if I use gunzip as program. Does that mean cat and gunzip are copyright infringing programs?

No, this merely means that these programs are tools and their status w.r.t. copyright is neutral. It's up to the user to use it responsibly. Unless you want to argue ChatGPT is special in some way?

Just like you could probably find many people who with a short prompt could reproduce that poem (and many others) perfectly. Does that mean they're infringing copyright just by reciting the poem when asked?

AI from a legal perspective

Posted Sep 29, 2023 5:25 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (4 responses)

The argument is that the weights are derivatives of the poem. It's not about the prompt and not about reproducing the poem, it's about the poem being stored in the weights so that distributing the weights infringes copyright.

The prompt is only needed to show that indeed the poem is stored in the weights. This is true even if you get ChatGPT to recite it one line at a time, but in the second case it produces it all at once.

AI from a legal perspective

Posted Oct 2, 2023 2:39 UTC (Mon) by Ninjasaid13 (guest, #167282) [Link] (3 responses)

By that same logic, a dictionary must be infringing on every book then. Since it contains the words in the dictionary that was used in the copyrighted works.

It only contains the statistical likelihood of which words are likely next in response to the prompt. That's why it hallucinates and can't reproduce an entire book.

AI from a legal perspective

Posted Oct 2, 2023 11:26 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (2 responses)

A dictionary doesn't provide the means to reproduce (even if only statistically) a sequence of ~100 words.

It would be interesting to know the probabilities. If the probability of the correct choice is orders of magnitude above the next, it's pretty hard to say it's just statistic completion as opposed to recollection of a complete text. The LLM wouldn't be able to recall the text if the poet hadn't written it in the first place, and therefore the weights are a derivative of the poem.

AI from a legal perspective

Posted Oct 2, 2023 12:31 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

A human's memory also provide the means to reproduce a sequence of ~100 words, complete with LLM-style "hallucinations" instead of perfectly accurate recall, and yet there's nothing about the human that makes their brain's weights a derivative of the texts its read.

What can be a derivative, however, is what the human does with that memory - if I deliberately regurgitate the text of Seamus Heaney's "In the Attic" as part of an advertising campaign, I'm infringing that copyright. If I use it to inform my own, lesser, poetry, and to improve my style, I'm not. LLMs could easily be held to be in the same position - the LLM itself does not infringe (since it has no literal copy, just like my brain), but its output can infringe, and if you use the output of an LLM, you're in the same position as you would be if you asked me to give you a poem. In particular, you can get in serious trouble since I could change the words of Seamus Heaney's poetry slightly to make it hard to catch the plagiarism, but you'd still be infringing.

AI from a legal perspective

Posted Oct 2, 2023 13:02 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

But you can't reproduce and distribute a brain. You can reproduce and distribute weights, and those weights *absolutely* wouldn't be able to produce the same output if it weren't for the inclusion of certain works in the training.

That's why IMO there is a case to be made for the weights being a derivative of at least a subset of the works included in the training, some of which are copyrighted.

AI from a legal perspective

Posted Sep 29, 2023 10:35 UTC (Fri) by farnz (subscriber, #17727) [Link]

Just like you could probably find many people who with a short prompt could reproduce that poem (and many others) perfectly. Does that mean they're infringing copyright just by reciting the poem when asked?

Quite possibly, yes. They'd be performing the poem without permission, which can be considered (depending on circumstances of the recital) a creation of a derivative work, and thus need a licence from the copyright holder. I would expect the same to apply to ChatGPT and similar; its output may or may not be "free and clear" of copyright infringement, just as would be true of a human.

AI from a legal perspective

Posted Oct 3, 2023 0:11 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (9 responses)

I think with music, the fragment of a song that needs to be duplicated is small enough to make this a possibility. Also, as he mentioned, code gets duplicated more easily.

Also, he's paraphrasing, and I would dare say he's doing it badly. I believe some of the legal arguments don't say "it must be in there", they say "it's a compression algorithm". Autoencoders, do, in fact, create compression functions. Now, they are _lossy_ compression algorithms, but, they are indeed compression algorithms. Which isn't as hand-wavy as "it must be in there, somewhere". It is, very much, in there. The question is how much loss occurred, and does that matter to the courts.

Overall, I'm not sure he understood the technical basis of these suits very well, particularly based on the odd way he described the tech aspects. And obviously he failed to understand that a lossy representation of the input is encoded in the model, which is a rather large oversight.

AI from a legal perspective

Posted Oct 3, 2023 12:59 UTC (Tue) by kleptog (subscriber, #1183) [Link] (8 responses)

> Overall, I'm not sure he understood the technical basis of these suits very well, particularly based on the odd way he described the tech aspects. And obviously he failed to understand that a lossy representation of the input is encoded in the model, which is a rather large oversight.

Maybe because from a legal perspective the tech aspects are not that important. Whether something is a derivative work is not related to tech aspects either, but whether we as a society think it's something the original author should have a say about.

And in the end this is not going to be decided on tech aspects, but on whether we as a society agree that allowing authors to prevent this is good for society as a whole. The EU Copyright Directive has already explicitly answered this to a great extent. For research & training purposes anything publicly accessible is fair game. For all other purposes authors can opt-out.

The only real question is if the people doing the model training had legitimate access to the copyrighted works. If they download loads of e-books from the Pirate Bay, then that's a definite no-go. On the other hand, there is no way to prove if something was part of the training set.

AI from a legal perspective

Posted Oct 3, 2023 13:33 UTC (Tue) by james (subscriber, #1325) [Link] (7 responses)

Shouldn't that be "there is no way to prove something was not part of the training set"? We've seen cases where people can prove something was part of the training set because the model reproduced it (or a significant chunk thereof).

AI from a legal perspective

Posted Oct 3, 2023 16:09 UTC (Tue) by sfeam (subscriber, #2841) [Link] (6 responses)

The observation that a chatbot can be persuaded to emit text matching all or part of a previous work does not by itself prove that work was part of the training set.

AI from a legal perspective

Posted Oct 3, 2023 19:40 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (5 responses)

This is why I used extremely famous (*), short but nontrivial poems in my experiments with Italian literature. A poem's choice of words is polished and unconventional enough that there's no other explanation than "the text was present (many times, enough to reinforce the learning) in the training corpus".

(*) the non copyrighted one is probably *the* most known poem in Italian literature. The copyrighted one must be in the top 5 of 20th century poems, and probably the most famous from that author.

AI from a legal perspective

Posted Oct 3, 2023 22:26 UTC (Tue) by kleptog (subscriber, #1183) [Link] (4 responses)

I guess for a definitive answer you'd need to nail down what is meant by "being in a training set". Suppose you had documents which discussed the poem in depth and in doing so cited each line before discussing it. Does that count as the poem "being in the training set"?

When I meant that you could prove something was in the training set, I was considering being able to point to any particular document and prove it for that. That is definitely not possible. At a higher level, you could probably make a case for poems like this. But even then, that doesn't say anything about whether that's allowed or not. The EU Copyright Directive allows authors to opt out, but it's difficult to see how that could work in practice. You can put a robots.txt file on your site, but that doesn't stop some web crawler mirroring the site and then that getting included.

In your examples, those poems almost certainly were in these training set in some form, most likely via Wikipedia or some lyrics site. That doesn't make the model a derivative work though. In particular, a derivative work needs to be a copyrightable thing itself, and I don't think anyone is arguing that there's sufficient creativity in LLMs to warrant copyright protection.

AI from a legal perspective

Posted Oct 4, 2023 4:55 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (3 responses)

> In particular, a derivative work needs to be a copyrightable thing itself, and I don't think anyone is arguing that there's sufficient creativity in LLMs to warrant copyright protection.

If so, what IP law is being relied on to require a license to use the LLMs? Most of them use "fauxpen source" terms that forbid commercial use or allow it only until a certain number of users. If LLMs have no copyright protection, that would not be valid.

Now it's a different story if you do the training yourself, because in that case there's no distribution of the weights. There are many similarities with copyleft and the SaaS loopholes, and it's a bit disappointing that the copyright arguments were dismissed in the talk. They're extremely complex in reality (see also the "compression algorithms" argument mentioned elsewhere) and they alone can keep lawyers employed for a long time...

AI from a legal perspective

Posted Oct 5, 2023 8:10 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

> If so, what IP law is being relied on to require a license to use the LLMs? Most of them use "fauxpen source" terms that forbid commercial use or allow it only until a certain number of users. If LLMs have no copyright protection, that would not be valid.

Good question. It's not obvious to me that LLMs are copyrightable in the normal sense, at least not everywhere. In the late 90's there were lawsuits about whether the phone book could be copyrightable given it was just a collection of facts. The status of OSM was similarly murky, it was only the creation of database rights that clarified the situation here. For LLMs, people are taking a lot of data that isn't theirs and running it through an algorithm, it's not clear this meets any standard for copyrightability, not even under database rights. The whole "sweat of the brow" idea is something in America, but not elsewhere AFAICT. And even then it's the computers doing the sweating.

Just placing a copyright notice on something doesn't make it copyrightable. It's not even obvious to me we would want them to be copyrightable. It doesn't seem necessary for the "progress of the arts". And in any case, if we wanted something like that, I think something more akin to patents would be better: full-disclosure about how it was made in return for protection of the result for a limited period.

For businesses like OpenAI I think they handle it under standard contract law and treat it as a trade secret. Basically, they provide the raw models under various conditions and restrictions on usage and who they can be passed on to with various penalty clauses. Copyright protection is only needed if you want to publish something to the public while retaining some control. Since most businesses wanting the models are using them to provide services based on them, this is probably sufficient.

AI from a legal perspective

Posted Oct 5, 2023 9:04 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (1 responses)

> For LLMs, people are taking a lot of data that isn't theirs and running it through an algorithm, it's not clear this meets any standard for copyrightability, not even under database rights.

But if training a (lossy) compression algorithm, distributing a compressed representation of a copyrighted work is still considered distribution for the sake copyright... Likewise for object files. Whether it's xz or gcc doing the translation, the output is still copyrighted and a derivative of the uncompressed work or the source code, respectively. What a mess.

Thanks for the discussion. I am even more aware now, of how many things I don't know in this area!

AI from a legal perspective

Posted Oct 5, 2023 13:47 UTC (Thu) by kleptog (subscriber, #1183) [Link]

> But if training a (lossy) compression algorithm, distributing a compressed representation of a copyrighted work is still considered distribution for the sake copyright... Likewise for object files.

AIUI the main test is if the "compression" still produces something that affects the market of the original work. Even a really highly compressed/minified video would be recognisable as the original work. Just changing the format obviously produces something that substitutes for the original work. Object files can substitute for the orignal work because you can build a working binary from them just like you could from the original source.

A parody of a work doesn't provide an alternative to the original work.

No-one is using LLMs as an alternative to buying a specific book. And given the model size is like 0.1% of the sample data size, you'd almost figure that any resemblance to a real world text is purely coincidental.

AI from a legal perspective

Posted Sep 27, 2023 18:17 UTC (Wed) by ms (subscriber, #41272) [Link] (19 responses)

> The weights are "simply a pile of numbers; it is not creative, it is not expressive". They are just the result of a mechanical process, he said.

I am more than a little sceptical that any animal brain is meaningfully different. More specialisation, and more checks and balances, sure. But I'm quite convinced it's no less a machine.

> That is important to recognize because it directly impacts the nature of how the law is likely to treat these models.

Not surprising that human law, written by humans, elevates the human brain to unfathomable sophistication that no machine can match.

AI from a legal perspective

Posted Sep 27, 2023 22:17 UTC (Wed) by Wol (subscriber, #4433) [Link]

> > The weights are "simply a pile of numbers; it is not creative, it is not expressive". They are just the result of a mechanical process, he said.

> I am more than a little sceptical that any animal brain is meaningfully different. More specialisation, and more checks and balances, sure. But I'm quite convinced it's no less a machine.

Except you are missing something VERY important. You're right that we humans tend to elevate our brains above anything else, but we also tend to demote other brains well below their ability. Which seriously skews our ability to be objective. And an AI is objectively very different from any high-functioning brain. An AI has no concept of truth, an AI has no concept of danger, etc etc.

I didn't know this til recently, but human brains have a dedicated face-recognition unit. That is dedicated not only to recognising that something is a face, but also whos face it is. Mine (like many others) is mildly defective - I have difficulty recognising people by their face. By their voice, on the other hand ...

Any MEDIC who studies the brain will tell you that MOST of it comprises of Special Purpose Processing Units. The General Purpose part of our brain simply deals with signals like "that is a face. I recognise it". This is quite obvious when you look at how the ear or the eye function. You can quite clearly see the brain activity as eg the first line visual system decides what it's looking at and then diverts anything it recognises to those special purpose units.

Which is why I gather Tesla now have special purpose people recognition units in their guidance systems. The trouble is, they almost certainly don't know what other special purpose units they need, and they probably don't want to implement special purpose because they believe the propaganda that a general purpose unit can do it just well. Given (as I say) the brain is mostly special purpose units, I'd be amazed if that's true.

Cheers,
Wol

AI from a legal perspective

Posted Sep 28, 2023 4:45 UTC (Thu) by alonz (subscriber, #815) [Link] (17 responses)

The law doesn't need to say anything about brains. Furthermore, the idea that a model's "learning" (i.e. the weights) is itself not derived from the copyrighted material it has observed is actually comparable to the law's approach to human learning: you are allowed to learn anything you wish, you're just forbidden from using this to create an infringing work.

AI from a legal perspective

Posted Sep 28, 2023 7:16 UTC (Thu) by ms (subscriber, #41272) [Link] (16 responses)

> ...is actually comparable to the law's approach to human learning...

Very much agree.

> The law doesn't need to say anything about brains.

My point, which perhaps I didn't make very well, was that if the law considers the concepts of "creativity", "expressivity", "emotion" etc as being incapable of being achieved by machine, then it *is* elevating human brains to something beyond the capability of a machine.

AI from a legal perspective

Posted Sep 28, 2023 7:56 UTC (Thu) by gfernandes (subscriber, #119910) [Link] (1 responses)

Was the monkey (or was it an ape) awarded copyright for taking a selfie?

If not, then why is AI any different?

AI from a legal perspective

Posted Sep 28, 2023 20:38 UTC (Thu) by opsec (subscriber, #119360) [Link]

No, the monkey did not receive copyright.

AI from a legal perspective

Posted Sep 28, 2023 8:03 UTC (Thu) by Wol (subscriber, #4433) [Link]

> My point, which perhaps I didn't make very well, was that if the law considers the concepts of "creativity", "expressivity", "emotion" etc as being incapable of being achieved by machine, then it *is* elevating human brains to something beyond the capability of a machine.

Well, based on current evidence, I would say it is clear that "creativity", "expressivity" and "emotion" *are* incapable of being achieved by machine with our current level of ability / understanding.

And without a paradigm shift in our understanding of AI, that's not going to change. At present we're so busy chasing down the wrong rabbit hole, that nobody with the necessary power to change it can see it's the wrong rabbit hole.

Cheers,
Wol

AI from a legal perspective

Posted Sep 28, 2023 8:07 UTC (Thu) by gfernandes (subscriber, #119910) [Link] (7 responses)

I also say that the 3-10w power use of the human brain, powered by sugars processed out of food, is way, WAY , above the capability of any AI.

So I'd say that superiority of the human brain is well deserved.

AI from a legal perspective

Posted Sep 28, 2023 9:41 UTC (Thu) by ms (subscriber, #41272) [Link] (2 responses)

Well, sure, human brains are pretty astonishing from an efficiency and also pure power pov. I believe we recently discovered plants make use of certain bits of quantum physics, which helps with efficiency of photosynthesis. I wouldn't be at all surprised if we eventually figure out that the evolution of brains also discovered similar possibilities and advantages and so forth. But fundamentally, can a brain solve problems that a Turing Machine cannot? Anyway, I enjoy these sorts of thought experiments, but I'm aware this is getting somewhat off-topic now.

AI from a legal perspective

Posted Sep 28, 2023 12:07 UTC (Thu) by amacater (subscriber, #790) [Link]

If we have a production line of Turing machines, each with infinite tape - at what point do they become obsolete such that the line can be switched off? How do we determine this? :)

AI from a legal perspective

Posted Feb 21, 2024 21:40 UTC (Wed) by nix (subscriber, #2304) [Link]

> I believe we recently discovered plants make use of certain bits of quantum physics, which helps with efficiency of photosynthesis.

If you mean thirty or forty years ago, then yes, recently. (For that matter, so do we: the electrons in the electron-transport chain in every mitochondrial membrane complex in every cell in our bodies quantum-tunnel along the chain. Without that endless quantum tunnelling dance constantly pumping protons to recharge our ATP, we'd all be dead in a minute or two. Of course the same is true of plants: they just also have chloroplasts doing similar things.)

AI from a legal perspective

Posted Sep 28, 2023 10:40 UTC (Thu) by fenncruz (subscriber, #81417) [Link] (2 responses)

On the other hand, my processing power is definitely < 1 FLOPS, and requires additional auxiliary storage units and I/O (pen & paper) when doing complicated maths.

It just depends on what task you are doing, as to which is better.

AI from a legal perspective

Posted Sep 28, 2023 11:18 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

Indeed, when it comes to processing video or audio, or walking/talking humans do considerably better. I imagine any eventual general AI will have multiple models doing different specialised tasks. It feels like the id/superego distinction: we've now built an id which can do all sorts of stuff automatically but doesn't really think ahead or consider its actions. The next step would be to couple this with some kind of "superego" which is much slower but monitors the id, trains for new situations and corrects the output when necessary.

I don't think the current model of LLM training is really suitable for training this "I have a model of the real world and if I say/do X then Y maybe the possible result and that is good/bad". This pretty much requires actual interaction with a (virtual) world with actual consequences for bad actions. Even in humans this is a 24/7 operation spanning decades with intensive training.

Though I guess bots built for interactive games must do something in this direction.

AI from a legal perspective

Posted Sep 28, 2023 12:07 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Indeed, when it comes to processing video or audio, or walking/talking humans do considerably better.

Not true at all. BUT.

Like I said, AI is charging down the wrong rabbit hole. I think this article dated from the 1980s, but somebody designed and built a robot crab, that was quite happy scuttling about in the surf zone.

What he did NOT do was try and control everything from a big central CPU (given that all he had was something like a 6502 or Z80, that's no surprise!) I can't remember much about it (not surprising, that long ago), but I think it was like each leg had its own controller, another controller changed the angle the body presented to the water, and they all interacted together.

It's like self-driving cars. Everyone who has no experience says you should run it from a control room. Everyone with experience knows that that's a disaster waiting to happen. The problem is the politicians and accountants are the people who control what actually happens ...

Cheers,
Wol

AI from a legal perspective

Posted Oct 2, 2023 10:36 UTC (Mon) by ras (subscriber, #33059) [Link]

The figure I see around the 'net for brain power consumption is 12 watts.

From memory, Tesla's HW3 neural engine they use for "full self driving" is 35 watts, but each car has two for redundancy (not speed) so 70 watts in total.

35 watts isn't that far from 12, and it processing the input from 8 cameras.

AI from a legal perspective

Posted Sep 28, 2023 10:37 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

The idea that "The weights are simply a pile of numbers; it is not creative, it is not expressive. They are just the result of a mechanical process." does not indicate that the law considers the machine's output as incapable of being creative, expressive, or emotional.

Instead, what they're saying is that the machine itself, being just a bunch of weights, is not protected by copyright law. In human terms, this is the law saying that copyright protection does not apply to your brain and body, even though the works you produce can be protected by copyright.

Personally, I think the ideal legal outcome is that an AI gets treated the same as a "human in a box"; if, instead of an AI, you had a human being that took in the inputs you give the AI, and gave you the outputs, what would the legal position be? This means that the AI itself is not infringing on the training set, or on the inputs, but that the output of the AI may well infringe copyright, and may well be a sufficiently creative and expressive work to meet the bar for copyright protection.

AI from a legal perspective

Posted Sep 28, 2023 11:32 UTC (Thu) by ms (subscriber, #41272) [Link]

Thank you; that makes an enormous amount of sense to me.

AI from a legal perspective

Posted Sep 28, 2023 12:35 UTC (Thu) by somlo (subscriber, #92421) [Link] (1 responses)

Agreed, with a small nit to pick:

> and may well be a sufficiently creative and expressive work to meet the bar for copyright protection

Who gets to own the copyright for such creative and expresive AI-generated output? The user who "prompted" it, the builder / trainer of the AI, someone/something else?

AI from a legal perspective

Posted Sep 28, 2023 13:03 UTC (Thu) by farnz (subscriber, #17727) [Link]

I would say that by default, it's the user who prompted it, but contract terms between the user, the builder/trainer, and the owner of the hardware the model runs on can change this. I would therefore also say that if the model is used to infringe copyright, the user is liable for that infringement if it reaches commercially significant levels.

AI from a legal perspective

Posted Oct 5, 2023 7:08 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

The .mkv file of the latest marvel film is just a number. But the police might have something to say if I decide to share this number on the internet for my fellow mathematics enthusiasts…