Class action against GitHub Copilot

Posted Nov 11, 2022 10:09 UTC (Fri) by bluca (subscriber, #118303)
In reply to: Class action against GitHub Copilot by vegard
Parent article: Class action against GitHub Copilot

> I am surprised there is nothing about combining works of different licenses and redistributing that (under a permissive license), as that seems the main problem to me.

Because the original license is irrelevant, as the model is not built using sources distributed under their licenses, but under specific exceptions in copyright law that allow text and data mining - see the recent EU directive on copyright. In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time. Hopefully the legislators over there will catch up soon and fix this.

Class action against GitHub Copilot

Posted Nov 11, 2022 10:17 UTC (Fri) by bluca (subscriber, #118303) [Link] (5 responses)

Think of it this way: imagine you distribute some software under an open source license. Someone approaches you and offers to pay you to use that work under different terms. Copyright law allows you to do that - regardless of how the work is distributed elsewhere, it doesn't affect the buyer, because you gave them permission to use it differently. This is a common use case and used widely, as I'm sure you are aware.
The text and data mining exceptions work in a similar way, but instead of getting permission from the author, permission is given by the law, which trumps any original license, and the author's wishes too. The only right you have is to ask commercial entities for an opt out. Non-profits don't even have to provide an opt-out. The only restrictions AI builders have is that the source must be publicly available (ie: if someone breaks in a computer system and steals sources or data, it is not allowed to be used for this or any other purposes).

Class action against GitHub Copilot

Posted Nov 11, 2022 10:49 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

I think you're reading the exception in EU law over-broadly, and it doesn't say quite what you're claiming.

The exception for text and data mining says that you are not infringing copyright solely by virtue of feeding material into your machine learning system, and that the resulting model is not itself automatically a derived work of the inputs. It does not say that the output of your machine learning system cannot infringe copyright. As far as I can tell, the intent behind this exception is to allow your system to engage in the sorts of copying that we already say are outside the scope of copyright when a human does it - for example, having read about io_uring, a human might build their next system around a submission and completion queue pair, and this is not a copy in the sense of copyright law.

This means that a court could rule legitimately that a given output from the system is sufficient of a copy to be a copyright infringement by the system, and that the use of the system is thus contributory infringement whenever it produces a literal copy of a protected part of its input.

This, in turn, would bring use of systems like GitHub Copilot into line with employing a human to do the same job: if, as a result of a prompt, I write out precisely the code that a previous employer used (complete with comments - whether I copied it from a notebook, or kept a copy of a past employer's source code), then that is copyright infringement. If, on the other hand, I write code that's similar in structure simply because there are only a few ways to loop over all items in a container, that's not copyright infringement.

Assuming the US courts apply this sort of reasoning, then the question before them is whether a human writing the same code with the same prompting would be infringing copyright or not - if you substitute "a Microsoft employee wrote this for a Microsoft customer to use" for "GitHub Copilot wrote this for a Microsoft customer to use", do you still see infringement or not?

Class action against GitHub Copilot

Posted Nov 11, 2022 12:10 UTC (Fri) by Wol (subscriber, #4433) [Link]

I was thinking along the same lines. Simply put, the exception allows the use of online corpora as input. It does NOT allow distribution of the resulting output.

So the exception doesn't cover you for using the result ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 14:49 UTC (Fri) by bluca (subscriber, #118303) [Link] (2 responses)

> The exception for text and data mining says that you are not infringing copyright solely by virtue of feeding material into your machine learning system, and that the resulting model is not itself automatically a derived work of the inputs.

Exactly, and there are many commentators that completely miss this, and assume the opposite, hence the need to clarify it.

> It does not say that the output of your machine learning system cannot infringe copyright.

Sure, but it implies that it does not automatically does so either. Then the onus is on the complainers to show that, first of all, these artificially induced snippets are copyrightable in the first place, and after that to show that intentionally inducing the tool to reproduce them, which requires knowing in advance what they look like and what keywords and surrounding setting to prepare in order to achieve that result, means that it's the tool that is at fault rather than the user.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:25 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link]

Would an AI that could auto fill the gaps in this be infringing the copyrights in your comment?

"> _____ _____ _____ _____ and _____ _____ _____ _____ you _____ _____ _____ _____ solely _____ _____ _____ _____ material _____ _____ _____ _____ system, _____ _____ _____ _____ model _____ _____ _____ _____ a _____ _____ _____ _____ inputs. _____ _____ _____ _____ many _____ _____ _____ _____ this, _____ _____ _____ _____ hence _____ _____ _____ _____ it. _____ _____ _____ _____ say _____ _____ _____ _____ your _____ _____ _____ _____ infringe _____ _____ _____ _____ implies _____ _____ _____ _____ automatically _____ _____ _____ _____ the _____ _____ _____ _____ complainers _____ _____ _____ _____ of _____ _____ _____ _____ snippets _____ _____ _____ _____ first _____ _____ _____ _____ to _____ _____ _____ _____ the _____ _____ _____ _____ which _____ _____ _____ _____ what _____ _____ _____ _____ what _____ _____ _____ _____ to _____ _____ _____ _____ achieve _____ _____ _____ _____ it's _____ _____ _____ _____ at _____ _____ _____ _____ user."

Class action against GitHub Copilot

Posted Nov 11, 2022 15:37 UTC (Fri) by farnz (subscriber, #17727) [Link]

You're not actually clarifying, unfortunately. The effect of the EU Copyright Directive is not to say that the model and its training process are guaranteed not to infringe copyright; rather, it's to say that the mere fact that copyrighted material was used as input to the training process does not imply that the training process or resulting model infringe, in and of itself.

And you're asking more of the complainers than EU law does. Under EU law, the complainers first have to show that there is a copyrightable interest in the output (which you do get right), and after that, they only have to show that the tool's output infringes that copyright. In particular, the tool is at fault if the material is infringing and it produces it from a non-infringing prompt - even if the prompt has to be carefully crafted to cause the tool to produce the infringing material.

As an example, let's use the prompt:


float rsqrt(float number) {
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;

This is not infringing in most jurisdictions - there's nothing in there that is copyrightable at this point, as all the names are either descriptive, or completely non-descript. If a machine learning model then goes on to output the Quake III Arena implementation of Q_rsqrt from this prompt, complete with the comments (including the commented out "this can be removed" line), then there's infringement by the tool, and if it can be demonstrated that the only place the tool got the code from was its training set, the tool provider is likely to be found to be a contributory infringer.

It doesn't matter that I've set the tool up with a troublesome prompt here (that's the first 5 lines of Q_rsqrt, just renamed to Q_rsqrt); I haven't infringed, and thus the infringement is a result of the tool's training data being copied verbatim into its output.

This is, FWIW, exactly the same test that would apply if I gave that prompt to a human who'd seen Quake III Arena's source code, and they infringed by copying the Quake III Arena implementation - I would not be able to prove infringement just because the human had seen the original source, but I would be able to do so if, given the prompt, they produced a literal copy Q_rsqrt.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:09 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

Because the original license is irrelevant, as the model is not built using sources distributed under their licenses, but under specific exceptions in copyright law that allow text and data mining - see the recent EU directive on copyright. In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time.

What the EU says about data mining is irrelevant to this case because Github as an entity is not based in the EU.

What is more pertinent is that the terms and conditions of GitHub stipulate that if you upload stuff to Github, you license Github to use said stuff to “improve the service”, including indexing etc. Since Copilot is basically a fancy search engine for Github-hosted code, it would not be unreasonable for Microsoft's lawyers to argue that their use of code already on Github to train Copilot is covered by the site's existing T&C's, so they don't even need to make a fair-use argument. This would be completely independent of the licenses in the various projects on Github, which govern the use and disposition of Github-hosted code by third parties.

Having said that, the question of whether Copilot's output can infringe on the copyright of its inputs is a separate (and difficult) issue, which should probably be investigated in the wider context of ML applications that, e.g., paint pictures or generate prose. It is obvious that much of the code Copilot deals with is boilerplate which is not sufficiently original to qualify for copyright protection in the first place, but then again, the fact that Copilot can be coaxed into producing swathes of demonstrably copyrighted code without the correct attribution or license grant should not be overlooked, either. (Personally I think that there are more efficient methods to deal with boilerplate, and the amount of due diligence required on the part of Copilot users to ensure that any nontrivial stuff Copilot may regurgitate is not violating any copyrights, let alone fit for purpose, negates the advantage of using Copilot to begin with, but your mileage may vary.)

Class action against GitHub Copilot

Posted Nov 11, 2022 14:54 UTC (Fri) by bluca (subscriber, #118303) [Link]

> What the EU says about data mining is irrelevant to this case because Github as an entity is not based in the EU.

I live and use it in the EU, so it is very relevant to me ;-) Also the EU is pretty much the world's regulatory superpower, so what it says on this matters a lot.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:56 UTC (Fri) by kleptog (subscriber, #1183) [Link] (2 responses)

> In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time.

Indeed, it's concerning for me that this is a case that potentially could set wide-reaching precedent. I know judges creating law is a feature of Common Law systems, but it feels like some of the issues here should be decided by the legislature, not the judiciary.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:14 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> I know judges creating law is a feature of Common Law systems, but it feels like some of the issues here should be decided by the legislature, not the judiciary.

Getting into politics I know, but it's why I feel the gutting of the House Of Lords about 20 years ago was a travesty of good government. The UK's effective Supreme Court was a (pretty ineffective) branch of the legislature. And as such it did a damn good job of making sure the law actually worked.

Career politicians as a whole seem pretty malevolent - by design or accident - and having a decent body of people who wielded power by accident and no design of their own seemed a very good counterbalance. For pretty much a century the House of Lords did a damn good job of making sure legislation was fair, just, and worked. (For some definition of "just", I know. The Lords are people of their time, with the prejudices of their time, but are far less likely to be swayed by populist demagogues.)

The problem the US suffers in particular at the moment, and we're going down the same route, is we seem to have "government by barrels of ink" ...

Cheers.
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 19:39 UTC (Fri) by bluca (subscriber, #118303) [Link]

The problem with that is I can bring many counter-examples of absolute bellends that ended up being peers when they shouldn't even be allowed anywhere near a residents association. But we are just a bit off-topic.