Class action against GitHub Copilot

By Jonathan Corbet
November 10, 2022

The GitHub Copilot offering claims to assist software developers through the application of machine-learning techniques. Since its inception, Copilot has been followed by controversies, mostly based on the extensive use of free software to train the machine-learning engine. The announcement of a class-action lawsuit against Copilot was thus unsurprising. The lawsuit raises all of the expected licensing questions and more; while some in our community have welcomed this attack against Copilot, it is not clear that this action will lead to good results.

Readers outside of the US may not be entirely familiar with the concept of a class-action lawsuit as practiced here. It is a way to seek compensation for a wrong perpetrated against a large number of people without clogging the courts with separate suits from each. The plaintiffs are grouped into a "class", with a small number of "lead plaintiffs" and the inevitable lawyers to represent the class as a whole. Should such a suit prevail, it will typically result in some sort of compensation to be paid to anybody who can demonstrate that they are a member of the class.

Class-action lawsuits have been used to, for example, get compensation for victims of asbestos exposure; they can be used to address massive malfeasance involving a lot of people. In recent decades, though, the class-action lawsuit seems to have become mostly a vehicle for extorting money from a business for the enrichment of lawyers. It is not an uncommon experience in the US to receive a mailing stating that the recipient may be a member of a class in a suit they have never heard of and that, by documenting their status, they can receive a $5 coupon in compensation for the harm that was done to them.

Compensation for the lawyers involved, instead, tends to run into the millions of dollars. Not all class-action lawsuits are abusive in this way, but it happens often enough that it has become second nature to look at a new class-action with a jaundiced eye.

The complaint

The complaint was filed on behalf of two unnamed lead plaintiffs against GitHub, Microsoft, and a multitude of companies associated with OpenAI (which is partially owned by Microsoft and participated in the development of Copilot). It explains at great length how Copilot has been trained on free software, and that it can be made to emit clearly recognizable fragments of that software without any of the associated attribution or licensing information. A few examples are given, showing where the emitted software came from, with some asides on the (poor) quality of the resulting code.

Distribution of any software must, of course, be done in compliance with the licenses under which that software is released. Even the most permissive of free-software licenses do not normally allow the removal of copyright or attribution information. Thus, the complaint argues, the distribution of software by Copilot, which does not include this information, is in violation of the that software's licenses and not, as GitHub seems to claim, a case of fair use. Whether fair use applies to Copilot may well be one of the key turning points in this case.

The members of the class of people who have allegedly been harmed by this activity are defined as:

All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time between January 1, 2015 and the present (the “Class Period”).

It is, as would be expected, a US-focused effort; if there is harm against copyright owners elsewhere in the world, it will have to be addressed in different courts. This wording would seem to exclude developers who have never themselves placed code on GitHub, but whose code has been put there by others — a frequent occurrence.

The list of charges against the defendants is impressive in its length and scope:

Violation of the Digital Millennium Copyright Act, brought about by the removal of copyright information from the code spit out by Copilot.
Breach of contract: the violation of the free-software licenses themselves. The failure to live up to the terms of a license is normally seen as a copyright violation rather than a contract issue, but they have thrown in the contract allegation as well.
Tortious interference in a contractual relationship; this is essentially a claim that GitHub is using free software to compete against its creators and has thus done them harm.
Fraud: GitHub users, it is claimed, were induced to put their software on GitHub by the promises made in GitHub's terms of service, which are said to be violated by the distribution of that software through Copilot.
False designation of origin — not saying where the software Copilot "creates" actually comes from.
Unjust enrichment: profiting by removing licensing information from free software.
Unfair competition: essentially a restatement of many of the other charges in a different light.
Breach of contract (again): the contracts in question this time are GitHub's terms of service and privacy policy.
Violation of the California Consumer Privacy Act: a claim that the plaintiff's personal identifying information has been used and disclosed by GitHub. Exactly which information has been abused in this way is not entirely clear.
Negligent handling of personal data: another claim related to the disclosure of personal information.
Conspiracy: because there are multiple companies involved, their having worked together on Copilot is said to be a conspiracy.

So what is this lawsuit asking in compensation for all of these wrongs? It starts with a request for an injunction to force Copilot to include the relevant licensing and attribution information with the code it emits. From there, the requests go straight to money, with attorney's fees being at the top of the list. After that, there are nine separate requests for both statutory and punitive damages. And just in case anybody thinks that the lawyers are thinking too small:

Plaintiffs estimate that statutory damages for Defendants’ direct violations of DMCA Section 1202 alone will exceed $9,000,000,000. That figure represents minimum statutory damages ($2,500) incurred three times for each of the 1.2 million Copilot users Microsoft reported in June 2022.

It seems fair to say that a lot of damage is being alleged here.

Some thoughts

The vacuuming of a massive amount of free software into the proprietary Copilot system has created a fair amount of discomfort in the community. It does, in a way, seem like a violation of the spirit of what we are trying to do. Whether it is a violation of the licenses involved is not immediately obvious, though. Human programmers will be influenced by the code they have seen through their lives and may well re-create, unintentionally, something they have seen before. Perhaps an AI-based system should be forgiven for doing the same.

Additionally, there could be an argument to be made that the code emitted by Copilot doesn't reach the point of copyright violation. The complaint spends a lot of time on the ability to reproduce in Copilot, using the right prompt, a JavaScript function called isEven() — which does exactly what one might expect — from a Creative-Commons-licensed textbook. It is not clear that a slow and buggy implementation of isEven() contains enough creative expression to merit copyright protection, though.

That said, there are almost certainly ways to get more complex — and useful — output from Copilot that might be considered to be a copyright violation. There are a lot of interesting questions that need to be answered regarding the intersection of copyright and machine-learning systems that go far beyond free software. Systems that produce images or prose, for example, may be subject to many of the same concerns. It would be good for everybody involved if some sort of consensus could emerge on how copyright should apply to such systems.

A class-action lawsuit is probably not the place to build that consensus. Lawsuits are risky affairs at best, and the chances of nonsensical or actively harmful rulings from any given court are not small. Judges tend to be smart people, but that does not mean that they are equipped to understand the issues at hand here. This suit could end up doing harm to the cause of free software overall.

The request for massive damages raises its own red flags. As the Software Freedom Conservancy noted in its response to the lawsuit, a core component of the ethical enforcement of free-software licenses is to avoid the pursuit of financial gain. The purpose of an enforcement action should be to obtain compliance with the licenses, not to generate multi-billion-dollar payouts. But such a payout appears to be an explicit goal of this action. Should it succeed, there can be no doubt that many more lawyers will quickly jump into that fray. That, in turn, could scare many people (and companies) away from free software entirely.

Bear in mind that most of these suits end up being settled before going to court. Often, that settlement involves a payment from the defendant without any admission of wrongdoing; the company is simply paying to make the suit go away. Should that happen here, the result will be a demonstration that money can be extracted from companies in this way without any sort of resolution of the underlying issues — perhaps a worst-case scenario.

Copilot does raise some interesting copyright-related questions, and it may well be, in the end, a violation of our licenses. Machine-learning systems do not appear to be going away anytime soon, so it will be necessary to come to some conclusions about how those systems interact with existing legal structures. Perhaps this class-action suit will be a step in that direction, but it is hard to be optimistic that there will be any helpful outcomes from that direction. Perhaps, at least, GitHub users will receive a coupon they can use to buy a new mouse or something.

Class action against GitHub Copilot

Posted Nov 10, 2022 15:56 UTC (Thu) by q_q_p_p (guest, #131113) [Link] (33 responses)

All code generated by Copilot should have at least GPL license (if not AGPL).

Class action against GitHub Copilot

Posted Nov 10, 2022 17:28 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (26 responses)

I know that you trolling, but… some my code on Github is licensed ISC. Microsoft probably used my code to train the Copilot. If my code influenced the output, I do not want the output to be copyrighted under license like GPL or anything connected with FSF.

Class action against GitHub Copilot

Posted Nov 10, 2022 17:44 UTC (Thu) by q_q_p_p (guest, #131113) [Link]

Do you have any control over it now? Did setting copyright notice help you?

Class action against GitHub Copilot

Posted Nov 10, 2022 17:53 UTC (Thu) by fenncruz (subscriber, #81417) [Link] (24 responses)

Not that they'd want to as it would cut down the amount of training data, but I could see making multiples models, one per license, as a way to try and side step license issues. That way if you want a GPL code then you only use a model trained on GPL code.

Class action against GitHub Copilot

Posted Nov 10, 2022 18:50 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (23 responses)

You can't do that unless you're prepared to comply with attribution requirements, or to claim that copyright is not infringed in the first place (in which case you don't need separate models).

Class action against GitHub Copilot

Posted Nov 12, 2022 23:02 UTC (Sat) by apoelstra (subscriber, #75205) [Link] (22 responses)

I'm curious if the attribution requirement would be satisfied if Copilot were to output a (perhaps 100s of MB) CONTRIBUTORS file listing every author listed in every attribution statement from the input corpus. Then Copilot-assisted code would just have to include this file. The result: a massively duplicated, never-read, list of github accounts from the time that any particular project was written.

This is clearly not in the spirit of the GPL but it's unclear to me whether it matches the letter.

Class action against GitHub Copilot

Posted Nov 12, 2022 23:29 UTC (Sat) by Wol (subscriber, #4433) [Link] (21 responses)

And what about all that BSD code? That CDDL code? That Mozilla Public Licence code? Those *licences* are NOT in the spirit of the GPL. (Some of them are also incompatible with the GPL, but that's beside the point ...)

Demanding that code - licenced under a non-GPL licence - be distributed "in the spirit of the GPL" is doing a major dis-service to the authors of that code!

What is this obsession with the GPL?! Who gave you the right to dictate to me what licence my code should be licenced under!

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 13, 2022 8:53 UTC (Sun) by milesrout (subscriber, #126894) [Link] (20 responses)

> Demanding that code - licenced under a non-GPL licence - be distributed "in the spirit of the GPL" is doing a major dis-service to the authors of that code!

But that's not the demand. The demand is that code derived from GPL-licensed code be distributed under the GPL. That's what the GPL requires.

> What is this obsession with the GPL?! Who gave you the right to dictate to me what licence my code should be licenced under!

Nobody is telling you what licence your code should be distributed under, unless your code is a derivative work of GPL-licensed code in which case YOU have told yourself to distribute it under the GPL by making it such.

Class action against GitHub Copilot

Posted Nov 13, 2022 12:59 UTC (Sun) by Wol (subscriber, #4433) [Link] (19 responses)

The original post said code output from Copilot should be GPL Which if a lot of the training code was not GPL, is a non sequitur.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 13, 2022 14:11 UTC (Sun) by amacater (subscriber, #790) [Link]

Code output from Copilot which reproduces other code should be under those licences yes - and that may mean that a large amount of code would need to be under GPL. Code which does not have an explicit licence to distribute shouldn't be reproduced as distributable under any licence at all, of course - the fact that you've given GitHub a right to host it doesn't give them the right to choose a copyright licence for ALL code that they then publish (And a lot of GitHub code has no ascertainable licence: fine for the originator of the code, not fine for anyone else to use at all).

Class action against GitHub Copilot

Posted Nov 13, 2022 17:54 UTC (Sun) by q_q_p_p (guest, #131113) [Link] (7 responses)

Now Copilot is license laundering scheme under which original licenses don't matter.
If instead Copilot would produce only GPL code, this would make FOSS advocates happy and corporate (including open source) shills unhappy. That's win-win situation for me.

You don't have to agree with me and instead make your own top comment - which license should Copilot use?

Class action against GitHub Copilot

Posted Nov 13, 2022 18:16 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

> Now Copilot is license laundering scheme under which original licenses don't matter.

> If instead Copilot would produce only GPL code, this would make FOSS advocates happy and corporate (including open source) shills unhappy. That's win-win situation for me.

Is that GPL2? GPL3? GPL 2 or 3? GPL2+? GPL3+?

Oh - and BSD, MIT, licenses like that are FLOSS. I'm pretty sure their advocates would be LESS than happy with Copilot laundering their code into GPL! If Copilot produces only GPL code, that's a lose-lose situation for a LOT of people. People who are fans of FLOSS ...

Please. Just stop trolling. Just because you're a GPL fanatic doesn't mean other FLOSS people agree with you that the GPL is a "good thing (tm)". GPL3 is a disaster ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 13, 2022 19:16 UTC (Sun) by q_q_p_p (guest, #131113) [Link] (2 responses)

>Is that GPL2? GPL3? GPL 2 or 3? GPL2+? GPL3+?

AGPL-3.0-or-later ideally, GPL-3.0-or-later realistically.

> Oh - and BSD, MIT, licenses like that are FLOSS. I'm pretty sure their advocates would be LESS than happy with Copilot laundering their code into GPL!

"(including open source)" was referring to them.

>Just because you're a GPL fanatic doesn't mean other FLOSS people agree with you that the GPL is a "good thing (tm)". GPL3 is a disaster ...

Imagine, I'm also a "people" and I don't think GPL3 to be a disaster.
Also again: you don't have to agree with me and instead make your own top comment - which license should Copilot use? How about proprietary with Commercial Use Only clause?

Class action against GitHub Copilot

Posted Nov 13, 2022 21:01 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> which license should Copilot use? How about proprietary with Commercial Use Only clause?

How about complying with the actual text of the GPL?

> If conditions are imposed on you (whether by court order, agreement or
> otherwise) that contradict the conditions of this License, they do not
> excuse you from the conditions of this License. If you cannot convey a
> covered work so as to satisfy simultaneously your obligations under this
> License and any other pertinent obligations, then as a consequence you may
> not convey it at all. For example, if you agree to terms that obligate you
> to collect a royalty for further conveying from those to whom you convey
> the Program, the only way you could satisfy both those terms and this
> License would be to refrain entirely from conveying the Program.

I express no opinion about whether the output of Copilot is actually a derivative work of its training materials. But *IF* it is, then it needs to comply with multiple incompatible copyleft licenses, so therefore it cannot be distributed at all. The GPL says this in black and white.

Class action against GitHub Copilot

Posted Nov 13, 2022 21:20 UTC (Sun) by q_q_p_p (guest, #131113) [Link]

> How about complying with the actual text of the GPL?

That's also an acceptable solution to me.

Class action against GitHub Copilot

Posted Nov 13, 2022 20:37 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (1 responses)

Anyone who chooses BSD or MIT licences is choosing to allow their code to be fed into the point attractor that is a GPLed project.

Class action against GitHub Copilot

Posted Nov 13, 2022 21:49 UTC (Sun) by Wol (subscriber, #4433) [Link]

There's a big difference between *allowing* and *being happy*.

It's well known that Linus likes the *practicality* of the GPL2, and is very *unhappy* with the *spirit* of the GPL, which is why Linux has never been licenced 2+, and which is why that move is unlikely ever to be considered.

People apparently use BSD because they want their code to spread. If that code gets incorporated into GPL projects, those projects are *hindering* the spread of the code by adding extra restrictions. Hopefully, the GPL code points back at the original BSD, but if the GPL version out-evolves the BSD one then the wants of the original developers have clearly been trampled on.

What's LEGAL is not always MORAL. The best approach is respect - for the code, for the authors, for people in general.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 14, 2022 9:41 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

> I'm pretty sure their advocates would be LESS than happy with Copilot laundering their code into GPL!

Then they shouldn't have picked a license that allows to do it?

Class action against GitHub Copilot

Posted Nov 14, 2022 9:24 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (9 responses)

If I copy paste a function from a MIT licensed file and another from a GPL licensed file, which licenses can I apply to my resulting file?

AFAIK It's only GPL and AGPL.

The authors of the MIT licensed file might not like the GPL but they did not restrict this kind of use.

It is however true that care should be taken to not mix code with incompatible licenses, so it's not so easy as "just license everything under AGPL".

By the way people say AGPL because it's the most restrictive of the famous ones.

Class action against GitHub Copilot

Posted Nov 14, 2022 14:29 UTC (Mon) by Wol (subscriber, #4433) [Link] (8 responses)

Neither? It's not your code?

The file is only *distributable* under the GPL, but you cannot apply any licence to code you did not write. The licence(s) that apply to the file are the licences the authors/owners applied. Any recipient can (if they know the history) copy the BSD function from that file, and distribute it under BSD.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 14, 2022 21:07 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (7 responses)

> Neither? It's not your code?

I know it's not my code but it seems I'm free to relicense MIT to GPL, and of course license the following mod

https://news.ycombinator.com/item?id=19489157

Class action against GitHub Copilot

Posted Nov 14, 2022 21:50 UTC (Mon) by sfeam (subscriber, #2841) [Link] (2 responses)

That very analysis you link to states the opposite: it is not possible to relicense code to GPL. It is possible to include MIT code with GPL code, but this does not have the effect of changing the MIT license to something else. The original copyright, and the original permission to distribute under MIT terms (not GPL) continues to exist. Whether any of that applies to the Copilot case is another matter. I am inclined to think that if the operation of Copilot is deemed to be permissible because it constitutes the training of an automated system, then its output is neither a compilation nor an original creative work and therefore is not subject to copyright at all. Thus the question of license becomes moot unless and until the output is re-worked as part of a larger whole that is copyrightable and could be licensed (or challenged) at that time.

Class action against GitHub Copilot

Posted Nov 15, 2022 9:44 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

I disagree with your analysis of Copilot, by analogy to a human.

Letting a human read text is not, in and of itself, an infringing activity. Nor is the resulting brain state in the human, even if it includes literal copies of text code they read. But the output of a human can itself be infringing, if instead of using my training to inform what I do, I regurgitate memorised chunks of text.

I expect the same principles to apply to Copilot and similar systems; training Copilot is not infringing. The resulting model is not infringing in and of itself. The output from the system, can, however, be infringing, and the degree to which it is a legal problem depends on the degree of infringement, and the extent to which the system disguises the origins of the code (in terms of contributory infringement, if I tell you that I'm showing you sample code from a given source, and you copy it, that's a different case to if I give you code that I do not attribute).

Class action against GitHub Copilot

Posted Nov 17, 2022 10:25 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Exactly what I was thinking. Also, whereas a human *might* accidentally regurgitate variable names as well as a learned code structure, an AI can be explicitly required to use generic variable names. The question then becomes a hard one. At what point is a chunk of code a derived work, rather than created from scratch? Haven't we been here before, with arguments about copyright of header files and APIs which cannot ever be written any other way?

for obj in object_list: obj.do_stuff() is surely fair, however the AI arrived at it.

for sd5obj in sd5_get_blue_meanies(): sd5obj.frobnicate_from_sd4( ) is surely a verbatim copy of somebody's identifiable code, and should at the very least be attributed.

Class action against GitHub Copilot

Posted Nov 14, 2022 22:17 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

" including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,"

You clearly didn't read the thread you pointed at.

The MIT licence allows SUBlicencing, ie applying other terms on top. According to your thread, there is no such thing as relicencing, certainly I've never seen that in any legal document.

And as I understand English, RElicencing means throwing out the old licence, and replacing it with a new one. No open source licence I've come across allows any such thing. As I understand the meaning of the word "relicencing", in practice this has precious little difference from handing over the copyright.

And that's also why/how you can extract the original code under the original licence. If it had been RElicenced, you wouldn't be able to get the original licence back.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 15, 2022 23:46 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

I'm not sure "relicensing" is even a thing in legal terms. Let's clarify a few ideas. Suppose we have some software that was written by person A. A gives a copy to B, and B gives a copy to C. Then:

* With both the GPL and the MIT license, A automatically extends a license to each person who obtains a copy of the software, by whatever means. In our example, this means that A gives a license to B and to C.
* With the MIT license, B may optionally give a different license to C (or, in principle, the same license, but that would be kind of pointless). This is called a "sublicense." It does not extinguish the original license, so C can decide which license to comply with. However, as a general rule in most jurisdictions, the sublicense cannot be broader than the rights that B already enjoys under the MIT license in the first place - you can't give what you don't have.
* In the context where the sublicense is the GPL, C is still required to comply with the MIT license's formalities. Obviously, B is also required to comply with those formalities. In practice, this is usually done simply by copying the copyright and license into a comment at the top of the file, but there are other ways of complying. If you're distributing binaries, you might need to take additional measures to ensure a copy of the license is still visible somewhere.
* Under the GPL, sublicensing is forbidden - C can only get the license directly from A. A and C never need to communicate in order to do this. The license, as mentioned above, springs into being automatically as soon as C has a copy of the software, regardless of how A or B may feel about that. I'm pretty sure the FSF did this to prevent B from placing "additional restrictions" on the sublicense.
* In principle, we could imagine a license which does not have this automatic property, where all rights have to be conveyed by explicit sublicensing (sort of the inverse of how the GPL works). This is inconvenient, so nobody does it in the FOSS world, but I imagine it's more common in environments where permissions are more closely guarded.
* We could also imagine a license which terminates as soon as you accept a sublicense from somebody else. This would make sublicensing have the effect of extinguishing the original license, and allow intermediaries to restrict downstream users' rights via clickwrap licenses. Again, this is unheard of in the FOSS world, but to my understanding it is not unlawful. I suppose you could call that operation "relicensing," but in practice, I don't think this is really a thing that anyone does.

Class action against GitHub Copilot

Posted Nov 16, 2022 5:50 UTC (Wed) by unilynx (guest, #114305) [Link] (1 responses)

I’m not sure anything of this works this way for a sub license from B to C if B is just providing a copy

The license is a promise not to sue for copyright infringement providing you follow certain requirements.

C has the same rights as B in your case no matter what B said, as B has no standing to sue for copyright infringement as he does not have any copyright.

(Unless A specifically provided the product to B with the understanding that the MIT license would only be valid to B)

Class action against GitHub Copilot

Posted Nov 17, 2022 19:48 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

As I mentioned, C can decide which license to comply with. In the scenario you describe, C would (probably) choose to comply with the license provided by A rather than the sublicense provided by B, so B's sublicense has no legal effect and may as well not exist. Thus, this is a distinction without a difference.

Class action against GitHub Copilot

Posted Nov 10, 2022 21:10 UTC (Thu) by developer122 (guest, #152928) [Link] (4 responses)

Just try that with the ZFS that's under the CDDL. OpenZFS may not come after you but there's a good chance Oracle will.

There's a good reason that OpenZFS is distributed as an nvidia-style module despite being copyleft.

Class action against GitHub Copilot

Posted Nov 10, 2022 21:28 UTC (Thu) by q_q_p_p (guest, #131113) [Link] (3 responses)

That's cool idea, if I were using Copilot I would try doing this: use Copilot to reproduce ZFS code, GPLv2 it and let Oracle fight Microsoft over the code generated by Copilot :-) What can go wrong?

Class action against GitHub Copilot

Posted Nov 11, 2022 0:43 UTC (Fri) by pabs (subscriber, #43278) [Link] (2 responses)

Oracle would probably sue you first, not Microsoft?

Class action against GitHub Copilot

Posted Nov 17, 2022 10:34 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (1 responses)

Write an AI to use Codepilot to regenerate ZFS code? Then post the source code of that AI under a "no rights reserved, use at finder's risk" type license? (This isn't purely flippant. It's the sort of world we are heading into, and the law will get seriously left behind by the AIs).

Class action against GitHub Copilot

Posted Nov 27, 2022 16:19 UTC (Sun) by flussence (guest, #85566) [Link]

That's not much different than populating an open FTP server with a "windows11enterprisecracked.iso" containing an exact byte length of apparent random noise alongside a 1MB "teehee.xor".

Class action against GitHub Copilot

Posted Nov 17, 2022 13:04 UTC (Thu) by esemwy (guest, #83963) [Link]

Copilot should have as its input a desired license. There’s no legal way to blanket re-license someone else’s code.

Class action against GitHub Copilot

Posted Nov 10, 2022 16:26 UTC (Thu) by bluca (subscriber, #118303) [Link] (75 responses)

I am so glad to live in Europe, where the legislation is way ahead of the US on this matter and makes clear that such a lawsuit is absolutely bogus and nonsensical. There are explicit provisions for text and data mining for the purpose of AI being excepted from copyright laws, as it should be. In the US it has to rely on fair use - mind, this lawsuit is still bogus and nonsensical, but it means it will have to rely on the court to do the right thing, as fair use is a case-by-case affair.

The demand for money makes it even more obvious this is a malicious effort by some copyright trolls, pushing for a maximalist interpretation of the law, which would be bad news for everybody but their lawyers and wallets.

Class action against GitHub Copilot

Posted Nov 10, 2022 17:14 UTC (Thu) by Rigrig (subscriber, #105346) [Link] (31 responses)

Training an AI should be fine, but when it faithfully reproduces input it gets tricky:

Take the isEven() function from the complaint:
If anyone writes an isEven(), chances are it looks like a lot of other isEven() functions out there.
But this one is exactly the same as a textbook example, which uses recursion for every input except 0 or 1. It even includes the test code from the book, including the // -> ?? exercise comment.

Regardless if it was produced by AI or a human, that sure smells like copyright violation to me.
Which is also what a lot of people are worried about: that this will be used to blatantly violate copyrights by claiming "It was written by an AI, which means it's the product of fair use."

Class action against GitHub Copilot

Posted Nov 10, 2022 19:11 UTC (Thu) by bluca (subscriber, #118303) [Link] (30 responses)

Nah, that is called concern trolling or sealioning. Those are not real world use cases, they are fabricated for clickbait effect. Absolutely nobody who's working on something real goes around trying to recreate the inverse square root algorithm or things like that that have been doing the rounds.

What this is really used for in the real world is to take care of boilerplate and such.

Class action against GitHub Copilot

Posted Nov 10, 2022 21:05 UTC (Thu) by ballombe (subscriber, #9523) [Link] (27 responses)

How do you know that ?

Class action against GitHub Copilot

Posted Nov 10, 2022 21:17 UTC (Thu) by bluca (subscriber, #118303) [Link] (26 responses)

I use it every day, and talk to other developers who use it every day. Do you?

Class action against GitHub Copilot

Posted Nov 11, 2022 9:51 UTC (Fri) by gspr (guest, #91542) [Link] (11 responses)

Someone has made a machine that outputs *potentially* copyright-infringing code. Whether or not it really is, is the core of the debate, and not something that's easy to answer. What I fail to understand is how "I and all my peers just use that machine to output non-copyrightable boilerplate" is any sort of excuse.

Class action against GitHub Copilot

Posted Nov 11, 2022 10:24 UTC (Fri) by bluca (subscriber, #118303) [Link] (10 responses)

Because it's not an excuse, it's explaining how this works in the real world, outside of clickbaity articles and copyright troll lawsuit fishing for money. Because this matters, a lot.
My smartphone camera can also *potentially" output copyright-infringing pictures. My mp3 player *potentially* plays copyrighted songs. And so on - these are tools, and their main intended and common use matters, and for alleged open source supporters to side with copyright maximalists and trolls looking for a quick payday is missing the point so much that it's not even fun anymore.

Copyright maximalism is bad for us. The only reason this gains any traction is because it's done by Microsoft, if Copilot had been built by Mozilla reactions would be quite different, and that's just sad and short-sighted.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:11 UTC (Fri) by gspr (guest, #91542) [Link] (9 responses)

I think these analogies are terrible. Your camera and your music players are indeed tools that you can use to infringe on copyright. They didn't come to you in a state infringing on copyright. The claim about Copilot, however, is that it "contains" (for some value of contains) and produces copyrighted material.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:47 UTC (Fri) by bluca (subscriber, #118303) [Link] (8 responses)

No, those analogies are perfectly adequate. Copilot does not contain any infringing material, it's not a repository. And you have to intentionally make it produce exact copies, with carefully selected inputs that you need to have pre-existing knowledge of, it does not happen randomly, it takes a lot of effort. So it's exactly like a camera or a music player, in that regard. It's the user intent that causes that behaviour, not the tool itself.

Class action against GitHub Copilot

Posted Nov 11, 2022 12:14 UTC (Fri) by gspr (guest, #91542) [Link] (7 responses)

> Copilot does not contain any infringing material

That's the question at the heart of the conundrum, and it's a very complex one. Sure, we can probably agree that Copilot does not contain bit-for-bit copies of copyrighted material. But that's not the bar. Distributing a lossily compressed copy of a copyrighted image without permission can still be infringement. On the other hand, distributing the average value of all the pixels in said image certainly is not. The spectrum in-between is where it gets hard, and this is (in my opinion) probably where Copilot and similar fall.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:32 UTC (Fri) by Wol (subscriber, #4433) [Link]

As a completely different example, copying material as part of a lawsuit does not infringe copyright. I can copy a work, present it to court, and copyright cannot touch me.

But if somebody else then takes my work and publishes it in a newspaper, that's not a legal document. Me putting it in a legal document did not strip copyright, it just gave ME immunity. The publisher can still get done for it.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 19:55 UTC (Fri) by bluca (subscriber, #118303) [Link] (5 responses)

Of course it is complicated, but the crux is that I don't see how this tool can be defined as something that distributed copies of anything. It is very obviously not built to pick existing snippets and shove them out of the door 1:1 to users. It's not how it works in the vast majority of cases, where it builds something adapted to the surrounding environment - which is what makes it so darn beautiful and useful to use. Then there are users who go out of their way to carefully construct a surrounding environment (=> input query) using their pre-existing knowledge to intentionally cause it to spit out pre-existing snippets in order to write clickbaity articles or, in this case, look for a quick payday.

Class action against GitHub Copilot

Posted Nov 12, 2022 15:24 UTC (Sat) by farnz (subscriber, #17727) [Link] (4 responses)

It can be defined as something that distributes copies of something in exactly the same way as a human engineer can be defined as someone who distributes copies of something.

If I have, in my notebooks, details of how to do something in kernel-dialect C, and I read a snippet of code from those details then adapt it to the codebase I'm working on, then I've distributed a copy of the snippet in my notebook. If the snippet in my notebook is not protected by copyright, then this is not an issue; if it is, then I've potentially infringed copyright by copying out that snippet and adapting it.

The same applies to Copilot - its model takes the place of the engineer's notebooks and knowledge of what they can find in their notebooks, and its output is potentially infringing in exactly the same way as a human engineer's output is potentially infringing, complete with fun questions around "non-literal copying".

Class action against GitHub Copilot

Posted Nov 17, 2022 13:16 UTC (Thu) by esemwy (guest, #83963) [Link] (3 responses)

A human engineer can be inspired by someone else’s code. An AI has no such ability. Computers memorize, and have rules. The fact that the AI is really complicated doesn’t change that. It’s more as if they obfuscated the source and are trying to pass it off as their own.

Class action against GitHub Copilot

Posted Nov 17, 2022 14:08 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

You're getting into the philosophy of what it means to be human, and missing my point at the same time.

My point is simply that if a human can infringe while doing the same thing that Copilot does, then it's absurd to say that Copilot cannot infringe because it's an AI - rather, it's reasonable to say that Copilot's ability to infringe copyright is bounded on the lower end by the degree to which a human doing the same thing can infringe copyright.

I've also provided a sketch of how a human can infringe copyright, which I can expand upon if it's not clear; unless you can demonstrate that Copilot is incapable of doing what the human does to infringe, however, you can't then claim that Copilot can't infringe where a human can.

Class action against GitHub Copilot

Posted Nov 17, 2022 15:57 UTC (Thu) by esemwy (guest, #83963) [Link] (1 responses)

The distinction does make a difference in this case. If a human takes a snippet of code and copies it, only changing the identifiers, my understanding is that is still a violation of copyright, and a company can’t take code, compile it, and claim it’s not an infringement. Copilot seems more like the latter, where the former example would be the human analog.

Class action against GitHub Copilot

Posted Nov 17, 2022 17:10 UTC (Thu) by farnz (subscriber, #17727) [Link]

I'm not following your reasoning here - in what way should Copilot be permitted to do something that would make Microsoft liable for the infringement if an employee did it?

Remember that I'm setting a lower bound - "if Copilot was just a communication interface to a human being at Microsoft who looked at the context sent to and then responded with a code snippet, would Microsoft be liable?". My claim is that if Microsoft would be liable in this variant on Copilot, then Microsoft are also liable if Copilot is, in fact, an "AI" based around machine learning, but that this is a one-way inference - if Microsoft would not be liable if Copilot was a comms channel, this doesn't tell you anything about whether Microsoft are liable if Copilot is actually an AI.

To summarize: my reasoning is that "an AI did it, not a human" should never be a get-out clause - it can increase your liability beyond that you'd face if a human did it, but it can never decrease it.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:06 UTC (Fri) by paulj (subscriber, #341) [Link] (10 responses)

For transparency, can you state if you have any relationships with either Microsoft or GitHub (I vaguely recall you have with MS, but ICBW)?

Class action against GitHub Copilot

Posted Nov 11, 2022 13:11 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (9 responses)

> For transparency, can you state if you have any relationships with either Microsoft or GitHub (I vaguely recall you have with MS, but ICBW)?

Bluca works for MS but is not involved with GitHub directly as I understand it. He has said before he doesn't think it is important to add such notes but I think the repeated level of participation in topics like these warrants one. It is a complex issue and it was clear from the beginning this is all going to end up in court(s). It may very well end with rulings that affect the future of such tools and even copyright in general. If you are going to come in strongly on one side (even if it happens to be coincidentally favorable to your employer which I can completely accept it is), other folks might want to take that into consideration when evaluating your opinion.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:30 UTC (Fri) by bluca (subscriber, #118303) [Link] (4 responses)

Absolute drivel. This is the comments section of a news article, not a court room. Are you going to post your tax returns to show you haven't invested in any of the companies involved? No? Thought so. Wind yer heid in.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:50 UTC (Fri) by paulj (subscriber, #341) [Link]

Aye right pal.

Class action against GitHub Copilot

Posted Nov 12, 2022 20:16 UTC (Sat) by k8to (guest, #15413) [Link]

Others seem to think it matters, myself included.

Class action against GitHub Copilot

Posted Nov 18, 2022 1:20 UTC (Fri) by jschrod (subscriber, #1646) [Link] (1 responses)

This is not drivel.

You are currently posting 20% of the comments on this article, without disclosing your affiliation.

I.e., you are a MS shill.

*plonk*

PS: I know a lot of folks who work at MS Research, and I'm grateful for the great work they are doing there. It is also obvious that there are some folks at MS who are doing very good work at the Linux kernel. But they are always open about their affiliation, even when commenting articles. And lwn.net is not some obscure Web site with an obscure community -- we are here at the heart of the Linux community that takes such issues seriously and discusses is with open visor.

PPS: For the record: I'm an owner of a company that is a MS partner, but my company has nothing to do with the Linux side of MS's business.

Class action against GitHub Copilot

Posted Dec 7, 2022 11:06 UTC (Wed) by nye (subscriber, #51576) [Link]

> I.e., you are a MS shill.
> *plonk*

This kind of ad hominem trolling has no place in LWN. Corbet, please for the love of god can we start seeing some temporary bans for repeat trolls like this?

Class action against GitHub Copilot

Posted Nov 11, 2022 17:52 UTC (Fri) by paulj (subscriber, #341) [Link] (3 responses)

Yes, bluca is very vocal on topics related to his (IIRC) employer MS. Hence why the association stuck in my mind.

It's almost impossible for such associations not to colour one's thinking at least a little. The enthusiasm shown for the debate by bluca is clear anyway.

Class action against GitHub Copilot

Posted Nov 11, 2022 19:47 UTC (Fri) by bluca (subscriber, #118303) [Link] (2 responses)

Except of course you say that as if you knew if for a fact, when it is very obviously factually incorrect, and one just has to go and see what my comments were for example on the whole secure-core-pc debacle to realise how nonsensical your proposition is. So you are misrepresenting reality either out of ignorance or malice. Either way, how about you stick to the facts of the matter and leave out the ad-hominems and doxxing?

Class action against GitHub Copilot

Posted Nov 14, 2022 10:29 UTC (Mon) by paulj (subscriber, #341) [Link]

Nothing in my comment was ad-hominem. It is widely recognised that people's associations - particularly any with tangible benefits - may at times colour their opinions of their associates, and even indirectly, friendly associates of their associates. It is a general human thing - not specific to you. It may or may not apply to you, but I - and others - would prefer to be aware of the association.

Nor did I "doxx" you. You acknowledged your employement with MS before here on LWN.

Class action against GitHub Copilot

Posted Nov 14, 2022 10:43 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

> Either way, how about you stick to the facts of the matter and leave out the ad-hominems and doxxing?

Ok, here's some facts: at the moment of writing

your comments amount to the 22% of the comments

there are 40 usernames that commented, so you are clearly over represented

Class action against GitHub Copilot

Posted Nov 11, 2022 19:03 UTC (Fri) by ballombe (subscriber, #9523) [Link] (2 responses)

> I use it every day, and talk to other developers who use it every day.

Nice to know.

> Do you?
No, that is what I ask.

My issue is that there is no verifiable claim about the size of the AI model.
For all we know it can be in the petabyte size. The model could just return table of indices to a gigantic array of strings.

Github made everyone nervous by changing the TOS. They pay the price now.

Class action against GitHub Copilot

Posted Nov 11, 2022 19:34 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

If it was, it wouldn't work how it does. It very clearly learns and adapts to the surrounding context - which is what makes it truly amazing to use to deal with boilerplate and repeated local patterns. I am a very lazy person, and it saves me some keystrokes when for example adding function-level unit tests - it's an automated copy-paste-search-replace that adapts to the current content. You couldn't do that with an index that returns pre-existing snippets.

Besides, some folks are reimplementing their own server + model, using the same client interface, and yes it's an actual AI model, not an index: https://github.com/moyix/fauxpilot

Class action against GitHub Copilot

Posted Nov 13, 2022 11:49 UTC (Sun) by ballombe (subscriber, #9523) [Link]

> If it was, it wouldn't work how it does. It very clearly learns and adapts to the surrounding context - which is what makes it truly amazing to use to deal with boilerplate and repeated local patterns. I am a very lazy person, and it saves me some keystrokes when for example adding function-level unit tests - it's an automated copy-paste-search-replace that adapts to the current content. You couldn't do that with an index that returns pre-existing snippets.

There would be two stages: first locate the relevant code snippet in the database, and then the
AI would post-process the snippet to adapt it to the context.

The second step is something that AI are well suited to do and nobody is claiming it is violating copyright, except
in so far that it is obfuscating the first stage.

The whole concept of using function name to infer their implementation requires some kind of storage, from purely information-theoretic consideration if only to conserve Kolmogorov complexity.

> Besides, some folks are reimplementing their own server + model, using the same client interface, and yes it's an actual AI model, not an index: https://github.com/moyix/fauxpilot

So even if copilot is shut down, you can go about your work by using fauxpilot ? Good!

Class action against GitHub Copilot

Posted Dec 16, 2022 7:45 UTC (Fri) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Using someone else's boilerplate code is still a copyright violation.

Class action against GitHub Copilot

Posted Dec 18, 2022 2:25 UTC (Sun) by anselm (subscriber, #2796) [Link]

Using someone else's boilerplate code is still a copyright violation.

That would depend on the exact circumstances. In many jurisdictions, code must exhibit a certain minimal degree of creativity to be eligible for copyright. If the boilerplate code in question is a very obvious or indeed the only sensible way of achieving a certain result in the programming language (and just a hassle to type out), then it may not fall under copyright because it is not sufficiently creative to warrant protection.

In such cases the main advantage of GitHub Copilot is probably that it is able to regurgitate the boilerplate with adapted variable names etc. But if all you're interested in is saving yourself some typing for boilerplate code that you use often, many programming editors have their own facilities to do this in a way that is a lot simpler and safer and doesn't involve referring to a ginormous proprietary search engine with a complete disregard of the legalities and etiquette of sharing code.

Class action against GitHub Copilot

Posted Nov 10, 2022 17:15 UTC (Thu) by MarcB (guest, #101804) [Link] (28 responses)

> There are explicit provisions for text and data mining for the purpose of AI being excepted from copyright laws, as it should be.

Why is this "as it should be"? Obviously, an AI system must not be allowed to perform any "copyright washing". Otherwise copyleft licenses would be completely undermined and any leaked, proprietary source code could be "freed" of its license.

The existing exemptions are for the purpose of the mining itself. The final result of this is then subject to a separate check. A research paper, or statistics or some abstract summary would obviously be allowed, but in this case, the output can be the literal input (minus copyright and license information). It is absolutely not clear if this is legal under any jurisdiction.

Class action against GitHub Copilot

Posted Nov 10, 2022 19:03 UTC (Thu) by bluca (subscriber, #118303) [Link] (27 responses)

> Why is this "as it should be"? Obviously, an AI system must not be allowed to perform any "copyright washing".

Because that's drivel. It is not how this works in the real world, it's completely fabricated clickbait.

Class action against GitHub Copilot

Posted Nov 10, 2022 20:41 UTC (Thu) by MarcB (guest, #101804) [Link] (12 responses)

> Because that's drivel. It is not how this works in the real world, it's completely fabricated clickbait.

There are examples of to happening, so it is obviously not fabricated. It might not be an issue for the users of Copilot, because most likely the risk of developers manually copying misattributed/unattributed code from the internet is much higher, but it certainly is an issue for Microsoft.

Even if the code generated by Copilot is not a verbatim copy of the input, it is clear, that an automated transformation is not enough to free code from its original copyright. The questions then would be, how it could be shown that the AI did create the output "on its own" and who carries the burden of this proof (the plaintiff would obviously unable to do so, because they cannot access the model).

In any case, my main point was that the directives exemptions are insufficient to declare such a lawsuit nonsensical in the EU. The directive uses the following definition:
"(2) ‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations"

Does this cover the output of source code? Maybe, but not obviously.

Class action against GitHub Copilot

Posted Nov 10, 2022 21:02 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Even if the code generated by Copilot is not a verbatim copy of the input, it is clear, that an automated transformation is not enough to free code from its original copyright. The questions then would be, how it could be shown that the AI did create the output "on its own" and who carries the burden of this proof (the plaintiff would obviously unable to do so, because they cannot access the model).

I think it's clear - if the plaintiff can show that the Copilot code is identical to their own, and the defendant (Copilot) had access to their code, then it's up to Copilot to prove it's not a copy.

There's also the question of "who has access to the evidence" - if you possess evidence (or should possess evidence) and fail to produce it, you cannot challenge your opponents claims over it.

So yes it is a *major* headache for Microsoft.

Oh - and as for the guy who thought "everything should be licenced GPL" - there is ABSOLUTELY NO WAY Microsoft will do that. Just ask AT&T what happened when they stuck copyright notices on Unix ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 10, 2022 21:16 UTC (Thu) by bluca (subscriber, #118303) [Link] (9 responses)

> There are examples of to happening, so it is obviously not fabricated.

Of course it's fabricated, complainers go out of their way to get the tool to spit out what they were looking for and then go "ah-ha!", for clickbait effect, as if it meant something. Just like using one VHS with a copied movie does not mean that the VHS company is responsible for movie piracy. Or just like if google returns a search result with a torrent link for a music track it doesn't mean google is responsible for music piracy, and so on.

> In any case, my main point was that the directives exemptions are insufficient to declare such a lawsuit nonsensical in the EU. The directive uses the following definition:
"(2) ‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations"
> Does this cover the output of source code? Maybe, but not obviously.

Of course it covers it, that's exactly what copilot is used for: fills in patterns (boilerplate). Have you every actually used it?

Class action against GitHub Copilot

Posted Nov 11, 2022 12:22 UTC (Fri) by gspr (guest, #91542) [Link] (8 responses)

> complainers go out of their way to get the tool to spit out what they were looking for and then go "ah-ha!"

does not imply

> it's fabricated

> for clickbait effect

> Just like using one VHS with a copied movie does not mean that the VHS company is responsible for movie piracy.

If playing back a new blank VHS tape in a particular way resulted in a blurry copy of said movie, then yeah, it perhaps it would.

> Or just like if google returns a search result with a torrent link for a music track it doesn't mean google is responsible for music piracy, and so on.

I don't see how this is even comparable.

> Of course it covers it, that's exactly what copilot is used for: fills in patterns (boilerplate). Have you every actually used it?

I'm not sure it matters what it's used for by you and your peers, if it comes with an out-of-the-box ability to also do the other things. Again: this is *not* the same as "a disk drive can be used for piracy" – the difference is that Copilot already (possibly, that's the debate) contains within it the necessary information to produce the infringing material.

Class action against GitHub Copilot

Posted Nov 11, 2022 13:57 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

To choose an example at one extreme, A&M Records, Inc. v. Napster, Inc. established that while there were non-infringing uses of Napster, Napster's awareness that there were infringing uses of their technology product was enough to establish liability.

And it's worth noting in this context that Napster on its own was not infringing copyright - to infringe copyright, you needed two Napster users to actively make a decision to infringe: one to make the content available, and one to request a copy of infringing content. In other words, one user had to prompt Napster to spit out what they were looking for, and even then it wouldn't do that unless another user had unlawfully supplied that content to their local copy of Napster. In contrast, if Copilot's output infringes, it only needs the prompting user to make it infringe - which doesn't bode well for Microsoft if the court determines that Copilot's output is an infringement.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:39 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

Napster and its users did not have a right to ingest copyrighted materials. AI developers have a right, by law (see EU Copyright Directive), to take any source material and use it to build a model, as long as it is publicly available.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:07 UTC (Fri) by farnz (subscriber, #17727) [Link]

That's a misrepresentation both of the Napster case (where the court deemed that the user's right to ingest copyrighted materials into the system was irrelevant), and of the EU Copyright Directive, which merely says that ingesting publicly available material into your system is not copyright infringement on its own, and that the fact of such ingestion does not make the model infringing. This does not preclude a finding of infringement by the model or its output - it simply means that to prove infringement you can't rely on the training data including your copyrighted material, but instead have to show that the output is infringing.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:04 UTC (Fri) by Wol (subscriber, #4433) [Link] (4 responses)

> I'm not sure it matters what it's used for by you and your peers, if it comes with an out-of-the-box ability to also do the other things.

So you think that the sale of knives, hammers, screwdrivers etc should be banned? Because they come with an out-of-the-box ability to be used for murder. Come to that, maybe banning cars would be a very good idea, along with electricity, because they're big killers.

It's not the USE that matters. All tools have the *ability* to be mis-used, sometimes seriously. Ban cameras - they take porn pictures. But if the PRIMARY use is ABuse, that's when the law should step in. Everything else has to rely on the courts and common sense.

In the UK, carrying offensive weapons in public is illegal. Yet many of my friends - quite legally - carry very sharp knives. Because they're "tools of the trade" for chef'ing.

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 14:19 UTC (Fri) by gspr (guest, #91542) [Link] (3 responses)

Sorry, my phrasing was bad. I did not mean to refer to what actions can be taken with the thing. What I'm trying to convey is that Copilot (perhaps!) contains (some representation of) the copyrighted material, and can *therefore* be used to reproduce the material.

A pen won't reproduce a copyrighted text without a human inputting missing data, even though it of course can he used to reproduce such a text with human assistance. Copilot, on the other hand, can (maybe!)

Class action against GitHub Copilot

Posted Nov 11, 2022 14:33 UTC (Fri) by bluca (subscriber, #118303) [Link] (2 responses)

It is *allowed* to ingest copyrighted materials for the models, by law. Hence it is not subject to the original license, among other things.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:38 UTC (Fri) by farnz (subscriber, #17727) [Link]

Your "hence" does not follow from your first statement.

The law says that the act of ingestion does not itself infringe copyright, nor does the fact of ingestion make the model infringe copyright automatically. It does not, however, says that the model is not subject to the original licence if it is found to be infringing copyright, nor does it say that the output of the model is not contributory infringement.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:42 UTC (Fri) by gspr (guest, #91542) [Link]

> It is *allowed* to ingest copyrighted materials for the models, by law. Hence it is not subject to the original license, among other things.

Yeah. But it's not allowed to *reproduce* that copyrighted material in a way incompatible with the original license. On one extreme, ingesting the material to produce, say, the parity of all the bits involved, is clearly not "reproduction" - and so is OK. On the other extreme, ingesting it and storing it perfectly in internal storage and spitting it back out on demand, clearly is "reproduction" - and surely not OK.

As I see it, the whole debate is about where between those extremes Copilot falls.

I'm not claiming to have the right answer. In fact, I don't even think I have _a_ answer. But I object to your sweeping statements about this seemingly being an easy and clear case.

Class action against GitHub Copilot

Posted Nov 14, 2022 9:28 UTC (Mon) by geert (subscriber, #98403) [Link]

> [...] aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations"

"patterns, trends, and correlations". For code, that would be reporting e.g. that 37% of all code that needs to sort something resort to quicksort, instead of reproducing a perfect copy of the source code of your newly-developed sorting algorithm released under the GPL.

Yeah, the "is not limited to" might be considered a loophole, but I guess anything that doesn't follow the spirit would be tossed out...

Class action against GitHub Copilot

Posted Nov 11, 2022 17:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (12 responses)

I eagerly await Microsoft's addition to Copilot's training set of the Windows and Office codebases if there's no such issue.

Class action against GitHub Copilot

Posted Nov 11, 2022 19:36 UTC (Fri) by bluca (subscriber, #118303) [Link] (11 responses)

Those are not hosted on Github (not even in the private section) but in a completely separate pre-existing git forge, so I'm afraid you'll be waiting for a long time

Class action against GitHub Copilot

Posted Nov 11, 2022 20:07 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (10 responses)

So? If there's no worry about contributory infringement, why not train on it? Why limit yourselves to public code and not any code Microsoft has access to?

Class action against GitHub Copilot

Posted Nov 11, 2022 20:44 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Danger of accidental leaks of secret credentials hard-coded in config files/code. Which should not happen, but often does in private code.

Please note, that it's not a question of copyright.

Class action against GitHub Copilot

Posted Nov 11, 2022 21:39 UTC (Fri) by bluca (subscriber, #118303) [Link] (8 responses)

Because it's built by Github, on Github? It's also not scraping Gitlab instances and so on. Also, believe me, just trying to get access to those instances would be such a major PITA that anybody sane would just give up, leave and go fishing. There's nothing to gain anyway, so why bother?

Class action against GitHub Copilot

Posted Nov 12, 2022 3:12 UTC (Sat) by pabs (subscriber, #43278) [Link] (7 responses)

Software Heritage have managed to ingest lots of different software sources, I'm sure Microsoft could easily manage to do the same for training GitHub Copilot, or even just get a copy from SWH.

https://www.softwareheritage.org/

Class action against GitHub Copilot

Posted Nov 12, 2022 11:29 UTC (Sat) by bluca (subscriber, #118303) [Link] (6 responses)

Why bother? Accessing gigatons of data from you own infrastructure on prem is cheap and easy. The same volume of data from third parties is going to cost an arm and a leg in bandwidth alone. Is there any evidence that spending all that money would significantly improve the quality of the models in any way?

Class action against GitHub Copilot

Posted Nov 12, 2022 15:39 UTC (Sat) by farnz (subscriber, #17727) [Link] (5 responses)

It would have been wise for Microsoft to train Copilot against their crown jewels (Office and Windows) for two reasons:

It makes their assertion that Copilot does not infringe anyone's copyright easier to defend if they're saying that it's safe to train it against their crown jewel codebases. The fact that MS haven't done this means that there's room to argue that they won't do it ever because of the risk of accidentally publishing parts of Windows or Office source code, and not just because of the difficulty of moving data from one business unit to another.
There's still a lot of people working on Windows API codebases - having Copilot trained on what are presumably the "best" codebases in the world (on average) would help those people out.

Class action against GitHub Copilot

Posted Nov 12, 2022 18:31 UTC (Sat) by bluca (subscriber, #118303) [Link] (3 responses)

1) Nah, naysayers will never, ever be happy, it would help in no way whatsoever while costing a boatload of money and effort
2) [citation needed]

Class action against GitHub Copilot

Posted Nov 12, 2022 20:33 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (1 responses)

Sure, 100% satisfaction is not feasible for anything, but I think it'd make a *lot* of the skepticism subside (including mine). Why wouldn't it cost any more than Copilot already cost? Or is ingesting new code not done anymore and Copilot "frozen"? If it isn't frozen, what's the marginal cost of a few hundred million lines on top of the billions already ingested?

How the hell do you think anyone would get a citation for that? Are you saying that Microsoft doesn't have useful Win32 API usage to train on for Windows developers? Or are you saying that even Microsoft doesn't use it well enough to bother training anything on it?

Class action against GitHub Copilot

Posted Nov 12, 2022 22:00 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Modern neural networks are often trained in stages, so just ingesting an additional corpus of code might indeed require retraining everything. But they'll have to do it eventually anyway.

Class action against GitHub Copilot

Posted Nov 13, 2022 16:56 UTC (Sun) by farnz (subscriber, #17727) [Link]

For 1, it's not about the naysayers, it's about what you can say in court to convince a judge (or jury in some US civil cases) that the naysayers are overreacting. The statement "we trained this against our crown jewels, the Windows and Office codebases, because we are completely certain that its output cannot contain enough of our original code to infringe copyright" is a very convincing statement to a judge or jury - and even if the court finds that Copilot engages in contributory infringement of people's copyright (having seen a demo of it doing so), the court is likely to be lenient on Microsoft as a result - the fact of having trained it against their core business codebases is helpful evidence that any infringement by Copilot's output is unintentional and something Microsoft would fix, because it puts their core business at risk.

And for 2, which part do you want a citation on? That Office and Windows are a big Win32 codebase written by good developers? That people still write code for Win32? That there's boilerplate in Win32 that would be simplified with an AI assistant helping you write the code?

Class action against GitHub Copilot

Posted Nov 12, 2022 22:28 UTC (Sat) by anselm (subscriber, #2796) [Link]

OTOH, it could be the case that the source code for Windows and Office is so atrociously horrible that they don't want to contaminate their ML model with it -- especially if there's a chance that recognisable bits of it could leak out for everyone to see.

Class action against GitHub Copilot

Posted Nov 14, 2022 10:44 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

Why didn't microsoft train copilot on windows 11 and office code?

Class action against GitHub Copilot

Posted Nov 10, 2022 20:04 UTC (Thu) by lkundrak (subscriber, #43452) [Link] (8 responses)

One thing you forgot to say: "Company that's the defendant in this case pays my bills."

You're welcome.

Class action against GitHub Copilot

Posted Nov 11, 2022 10:25 UTC (Fri) by bluca (subscriber, #118303) [Link] (7 responses)

it wasn't forgot, what value do you think that adds, precisely?

Class action against GitHub Copilot

Posted Nov 11, 2022 11:22 UTC (Fri) by gspr (guest, #91542) [Link] (6 responses)

Courtesy and transparency, to name two.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:48 UTC (Fri) by bluca (subscriber, #118303) [Link] (5 responses)

Neither are relevant. I see that you didn't dox yourself either, so where are yours?

Class action against GitHub Copilot

Posted Nov 11, 2022 12:16 UTC (Fri) by gspr (guest, #91542) [Link] (4 responses)

I have no relationships with either of the parties in this case.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:28 UTC (Fri) by bluca (subscriber, #118303) [Link] (3 responses)

So you say. Post your tax returns, including any buying/selling of shares, going back 10 years. You wouldn't want to appear not to be "transparent", would you now?

Class action against GitHub Copilot

Posted Nov 11, 2022 14:59 UTC (Fri) by lkundrak (subscriber, #43452) [Link] (2 responses)

I suggest you stop commenting for a bit and take some time to familiarize yourself with the levels of civility that's usual in LWN comment sections. It could help you make your point without making a complete fool of yourself. You don't seem to have noticed you're doing that, and it's just painful to watch.

Class action against GitHub Copilot

Posted Nov 11, 2022 16:30 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

It takes a remarkable dose of creativity to start talking about "civility" when your only contributions so far have been ad-hominems and doxxing

Civility

Posted Nov 11, 2022 16:32 UTC (Fri) by corbet (editor, #1) [Link]

Speaking of civility, I think that this branch of the conversation has gone far enough - and beyond. Can we retire it here please?

Class action against GitHub Copilot

Posted Nov 12, 2022 21:52 UTC (Sat) by vetse (subscriber, #143022) [Link] (3 responses)

I don't get exactly why some folks it as some axiomatic truth that companies should be able to just vacuum up any and all public data of any kind and use it for training their ML models. I'm generally for information being freely available, but most AI-related projects I've seen (Copilot included) have left a very sour taste in my mouth that make me think the creators aren't at all considering any of the consequences of what they've made.

Class action against GitHub Copilot

Posted Nov 14, 2022 10:53 UTC (Mon) by kleptog (subscriber, #1183) [Link] (2 responses)

This gets to the heart of what the recent EU Copyright Directive is trying to achieve in this area. Large companies with lots of money are going to hoover up anything publicly available anyway. If the legal status isn't totally clear, they can just throw money at the problem. On the other hand, researchers and students were getting threatened with lawsuits when they were training models on data freely accessible on the internet. Additionally, institutions like the Internet Archive and public libraries trying to archive for the future were also being threatened.

So the likely end result was that big companies with lots of money get to make new models on lots of data, while start-ups, researchers and students who are working on the next generation of technologies in this area are stymied by possible lawsuits. This was deemed undesirable.

The chosen solution is to allow model training on any publicly available data for research and training purposes. And organisations that publish online can opt-out (in a machine readable fashion) from being used in machine learning. It doesn't say anything about the copyright status of the output of the models.

Of course, it's only a directive, so you're relying on the member states to properly implement this. But it's better than it was.

Ref: https://eur-lex.europa.eu/eli/dir/2019/790/oj

Class action against GitHub Copilot

Posted Nov 18, 2022 15:44 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (1 responses)

Also, the EU does not follow common law. If the agreed upon consensus proves unworkable (for example, if some foreign mega-corporation used its quasi-monopoly in code hosting, to appropriate other people’s work using a model it was the only one in position to create) the law can always be changed.

It is *sooo* refreshing to live in a legal system where past mistakes are not set in stone.

Class action against GitHub Copilot

Posted Nov 18, 2022 19:50 UTC (Fri) by Wol (subscriber, #4433) [Link]

Well, Parliament can always overturn Common Law.

Common Law is - certainly in its origin - just people asking judges to settle disputes. It just solidifies to stone as in "this is what seems right".

And then if it seems appropriate Parliament can come along, pass Statute Law, and toss the whole Common Law structure into the bin.

Although if the Judges think it unfair they can gut the Statute - it does happen ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 14, 2022 10:23 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

As per other occasions, it would be nice if you disclosed that you work for microsoft.

> I am so glad to live in Europe, where the legislation is way ahead of the US on this matter and makes clear that such a lawsuit is absolutely bogus and nonsensical.

I disagree.

The thing that was ruled ok was not for generating, and certainly not for verbatim copy paste. This is different and requires a separate ruling.

> The demand for money makes it even more obvious this is a malicious effort

There has to be a demand for money, or they would be saying there was no harm done and microsoft should continue to do whatever it's doing. But the fact that there is a great number of people complaining about this online tells us that they are feeling wronged… and a court might decide just how wronged they all are.

Class action against GitHub Copilot

Posted Nov 10, 2022 16:50 UTC (Thu) by bfields (subscriber, #19510) [Link] (8 responses)

"In recent decades, though, the class-action lawsuit seems to have become mostly a vehicle for extorting money from a business for the enrichment of lawyers."

I don't love this framing.

It seems to me there are two potential goals here: compensating victims and making sure crime doesn't pay. In cases where the damage was a small amount multiplied by a large number of people, the latter may be more important.

I mean, if someone scams me out of $5, it's rarely going to be worth my while to pursue them for the $5, but if someone else wants to, I'd happily throw that $5 into a pot of money for them to use. And when I've gotten one of those mailings, that's been my attitude--"hey, doesn't sound like it's worth even figuring out whether this applies to me, but it's probably good someone went after them."

In a case where something costs a lot of people a little money each, what should we do? No doubt the system could be more efficient, but I don't see how you're going to completely avoid the expense of sorting out exactly what happened and who's at fault, all to reasonable degree of certainty--something that could inherently need a lot of legwork by a lot of people, some with pretty specialized skills. And sometimes, in the end, you'll fail to make the case, so you'll have to build that into your costs too.

We could let that kind of thing slide. I think that could have long-term consequences we wouldn't like.

We could make it a government function and pay for it with our taxes. Obviously, that's already what we do in some cases.

Or we can have a system like this that allows law firms to take on the risk and expense in exchange for taking a cut when they win.

I honestly don't know what's best, but I don't think the latter is obviously "abusive" just because the compensation to individual victims is sometimes trivial.

Anyway, sorry, that's all a bit of a derail, as I'm not really convinced of the merits of this particular case. (Maybe I haven't thought it through.)

Class action against GitHub Copilot

Posted Nov 10, 2022 17:55 UTC (Thu) by hkario (subscriber, #94864) [Link] (5 responses)

1. because the total amount is hardly large for a typical corporation that's a target of a class action (a $5 fine for doing something is not a fine, it's just the cost of doing something).
2. getting $5 in "compensation" is a passive-aggressive corporate version of a non-apology: "sorry if you feel offended"

Class action against GitHub Copilot

Posted Nov 10, 2022 20:04 UTC (Thu) by bfields (subscriber, #19510) [Link] (4 responses)

1. because the total amount is hardly large for a typical corporation that's a target of a class action (a $5 fine for doing something is not a fine, it's just the cost of doing something).

That's OK, it doesn't necessarily have to be a threat to the corporation's existence to be useful, it just has to raise the expected cost of the undesirable behavior sufficiently.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:20 UTC (Fri) by hkario (subscriber, #94864) [Link] (3 responses)

Yes, the point of a fine isn't to destroy a company. But when the fine for the activity is lower than the income from it (often by an order of magnitude), how exactly is that a discouragement for the activity?

Class action against GitHub Copilot

Posted Nov 11, 2022 12:16 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

Which is why we have the Company Secretary, who in law is the personal embodiment of the company. No Secretary, no company.

Unfortunately, we are completely useless at punishing the secretary for the misdeeds of the company. If the Secretary knows they can (and WILL) be fined or imprisoned personally for company misdeeds, any company that starts "playing dirty" will soon find itself unable to recruit a secretary, and subject to intense scrutiny or being broken up.

Cheers,
Wol

Company secretaries are personally liable?

Posted Nov 11, 2022 15:12 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link] (1 responses)

> If the Secretary knows they can (and WILL) be fined or imprisoned personally for company misdeeds,

Aren't most types of companies limited liability entities?

Company secretaries are personally liable?

Posted Nov 11, 2022 15:27 UTC (Fri) by Wol (subscriber, #4433) [Link]

What does this have to do with the price of tea in China?

I think I know what you're getting at - the Board of Directors are (allegedly) protected against being responsible for things going wrong on their watch. But if nobody can be held legally responsible for the company breaking the law (say for example ignoring health and safety, and people getting killed), then things can turn nasty, and they regularly do. Usually on a smaller scale than that, admittedly.

But the post of Company Secretary (a *mandatory* post - which is why you'll find that most organisations technically are not allowed to function without a Secretary) is mandated by law to be the official record keeper and legal advisor. As such they can be held personally liable for any wrongdoing on their watch they should have known about, or were told about. And (for companies over a certain size, the last figure I knew was £3M) they have to be legally qualified in some way.

If we started to make company secretaries realise this was actually a serious role, with serious liabilities, the standard of corporate governance would probably rise pretty quickly!

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 10, 2022 18:57 UTC (Thu) by unBrice (subscriber, #72229) [Link]

> I honestly don't know what's best,

Neither do I, but I thought you may be interested in hearing about a third alternative to state-owned agencies and predatory law firms. In France class actions are restricted to non profit organizations that have to go through a vetting process (eg there are 15 of such vetted non-profits defending customers). Additionally they are only allowed to sue for specific offenses (including discrimination and infringements to consumer protection laws). I suspect the way damage compensation works is also very different but I am not knowledgeable on that.

Class action against GitHub Copilot

Posted Nov 18, 2022 15:53 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

The point of class action is not to compensate victims but to incentivize private law firms into making sure the law is applied. Because it turns out states are very bad at prosecuting interests that make lavish campaign donations.

It would be more honest to scrap the compensation altogether but lawmakers are reluctant to admit they don’t trust themselves to prosecute big money fairly.

Class action against GitHub Copilot

Posted Nov 10, 2022 16:56 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> Plaintiffs estimate that statutory damages for Defendants’ direct violations of DMCA Section 1202 alone will exceed $9,000,000,000. That figure represents minimum statutory damages ($2,500) incurred three times for each of the 1.2 million Copilot users Microsoft reported in June 2022.

They're excluding all the foreign contributors to github ... they're excluding all the contributors who didn't upload their own code ... surely they should exclude all the alleged violators (aka Copilot users) who are not subject to US law?

Smiles :-)

Cheers,
Wol

Class action against GitHub Copilot

Posted Dec 10, 2022 23:53 UTC (Sat) by sammythesnake (guest, #17693) [Link]

I assume the argument they would make is that the users (within or without the US) aren't a material participant in the infringing actions of copilot et al, but rather the recipients of the resulting infringing materials.

(Based on my attempt to read the minds of the plaintiffs and with no statement of whether I'd agree)

Class action against GitHub Copilot

Posted Nov 10, 2022 17:47 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> That said, there are almost certainly ways to get more complex — and useful — output from Copilot that might be considered to be a copyright violation. There are a lot of interesting questions that need to be answered regarding the intersection of copyright and machine-learning systems that go far beyond free software. Systems that produce images or prose, for example, may be subject to many of the same concerns. It would be good for everybody involved if some sort of consensus could emerge on how copyright should apply to such systems.

It must be emphasized that, at least for the purposes of this lawsuit, that doesn't matter. Courts will look at the specific examples you put before them. If you put a bad example before the court, you're (maybe) going to get a bad outcome. Nothing in the US copyright law really allows you to make this sort of "well, there's infringement in here somewhere" argument, except perhaps for contributory copyright infringement (i.e. "what they sued Napster over"). But you probably can't win that one either, because of the Betamax decision (Copilot has substantial non-infringing uses, since some of its output will not be similar to any of its training examples - and if you don't have substantial similarity, you don't have infringement under US law).

That said, the isEven example looks awfully concerning to me, and even if this particular case goes nowhere, Microsoft and GitHub need to get a lot better at eliminating such outputs.

Class action against GitHub Copilot

Posted Nov 10, 2022 18:09 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

There's no possible bad outcome to this. Either Microsoft loses (open season on makers of copyright-laundering machine learning software) or the plaintiffs lose (open season on anyone whose copyrighted work can be laundered through machine learning software - hello RIAA, Disney)

Class action against GitHub Copilot

Posted Nov 10, 2022 21:35 UTC (Thu) by mpr22 (subscriber, #60784) [Link] (1 responses)

This seems to presuppose that the ultimate decision will be based on a broadly applicable point of law.

Class action against GitHub Copilot

Posted Nov 13, 2022 9:01 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

Indeed. By far the more likely outcome is some complicated application of the Abstraction-Filtration-Comparison test to the particular examples which were given in the litigation, and an explicit statement that there's no precedent either way on AI code in general.

In other words: The court is not going to make some sweeping statement about AI code. The court is going to ignore the AI, look at the examples given, and decide those examples are infringing. Then the court is going to write a clear warning that the ruling might not generalize to other AI code. We're going to end up with a massive gray area and years of legal uncertainty. Various tech companies will then add it to their long list of "parts of US copyright law that we want Congress to fix."

Class action against GitHub Copilot

Posted Nov 10, 2022 18:29 UTC (Thu) by magnus (subscriber, #34778) [Link] (2 responses)

It would be interesting to know roughly the compressed size of the input data compared to the compressed size of the resulting weights of the Copilot model. If the size of the weights is about the same then most of the input could be still in there unmodified in the model to be spat out given the right input, but if the model is way smaller then information has been "crunched" together.

Class action against GitHub Copilot

Posted Nov 10, 2022 18:45 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

Certainly the *image* models are way smaller than their training inputs, to the point where any "compression" explanation would be absurd (you would have to compress each image file down to a few bytes). We know this because you can download Stable Diffusion and run it on your own system, and we also know how many images are/were in the LAION set, so you can just do the math.

I don't know as much about the text models. You can't download any of those (yet), but there might be a blog post or something where one of these companies brags about the size of their model.

Class action against GitHub Copilot

Posted Nov 16, 2022 6:20 UTC (Wed) by jezuch (subscriber, #52988) [Link]

In fact, encoding tons of images in a neural network *is* a form of compression. There's massive amounts of redundancy you can remove this way.

> you would have to compress each image file down to a few bytes

"Mona Lisa" is just a few bytes and is enough to reproduce the original :)

> but there might be a blog post or something where one of these companies brags about the size of their model.

I don't know the details, but I hear it's on the range of billions of neurons.

A good class action?

Posted Nov 10, 2022 18:35 UTC (Thu) by nickodell (subscriber, #125165) [Link] (17 responses)

>Class-action lawsuits have been used to, for example, get compensation for victims of asbestos exposure; they can be used to address massive malfeasance involving a lot of people. In recent decades, though, the class-action lawsuit seems to have become mostly a vehicle for extorting money from a business for the enrichment of lawyers.
Have you ever read the blog Lowering the Bar? He writes about the really stupid ones. My favorite is the multiple class action suits alleging that Froot Loops have no fruit in them.

At the same time, I don't think this is a stupid lawsuit. I don't think it particularly has much to do with free software, either. Free software developers are just the first group ornery enough to file a lawsuit about it. The central issue is, "Is it fair use to train an AI on copyrighted data?" No US court has answered this question. This lack of clarity causes unequal pay.

For example, consider how large and small copyright holders get treated. DALLE-2 is trained on images from the internet, regardless of copyright status. OpenAI announced that they had a deal to license all of Shutterstock's catalogue of stock photos. Why did they do that, if OpenAI had the legal right to train an AI on those photos either way? Because Shutterstock has lawyers, and if OpenAI went to court and lost, it would set a terrible precedent for them. This is how it will go, in general. Big copyright holders, with a willingness to fund a lawsuit for years, will get compensated. Small copyright holders will get nothing. If there were clarity, these two groups would get treated equally.

In addition to creators getting shafted, this lack of clarity has a chilling effect on AI research too. No business would make GPT-3 essential to their business if it turned out, years down the line, that using GPT-3 obligated them to pay licensing to everybody who had ever written something on the Internet. Until this is solved, investment in AI will be slowed down.

Some commenters point out that the EU allows text/data mining for the purpose of AI. This is true, but that exception applies only for non-commercial use, which Copilot is clearly not.

A good class action?

Posted Nov 10, 2022 19:07 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

> Some commenters point out that the EU allows text/data mining for the purpose of AI. This is true, but that exception applies only for non-commercial use, which Copilot is clearly not.

Wrong. The exception applies to _everybody_. Non-commercial have additional provisions, such as not being obliged to offer an explicit opt-out (a-la robots.txt). But the copyright exception is exactly the same.

A good class action?

Posted Nov 11, 2022 15:15 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link] (2 responses)

How is European copyright law relevant? This suit was filed by American lawyers representing American developers against American companies in an American court.

A good class action?

Posted Nov 11, 2022 19:59 UTC (Fri) by bluca (subscriber, #118303) [Link]

Sure, but I am European, so it is relevant to me. Also the EU is a world-wide rule-setter, so it is very important to note what it does, as it indirectly affects other jurisdictions too in many regulatory domains.

A good class action?

Posted Nov 17, 2022 19:52 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

This is a novel area of law. The lawsuit may be in the States now, but the issue as a whole is going to be everywhere within a year or two.

A good class action?

Posted Nov 11, 2022 8:30 UTC (Fri) by nilsmeyer (guest, #122604) [Link] (12 responses)

> Have you ever read the blog Lowering the Bar? He writes about the really stupid ones. My favorite is the multiple class action suits alleging that Froot Loops have no fruit in them.

Do they have "Froot" in them? In Germany Almond Milk can't be sold as milk because it doesn't come from a mammal - apparently this would confuse customers or so the judges thought. However there are no punitive damages and the result of the lawsuit is to change the name, not shake down the competition for money.

I'm not sure which system is better. For example, over here, if a company suffers a data breach due to negligence the government can fine them and keep the money, but the people affected won't see a cent of the money. If they sue they have to prove the monetary damages, and you don't get compensated for the time lost talking to lawyers or sitting in court.

A good class action?

Posted Nov 11, 2022 12:04 UTC (Fri) by Wol (subscriber, #4433) [Link]

> In Germany Almond Milk can't be sold as milk because it doesn't come from a mammal - apparently this would confuse customers or so the judges thought.

Fortunately, when they tried to ban Plum Duff, sanity prevailed. The word "plum" (in this context) does not refer to a fruit. I really hope they don't decide to rename mincemeat!

Cheers,
Wol

A good class action?

Posted Nov 11, 2022 15:07 UTC (Fri) by nickodell (subscriber, #125165) [Link]

In the US, there is also a non-court system for food labeling disputes, which decides the vast majority of these issues. The FDA establishes standards for how food can be labeled. For example, if you want to call a product "ice cream," it must fulfill a long list of requirements.

Incidentally, there is also a push by dairy farmers to get the FDA to define milk as cow's milk. https://www.wired.com/story/the-fda-may-nix-the-word-milk...

A good class action?

Posted Nov 14, 2022 12:06 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (9 responses)

> In Germany Almond Milk can't be sold as milk because it doesn't come from a mammal

That's because vegans keep killing their children and then say "well I gave milk"

https://www.bbc.com/news/world-europe-40274493

https://www.newsweek.com/parents-convicted-feeding-baby-v...

A good class action?

Posted Nov 14, 2022 14:50 UTC (Mon) by kleptog (subscriber, #1183) [Link] (2 responses)

It may have changed the packaging, but you can't change the language. Everyone still calls it soy/almond milk.

Makes no difference though. Feeding infants exclusively cow's milk is also bad. There's a reason baby formula exists.

You can't fix stupid though. We live in an age where almost anything you'd want to know is at your fingertips, and stupid things still happen.

A good class action?

Posted Nov 14, 2022 22:24 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

One man's stupid is another man's clever. And humans suffer from "confirmation bias". As far as the species goes, it's actually a very good survival trait.

Picking a medical rather than social example, why do people suffer from Sickle Cell? In most of the world, it's a pure handicap. But in much of the tropics, where malaria is endemic, people who suffer from mild Sickle Cell have a distinct survival advantage.

And in a world where "do something" usually trumps "do nothing", confirmation bias means you do that something a lot faster ...

Cheers,
Wol

A good class action?

Posted Nov 16, 2022 21:29 UTC (Wed) by LtWorf (subscriber, #124958) [Link]

> One man's stupid is another man's clever.

Killing children isn't clever. Stop it please.

A good class action?

Posted Nov 15, 2022 10:39 UTC (Tue) by paulj (subscriber, #341) [Link] (5 responses)

Wow. Just wow. The stupidity of some people - tragic for the parents, but out of their own idiocy. I might have been a little sceptical of regulatory bodies trying to stop vegetable liquids being sold as "<whatever> milk", in the "maybe there's more important things" sense, until your follow up comment.

Baby mammals need mammal milk, ideally from a healthy, well-fed mother of their own species.

A good class action?

Posted Nov 15, 2022 15:47 UTC (Tue) by Wol (subscriber, #4433) [Link] (4 responses)

Don't forget, though, that from reading the article it appears the mother was "dry" - milk from the mother was not an option.

They also self-diagnosed the baby as lactose-intolerant. Stupid thing to do, but ...

If you think someone else is being stupid, how can you be sure it's not your own ignorance or stupidity that leads you to that conclusion? The parents' actions make perfect sense inside their own world and belief system. Sounds to me like they somehow slipped through the ante-natal safety net ...

Cheers,
Wol

A good class action?

Posted Nov 16, 2022 10:44 UTC (Wed) by paulj (subscriber, #341) [Link] (3 responses)

Given the parents seem to have some kind of ideology against mammalian milk, I doubt there was any real attempt to have the mother produce milk at the beginning. It takes at least a few days to get going, and it's very easy for (new) mothers (or worried others around them) to tell them the baby must be hungry and therefore the baby must be given something else. And when the baby is given something else, that reduces or eliminates demand for the mother's milk, the suckling time neeeded to stimulate production particularly and... so.. no milk is produced.

A good class action?

Posted Nov 16, 2022 14:10 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

Hence my comment about lack of ante-natal care. Not that I know much about it (my daughters are step-), but yes I understood that when a new-born cries it's more for a cuddle (that leads to milk production), and not because they are hungry.

(I meant to say post-natal, but either way that is the sort of thing the mother-to-be should have been taught.)

Cheers,
Wol

A good class action?

Posted Nov 17, 2022 10:23 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

It's more than just cuddling. The baby needs to actually suckle, regularly and for good periods, during the initial few days where the mother will still not be producing milk to any significant degree (other than the initial build up of colostum). It is the suckling action that stimulates production. But it can be hard for a new mother to watch a baby eager to suckle - and conclude the baby must be hungry, and therefore there is something wrong with her milk production. But... it just takes a day or 3, or 4 to really get going (and a baby brought to term is born with extra fat precisely to get through that time).

A good class action?

Posted Nov 17, 2022 14:16 UTC (Thu) by corbet (editor, #1) [Link]

...and all of this has been pretty far off-topic for some time now. Perhaps we could conclude this sub-thread here?

Class action against GitHub Copilot

Posted Nov 11, 2022 9:50 UTC (Fri) by vegard (subscriber, #52330) [Link] (12 responses)

I am surprised there is nothing about combining works of different licenses and redistributing that (under a permissive license), as that seems the main problem to me.

They should have trained different models corresponding to different levels of license compatibility; that way, you could ensure that each model only produces code that can be reasonably said to fall under a specific license.

I would also argue that specific models are just a different representation of what is fundamentally the same code. If you encoded the whole Linux kernel source code in a PNG (losslessly) that does not fundamentally change its license or how it can be used.

(The above is all my personal views/opinion.)

Class action against GitHub Copilot

Posted Nov 11, 2022 10:09 UTC (Fri) by bluca (subscriber, #118303) [Link] (11 responses)

> I am surprised there is nothing about combining works of different licenses and redistributing that (under a permissive license), as that seems the main problem to me.

Because the original license is irrelevant, as the model is not built using sources distributed under their licenses, but under specific exceptions in copyright law that allow text and data mining - see the recent EU directive on copyright. In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time. Hopefully the legislators over there will catch up soon and fix this.

Class action against GitHub Copilot

Posted Nov 11, 2022 10:17 UTC (Fri) by bluca (subscriber, #118303) [Link] (5 responses)

Think of it this way: imagine you distribute some software under an open source license. Someone approaches you and offers to pay you to use that work under different terms. Copyright law allows you to do that - regardless of how the work is distributed elsewhere, it doesn't affect the buyer, because you gave them permission to use it differently. This is a common use case and used widely, as I'm sure you are aware.
The text and data mining exceptions work in a similar way, but instead of getting permission from the author, permission is given by the law, which trumps any original license, and the author's wishes too. The only right you have is to ask commercial entities for an opt out. Non-profits don't even have to provide an opt-out. The only restrictions AI builders have is that the source must be publicly available (ie: if someone breaks in a computer system and steals sources or data, it is not allowed to be used for this or any other purposes).

Class action against GitHub Copilot

Posted Nov 11, 2022 10:49 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

I think you're reading the exception in EU law over-broadly, and it doesn't say quite what you're claiming.

The exception for text and data mining says that you are not infringing copyright solely by virtue of feeding material into your machine learning system, and that the resulting model is not itself automatically a derived work of the inputs. It does not say that the output of your machine learning system cannot infringe copyright. As far as I can tell, the intent behind this exception is to allow your system to engage in the sorts of copying that we already say are outside the scope of copyright when a human does it - for example, having read about io_uring, a human might build their next system around a submission and completion queue pair, and this is not a copy in the sense of copyright law.

This means that a court could rule legitimately that a given output from the system is sufficient of a copy to be a copyright infringement by the system, and that the use of the system is thus contributory infringement whenever it produces a literal copy of a protected part of its input.

This, in turn, would bring use of systems like GitHub Copilot into line with employing a human to do the same job: if, as a result of a prompt, I write out precisely the code that a previous employer used (complete with comments - whether I copied it from a notebook, or kept a copy of a past employer's source code), then that is copyright infringement. If, on the other hand, I write code that's similar in structure simply because there are only a few ways to loop over all items in a container, that's not copyright infringement.

Assuming the US courts apply this sort of reasoning, then the question before them is whether a human writing the same code with the same prompting would be infringing copyright or not - if you substitute "a Microsoft employee wrote this for a Microsoft customer to use" for "GitHub Copilot wrote this for a Microsoft customer to use", do you still see infringement or not?

Class action against GitHub Copilot

Posted Nov 11, 2022 12:10 UTC (Fri) by Wol (subscriber, #4433) [Link]

I was thinking along the same lines. Simply put, the exception allows the use of online corpora as input. It does NOT allow distribution of the resulting output.

So the exception doesn't cover you for using the result ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 14:49 UTC (Fri) by bluca (subscriber, #118303) [Link] (2 responses)

> The exception for text and data mining says that you are not infringing copyright solely by virtue of feeding material into your machine learning system, and that the resulting model is not itself automatically a derived work of the inputs.

Exactly, and there are many commentators that completely miss this, and assume the opposite, hence the need to clarify it.

> It does not say that the output of your machine learning system cannot infringe copyright.

Sure, but it implies that it does not automatically does so either. Then the onus is on the complainers to show that, first of all, these artificially induced snippets are copyrightable in the first place, and after that to show that intentionally inducing the tool to reproduce them, which requires knowing in advance what they look like and what keywords and surrounding setting to prepare in order to achieve that result, means that it's the tool that is at fault rather than the user.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:25 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link]

Would an AI that could auto fill the gaps in this be infringing the copyrights in your comment?

"> _____ _____ _____ _____ and _____ _____ _____ _____ you _____ _____ _____ _____ solely _____ _____ _____ _____ material _____ _____ _____ _____ system, _____ _____ _____ _____ model _____ _____ _____ _____ a _____ _____ _____ _____ inputs. _____ _____ _____ _____ many _____ _____ _____ _____ this, _____ _____ _____ _____ hence _____ _____ _____ _____ it. _____ _____ _____ _____ say _____ _____ _____ _____ your _____ _____ _____ _____ infringe _____ _____ _____ _____ implies _____ _____ _____ _____ automatically _____ _____ _____ _____ the _____ _____ _____ _____ complainers _____ _____ _____ _____ of _____ _____ _____ _____ snippets _____ _____ _____ _____ first _____ _____ _____ _____ to _____ _____ _____ _____ the _____ _____ _____ _____ which _____ _____ _____ _____ what _____ _____ _____ _____ what _____ _____ _____ _____ to _____ _____ _____ _____ achieve _____ _____ _____ _____ it's _____ _____ _____ _____ at _____ _____ _____ _____ user."

Class action against GitHub Copilot

Posted Nov 11, 2022 15:37 UTC (Fri) by farnz (subscriber, #17727) [Link]

You're not actually clarifying, unfortunately. The effect of the EU Copyright Directive is not to say that the model and its training process are guaranteed not to infringe copyright; rather, it's to say that the mere fact that copyrighted material was used as input to the training process does not imply that the training process or resulting model infringe, in and of itself.

And you're asking more of the complainers than EU law does. Under EU law, the complainers first have to show that there is a copyrightable interest in the output (which you do get right), and after that, they only have to show that the tool's output infringes that copyright. In particular, the tool is at fault if the material is infringing and it produces it from a non-infringing prompt - even if the prompt has to be carefully crafted to cause the tool to produce the infringing material.

As an example, let's use the prompt:


float rsqrt(float number) {
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;

This is not infringing in most jurisdictions - there's nothing in there that is copyrightable at this point, as all the names are either descriptive, or completely non-descript. If a machine learning model then goes on to output the Quake III Arena implementation of Q_rsqrt from this prompt, complete with the comments (including the commented out "this can be removed" line), then there's infringement by the tool, and if it can be demonstrated that the only place the tool got the code from was its training set, the tool provider is likely to be found to be a contributory infringer.

It doesn't matter that I've set the tool up with a troublesome prompt here (that's the first 5 lines of Q_rsqrt, just renamed to Q_rsqrt); I haven't infringed, and thus the infringement is a result of the tool's training data being copied verbatim into its output.

This is, FWIW, exactly the same test that would apply if I gave that prompt to a human who'd seen Quake III Arena's source code, and they infringed by copying the Quake III Arena implementation - I would not be able to prove infringement just because the human had seen the original source, but I would be able to do so if, given the prompt, they produced a literal copy Q_rsqrt.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:09 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

Because the original license is irrelevant, as the model is not built using sources distributed under their licenses, but under specific exceptions in copyright law that allow text and data mining - see the recent EU directive on copyright. In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time.

What the EU says about data mining is irrelevant to this case because Github as an entity is not based in the EU.

What is more pertinent is that the terms and conditions of GitHub stipulate that if you upload stuff to Github, you license Github to use said stuff to “improve the service”, including indexing etc. Since Copilot is basically a fancy search engine for Github-hosted code, it would not be unreasonable for Microsoft's lawyers to argue that their use of code already on Github to train Copilot is covered by the site's existing T&C's, so they don't even need to make a fair-use argument. This would be completely independent of the licenses in the various projects on Github, which govern the use and disposition of Github-hosted code by third parties.

Having said that, the question of whether Copilot's output can infringe on the copyright of its inputs is a separate (and difficult) issue, which should probably be investigated in the wider context of ML applications that, e.g., paint pictures or generate prose. It is obvious that much of the code Copilot deals with is boilerplate which is not sufficiently original to qualify for copyright protection in the first place, but then again, the fact that Copilot can be coaxed into producing swathes of demonstrably copyrighted code without the correct attribution or license grant should not be overlooked, either. (Personally I think that there are more efficient methods to deal with boilerplate, and the amount of due diligence required on the part of Copilot users to ensure that any nontrivial stuff Copilot may regurgitate is not violating any copyrights, let alone fit for purpose, negates the advantage of using Copilot to begin with, but your mileage may vary.)

Class action against GitHub Copilot

Posted Nov 11, 2022 14:54 UTC (Fri) by bluca (subscriber, #118303) [Link]

> What the EU says about data mining is irrelevant to this case because Github as an entity is not based in the EU.

I live and use it in the EU, so it is very relevant to me ;-) Also the EU is pretty much the world's regulatory superpower, so what it says on this matters a lot.

Class action against GitHub Copilot

Posted Nov 11, 2022 14:56 UTC (Fri) by kleptog (subscriber, #1183) [Link] (2 responses)

> In the US is murkier because there's no corresponding exception yet, so it's all done under fair use exception, which is a pain as it has to be defended in court every time.

Indeed, it's concerning for me that this is a case that potentially could set wide-reaching precedent. I know judges creating law is a feature of Common Law systems, but it feels like some of the issues here should be decided by the legislature, not the judiciary.

Class action against GitHub Copilot

Posted Nov 11, 2022 15:14 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> I know judges creating law is a feature of Common Law systems, but it feels like some of the issues here should be decided by the legislature, not the judiciary.

Getting into politics I know, but it's why I feel the gutting of the House Of Lords about 20 years ago was a travesty of good government. The UK's effective Supreme Court was a (pretty ineffective) branch of the legislature. And as such it did a damn good job of making sure the law actually worked.

Career politicians as a whole seem pretty malevolent - by design or accident - and having a decent body of people who wielded power by accident and no design of their own seemed a very good counterbalance. For pretty much a century the House of Lords did a damn good job of making sure legislation was fair, just, and worked. (For some definition of "just", I know. The Lords are people of their time, with the prejudices of their time, but are far less likely to be swayed by populist demagogues.)

The problem the US suffers in particular at the moment, and we're going down the same route, is we seem to have "government by barrels of ink" ...

Cheers.
Wol

Class action against GitHub Copilot

Posted Nov 11, 2022 19:39 UTC (Fri) by bluca (subscriber, #118303) [Link]

The problem with that is I can bring many counter-examples of absolute bellends that ended up being peers when they shouldn't even be allowed anywhere near a residents association. But we are just a bit off-topic.

Class action against GitHub Copilot

Posted Nov 11, 2022 11:27 UTC (Fri) by ceplm (subscriber, #41334) [Link] (2 responses)

> In recent decades, though, the class-action lawsuit seems to have become mostly a vehicle for extorting money from a business for the enrichment of lawyers.

That is a quite uncool things to say. Yes, of course, payments to the class members are usually just a pittance, but that is not the point of the class actions lawsuits. The main point of class actions is on the side of the balance sheets: what the defendant has to pay for the lost suit, and that class actions lawsuit are by far more powerful tool for product liability and consumer protection than any government action (comparing to for example Europe). Do you remember Ford Pinto, Chevrolet Corvair and other cars which resulted in class action lawsuits (and long list of other industries changed for better by class actions lawsuits could follow)? Yeah, nobody would dare to make such cars any more, because of those.

Concerning the payments made to the lawyers involved in class actions. Do you know how much these litigations usually cost? Do you know how many law firms went into bankruptcy because of class action lawsuits?

I don’t want this thread to fall into political discussion (so I probably won’t follow on any replies to this comment), and I easily admit that the system is often abused, but I would like it to stand here, that it is one of strongest tools “normal people” have against corporate interests and you would be sorry to loose it. We, who don’t have it (I live in Europe) are now struggling to create something similar (https://ec.europa.eu/commission/presscorner/detail/en/sta...).

Class action against GitHub Copilot

Posted Nov 11, 2022 12:19 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> The main point of class actions is on the side of the balance sheets: what the defendant has to pay for the lost suit, and that class actions lawsuit are by far more powerful tool for product liability and consumer protection than any government action (comparing to for example Europe)

Are you sure? In the UK we have Trading Standards, who's bite is pretty bad. The trouble is, like so many such Quangos, they are seen by the Treasury as a dead cost, and as such are badly hamstrung by inadequate budgets. If their budget bore any relationship to their value, they would rapidly rise ...

Cheers,
Wol

Class action against GitHub Copilot

Posted Nov 12, 2022 8:10 UTC (Sat) by ceplm (subscriber, #41334) [Link]

I am sorry, I did myself: forgetting that the United Kingdom has nothing to do with Europe. I really don’t know enough about the British law, and I really don’t know what was the UK position on the collective redress initiative.

Class action against GitHub Copilot

Posted Nov 14, 2022 9:20 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (1 responses)

> a core component of the ethical enforcement of free-software licenses is to avoid the pursuit of financial gain

Since when? I thought it had always been perfectly fine to sell free software, since its inception.

Class action against GitHub Copilot

Posted Nov 14, 2022 9:59 UTC (Mon) by excors (subscriber, #95769) [Link]

> Since when? I thought it had always been perfectly fine to sell free software, since its inception.

I don't think it's saying you shouldn't try to make money from developing or distributing free software, it's saying you shouldn't try to make money from free software licence enforcement. The goal of enforcement should be compliance, and financial penalties are merely a mechanism for encouraging compliance and for funding further enforcement.

Class action against GitHub Copilot

Posted Nov 17, 2022 10:41 UTC (Thu) by anton (subscriber, #25547) [Link]

The vacuuming of a massive amount of free software into the proprietary Copilot system has created a fair amount of discomfort in the community. It does, in a way, seem like a violation of the spirit of what we are trying to do.

Actually, what Copilot does would be very much in at least my view of the spirit of free software, if it conformed with the license conditions (i.e., their model of GPLv3-compatible software would spit out code under GPLv3, their model of CDDL-compatible software would spit out code under CDDL, their model of permissive code would spit out code under a permissive license, etc.). I am publishing my software as free software because I want others to be able to use the four freedoms, including freedom 1: To study how the code works, and change it to make it do what they wish. Copilot could be a helpful tool for that.

Yes, it would be better if Copilot was free software itself, but that's a different issue.

Class action against GitHub Copilot

Posted Nov 17, 2022 13:25 UTC (Thu) by esemwy (guest, #83963) [Link]

Here’s a thought experiment. If the code generated by Copilot is free and clear, “because that’s the way AI works,” why were only open source projects used for training data? Why not everything on Github?

I mean, after all, it’s not copying anything….