|
|
Subscribe / Log in / New account

DeVault: GitHub Copilot and open source laundering

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 1:13 UTC (Fri) by NYKevin (subscriber, #129325)
In reply to: DeVault: GitHub Copilot and open source laundering by Wol
Parent article: DeVault: GitHub Copilot and open source laundering

Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can. The intent of the person writing the code, or even of the person creating the model, has no bearing on any of this. If it so happens that the model is not a derivative work of the original in your jurisdiction, then you have to write your legislature and get them to change the law. You can't "fix" it by putting an extra term in the license; the license is not a law, and does not entitle you to impose arbitrary rights and restrictions on other people.

> If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous".

That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. At best, the defendant might be able to raise the defense of "innocent infringement" in some jurisdictions. But under US law, that does not relieve the defendant of liability, it merely reduces the monetary amount of their damages, which can still be quite substantial if many copies were made. Also, a valid copyright notice often defeats or greatly weakens this defense (see e.g. 17 USC 401(d)), depending on jurisdiction.

Seriously, if anyone in this thread is contemplating acting on this suggestion, I would strongly urge that person to consult an attorney who specializes in copyright law. This is not how the law works at all. You cannot go before a judge and say "the law was ambiguous so I just did it anyway," and expect to automatically win.

> But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not).

You have it backwards. Either the defendant is arguing that no license is required (and so it doesn't matter whether the license is valid or invalid), or the defendant is arguing that the license is valid and its actions fall within the scope of the license. Arguing that the license is invalid is something the plaintiff might do in order to defeat the latter defense; it never makes sense for the defendant to raise such an argument.


to post comments

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 2:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (25 responses)

> Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can.

And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not.

> That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of.

Who was talking about the *law*? I was talking about the *licence*.

If the licence makes it absolutely clear that the licensor considers ML to create derivative code, then the licensee cannot claim an innocent mistake. The licensee MUST claim that a licence is not required and copyright does not apply.

At the end of the day, it's down to the Judge to *apply* the law. And if it is clear to the Judge that the defendant "knew or should have known" that they were acting against the wishes of the licensor, then there is no defence of estoppel, or "innocent infringement", or "but I thought it was okay".

And, faced with the choice of siding with the plaintiff and saying to the defendent "you knew the defendent did not permit that", or CREATING NEW LAW by explicitly defining ML into the Public Domain or whatever, which do you think a Judge is going to choose?

At the end of the day, putting this stuff into your licence does not change the law. But it makes it a damn sight more likely that the Judge is going to side with your interpretation of the law.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 2:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (24 responses)

> And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not.

That is exactly my point. The judge will make this decision, based on the facts of the case and what the legislature wrote in the statute. Not based on the license. The license has zero to do with what is or is not a derivative work.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 8:04 UTC (Fri) by Wol (subscriber, #4433) [Link] (23 responses)

And you're completely missing my point.

If the Judge has to decide whether ML is a derivative work or not (and create new law in the process!!!), then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

If the license says "I consider this to be a derivative work", the Judge will not want to create new law by disagreeing - you're effectively twisting the Judge's arm. How far he lets you twist it is down to him :-)

Remember PJ - Judges try to upset the apple-cart as little as possible. If you give the Judge an out, he will take it ...

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 9:15 UTC (Fri) by kleptog (subscriber, #1183) [Link] (7 responses)

On the other hand, if the legislature by statue declares that ML models are not derivative works of the inputs, then you can write whatever you like in the license, it's irrelevant.

In particular, the EU Copyright Directive states that Text and Data Mining for the purpose of research and education is permitted. You can write whatever your like in your license, it has no effect. Now, GitHub is making a commercial product here so they don't get to claim an broad exemption. So it comes to the individual member states to regulate as they see fit.

Which basically makes the conclusion: It depends.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:01 UTC (Fri) by bluca (subscriber, #118303) [Link] (3 responses)

TDM is allowed for any purpose under EU law. Commercial entities have to let publishers of corpora opt out (W3C is working on a spec like robots.txt for this), while researchers and educators don't. The key detail is that you have to opt out of ALL data mining, not just from entities you don't like.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:03 UTC (Fri) by bluca (subscriber, #118303) [Link]

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:20 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> The key detail is that you have to opt out of ALL data mining, not just from entities you don't like.

That's easily got round - opt out by default and grant people you DO like a licence. In other words, the default is "mining is permitted", and the law says you have to explicitly change THE DEFAULT if you don't like it. Pretty sensible, imho.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:37 UTC (Fri) by bluca (subscriber, #118303) [Link]

Even if that holds (not a lawyer, can't say if 'workarounds' like that are likely to survive a court case or not), it means changing the license to explicitly go against the 'No Discrimination Against Persons or Groups' and/or 'No Discrimination Against Fields of Endeavor' principles, thus making it non-free

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 13:49 UTC (Fri) by edeloget (subscriber, #88392) [Link] (2 responses)

I am inclined to follow the same direction. The EU directive is quite precise on what it implies - and Copilot as a research program would qualify for the exemption but Copilot as a commercial program does not.

The same things goes for books, photographies and so on. You can train a language model on copyrighted books for research and education. But if you want to do it for the purpose of offering a commercial service, you have to get the proper authorization (this can be costly but it's not out of reach for a company like Microsoft).

I haven't read all the comments below, so maybe I'll state a point that has already been proposed. The "model is a derivative work" question is interesting, yet I don't think this is a real issue. The problem I see (and I think it's a problem at the moment I read the "we can do it" rationale by Github) is that I don't believe them: contrary to what they say, the code written by the machine is a derivative work, no matter how hard they'll try to press on this. Not only you can easily obtain code that is a direct copy paste of existing code (with comments, if needed :)) but even if you don't, you'll end up with code that 1) has a striking similary with exiting code (after I have used it for a while, I don't envision Copilot to magically imagine new algorithms) and 2) is directly inspired by the input code (Copilot is unable to code a solution to a new problem ; for exemple, it cannot propose you to use an API it does not already know).

So, as a conclusion, I would not go by the "a model created using this code is a derivative work" clause. I would go by "any code created by a ML model trained with this code is a derivative work" clause which I find both more logical and more satisfactory. As a consequence, the tool itself can exist, but cannot be used to created anything but free software - as I see it, this would be a win-win situation (although it might be tough to market to software shops :))

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 15:57 UTC (Fri) by bluca (subscriber, #118303) [Link]

> I am inclined to follow the same direction. The EU directive is quite precise on what it implies - and Copilot as a research program would qualify for the exemption but Copilot as a commercial program does not.

The EU directive allows TDM for commercial programs too. It adds an opt-out provision for that case.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 16:36 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

I absolutely agree that any code produced by Copilot that is a verbatim copy of anything from its training corpus would be a copyright violation, unless that piece fell under one of the established limits on copyright, such as purely functional material that has a restricted number of ways it can be expressed or a snippet that's too small to be considered expressive. Material that's suspiciously similar to something in the training corpus would be at least deeply suspect. But that would be true whether it comes from Copilot or from a human programmer. If your code is a copy of someone else's, it's a copyright violation regardless of how it got that way unless it isn't eligible for copyright in the first place.

The problem with trying to restrict Copilot (and similar programs) is threefold:

  1. Existing copyright law allows authors to be exposed to and influenced by existing works. For example, if I read the source code to the Linux kernel, my brain does not become tainted by the ideas therein, making any program I write a derivative of the Linux kernel. The same thing would seem to apply to teaching a ML model on existing software.
  2. To the extent using your code as a training corpus would be considered copying, it would likely be considered to be a sufficiently transformative use that it would fall under fair use. Their ML model is a completely different thing from your original program and serves a completely different function. That is pretty much the definition of a transformative use.
  3. Specifically regarding Free Software under the FSF guidelines, one of the essential freedoms of free software (freedom 1) is the right to "study how the program works". Software that is licensed under a FSF license has accepted this as a fundamental principle, and it seems to me that it absolutely applies to using the program as part of a teaching corpus for a ML model.

Again, this applies only to training the model. The output of the model is a different thing and may be a copyright violation even if the model itself isn't.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 10:00 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (9 responses)

> if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

Why? Do you have examples of this occurring?

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:25 UTC (Fri) by Wol (subscriber, #4433) [Link] (8 responses)

No. But legal principles? From Groklaw?

PJ was very clear on this - Judges are very reluctant to rock the boat. If "is this a derivative work" is not clear then, given the choice of a NARROW interpretation of the licence that says "the licence denies permission, I'll side with the licence", or a BROAD interpretation that says "all licences like that are invalid", which one are they going to choose?

Especially when the defendant has "known or should have known" the plaintiff's express wish and ignored it.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 17:45 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

The point is that isn't how licenses work. The idea behind a copyright license is that the licensor grants the licensee some rights they would normally be denied by copyright law in exchange for a consideration. For example, if copyright law would normally deny me the right to use your program to train my ML model, you can write a license that would grant me that right.

But in practice a license can't prevent someone from doing something they would otherwise have the right to do under copyright law. It is possible to write a license that requires the licensee giving up some rights under copyright law as part of the consideration they get for receiving some other rights. But nobody is forced to agree to the license! If they simply refuse to accept the license, they can continue doing anything they normally had the right to do under copyright law. Refusing the licensing terms would deny them whatever rights the license would grant them, but if they weren't intending to do those things it's an empty threat.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:25 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> But in practice a license can't prevent someone from doing something they would otherwise have the right to do under copyright law.

But if copyright LAW is not clear on the matter?

That is what everybody is ignoring - it is down to the Judge to decide what the law IS. If the licence explicitly refuses permission, does the Judge make a NARROW ruling that says the licensor's explicit wishes rule, or a BROAD ruling that all such clauses are invalid.

PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time.

And I don't know which case it was, but there was a discussion about a pro-software-patent Judge some while back, who ruled "In THIS case, the software is clearly non-patentable. I can't conceive of a scenario where any software is patentable". Note he didn't even attempt to say software isn't patentable. He was pro-patents. But he stated, in a ruling, "I don't think it is possible for software to pass the patentability bar". He made a very narrow ruling, but accepted that the consequences would probably be wide.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:21 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time.

But that applies only if the broad and narrow ruling turn out the same way. In that case, the judge will usually rule on the narrowest possible grounds that results in the outcome they think is right for the case. If the broad and narrow grounds for the ruling have opposite results, the judge has to go based on which one seems to be a better reading of the law and situation, not just on narrow versus broad. More generally, narrow vs broad is something that's more true of low-level judges than of higher-level ones. Even if individual judges make narrow rulings, it's likely that different judges will rule differently. That will create uncertainty and force a higher court to rule on the matter, creating a broader ruling. That's the way these things usually go.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 17:05 UTC (Sat) by khim (subscriber, #9252) [Link]

> But that applies only if the broad and narrow ruling turn out the same way.

Where does that idea comes from? Narrow ruling is used precisely to ensure the possibility of broad ruling (made in a different case by a different judge later) to proclaim the opposite outcome!

> If the broad and narrow grounds for the ruling have opposite results, the judge has to go based on which one seems to be a better reading of the law and situation, not just on narrow versus broad.

Narrow is almost always better. Because, well, it's narrow. It describes the situation more precisely. The only save you can have is to proclaim that narrow reading is so narrow it's not applicable to your case at all.

That often happens with patents (judge is presented with half-dozen of patents which can be, theoretically, be treated as prior art and eliminate the patent completely, but 9 times out of 10 judge doesn't do that, but only just proclaim that yes, patent is still valid, just not applicable for your case).

> That's the way these things usually go.

I would say it's the way these things usually don't go. 99% of time decision doesn't reach high enough courts to decide anything definitively. Usually it takes dozens of cases and decades of litigation for that to happen.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 19:30 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (3 responses)

Intent will matter in a bunch of cases, but in this case the defining factor is whether or not something is a derived work. In two otherwise identical scenarios, one involving a license that says "I consider this to be a derived work" and one without that claim, from the standpoint of copyright law we'd expect both outcomes to be the same - the license doesn't determine how far copyright law reaches, so anything else would be bizarre.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:43 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> Intent will matter in a bunch of cases, but in this case the defining factor is whether or not something is a derived work.

And that is exactly the argument in front of the Judge. *IS* it a derived work? And if the law is unclear, and the licensor is explicit that he considers it a derived work, then the only safe option for the Judge is to rule that it IS a derived work and let the legislators sort it out.

And this is why you can NOT "defer to the law" in this argument. The question at issue is not "is this a derivative work according to the law?", but "what is the law?". THAT is the argument in front of the Judge.

The legislators can choose to let the genie out the bottle. The Judge will not be happy about letting the genie out the bottle off his own bat.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:13 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (1 responses)

The license stating that something is considered a derived work doesn't make it a derived work. The copyright holder is still going to have to turn up in court and make that argument, and whoever created the allegedly derived work is going to defend against that. If the same judge heard the same arguments in a case where the license made no claims about whether or not something was a derived work, we'd expect to see the same decision.

DeVault: GitHub Copilot and open source laundering

Posted Jul 1, 2022 8:32 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

It would be very hard to argue in court that code produced by an AI trained on an existing code base is anything else than a derivation of this code base, since there is no human inventivity involved to muddy up waters.

Despite years of commercial pretense to the reverse professionals know “smart” systems are no smarter than the human who coded them.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 18:23 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (4 responses)

> then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

Nope, that's not how the law works. Stating it over and over again does not make it true.

For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 18:28 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

(Self-nitpick: I probably should have written "the prevailing copyright law" rather than "the copyright statute" because, in common law countries, the former also includes prior legal precedents.)

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:50 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

And if that's not enough for him to make up his mind?

That *SHOULD* be all that's needed. But if that IS all that's needed, why can't we all make our own minds up? Surely it's obvious? Why do we need Judges? It can't be THAT hard ... ?

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:59 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Sure, you can write the government a letter saying that you're seceding from your country and making up your own laws. It's not going to accomplish anything, but you can do it anyway.

DeVault: GitHub Copilot and open source laundering

Posted Jun 27, 2022 14:18 UTC (Mon) by anselm (subscriber, #2796) [Link]

For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

I think it would still be of some interest whether X resulted from Y through “cp Y X” or through a query to Copilot whose model was trained on Y. In the first case, X is pretty clearly a derived work of Y. In the second case, Microsoft, at least, would probably like to claim it isn't.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds