DeVault: GitHub Copilot and open source laundering [LWN.net]

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 16:26 UTC (Thu) by mpldr (guest, #154861) [Link] (24 responses)

I would personally love for an independent legal team to take a look at this issue… I doubt that either the FSF's or GitHub's legal teams can be trusted to be impartial on these topics.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 18:31 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (10 responses)

Emphasis on "legal" team, not engineering team. A lot of engineers have very strange ideas about how the GPL works. For example, some engineers (presumably not including DeVault) think that just reading GPL'd code causes it to magically "infect" any code you write after that point, which is not even close to being true.* The GPL attaches to derivative works. "Derivative work" is a legal term of art, and the license does not attempt to offer a definition for it, because it's defined by the underlying copyright law. That's why you need a team of lawyers, not a team of engineers, to evaluate something like this. In particular, I think DeVault's emphasis on the model itself being a derivative work may be an unnecessary distraction, because:

1. It's not clear to me whether this claim is actually correct. A model is ultimately "just" a big bag of statistical information, and I honestly don't know whether (US) copyright law attaches to such things in the first place, but I'm skeptical (see e.g. Feist v. Rural).
2. It's not relevant. What matters is whether the output of the model is a derivative work of the original, which is a completely different legal question. Derivative works are not subject to some sort of magical "transitive property" that requires the model to also be a derivative work; you can argue that the output is derivative while taking no position on the status of the model itself. Similarly, you could argue that the output is *not* derivative, again taking no position on the model. The status of the model is not relevant to the question, unless you're going to allege an AGPL** violation.

* The kernel of truth here is that, in practice, clean-room engineering is often a good idea for the avoidance of legal risk. But there's nothing in either the GPL or the copyright statute that says you have to do it. Because that would be stupid. Imagine if novelists couldn't read books without running into copyright issues.
** The AGPL is the only widely-used license whose obligations attach on creation of a derivative work, rather than on distribution of that work. As far as I know, GitHub has no intention of distributing the model itself to anyone, so if you want to sue GitHub just for creating the model, you'd have to claim an AGPL violation specifically.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 19:26 UTC (Thu) by ballombe (subscriber, #9523) [Link] (1 responses)

The legal issue will probably depend on the specific data you can extract from copilot, and obviously github will not help you to find out, and lawyers might not be sufficient.

Maybe there is some specific request that led copilot to return whole body of some GPL files. For example, by looking for certain patterns that occurs in a single software etc.
That would strengthen the case.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 19:52 UTC (Thu) by Gaelan (guest, #145108) [Link]

Someone got Copilot to generate the fast inverse square root function from Quake III (which is GPL'd), "what the fuck?" comment and all: https://twitter.com/mitsuhiko/status/1410886329924194309

Amusingly, it also autocompleted a BSD license onto that code.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 1:00 UTC (Sat) by gerdesj (subscriber, #5446) [Link] (2 responses)

"A lot of engineers have very strange ideas about how the GPL works."

Quite. Also, many putative authorities on the GPL seem to forget that there are many legal systems. If you are going to dive in and be authoritative on he GPL then you really should present an argument that works for all legal systems that the GPL attempts to work within. Quite a job!

Legal is as legal does: Some legal systems have a concept of "reasonable" or what a "reasonable" person would do and I think that is what the GPL is riffing off. There's also the concept of being able to "quietly enjoy [something]". I'm a Brit. so my local legal system informs my knowledge here. Not all legal systems work like that.

I think it is fair to say that we all have strange ideas about how the GPL works. There's no need to call out end users.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 11:47 UTC (Sat) by Wol (subscriber, #4433) [Link]

> Quite. Also, many putative authorities on the GPL seem to forget that there are many legal systems.

And far too many authorities read what they want to see, not what's actually there. I've sure been guilty of that. I think my knowledge of the GPL now is pretty good, precisely because I've had plenty of people call me out on my mistakes.

How many "experts" have NOT been through that learning experience? The majority of them?

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 3:22 UTC (Sun) by gdt (subscriber, #6284) [Link]

The deeper point about differing copyright laws is not so much interpretation of copyright licenses, but the distinction between "fair use" and "fair dealing".

Copilot claims its actions are fair use, and therefore the license is irrelevant. However in fair dealing jurisdictions Copilot's use of the program source must either meet the copyright license or one of the black-letter list of allowed uses in the fair dealing exceptions of that juridiction's copyright law.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 9:40 UTC (Sun) by gspr (subscriber, #91542) [Link] (4 responses)

> It's not clear to me whether this claim is actually correct. A model is ultimately "just" a big bag of statistical information, and I honestly don't know whether (US) copyright law attaches to such things in the first place, but I'm skeptical (see e.g. Feist v. Rural).

Surely it does apply in one extreme, namely that of a really good model! If I take a copyrighted picture and create a model that very accurately reproduces said picture, I don't think it's very unlikely that my model runs afoul of the original's copyright.

In the other extreme—that of a really terrible model—it probably doesn't, but we probably shouldn't write off the models in-between those extremes.

DeVault: GitHub Copilot and open source laundering

Posted Jun 27, 2022 16:51 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

Are you talking about a real-world model - like you made an aircraft model from a photo of an aircraft? In that case it's a physical object and patent rules should apply, or design rules, and NOT copyright rules.

If, however, you're referring to a model like most people here are - computer science, otherwise known as maths - then equally copyright should NOT apply, because it's maths. Or "sweat of the brow". Or a whole other bunch of doctrines that lawyers do their best to mis-understand but that state quite clearly it is not copyrightable material.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 28, 2022 1:26 UTC (Tue) by hummassa (subscriber, #307) [Link] (1 responses)

I will offer an informed alternative to your pov, Wol:

If a mathematical model produces as its output a perfect reproduction of an copyrightable and copyrighted work (something novel produced by the human mind), then said mathematical model is nothing but a copying apparatus. There is no difference between the neural model of Copilot and a big HD containing all the works it's seen, just well indexed. The output is just a copy of the copyrighted work, subject to the same protections under the laws and treaties.

DeVault: GitHub Copilot and open source laundering

Posted Jun 28, 2022 8:00 UTC (Tue) by Wol (subscriber, #4433) [Link]

(And to gspr)

Oh the joys of the ambiguity of English ...

As I read it "copyright attaches to the model" - in other words there is no copyright *in* the model. But if there is copyright in the *original*, then that applies *to* the model as well ...

I think we're talking at cross purposes ... :-)

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 28, 2022 6:37 UTC (Tue) by gspr (subscriber, #91542) [Link]

By this logic, unlicensed redistribution of copyrighted images, video and audio are fine as long as the material is recompressed. It's just math!

Clearly absurd.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 20:28 UTC (Thu) by mrugiero (guest, #153040) [Link] (2 responses)

And in the case of the FSF their role is to actively try to be partial in favor of free software. On intent, not by accident. If they can get a win for FLOSS then they accomplished their stated mission.
Impartiality is for the judge, not for the litigants.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 20:58 UTC (Thu) by mpldr (guest, #154861) [Link] (1 responses)

> [the FSF's] role is to actively try to be partial in favor of free software

There's nothing wrong with that – quite the opposite – but it's not helpful if what you want is a legal review. You may get some interesting points from them, sure; but it's not exactly helpful when trying to find out what is actually law (let alone that this is a court's job)

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 2:33 UTC (Fri) by scientes (guest, #83068) [Link]

In the common law system the courts' job is to write law.

/Almost not sarcastic

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:27 UTC (Fri) by flussence (guest, #85566) [Link] (9 responses)

The FSF isn't going to do anything useful at all — the purpose of a system is what it does.

The GPL2 didn't make nVidia *or* AMD play nice with Linux (key phrase: "preferred form for modification"), the GPL3 didn't stop TiVoization (they trivially routed around it; especially Apple), the AGPL3 didn't stop SaaS vendor lock-in (instead they weaponised it against each other), and no current or future iteration of it will stop Microsoft committing automated for-profit piracy at global scale like is happening here.

The only thing the GPL *is*, clearly, is weak DRM powered by magical thinking and a bunch of weird elitist old men clinging to a power fantasy dreamed up half a century ago, from which they refuse to grow up from. People who try to actually play by the stated rules get a worse experience, corporations engage in automated piracy with neural networks with little to no legal repercussions, or often just old-fashioned copyright infringement if they're peddling white-label ARM devices, and any attempts to resist this toothless status quo from within the system get you ostracised. Most users of GPLed software have never and will never know it exists, never mind read or understand it, and if they did wouldn't be able to meaningfully exercise rights granted under it. But it sure makes some people feel smug about themselves on their moral high horse.

I feel like at this point the only way to stop this cancer of trillionaires strip-mining the creative output of individuals is to stop giving away any legal rights to that work in the first place. Make the code utterly radioactive to anyone who takes license texts seriously, especially corporate lawyers: All Rights Reserved, free for personal use only, the software shall be used for good not evil, and with a written threat to DMCA anyone found uploading to github or any other platform of similar size and motive. Piracy is going to happen anyway, but we can still choose who feels safe and comfortable doing it.

Thought experiment: if you train a neural network on the text of the GPL itself and coax output from the machine that superficially resembles the input but with manually chosen tweaks that change its meaning, are you exempt from the copyright header in the original, as MSFT seems to think it is? If so, that's the final nail in the coffin for software copyright as a whole; the words of the legal document and the colour of the bits don't mean anything any more.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 0:07 UTC (Sun) by salimma (subscriber, #34460) [Link] (8 responses)

> The GPL2 didn't make nVidia *or* AMD play nice with Linux

Any specific example for AMD here? They seem to be much better citizens when it comes to the GPL, at least compared to nVidia (and even nVidia is finally open sourcing kernel drivers)

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 0:23 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

ATI had closed the driver for R300 video cards, even though the driver for R200 was open. The opened up only after being acquired by AMD.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 10:00 UTC (Sun) by flussence (guest, #85566) [Link] (6 responses)

I'm still a bit distrustful of them after what they did to undermine the xf86-video-radeonhd project.

Rumour has it that AMD middle management wanted the FOSS option they were due to announce to be kept slightly inferior to fglrx for Reasons, and having an independent effort that didn't depend on firmware blobs or the decrepit x86-only int10 VBIOS like that did was severely embarrassing them.

Not many people may remember this now, but the R200/(reverse-engineered)R300 driver also used to be blob-free. Strangely that stopped being the case after AMD took over, even though it was feature-complete.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 18:49 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> Not many people may remember this now, but the R200/(reverse-engineered)R300 driver also used to be blob-free. Strangely that stopped being the case after AMD took over, even though it was feature-complete.

Because it made no sense to reimplement the critical power management and link training code multiple times, instead of doing it once in AtomBIOS.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 19:40 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (3 responses)

> the R200/(reverse-engineered)R300 driver also used to be blob-free

The DRM side of the R2/3/400 driver always required a firmware blob - but for a long time it was just embedded inside the kernel driver, so wasn't user visible. My recollection (which seems to be supported by the driver, but it's been a long time since I looked at this properly so I could be wrong) is that a bunch of the 2D acceleration in that driver depended on DRM, so effectively the 2D driver also had a blob dependency if you wanted it to work properly.

The difference between -radeonhd and -ati as far as reliance on firmware goes was that the defined interface to various pieces of card functionality was to execute interpreted scripts present in the card flash. These scripts didn't do anything that the driver couldn't, so you could absolutely reimplement that functionality in the driver - the problem is that card vendors used these scripts as a way to abstract hardware differences (eg, using RAM from different vendors with different timing constraints), and ignoring Atom would mean having to have card-specific data in the driver before that card would work correctly. A hybrid approach is to use Atom for data but not for code, but even then there are still risks due to the fact that the defined interface is the scripts and not the data tables. A card vendor could modify the way the script interpreted the tables (or even hardcode stuff directly into the script) and again you'd need card-specific knowledge to avoid that. -radeonhd spent a while trying to avoid executing any Atom code, but effectively relied on it anyway - it couldn't program the card from cold and so depended on the system firmware having executed the scripts before it ran. In any case, support for executing Atom code (including running the ASIC init function) was added to -radeonhd by September of 2007.

Looking at the initial commits to support r500 in the -ati driver, I think the only time it would ever call int10 is if the card was entirely uninitialised. -radeonhd would do exactly the same if it was configured without support for doing Atom-based init.

DeVault: GitHub Copilot and open source laundering

Posted Jun 29, 2022 10:12 UTC (Wed) by flussence (guest, #85566) [Link] (2 responses)

That first paragraph's a letdown, but...

I learned something today. Thanks for explaining all that to my dumb ass.

This is the kind of thing that makes me keep my subscription to the site going.

DeVault: GitHub Copilot and open source laundering

Posted Jun 29, 2022 10:31 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (1 responses)

The easiest way to think about Atom is like ACPI - it's a lump of bytecode that's interpreted in kernel context, and which talks to hardware the kernel could already talk to. We could rip out the ACPI interpreter in the kernel and hardcode knowledge of every singly ACPI board instead, and we'd have successfully replaced non-free code with free code. But we'd also have a kernel that didn't boot on a bunch of new systems until people had reverse engineered the relevant ACPI code and reimplemented it, and so nobody has seriously suggested doing that.

There's tradeoffs. I'd love to avoid having to rely on non-free code to make hardware work, and I'm not going to criticise people for writing drivers that do that. But the reality is that any such driver is going to work less well than a driver that uses the defined interface to call non-free firmware (in much the same way that we call into non-free UEFI code to set boot variables these days), and so there's value in the driver that calls non-free code existing, and also it's unsurprising that distros would pick the one that works with more hardware.

Luc's priorities on -radeonhd were probably based on his experience with the VIA chipsets that were extremely limited by what the BIOS permitted (and yeah it turns out that not being able to set any modes other than those that are hardcoded in the BIOS is not good!), but the outcome was also that his driver for those chipsets simply didn't work on all hardware - I had a VIA-based laptop that would just give a black screen with his driver, because the BIOS didn't match his expectations. To be completely fair, on Radeon I hit some similar constraints when I was researching reclocking the RAM for power management - the Atom scripts simply took too long, so I took out a bunch of the wait statements and hardcoded those into the kernel and it was great, and then after a couple of days of uptime the card would wedge during a reclock and also it didn't work on all hardware, so it turns out there was a reason that those were there in the first place. So I absolutely understand the desire to have native code for all of this, but also in the absence of vendors providing explicit contracts about hardware behaviour, a driver that doesn't use the defined interfaces is inherently going to break things.

DeVault: GitHub Copilot and open source laundering

Posted Jun 29, 2022 11:31 UTC (Wed) by farnz (subscriber, #17727) [Link]

It's also worth noting in this context that part of the reason to have vendor scripts of some form (be they AtomBIOS, ACPI or others) is that power delivery changes can't happen instantly - and different board manufacturers will have done different transient analysis to determine what their hardware can reliably support. The results of that analysis (whether it's a "rule of thumb" assessment or a proper calculation) need communicating to the driver somehow - and a small scripting language is as good as any other way to deal with it, especially since the edge cases get complex if you're doing a per-board calculation based on measuring the final system during post-manufacture testing of a board.

That said, given the quality of some vendor code, I understand Luc's reluctance to trust it - I've encountered one vendor who asserted that the CPU would detect an OUT 0xCF8, EAX instruction in userspace and then ensure that nothing else accessed PCI configuration space until the same userspace process executed either OUT 0xCFC, EAX or IN EAX, 0xCFC later, on the basis that if the CPU didn't do that, it would be possible for their userspace driver to crash. I'm not even sure how this could work under Linux…

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 20:16 UTC (Sun) by sjj (guest, #2020) [Link]

Isn't 20 years enough to carry a grudge? Those middle managers have moved on a long time ago already.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 18:21 UTC (Thu) by bluca (subscriber, #118303) [Link] (2 responses)

The EU Copyright Directive allows Text and Data Mining, even for commercial purposes, regardless of what the copyright status and/or license of the text body is. I don't see how any terms of the original license (if even there is one) apply. If you have legal access to a body of proprietary text, you can still legally do TDM on it and train a model.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 18:43 UTC (Thu) by bluss (guest, #47454) [Link] (1 responses)

It doesn't give you a blanket reason to publish parts of the corpus, though. Maybe it is not prescient in how the "analysis" (outcome) of the data mining might be a derivative of the data itself.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 19:38 UTC (Thu) by bluca (subscriber, #118303) [Link]

Where is that clause defined?

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 18:26 UTC (Thu) by Deleted user 129183 (guest, #129183) [Link] (31 responses)

> And, my advice to free software maintainers who are pissed that their licenses are being ignored. First, don’t use GitHub and your code will not make it into the model (for now).

This would be a good recommendation, but unfortunately…

The way software freedom is defined, it means that everyone is free to redistribute the source code as they wish, _including uploading it verbatim to G*tHub_ even if it doesn’t come from there. You cannot forbid it in a licence, because it would make the software non-free (and potentially introduce a licence incompatibility). Sure, you can just ask people not to, but not everybody listens, and when you notice that somebody uploaded your software to G*tHub against your wishes, it may be already too late and the code may be already stolen and incorporated into the machine learning dataset.

The only winning move would be not to play and not publish the source code anywhere nor show it to anyone. Which will obviously make your software proprietary, which may be not something that you would want.

> Instead, I would update your licenses to clarify that incorporating the code into a machine learning model is considered a form of derived work, and that your license terms apply to the model and any works produced with that model.

…which would be also just disregarded by G*tHub. Again, the only way to win is not to play.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 18:49 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (28 responses)

>> Instead, I would update your licenses to clarify that incorporating the code into a machine learning model is considered a form of derived work, and that your license terms apply to the model and any works produced with that model.

Aha, I missed that line in DeVault's post.

This is not a thing you can do. The judge decides what counts as a derivative work. You don't. The license cannot override the applicable copyright law. If copyright law does not say that a license is required in order to create the model in the first place, then no provision of the license can prohibit it. Not even "All rights reserved, do not redistribute." OTOH, if the applicable copyright law says that a license is required to create such a model, then GitHub has to comply with the terms of the license, and it doesn't matter whether the license explicitly calls this out or not.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 23:07 UTC (Thu) by Wol (subscriber, #4433) [Link] (27 responses)

The problem with your approach is that INTENT MATTERS.

If your licence makes it clear that you consider putting your source into a ML algorithm creates a derivative work, then the Judge is likely to agree with you.

If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous". Fair enough. But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not).

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 1:13 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (26 responses)

Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can. The intent of the person writing the code, or even of the person creating the model, has no bearing on any of this. If it so happens that the model is not a derivative work of the original in your jurisdiction, then you have to write your legislature and get them to change the law. You can't "fix" it by putting an extra term in the license; the license is not a law, and does not entitle you to impose arbitrary rights and restrictions on other people.

> If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous".

That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. At best, the defendant might be able to raise the defense of "innocent infringement" in some jurisdictions. But under US law, that does not relieve the defendant of liability, it merely reduces the monetary amount of their damages, which can still be quite substantial if many copies were made. Also, a valid copyright notice often defeats or greatly weakens this defense (see e.g. 17 USC 401(d)), depending on jurisdiction.

Seriously, if anyone in this thread is contemplating acting on this suggestion, I would strongly urge that person to consult an attorney who specializes in copyright law. This is not how the law works at all. You cannot go before a judge and say "the law was ambiguous so I just did it anyway," and expect to automatically win.

> But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not).

You have it backwards. Either the defendant is arguing that no license is required (and so it doesn't matter whether the license is valid or invalid), or the defendant is arguing that the license is valid and its actions fall within the scope of the license. Arguing that the license is invalid is something the plaintiff might do in order to defeat the latter defense; it never makes sense for the defendant to raise such an argument.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 2:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (25 responses)

> Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can.

And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not.

> That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of.

Who was talking about the *law*? I was talking about the *licence*.

If the licence makes it absolutely clear that the licensor considers ML to create derivative code, then the licensee cannot claim an innocent mistake. The licensee MUST claim that a licence is not required and copyright does not apply.

At the end of the day, it's down to the Judge to *apply* the law. And if it is clear to the Judge that the defendant "knew or should have known" that they were acting against the wishes of the licensor, then there is no defence of estoppel, or "innocent infringement", or "but I thought it was okay".

And, faced with the choice of siding with the plaintiff and saying to the defendent "you knew the defendent did not permit that", or CREATING NEW LAW by explicitly defining ML into the Public Domain or whatever, which do you think a Judge is going to choose?

At the end of the day, putting this stuff into your licence does not change the law. But it makes it a damn sight more likely that the Judge is going to side with your interpretation of the law.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 2:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (24 responses)

> And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not.

That is exactly my point. The judge will make this decision, based on the facts of the case and what the legislature wrote in the statute. Not based on the license. The license has zero to do with what is or is not a derivative work.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 8:04 UTC (Fri) by Wol (subscriber, #4433) [Link] (23 responses)

And you're completely missing my point.

If the Judge has to decide whether ML is a derivative work or not (and create new law in the process!!!), then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

If the license says "I consider this to be a derivative work", the Judge will not want to create new law by disagreeing - you're effectively twisting the Judge's arm. How far he lets you twist it is down to him :-)

Remember PJ - Judges try to upset the apple-cart as little as possible. If you give the Judge an out, he will take it ...

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 9:15 UTC (Fri) by kleptog (subscriber, #1183) [Link] (7 responses)

On the other hand, if the legislature by statue declares that ML models are not derivative works of the inputs, then you can write whatever you like in the license, it's irrelevant.

In particular, the EU Copyright Directive states that Text and Data Mining for the purpose of research and education is permitted. You can write whatever your like in your license, it has no effect. Now, GitHub is making a commercial product here so they don't get to claim an broad exemption. So it comes to the individual member states to regulate as they see fit.

Which basically makes the conclusion: It depends.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:01 UTC (Fri) by bluca (subscriber, #118303) [Link] (3 responses)

TDM is allowed for any purpose under EU law. Commercial entities have to let publishers of corpora opt out (W3C is working on a spec like robots.txt for this), while researchers and educators don't. The key detail is that you have to opt out of ALL data mining, not just from entities you don't like.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:03 UTC (Fri) by bluca (subscriber, #118303) [Link]

https://www.w3.org/2022/tdmrep/

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:20 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> The key detail is that you have to opt out of ALL data mining, not just from entities you don't like.

That's easily got round - opt out by default and grant people you DO like a licence. In other words, the default is "mining is permitted", and the law says you have to explicitly change THE DEFAULT if you don't like it. Pretty sensible, imho.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:37 UTC (Fri) by bluca (subscriber, #118303) [Link]

Even if that holds (not a lawyer, can't say if 'workarounds' like that are likely to survive a court case or not), it means changing the license to explicitly go against the 'No Discrimination Against Persons or Groups' and/or 'No Discrimination Against Fields of Endeavor' principles, thus making it non-free

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 13:49 UTC (Fri) by edeloget (subscriber, #88392) [Link] (2 responses)

I am inclined to follow the same direction. The EU directive is quite precise on what it implies - and Copilot as a research program would qualify for the exemption but Copilot as a commercial program does not.

The same things goes for books, photographies and so on. You can train a language model on copyrighted books for research and education. But if you want to do it for the purpose of offering a commercial service, you have to get the proper authorization (this can be costly but it's not out of reach for a company like Microsoft).

I haven't read all the comments below, so maybe I'll state a point that has already been proposed. The "model is a derivative work" question is interesting, yet I don't think this is a real issue. The problem I see (and I think it's a problem at the moment I read the "we can do it" rationale by Github) is that I don't believe them: contrary to what they say, the code written by the machine is a derivative work, no matter how hard they'll try to press on this. Not only you can easily obtain code that is a direct copy paste of existing code (with comments, if needed :)) but even if you don't, you'll end up with code that 1) has a striking similary with exiting code (after I have used it for a while, I don't envision Copilot to magically imagine new algorithms) and 2) is directly inspired by the input code (Copilot is unable to code a solution to a new problem ; for exemple, it cannot propose you to use an API it does not already know).

So, as a conclusion, I would not go by the "a model created using this code is a derivative work" clause. I would go by "any code created by a ML model trained with this code is a derivative work" clause which I find both more logical and more satisfactory. As a consequence, the tool itself can exist, but cannot be used to created anything but free software - as I see it, this would be a win-win situation (although it might be tough to market to software shops :))

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 15:57 UTC (Fri) by bluca (subscriber, #118303) [Link]

> I am inclined to follow the same direction. The EU directive is quite precise on what it implies - and Copilot as a research program would qualify for the exemption but Copilot as a commercial program does not.

The EU directive allows TDM for commercial programs too. It adds an opt-out provision for that case.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 16:36 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

I absolutely agree that any code produced by Copilot that is a verbatim copy of anything from its training corpus would be a copyright violation, unless that piece fell under one of the established limits on copyright, such as purely functional material that has a restricted number of ways it can be expressed or a snippet that's too small to be considered expressive. Material that's suspiciously similar to something in the training corpus would be at least deeply suspect. But that would be true whether it comes from Copilot or from a human programmer. If your code is a copy of someone else's, it's a copyright violation regardless of how it got that way unless it isn't eligible for copyright in the first place.

The problem with trying to restrict Copilot (and similar programs) is threefold:

Existing copyright law allows authors to be exposed to and influenced by existing works. For example, if I read the source code to the Linux kernel, my brain does not become tainted by the ideas therein, making any program I write a derivative of the Linux kernel. The same thing would seem to apply to teaching a ML model on existing software.
To the extent using your code as a training corpus would be considered copying, it would likely be considered to be a sufficiently transformative use that it would fall under fair use. Their ML model is a completely different thing from your original program and serves a completely different function. That is pretty much the definition of a transformative use.
Specifically regarding Free Software under the FSF guidelines, one of the essential freedoms of free software (freedom 1) is the right to "study how the program works". Software that is licensed under a FSF license has accepted this as a fundamental principle, and it seems to me that it absolutely applies to using the program as part of a teaching corpus for a ML model.

Again, this applies only to training the model. The output of the model is a different thing and may be a copyright violation even if the model itself isn't.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 10:00 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (9 responses)

> if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

Why? Do you have examples of this occurring?

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 12:25 UTC (Fri) by Wol (subscriber, #4433) [Link] (8 responses)

No. But legal principles? From Groklaw?

PJ was very clear on this - Judges are very reluctant to rock the boat. If "is this a derivative work" is not clear then, given the choice of a NARROW interpretation of the licence that says "the licence denies permission, I'll side with the licence", or a BROAD interpretation that says "all licences like that are invalid", which one are they going to choose?

Especially when the defendant has "known or should have known" the plaintiff's express wish and ignored it.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 17:45 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

The point is that isn't how licenses work. The idea behind a copyright license is that the licensor grants the licensee some rights they would normally be denied by copyright law in exchange for a consideration. For example, if copyright law would normally deny me the right to use your program to train my ML model, you can write a license that would grant me that right.

But in practice a license can't prevent someone from doing something they would otherwise have the right to do under copyright law. It is possible to write a license that requires the licensee giving up some rights under copyright law as part of the consideration they get for receiving some other rights. But nobody is forced to agree to the license! If they simply refuse to accept the license, they can continue doing anything they normally had the right to do under copyright law. Refusing the licensing terms would deny them whatever rights the license would grant them, but if they weren't intending to do those things it's an empty threat.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:25 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> But in practice a license can't prevent someone from doing something they would otherwise have the right to do under copyright law.

But if copyright LAW is not clear on the matter?

That is what everybody is ignoring - it is down to the Judge to decide what the law IS. If the licence explicitly refuses permission, does the Judge make a NARROW ruling that says the licensor's explicit wishes rule, or a BROAD ruling that all such clauses are invalid.

PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time.

And I don't know which case it was, but there was a discussion about a pro-software-patent Judge some while back, who ruled "In THIS case, the software is clearly non-patentable. I can't conceive of a scenario where any software is patentable". Note he didn't even attempt to say software isn't patentable. He was pro-patents. But he stated, in a ruling, "I don't think it is possible for software to pass the patentability bar". He made a very narrow ruling, but accepted that the consequences would probably be wide.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:21 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time.

But that applies only if the broad and narrow ruling turn out the same way. In that case, the judge will usually rule on the narrowest possible grounds that results in the outcome they think is right for the case. If the broad and narrow grounds for the ruling have opposite results, the judge has to go based on which one seems to be a better reading of the law and situation, not just on narrow versus broad. More generally, narrow vs broad is something that's more true of low-level judges than of higher-level ones. Even if individual judges make narrow rulings, it's likely that different judges will rule differently. That will create uncertainty and force a higher court to rule on the matter, creating a broader ruling. That's the way these things usually go.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 17:05 UTC (Sat) by khim (subscriber, #9252) [Link]

> But that applies only if the broad and narrow ruling turn out the same way.

Where does that idea comes from? Narrow ruling is used precisely to ensure the possibility of broad ruling (made in a different case by a different judge later) to proclaim the opposite outcome!

> If the broad and narrow grounds for the ruling have opposite results, the judge has to go based on which one seems to be a better reading of the law and situation, not just on narrow versus broad.

Narrow is almost always better. Because, well, it's narrow. It describes the situation more precisely. The only save you can have is to proclaim that narrow reading is so narrow it's not applicable to your case at all.

That often happens with patents (judge is presented with half-dozen of patents which can be, theoretically, be treated as prior art and eliminate the patent completely, but 9 times out of 10 judge doesn't do that, but only just proclaim that yes, patent is still valid, just not applicable for your case).

> That's the way these things usually go.

I would say it's the way these things usually don't go. 99% of time decision doesn't reach high enough courts to decide anything definitively. Usually it takes dozens of cases and decades of litigation for that to happen.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 19:30 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (3 responses)

Intent will matter in a bunch of cases, but in this case the defining factor is whether or not something is a derived work. In two otherwise identical scenarios, one involving a license that says "I consider this to be a derived work" and one without that claim, from the standpoint of copyright law we'd expect both outcomes to be the same - the license doesn't determine how far copyright law reaches, so anything else would be bizarre.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:43 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> Intent will matter in a bunch of cases, but in this case the defining factor is whether or not something is a derived work.

And that is exactly the argument in front of the Judge. *IS* it a derived work? And if the law is unclear, and the licensor is explicit that he considers it a derived work, then the only safe option for the Judge is to rule that it IS a derived work and let the legislators sort it out.

And this is why you can NOT "defer to the law" in this argument. The question at issue is not "is this a derivative work according to the law?", but "what is the law?". THAT is the argument in front of the Judge.

The legislators can choose to let the genie out the bottle. The Judge will not be happy about letting the genie out the bottle off his own bat.

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:13 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (1 responses)

The license stating that something is considered a derived work doesn't make it a derived work. The copyright holder is still going to have to turn up in court and make that argument, and whoever created the allegedly derived work is going to defend against that. If the same judge heard the same arguments in a case where the license made no claims about whether or not something was a derived work, we'd expect to see the same decision.

DeVault: GitHub Copilot and open source laundering

Posted Jul 1, 2022 8:32 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

It would be very hard to argue in court that code produced by an AI trained on an existing code base is anything else than a derivation of this code base, since there is no human inventivity involved to muddy up waters.

Despite years of commercial pretense to the reverse professionals know “smart” systems are no smarter than the human who coded them.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 18:23 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (4 responses)

> then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor.

Nope, that's not how the law works. Stating it over and over again does not make it true.

For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 18:28 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

(Self-nitpick: I probably should have written "the prevailing copyright law" rather than "the copyright statute" because, in common law countries, the former also includes prior legal precedents.)

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 20:50 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

And if that's not enough for him to make up his mind?

That *SHOULD* be all that's needed. But if that IS all that's needed, why can't we all make our own minds up? Surely it's obvious? Why do we need Judges? It can't be THAT hard ... ?

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:59 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Sure, you can write the government a letter saying that you're seceding from your country and making up your own laws. It's not going to accomplish anything, but you can do it anyway.

DeVault: GitHub Copilot and open source laundering

Posted Jun 27, 2022 14:18 UTC (Mon) by anselm (subscriber, #2796) [Link]

For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.

I think it would still be of some interest whether X resulted from Y through “cp Y X” or through a query to Copilot whose model was trained on Y. In the first case, X is pretty clearly a derived work of Y. In the second case, Microsoft, at least, would probably like to claim it isn't.

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 0:11 UTC (Sun) by salimma (subscriber, #34460) [Link] (1 responses)

I'm surprised Mercurial is not mentioned. Sourcehut supports it, and GitHub doesn't, so merely by using it you set up some barrier against people accidentally uploading a fork to GH.

It will probably also curtail your network effect even further though.

DeVault: GitHub Copilot and open source laundering

Posted Jun 27, 2022 9:22 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Eh. `git-remote-hg` exists and works well enough that it's the only way that I interact with Mercurial repos at least.

DeVault: GitHub Copilot and open source laundering

Posted Jun 23, 2022 22:26 UTC (Thu) by jmspeex (subscriber, #51639) [Link] (16 responses)

Actually, this is not just about the GPL and copyleft licenses. Even a license as permissive as the BSD still has a few requirements (including the copyright notice) that are easy to break with ML code generation. So if the code your ML model spits out is BSD-licensed, you could still get in trouble over it.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 1:07 UTC (Fri) by developer122 (guest, #152928) [Link] (15 responses)

This is an important point.

There were a raft of people arguing that all code from copilot should be GPL due to ingested GPL code, to which I reply: What gives the GPL priority?

There is code on github under lots of licences, eg the CDDL which is a copyleft free software licence but incompatible with the GPL. As there is significant CDDL code on github, under the same reasoning perhaps all the output should be CDDL?

An then there's all the code on github which specifies no licence at all. Just because it is publicly available does not mean any and all rights have been granted to you. Plenty of this code is proprietary.

The end result is an unlicencable pile of legal mush that no sane lawyer should be going anywhere near.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 8:16 UTC (Fri) by bluca (subscriber, #118303) [Link] (14 responses)

Training of the model is not done under the terms of any license. It's done under the copyright directive laws, which give an exception on copyright rules when doing text and data mining, for any purpose. I don't see how there's any 'legal mush'.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 9:19 UTC (Fri) by Vipketsh (guest, #134480) [Link] (9 responses)

I think the "legal mush" comes from what constitutes data/text mining and if the entirety of copilot wholly fits under it.

I can see it being difficult to dispel an argument that Copilot does something more than just mining when it is able to regurgitate numerous lines of code verbatim, including comments. Is it text/data mining if the representation of some code is changed to a set of statistical weights ? Is that any different to, say, compressing the code ? Is it different to encrypting the code requiring a password to gain back the original representation ?

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:09 UTC (Fri) by bluca (subscriber, #118303) [Link] (8 responses)

I'm not a lawyer, but me thinks you'll have a pretty hard time trying to prove getting text bodies to train an AI model does not constitute Text and Data Mining. I mean, it's pretty much what it is for, and AI/machine learning are mentioned plenty of times by the legislators as examples, as far as I can see.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 11:52 UTC (Fri) by Vipketsh (guest, #134480) [Link] (7 responses)

It's the same thing, just you changed the question to "what is an AI model?" and what constitutes "training". If it is allowed to produce a black box where you feed in some code and then get part of that code out the other end verbatim what limits are placed on the inside of that black box ? Can I call an HDD an "AI model" and "train" it by storing some code on it ?

The major considering factor in this case for me (IANAL) is that if "AI models" are allowed to reproduce code verbatim then there is a legally sound method to de-license any and all code: Take something similar to copilot train it with the code you would like to de-license and then get your device to reproduce it. That is clearly not acceptable, yet it is what copilot does in some cases.

In the end this is a new situation which touches on various concepts and courts and/or law makers need to figure out where this and similar situations stand.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 13:01 UTC (Fri) by bluca (subscriber, #118303) [Link]

Again, not a lawyer, but if you think you can convince a judge that an hard drive is an AI model, I wouldn't put my money on it, but knock yourself out.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 18:37 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

IMHO we don't need to figure this out, unless somebody wants to allege that GitHub is violating the AGPL by not providing the model's source code, parameters, etc. Such a lawsuit would be complicated, messy, and bring in all sorts of difficult legal issues.

For literally any other legal claim under the sun, you can just point directly at the output and say "Regardless of how it was created, that output is substantially similar to some portion of the XYZ codebase, and therefore it is a derivative work of XYZ and is in violation of [whatever license applies]." Judges have been dealing with the "we can't directly prove that the defendant copied the plaintiff's work" problem for a very long time, and are perfectly capable of finding for the plaintiff anyway.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 19:05 UTC (Fri) by samth (guest, #1290) [Link] (2 responses)

No, you can't "trick" copyright law. A judge would say "obviously those are different". The fact that from a computing perspective, there is a sense in which they're the same thing, or that there is not way to draw a bright line distinction between a hard drive and Copilot, doesn't matter at all. The law does not work the way computers do.

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:16 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> The law does not work the way computers do.

The problem is when the law says "black is white", or "true is false". What was that about the ?Pennsylvania legislature trying to define "pi = 3"?

Lawyers are supposed to understand logic. Maths is a subset of logic. Yet how many legal screw-ups do we have because lawyers want to legislate maths out of existence?

And then we get people here who think law is "black and white". PJ was quite clear - "law is squishy". BECAUSE lawyers don't understand black and white! Or because life mostly ISN'T black and white!

It would be lovely if things were that simple. But what answer do you get when you ask the question "what's 6 times 9?". When you ask the question "What is reality?" Because even a Physicist will turn round and say "the Universe doesn't know"!. YOUR reality is different from MINE, and we have no way of telling if there even IS a correct version of events.

For the most part, law tends to fix (or forget) its worst mistakes, but isn't that true of Computer Scientists too?

You need to try and twist reality to your view, or you'll find other people WILL twist it to theirs. Quite possibly without even trying or intending to ...

Cheers,
Wol

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 22:52 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> What was that about the ?Pennsylvania legislature trying to define "pi = 3"?

Hey, PA has its problems. But this one came from Indiana. Luckily there was a school teacher that was there to help teach some math.

https://www.straightdope.com/21341975/did-a-state-legisla...

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 21:12 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

The major considering factor in this case for me (IANAL) is that if "AI models" are allowed to reproduce code verbatim then there is a legally sound method to de-license any and all code: Take something similar to copilot train it with the code you would like to de-license and then get your device to reproduce it. That is clearly not acceptable, yet it is what copilot does in some cases.

No, because copyright is fundamentally path independent. All that is needed is to show that an identifiable part of copyrighted work A shows up in work B. Once the author of work A has shown that, a copyright violation is assumed, and it's up to the author of B to provide a specific defense for why it isn't a copyright violation. There are valid defenses, but laundering the code through an AI is not one of them.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 6:00 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Clarification: Copyright is path-independent, provided that a path actually exists. In theory, you do have to show that the defendant had access to the copyrighted work, and that this somehow affected the contents of the allegedly infringing work. But in practice, this is a very weak requirement. You don't have to make specific claims about exactly what the defendant did or the path they took from A to B. You just have to show that the path probably exists. OTOH, if the defendant can plausibly argue lack of access (e.g. "Your code was proprietary, and I couldn't read it even if I wanted to"), then that is a valid defense to copyright infringement (provided, of course, that the finder of fact believes this argument).

DeVault: GitHub Copilot and open source laundering

Posted Jun 24, 2022 16:56 UTC (Fri) by Lennie (subscriber, #49641) [Link]

The comment above had a very nice example how it's NOT that simple:

https://lwn.net/Articles/898818/

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 14:23 UTC (Sat) by eduperez (guest, #11232) [Link] (2 responses)

> Training of the model is not done under the terms of any license. It's done under the copyright directive laws, which give an exception on copyright rules when doing text and data mining, for any purpose. I don't see how there's any 'legal mush'.

I do not think anybody is arguing about the model training, but the output from the model once it has been trained.

If the model produces an output that is a verbatim copy of some GPL'd code (see https://twitter.com/mitsuhiko/status/1410886329924194309 for an example), is that code free now, just because it was produced by some AI? Or is it still protected by the GPL, because it is a derivative work? When can we consider that the output has been produced by the AI, and when can we consider it is still a derived work? This is the legal mush.

DeVault: GitHub Copilot and open source laundering

Posted Jun 25, 2022 21:27 UTC (Sat) by Vipketsh (guest, #134480) [Link] (1 responses)

I think rgmoore and NYKevin answered that question pretty convincingly in their replies to my comment above: if you believe that your code exists in something it shouldn't you argue that and courts don't care how it got to where it shouldn't have, only if it is there or not. Perhaps I can put it another way: what the AI model produces doesn't matter, only where that output ends up in.

Even so, I still think that the legal status of the "AI model" warrants further introspection. When talking about open source code the "AI model" isn't much of a consideration because at worst it stores code in some cryptic way that is available in a much more easier digested form. But that changes a whole lot if the training material is some propriety code stolen from somewhere. If your "AI model" is able to reproduce the stolen code verbatim (or sufficient parts for copyright to apply) and training of the "AI model" is "give an exception on copyright rules when doing text and data mining, for any purpose" (bluca's words) that should mean that this trained "AI model" is fully legal and thus a legal distribution mechanism for the stolen code. Surely there is a legal principal to prevent things to work out this way ? At what point does an "AI model" turn into a "distribution mechanism" ?

DeVault: GitHub Copilot and open source laundering

Posted Jun 26, 2022 9:46 UTC (Sun) by bluca (subscriber, #118303) [Link]

The TDM provisions in the copyright directive apply only for legally accessible corpora. A stolen body of code cannot be data mined legally just because it's available, it needs to be legally available.

Not trained on MS closed source

Posted Jun 24, 2022 22:09 UTC (Fri) by glenn (subscriber, #102223) [Link] (12 responses)

From the Copilot FAQ:

GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

It's not trained on Microsoft's internal closed source projects. Gee, I wonder why this is.

Not trained on MS closed source

Posted Jun 24, 2022 22:27 UTC (Fri) by bluca (subscriber, #118303) [Link] (10 responses)

Because those are not on Github, they are on a completely separate and older system which is a massive pain in the backside to even gain access to, let alone use. Not surprised they wouldn't touch that with a barge pole to be honest.

Not trained on MS closed source

Posted Jun 24, 2022 22:46 UTC (Fri) by glenn (subscriber, #102223) [Link] (2 responses)

Maybe we can take Copilot seriously once Microsoft pledges to indemnify their users from lawsuits related to the code Copilot generates.

Not trained on MS closed source

Posted Jun 25, 2022 9:18 UTC (Sat) by bluca (subscriber, #118303) [Link] (1 responses)

What lawsuits?

Not trained on MS closed source

Posted Nov 11, 2022 22:16 UTC (Fri) by glenn (subscriber, #102223) [Link]

It took some time, but there's this one: https://lwn.net/Articles/914150

It's not unthinkable that users of the tool who for commercial purposes could also be sued.

Not trained on MS closed source

Posted Jun 26, 2022 22:27 UTC (Sun) by LtWorf (subscriber, #124958) [Link] (6 responses)

There's plenty of proprietary stuff on github that they didn't dare to use.

I think this alone shows that they are not very sure about the legality of what they are doing, but trust that developers won't be able to do anything about it (unlike the paying customers).

Not trained on MS closed source

Posted Jun 27, 2022 0:31 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

Exactly which proprietary stuff is publicly available and wasn't used?

Not trained on MS closed source

Posted Jun 27, 2022 5:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

GitHub does not require uploaded code to be FOSS, as long as it allows the use of GitHub's "fork" button (loosely equivalent to git clone), and one or two other pieces of site functionality. In theory, they could have limited their training set to only include FOSS repositories, and not proprietary or no-license repositories (in most jurisdictions, no license means the same thing as "all rights reserved").

But I couldn't find any statement in their FAQ one way or the other - it just refers to "public repositories on GitHub," a category including both FOSS and proprietary code. It's entirely possible that they are using all of that code, and IMHO that seems like the most straightforward way to read the sentence (which doesn't mean that it is the intended meaning, of course).

Not trained on MS closed source

Posted Jun 27, 2022 12:45 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (2 responses)

Or more accurately, they are *very* sure of the (il)legality of their approach and only dare to use it against projects that they feel won’t fight back.

Microsoft has access to plenty enough of proprietary code to train a model on, that they chose to use other people’s code instead says volumes.

Not trained on MS closed source

Posted Jun 27, 2022 18:31 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

Or more accurately, it is trained on publicly (legally) available corpora, as the law requires to be excepted from copyright restrictions

Not trained on MS closed source

Posted Jun 30, 2022 13:45 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

FOSS is not the same thing as public domain, it is all copyrighted

Not trained on MS closed source

Posted Jun 27, 2022 14:44 UTC (Mon) by excors (subscriber, #95769) [Link]

I think there are significant non-copyright reasons to exclude non-public code. E.g. The Copilot FAQ says:

> Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be rare that GitHub Copilot suggestions included personal data verbatim from the training set. [...] We have implemented a filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough. We will keep improving the filter system to be more intelligent to detect and remove more personal data from the suggestions.

"rare" != "never", and if someone stores sensitive personal data in a private GitHub repository then they absolutely don't want Copilot to reveal that information publicly to anyone who tries hard enough.

I expect the same applies to other confidential information, like yet-to-be-announced product names that companies might store in private repositories, or secret keys, or algorithms that they're protecting as trade secrets, etc.

Since the Copilot training data apparently includes public repositories even if they have a restrictive license, but excludes private repositories even if they have a very permissive license, it sounds like GitHub is confident that there are no copyright issues but is concerned about those other privacy issues.

Not trained on MS closed source

Posted Jun 27, 2022 14:26 UTC (Mon) by geert (subscriber, #98403) [Link]

"trained on A, including B", does not mean that it was trained on B only.
So it may have been trained on any source code that is publicly available; we don't know what exactly, the description in the FAQ is very vague (deliberately?).

FWIW, it might have been trained on whatever proprietary code that was ever leaked to the Internet, which might even include the sources of some version of Microsoft Windows ;-)

DeVault: GitHub Copilot and open source laundering

Posted Aug 30, 2022 13:27 UTC (Tue) by scientes (guest, #83068) [Link] (1 responses)

I think there is too much attention to this feature rather the freedom-stealing aspects of github.

What I noticed is that github has serious rate-limiting on their search functionality, but google's codesearch project (which was a nerdy project and the RE2 regex library was built for it) embraced these type of searches. It looks like Microsoft is hostile to technical proficiency and just wants control and classical Microsoft things.

DeVault: GitHub Copilot and open source laundering

Posted Aug 30, 2022 15:22 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

IIRC, search was hampered with rate limits and "must be logged in to search repo contents" since before the acquisition.