Why is Copilot so bad?

Posted Jul 2, 2022 11:49 UTC (Sat) by Wol (subscriber, #4433)
In reply to: Why is Copilot so bad? by SLi
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

> Ok, but is it seriously a good idea to prevent people from creating (including free and open) AI models from publically available data? To me that's the kind of copyright maximalism I did not expect from the free software movement.

There's nothing wrong with MODEL. But there's everything wrong with the USES THAT MAY BE MADE of the output.

The output is - must be - a derivative work of the inputs used to create it. That's what the word "derivative" means.

This then brings copyright into play. The output may be a conglomerate of multiple similar works, in which case the copyright status is probably "too trivial to be eligible". Or the output may be the sole match for a complex piece of code someone is trying to write, tempting them just to take Copilot's output verbatim as the solution to their problem. In that case the copyright status is "blatant piracy". And there's all the other points on the spectrum between them.

Mining publicly available code and using it for education is fine - why else would it be publicly available? It's generally accepted that stuff is out there for people to read and learn from.

But it's NOT generally acceptable that stuff is put out there for others to make a fast buck from. Using Copilot output for commercial purposes is NOT an acceptable default status - a lot of it has been shared on a "share and share alike" basis and people who don't "play the game" are Stealers of the Commons. Dunno about other countries, but "Finders Keepers" could land you in jail for theft over here (unlikely, but perfectly possible - you have a duty to try and find the rightful owner).

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 2, 2022 12:40 UTC (Sat) by SLi (subscriber, #53131) [Link] (10 responses)

The model is pretty much useless if you cannot use it for anything without fouling copyright. I really think that would be a very harmful development.

Luckily, I also think that that understanding of what a derivative work means in the copyright context is pretty wild and likely incorrect. Well, whatever the law turns out to mean, I wish people stopped advocating for such harmful interpretations. It may be that the law ends up preventing any real AI code models, but it definitely should not.

Why is Copilot so bad?

Posted Jul 2, 2022 13:10 UTC (Sat) by bluca (subscriber, #118303) [Link] (8 responses)

Thank you for clearly expressing these points - the future some of the commentators wish for is very bleak. It seems to me mostly a knee-jerk 'Microsoft bad!' reaction, while they don't realize that if using public code repositories to train an AI model was not allowed, in practice it means only giant corporations with huge caches of proprietary internal code (like... Microsoft!) would be able to legally build and sell an AI product such as Copilot. That would be a really sad and bleak outcome. Fortunately the law, at least in Europe, doesn't seem to go in that direction.

Why is Copilot so bad?

Posted Jul 4, 2022 9:23 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (7 responses)

I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Copyleft only exists because copyright exists. I think we can agree on this point.

People use copyleft licenses because they want their work to remain free.

IF copyright didn't exist and all software source was public domain, we'd all be very glad that copilot was there to help writing code that would be free.

However copyright does exist, and copilot is going to be used mostly to write copyrighted proprietary software, using copyleft software. This is clearly something that the authors of copyleft software didn't want.

Not using github is not a solution because anyone (including microsoft itself) has every right to just mirror whatever on github.

Now you claim (and have an economical interest in claiming so) that copilot does not infringe. However you aren't a judge. And while I do agree that creating the model does not infringe, the output generated from the model is another thing entirely, and that might be infringing.

In any case people who wrote GPL code know that their work is going to be used in proprietary code, which goes against the license and against their wishes when they chose that license.

You are just betting that a future lawsuit will say that you are right. But even if you are wrong it will be the users of copilot being in violation, so microsoft is betting that it will be very hard to find who to sue and no lawsuit will ever happen.

To respond to your comment, no, having your license terms respected is not "bleak". Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

The law to train a ML model doesn't say anything about using that model to generate new content.

Why is Copilot so bad?

Posted Jul 4, 2022 11:47 UTC (Mon) by bluca (subscriber, #118303) [Link] (6 responses)

> I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

> (and have an economical interest in claiming so)

I am not in GH and I am not a shareholder, so you can park this nonsensical tinfoil-hattery straight away - I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

> To respond to your comment, no, having your license terms respected is not "bleak".

It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software beside some boring indexing or suchlike software, as it would be de-facto impossible to compile a legal training corpus unless you have a metric ton of private code available to you. That would be dreadful, and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

> Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code, that is to say all of which is publicly available Github (there's loads), as the team has said multiple times in public, because that's where the training data comes from. If it's on different systems (external or internal), it wasn't used, it's as simple as that - I don't even know if the GH org can access other systems, but from my own experience, I'm pretty sure they cannot even if they wanted to.

> The law to train a ML model doesn't say anything about using that model to generate new content.

Lawmakers were clearly and openly talking about AI applications, and not just some indexing applications or some such activities. A giant trunk of AI r&d is in the field of generating content, like GPT and so on. It seems like a bold assumption to think that the lawmakers weren't aware of all that.

Why is Copilot so bad?

Posted Jul 4, 2022 13:03 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (5 responses)

> Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

You claim that, but here you are with 26 comments defending microsoft's actions.

> I am not in GH and I am not a shareholder

I'm sure you vested or will vest stocks. It's common practice. And you do get a salary I hope?

> I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

> It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software

Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

> That would be dreadful

It is dreadful indeed. I'm not sure why you are considering microsoft to be this little innocent startup company.

> and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

That's your personal opinion that you keep repeating but there is no agreement. And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

> It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code

The open source one… not the proprietary one… Be intellectually honest please. I talked about proprietary code and you replied something entirely OT.

> If it's on different systems (external or internal), it wasn't used

And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

> Lawmakers were clearly and openly talking about AI applications

Generating code is not the only ML application that can exist. Classifiers are ML.

I'm sure the lawmakers were aware, and that's why they talked about "training data" but not about "spitting out the training data verbatim".

You are reading what you would like to be written rather than what is actually written.

Why is Copilot so bad?

Posted Jul 4, 2022 18:16 UTC (Mon) by bluca (subscriber, #118303) [Link] (4 responses)

> You claim that, but here you are with 26 comments defending microsoft's actions.

And...?

> Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

You forgot hand-carving a new silicon behind a blast door in an hazmat suit. Also TIL that Neovim is a proprietary editor. And there's no need for credit cards if you are an open source maintainer, you get it for free.

> Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

You are both failing to see the point (major corporations would be fine if the law worked like the maximalists wanted it to, it's the rest that would be worse off) and also talking nonsense, there is no license violation anywhere. Feel free to point to the court cases if not. Just because a few trolls and edgy teenagers shout "violation!" it doesn't mean it's actually happening, you need to prove it. Can you?

> And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

The fact that some are complaining doesn't mean the alternative if the law was different would be better. There's plenty of anti-vaxxers complaining about the vaccination programs wordwide, it doesn't mean we'd be better off without vaccines.

> And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

It's because of the aliens trapped in those repos, duh! Now if you take off your tin foil hat for a moment and go read other replies, I've already given my uninformed guess on why only public repos on Github are used.

> You are reading what you would like to be written rather than what is actually written.

I'm not the one claiming that training a model violates copyright when it's explicitly allowed by law.

Can we stop here?

Posted Jul 4, 2022 18:30 UTC (Mon) by corbet (editor, #1) [Link] (3 responses)

I'm thinking that perhaps this particular subthread has gone as far as it needs to; let's stop it here.

Thank you.

Can we stop here?

Posted Jul 5, 2022 14:35 UTC (Tue) by nye (subscriber, #51576) [Link] (2 responses)

It reflects badly on you that you post this as a reply to someone responding to repeated baseless personal attacks.

Can we stop here?

Posted Jul 5, 2022 14:52 UTC (Tue) by corbet (editor, #1) [Link]

Perhaps you have the time to watch an out-of-control comment thread - on a holiday - to find the perfect point at which to intervene. I apologize, but I lack that time.

Can we stop here?

Posted Jul 6, 2022 9:34 UTC (Wed) by sdalley (subscriber, #18550) [Link]

Well, you can equally argue that the thread stopping where it did allowed the correct person to have the last word.

But why argue at all. C'mon now, let's give Jon the respect he's entitled to as owner of this site...

Why is Copilot so bad?

Posted Jul 4, 2022 9:01 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

The model can be used to find probability of bugs or whatever. Not all models are used to generate.

For example in my ML course at university we trained a thing to recognise handwriting. We didn't use it to generate a new font.