Why is Copilot so bad?

Posted Jul 2, 2022 1:14 UTC (Sat) by pabs (subscriber, #43278)
In reply to: Why is Copilot so bad? by SLi
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

A world without copyright is not one in which most software is free software. The essential ingredient of free software is equality of access to a work between upstream and downstream. Most of the restrictions on downstream access go away without copyright, but there is still one that does not; source code. In a world without copyright law, source code doesn't become magically public. Trade secret source code will still exist and NDAs and other contracts will protect it.

Why is Copilot so bad?

Posted Jul 2, 2022 9:09 UTC (Sat) by SLi (subscriber, #53131) [Link] (23 responses)

Ok, but is it seriously a good idea to prevent people from creating (including free and open) AI models from publically available data? To me that's the kind of copyright maximalism I did not expect from the free software movement.

In fact, this feels somewhat like a knee-jerk reaction based on a hated company being behind this. Among the masses, I think that's a large part of it. I'm not arrogant enough to think that genuinely knowledgeable and philosophical people like those at SFC have that attitude, though, so that's what leaves me confused.

To me, it seems that the argument is essentially that it should not be realistically legally possible to create good AI models (for which you need at at least hundreds of gigabytes of source code) because of copyright reasons and the impossibility of vetting a copyright-safe set of such code. And I think this is a very counterproductive argument.

Training only one such model is expensive (and bad for environment) enough computation that it really, really should not be done separately for each mutually incompatible free software license, let alone retrain every time you discover that there's license unclarities with some small part of the input (I know, unheard of in the free software world...).

Why is Copilot so bad?

Posted Jul 2, 2022 11:49 UTC (Sat) by Wol (subscriber, #4433) [Link] (11 responses)

> Ok, but is it seriously a good idea to prevent people from creating (including free and open) AI models from publically available data? To me that's the kind of copyright maximalism I did not expect from the free software movement.

There's nothing wrong with MODEL. But there's everything wrong with the USES THAT MAY BE MADE of the output.

The output is - must be - a derivative work of the inputs used to create it. That's what the word "derivative" means.

This then brings copyright into play. The output may be a conglomerate of multiple similar works, in which case the copyright status is probably "too trivial to be eligible". Or the output may be the sole match for a complex piece of code someone is trying to write, tempting them just to take Copilot's output verbatim as the solution to their problem. In that case the copyright status is "blatant piracy". And there's all the other points on the spectrum between them.

Mining publicly available code and using it for education is fine - why else would it be publicly available? It's generally accepted that stuff is out there for people to read and learn from.

But it's NOT generally acceptable that stuff is put out there for others to make a fast buck from. Using Copilot output for commercial purposes is NOT an acceptable default status - a lot of it has been shared on a "share and share alike" basis and people who don't "play the game" are Stealers of the Commons. Dunno about other countries, but "Finders Keepers" could land you in jail for theft over here (unlikely, but perfectly possible - you have a duty to try and find the rightful owner).

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 2, 2022 12:40 UTC (Sat) by SLi (subscriber, #53131) [Link] (10 responses)

The model is pretty much useless if you cannot use it for anything without fouling copyright. I really think that would be a very harmful development.

Luckily, I also think that that understanding of what a derivative work means in the copyright context is pretty wild and likely incorrect. Well, whatever the law turns out to mean, I wish people stopped advocating for such harmful interpretations. It may be that the law ends up preventing any real AI code models, but it definitely should not.

Why is Copilot so bad?

Posted Jul 2, 2022 13:10 UTC (Sat) by bluca (subscriber, #118303) [Link] (8 responses)

Thank you for clearly expressing these points - the future some of the commentators wish for is very bleak. It seems to me mostly a knee-jerk 'Microsoft bad!' reaction, while they don't realize that if using public code repositories to train an AI model was not allowed, in practice it means only giant corporations with huge caches of proprietary internal code (like... Microsoft!) would be able to legally build and sell an AI product such as Copilot. That would be a really sad and bleak outcome. Fortunately the law, at least in Europe, doesn't seem to go in that direction.

Why is Copilot so bad?

Posted Jul 4, 2022 9:23 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (7 responses)

I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Copyleft only exists because copyright exists. I think we can agree on this point.

People use copyleft licenses because they want their work to remain free.

IF copyright didn't exist and all software source was public domain, we'd all be very glad that copilot was there to help writing code that would be free.

However copyright does exist, and copilot is going to be used mostly to write copyrighted proprietary software, using copyleft software. This is clearly something that the authors of copyleft software didn't want.

Not using github is not a solution because anyone (including microsoft itself) has every right to just mirror whatever on github.

Now you claim (and have an economical interest in claiming so) that copilot does not infringe. However you aren't a judge. And while I do agree that creating the model does not infringe, the output generated from the model is another thing entirely, and that might be infringing.

In any case people who wrote GPL code know that their work is going to be used in proprietary code, which goes against the license and against their wishes when they chose that license.

You are just betting that a future lawsuit will say that you are right. But even if you are wrong it will be the users of copilot being in violation, so microsoft is betting that it will be very hard to find who to sue and no lawsuit will ever happen.

To respond to your comment, no, having your license terms respected is not "bleak". Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

The law to train a ML model doesn't say anything about using that model to generate new content.

Why is Copilot so bad?

Posted Jul 4, 2022 11:47 UTC (Mon) by bluca (subscriber, #118303) [Link] (6 responses)

> I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

> (and have an economical interest in claiming so)

I am not in GH and I am not a shareholder, so you can park this nonsensical tinfoil-hattery straight away - I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

> To respond to your comment, no, having your license terms respected is not "bleak".

It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software beside some boring indexing or suchlike software, as it would be de-facto impossible to compile a legal training corpus unless you have a metric ton of private code available to you. That would be dreadful, and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

> Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code, that is to say all of which is publicly available Github (there's loads), as the team has said multiple times in public, because that's where the training data comes from. If it's on different systems (external or internal), it wasn't used, it's as simple as that - I don't even know if the GH org can access other systems, but from my own experience, I'm pretty sure they cannot even if they wanted to.

> The law to train a ML model doesn't say anything about using that model to generate new content.

Lawmakers were clearly and openly talking about AI applications, and not just some indexing applications or some such activities. A giant trunk of AI r&d is in the field of generating content, like GPT and so on. It seems like a bold assumption to think that the lawmakers weren't aware of all that.

Why is Copilot so bad?

Posted Jul 4, 2022 13:03 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (5 responses)

> Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

You claim that, but here you are with 26 comments defending microsoft's actions.

> I am not in GH and I am not a shareholder

I'm sure you vested or will vest stocks. It's common practice. And you do get a salary I hope?

> I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

> It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software

Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

> That would be dreadful

It is dreadful indeed. I'm not sure why you are considering microsoft to be this little innocent startup company.

> and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

That's your personal opinion that you keep repeating but there is no agreement. And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

> It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code

The open source one… not the proprietary one… Be intellectually honest please. I talked about proprietary code and you replied something entirely OT.

> If it's on different systems (external or internal), it wasn't used

And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

> Lawmakers were clearly and openly talking about AI applications

Generating code is not the only ML application that can exist. Classifiers are ML.

I'm sure the lawmakers were aware, and that's why they talked about "training data" but not about "spitting out the training data verbatim".

You are reading what you would like to be written rather than what is actually written.

Why is Copilot so bad?

Posted Jul 4, 2022 18:16 UTC (Mon) by bluca (subscriber, #118303) [Link] (4 responses)

> You claim that, but here you are with 26 comments defending microsoft's actions.

And...?

> Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

You forgot hand-carving a new silicon behind a blast door in an hazmat suit. Also TIL that Neovim is a proprietary editor. And there's no need for credit cards if you are an open source maintainer, you get it for free.

> Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

You are both failing to see the point (major corporations would be fine if the law worked like the maximalists wanted it to, it's the rest that would be worse off) and also talking nonsense, there is no license violation anywhere. Feel free to point to the court cases if not. Just because a few trolls and edgy teenagers shout "violation!" it doesn't mean it's actually happening, you need to prove it. Can you?

> And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

The fact that some are complaining doesn't mean the alternative if the law was different would be better. There's plenty of anti-vaxxers complaining about the vaccination programs wordwide, it doesn't mean we'd be better off without vaccines.

> And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

It's because of the aliens trapped in those repos, duh! Now if you take off your tin foil hat for a moment and go read other replies, I've already given my uninformed guess on why only public repos on Github are used.

> You are reading what you would like to be written rather than what is actually written.

I'm not the one claiming that training a model violates copyright when it's explicitly allowed by law.

Can we stop here?

Posted Jul 4, 2022 18:30 UTC (Mon) by corbet (editor, #1) [Link] (3 responses)

I'm thinking that perhaps this particular subthread has gone as far as it needs to; let's stop it here.

Thank you.

Can we stop here?

Posted Jul 5, 2022 14:35 UTC (Tue) by nye (subscriber, #51576) [Link] (2 responses)

It reflects badly on you that you post this as a reply to someone responding to repeated baseless personal attacks.

Can we stop here?

Posted Jul 5, 2022 14:52 UTC (Tue) by corbet (editor, #1) [Link]

Perhaps you have the time to watch an out-of-control comment thread - on a holiday - to find the perfect point at which to intervene. I apologize, but I lack that time.

Can we stop here?

Posted Jul 6, 2022 9:34 UTC (Wed) by sdalley (subscriber, #18550) [Link]

Well, you can equally argue that the thread stopping where it did allowed the correct person to have the last word.

But why argue at all. C'mon now, let's give Jon the respect he's entitled to as owner of this site...

Why is Copilot so bad?

Posted Jul 4, 2022 9:01 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

The model can be used to find probability of bugs or whatever. Not all models are used to generate.

For example in my ML course at university we trained a thing to recognise handwriting. We didn't use it to generate a new font.

Why is Copilot so bad?

Posted Jul 3, 2022 4:28 UTC (Sun) by pabs (subscriber, #43278) [Link] (9 responses)

I don't believe that you can create a free and open ML model from publicly available data of unspecified provenance and licensing, you need free and open data for that, training an ML model against publicly available proprietary data creates what the Debian Machine Learning Team calls a ToxicCandy model. That is one where Debian couldn't theoretically redistribute the training data in main, upload the training code to main, then train the model using the code/data from main and then redistribute the model in main and then have downstream users create an offline mirror of Debian and then do completely offline modifications to the training data (removing data bias for example) and completely offline retraining of the model.

https://salsa.debian.org/deeplearning-team/ml-policy

Of course, the prohibitively large sizes of most of the training data sets and the prohibitively large costs of training make this scenario infeasible for various actually useful models, but maybe if there were a group working on and funding libre ML with training data storage, compute and reproducible training, then it would become feasible to have actually libre ML.

I don't believe that ToxicCandy models, nor proprietary models are a good idea. Also I believe that the purposes that many uses of ML models are put to are very unethical and that ML researchers need to think carefully about what the model they are creating will enable.

I haven't thought about Copilot enough to comment on the rest of your post.

Why is Copilot so bad?

Posted Jul 3, 2022 6:27 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (5 responses)

free and open data

We need to be careful with the usage of that phrase. "Data" can refer to any set of information under the sun. But many information sets are not subject to copyright protection in the US (see Feist v. Rural), and are subject to sui generis database rights (similar but not identical to copyright) in the EU. This is further complicated by the fact that US law allows you to copyright the "selection or arrangement" of data that would otherwise not be subject to copyright.

In the case of Copilot, the inputs are, of course, subject to copyright. But, if you'll excuse my use of US law (it's the legal system I know best), there are a whole bunch of unanswered questions:

Are the model's parameters, as a whole, subject to copyright protection?
- Probably not; the parameters are not the product of human creativity, but are instead produced by an automated process. US law generally doesn't like giving copyright to non-humans. Furthermore, they don't directly encode meaningful human creativity as such - the parameters are basically a big pile of statistical information.
Assuming, for the sake of argument, that the model is not subject to copyright protection, can it nevertheless be a "derivative work" within the meaning of the copyright statute?
- Bizarrely, the copyright statute provides a reasonable definition of "derivative work," but no definition is provided for the word "work," (17 USC 101) so this remains unclear to me. It probably depends on whether the model counts as a "work."
If the model is not a derivative work, does creating the model violate any of the other exclusive rights in 17 USC 106?
- I strongly doubt it. If it's not a derivative work, it doesn't look like any of the other rights are applicable. It's derivative work or nothing.
Is the output of the model a derivative work of its inputs?
- IMHO this is the question that most likely matters, and the one that everyone has been stubbornly ignoring in order to pontificate about random other stuff that is functionally irrelevant. I don't think any of the other questions have much bearing on how this comes out. Instead, you'd probably get the usual "substantial similarity" test - the plaintiff would come up with *one* *particular* piece of code that is allegedly infringed, and the output which allegedly infringes it, and then the judge would look at the two bits of code and see how similar they are, on a case-by-case basis. Enjoy litigating the crap out of that...
If some particular output of the model is indeed a derivative work of one of its inputs, can GitHub's TOS indemnify itself for the infringement?
- I have no idea, but I assume that Microsoft's lawyers looked at this and said "yes." Otherwise, it probably wouldn't be available.

Your legal system will probably have a different set of unanswered questions, which may in turn have different answers. Regardless, trying to make strong claims about what is or is not legal is a fool's errand at this point.

Why is Copilot so bad?

Posted Jul 3, 2022 6:32 UTC (Sun) by pabs (subscriber, #43278) [Link]

My comment was unrelated to legal systems, but about what could be considered "free and open ML". I consider the Debian ML policy linked above a good start towards defining such a thing.

Why is Copilot so bad?

Posted Jul 4, 2022 13:15 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (3 responses)

> Enjoy litigating the crap out of that...

If you were a company… would you buy copilot knowing that after that every single github user can hit you with an infringement lawsuit?

I'm sure there are patent trolls interested in acquiring the rights to some github projects and going to get claims around :)

If I was a CTO in charge of a company I'd just not buy into it because the potential cost in legal fees and complete bankrupcy seems to greatly outweigh the time we could save.

Why is Copilot so bad?

Posted Jul 5, 2022 8:19 UTC (Tue) by cortana (subscriber, #24596) [Link] (2 responses)

Worse, C-levels now need to concern themselves with liability from their employees using Copilot themselves (after all, if an employee thinks they'll perform better at there job if they pay for it themselves...) and not disclosing the fact that they are committing unlicensed code to their employer's repos...

Why is Copilot so bad?

Posted Jul 5, 2022 8:49 UTC (Tue) by geert (subscriber, #98403) [Link] (1 responses)

People can commit unlicensed code to their employer's repos without Copilot, too.

Why is Copilot so bad?

Posted Jul 12, 2022 8:27 UTC (Tue) by cortana (subscriber, #24596) [Link]

But Copilot makes it a lot easier to do it at industrial scale. Hence the risk is greater.

Why is Copilot so bad?

Posted Jul 3, 2022 23:21 UTC (Sun) by SLi (subscriber, #53131) [Link] (2 responses)

I think the problem with a strong desire or mandate to have a model trained only a large (like Debian-sized or Github-sized) dataset of strictly free software or even license compatible code is the practical impossibility of coming up with such a body of code. Although if I understand your argument correctly, you don't mind license incompatibility (like the old-style BSD woes or the OpenSSL saga) as long as it's legal and FOSS?

You observed correctly that training such a model costs millions. It may be possible in the future that training will become less expensive due to algorithmic or hardware improvements. In practice what you can expect to happen is that people (and companies) will train larger and more useful models than the current ones.

So, assume you have trained such a model on, say, all the code in Debian. Now it turns out there's a small piece of code there that is actually not free software, perhaps not even distributable (happens all the time, I think?). What are you going to do, retrain it from scratch?

Why is Copilot so bad?

Posted Jul 4, 2022 3:36 UTC (Mon) by pabs (subscriber, #43278) [Link] (1 responses)

My comment was more general than just code generation models.

I hadn't thought of license incompatibility, but presumably it would indeed be a concern.

I know approximately zero about ML, but AFAIK retraining is the only option when it comes deficiencies in a model due to bad input data. For example if a model is indirectly biased against certain groups of people, the procedure is presumably to analyse the bias in the input data, then discard some subset of that data or add more data and then retrain the model from scratch. If an ML chatbot is racist because it was trained on internet comments from various sites, you either just delete all the ones from 4chan and hope there are no racist comments on Twitter etc :) or manually comb through all the millions of comments and delete racist ones. Or just give up on the internet as a source of input data :) So yeah, retraining is the only option in the face of non-free or non-redistributable code input.

Why is Copilot so bad?

Posted Jul 4, 2022 9:49 UTC (Mon) by SLi (subscriber, #53131) [Link]

Yup... Well, I can maybe see a community perhaps with some public funding support coming together and coughing up a few million for the training for a significant model.

I have very hard time seeing that repeating whenever someone discovers there was a few kilobytes of non-free code in the input.

Why is Copilot so bad?

Posted Jul 3, 2022 5:39 UTC (Sun) by oldtomas (guest, #72579) [Link]

Defending Microsoft's position against (perceived?) "copyright maximalism" is... spicy.