Why is Copilot so bad?
Why is Copilot so bad?
Posted Jul 2, 2022 1:14 UTC (Sat) by pabs (subscriber, #43278)In reply to: Why is Copilot so bad? by SLi
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Posted Jul 2, 2022 9:09 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (23 responses)
In fact, this feels somewhat like a knee-jerk reaction based on a hated company being behind this. Among the masses, I think that's a large part of it. I'm not arrogant enough to think that genuinely knowledgeable and philosophical people like those at SFC have that attitude, though, so that's what leaves me confused.
To me, it seems that the argument is essentially that it should not be realistically legally possible to create good AI models (for which you need at at least hundreds of gigabytes of source code) because of copyright reasons and the impossibility of vetting a copyright-safe set of such code. And I think this is a very counterproductive argument.
Training only one such model is expensive (and bad for environment) enough computation that it really, really should not be done separately for each mutually incompatible free software license, let alone retrain every time you discover that there's license unclarities with some small part of the input (I know, unheard of in the free software world...).
Posted Jul 2, 2022 11:49 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (11 responses)
There's nothing wrong with MODEL. But there's everything wrong with the USES THAT MAY BE MADE of the output.
The output is - must be - a derivative work of the inputs used to create it. That's what the word "derivative" means.
This then brings copyright into play. The output may be a conglomerate of multiple similar works, in which case the copyright status is probably "too trivial to be eligible". Or the output may be the sole match for a complex piece of code someone is trying to write, tempting them just to take Copilot's output verbatim as the solution to their problem. In that case the copyright status is "blatant piracy". And there's all the other points on the spectrum between them.
Mining publicly available code and using it for education is fine - why else would it be publicly available? It's generally accepted that stuff is out there for people to read and learn from.
But it's NOT generally acceptable that stuff is put out there for others to make a fast buck from. Using Copilot output for commercial purposes is NOT an acceptable default status - a lot of it has been shared on a "share and share alike" basis and people who don't "play the game" are Stealers of the Commons. Dunno about other countries, but "Finders Keepers" could land you in jail for theft over here (unlikely, but perfectly possible - you have a duty to try and find the rightful owner).
Cheers,
Posted Jul 2, 2022 12:40 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (10 responses)
The model is pretty much useless if you cannot use it for anything without fouling copyright. I really think that would be a very harmful development.
Luckily, I also think that that understanding of what a derivative work means in the copyright context is pretty wild and likely incorrect. Well, whatever the law turns out to mean, I wish people stopped advocating for such harmful interpretations. It may be that the law ends up preventing any real AI code models, but it definitely should not.
Posted Jul 2, 2022 13:10 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (8 responses)
Posted Jul 4, 2022 9:23 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (7 responses)
Copyleft only exists because copyright exists. I think we can agree on this point.
People use copyleft licenses because they want their work to remain free.
IF copyright didn't exist and all software source was public domain, we'd all be very glad that copilot was there to help writing code that would be free.
However copyright does exist, and copilot is going to be used mostly to write copyrighted proprietary software, using copyleft software. This is clearly something that the authors of copyleft software didn't want.
Not using github is not a solution because anyone (including microsoft itself) has every right to just mirror whatever on github.
Now you claim (and have an economical interest in claiming so) that copilot does not infringe. However you aren't a judge. And while I do agree that creating the model does not infringe, the output generated from the model is another thing entirely, and that might be infringing.
In any case people who wrote GPL code know that their work is going to be used in proprietary code, which goes against the license and against their wishes when they chose that license.
You are just betting that a future lawsuit will say that you are right. But even if you are wrong it will be the users of copilot being in violation, so microsoft is betting that it will be very hard to find who to sue and no lawsuit will ever happen.
To respond to your comment, no, having your license terms respected is not "bleak". Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.
The law to train a ML model doesn't say anything about using that model to generate new content.
Posted Jul 4, 2022 11:47 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (6 responses)
Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you
> (and have an economical interest in claiming so)
I am not in GH and I am not a shareholder, so you can park this nonsensical tinfoil-hattery straight away - I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.
> To respond to your comment, no, having your license terms respected is not "bleak".
It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software beside some boring indexing or suchlike software, as it would be de-facto impossible to compile a legal training corpus unless you have a metric ton of private code available to you. That would be dreadful, and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.
> Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.
It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code, that is to say all of which is publicly available Github (there's loads), as the team has said multiple times in public, because that's where the training data comes from. If it's on different systems (external or internal), it wasn't used, it's as simple as that - I don't even know if the GH org can access other systems, but from my own experience, I'm pretty sure they cannot even if they wanted to.
> The law to train a ML model doesn't say anything about using that model to generate new content.
Lawmakers were clearly and openly talking about AI applications, and not just some indexing applications or some such activities. A giant trunk of AI r&d is in the field of generating content, like GPT and so on. It seems like a bold assumption to think that the lawmakers weren't aware of all that.
Posted Jul 4, 2022 13:03 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (5 responses)
You claim that, but here you are with 26 comments defending microsoft's actions.
> I am not in GH and I am not a shareholder
I'm sure you vested or will vest stocks. It's common practice. And you do get a salary I hope?
> I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.
Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.
> It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software
Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.
> That would be dreadful
It is dreadful indeed. I'm not sure why you are considering microsoft to be this little innocent startup company.
> and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.
That's your personal opinion that you keep repeating but there is no agreement. And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.
> It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code
The open source one… not the proprietary one… Be intellectually honest please. I talked about proprietary code and you replied something entirely OT.
> If it's on different systems (external or internal), it wasn't used
And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?
> Lawmakers were clearly and openly talking about AI applications
Generating code is not the only ML application that can exist. Classifiers are ML.
I'm sure the lawmakers were aware, and that's why they talked about "training data" but not about "spitting out the training data verbatim".
You are reading what you would like to be written rather than what is actually written.
Posted Jul 4, 2022 18:16 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (4 responses)
And...?
> Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.
You forgot hand-carving a new silicon behind a blast door in an hazmat suit. Also TIL that Neovim is a proprietary editor. And there's no need for credit cards if you are an open source maintainer, you get it for free.
> Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.
You are both failing to see the point (major corporations would be fine if the law worked like the maximalists wanted it to, it's the rest that would be worse off) and also talking nonsense, there is no license violation anywhere. Feel free to point to the court cases if not. Just because a few trolls and edgy teenagers shout "violation!" it doesn't mean it's actually happening, you need to prove it. Can you?
> And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.
The fact that some are complaining doesn't mean the alternative if the law was different would be better. There's plenty of anti-vaxxers complaining about the vaccination programs wordwide, it doesn't mean we'd be better off without vaccines.
> And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?
It's because of the aliens trapped in those repos, duh! Now if you take off your tin foil hat for a moment and go read other replies, I've already given my uninformed guess on why only public repos on Github are used.
> You are reading what you would like to be written rather than what is actually written.
I'm not the one claiming that training a model violates copyright when it's explicitly allowed by law.
Posted Jul 4, 2022 18:30 UTC (Mon)
by corbet (editor, #1)
[Link] (3 responses)
Thank you.
Posted Jul 5, 2022 14:35 UTC (Tue)
by nye (subscriber, #51576)
[Link] (2 responses)
Posted Jul 5, 2022 14:52 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted Jul 6, 2022 9:34 UTC (Wed)
by sdalley (subscriber, #18550)
[Link]
But why argue at all. C'mon now, let's give Jon the respect he's entitled to as owner of this site...
Posted Jul 4, 2022 9:01 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link]
For example in my ML course at university we trained a thing to recognise handwriting. We didn't use it to generate a new font.
Posted Jul 3, 2022 4:28 UTC (Sun)
by pabs (subscriber, #43278)
[Link] (9 responses)
https://salsa.debian.org/deeplearning-team/ml-policy
Of course, the prohibitively large sizes of most of the training data sets and the prohibitively large costs of training make this scenario infeasible for various actually useful models, but maybe if there were a group working on and funding libre ML with training data storage, compute and reproducible training, then it would become feasible to have actually libre ML.
I don't believe that ToxicCandy models, nor proprietary models are a good idea. Also I believe that the purposes that many uses of ML models are put to are very unethical and that ML researchers need to think carefully about what the model they are creating will enable.
I haven't thought about Copilot enough to comment on the rest of your post.
Posted Jul 3, 2022 6:27 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
We need to be careful with the usage of that phrase. "Data" can refer to any set of information under the sun. But many information sets are not subject to copyright protection in the US (see Feist v. Rural), and are subject to sui generis database rights (similar but not identical to copyright) in the EU. This is further complicated by the fact that US law allows you to copyright the "selection or arrangement" of data that would otherwise not be subject to copyright.
In the case of Copilot, the inputs are, of course, subject to copyright. But, if you'll excuse my use of US law (it's the legal system I know best), there are a whole bunch of unanswered questions:
Your legal system will probably have a different set of unanswered questions, which may in turn have different answers. Regardless, trying to make strong claims about what is or is not legal is a fool's errand at this point.
Posted Jul 3, 2022 6:32 UTC (Sun)
by pabs (subscriber, #43278)
[Link]
Posted Jul 4, 2022 13:15 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
If you were a company… would you buy copilot knowing that after that every single github user can hit you with an infringement lawsuit?
I'm sure there are patent trolls interested in acquiring the rights to some github projects and going to get claims around :)
If I was a CTO in charge of a company I'd just not buy into it because the potential cost in legal fees and complete bankrupcy seems to greatly outweigh the time we could save.
Posted Jul 5, 2022 8:19 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (2 responses)
Posted Jul 5, 2022 8:49 UTC (Tue)
by geert (subscriber, #98403)
[Link] (1 responses)
Posted Jul 12, 2022 8:27 UTC (Tue)
by cortana (subscriber, #24596)
[Link]
Posted Jul 3, 2022 23:21 UTC (Sun)
by SLi (subscriber, #53131)
[Link] (2 responses)
You observed correctly that training such a model costs millions. It may be possible in the future that training will become less expensive due to algorithmic or hardware improvements. In practice what you can expect to happen is that people (and companies) will train larger and more useful models than the current ones.
So, assume you have trained such a model on, say, all the code in Debian. Now it turns out there's a small piece of code there that is actually not free software, perhaps not even distributable (happens all the time, I think?). What are you going to do, retrain it from scratch?
Posted Jul 4, 2022 3:36 UTC (Mon)
by pabs (subscriber, #43278)
[Link] (1 responses)
I hadn't thought of license incompatibility, but presumably it would indeed be a concern.
I know approximately zero about ML, but AFAIK retraining is the only option when it comes deficiencies in a model due to bad input data. For example if a model is indirectly biased against certain groups of people, the procedure is presumably to analyse the bias in the input data, then discard some subset of that data or add more data and then retrain the model from scratch. If an ML chatbot is racist because it was trained on internet comments from various sites, you either just delete all the ones from 4chan and hope there are no racist comments on Twitter etc :) or manually comb through all the millions of comments and delete racist ones. Or just give up on the internet as a source of input data :) So yeah, retraining is the only option in the face of non-free or non-redistributable code input.
Posted Jul 4, 2022 9:49 UTC (Mon)
by SLi (subscriber, #53131)
[Link]
I have very hard time seeing that repeating whenever someone discovers there was a few kilobytes of non-free code in the input.
Posted Jul 3, 2022 5:39 UTC (Sun)
by oldtomas (guest, #72579)
[Link]
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
I'm thinking that perhaps this particular subthread has gone as far as it needs to; let's stop it here.
Can we stop here?
Can we stop here?
Perhaps you have the time to watch an out-of-control comment thread - on a holiday - to find the perfect point at which to intervene. I apologize, but I lack that time.
Can we stop here?
Can we stop here?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
free and open data
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?