Why is Copilot so bad?
Why is Copilot so bad?
Posted Jul 4, 2022 16:17 UTC (Mon) by SLi (subscriber, #53131)In reply to: Why is Copilot so bad? by nim-nim
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
As a practical matter, no such large corpus of code without any copyright violations to be discovered exists. I suspect the large corporations come closest. For the free software world, this idea would kill the last hope of training such models.
I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.
Posted Jul 4, 2022 16:37 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (1 responses)
I'm not in GH so I don't know, but if I had to take a wild guess I'd say it's much simpler than that. The non-GH internal SCM systems are such an horrendous pain in the back to use and even get access to, I'm willing to bet the team working on the model, even if given permission to use those sources, would "nope" the heck out very, very fast and never look back.
Posted Jul 4, 2022 17:53 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jul 5, 2022 6:39 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (12 responses)
Not at all.
JUST APPLY THE ORIGINAL LICENSING
We spent decades streamlining FLOSS licensing to make sure the number of actual licenses in play is small and their effects are clearly understood. The kind of legal effort proprietary companies coped on. As shown every single time some software giant tries to relicense its own (?) code and takes years clearing the effects of not doing due diligence before.
THERE IS NO VALID EXCUSE TO IGNORE FLOSS LICENSES.
There is easily 60% of github content that is governed by a small dozen of FLOSS licenses. This is more than enough to train a model on. Distinguishing between a dozen different terms is not hard.
This is especially galling from an extra-wealthy company who had the means for years to clear the legal status of its own code (but did not), spent years mocking people that wasted time arguing about the exact effect of FLOSS license terms, and then starts pillaging this very code without even trying to comply with the hardly won simple licensing state.
This is especially galling from a division (github) that has been asked for years to help committers navigate legalities, make half assed efforts, and then proceeds to ignore the result of those efforts.
Stop finding ridiculous excuses. FLOSS is about the only software trove where ML can work legally *because* of its licensing simplicity (that took a lot of effort to achieve). ASSUMING YOU APPLY THIS LICENSING. Otherwise, no better than proprietary software, and Microsoft has plenty of its own to play with, and it’s not welcome playing with other people’s software when not abiding with the legal conditions.
No better than the people that ignore creative commons terms because their own legal status is an utter mess, and they expect others to be just as bad. Not an honest mistake once they’ve been told repeatedly it’s not the case. They can stomp on their own licensing not on the one of others.
Posted Jul 5, 2022 6:48 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 10:14 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (10 responses)
There is no such thing as a truly massive corpus of code with a known license and guaranteed freedom from copyright issues. There just isn't. It's not an excuse.
Posted Jul 5, 2022 10:28 UTC (Tue)
by amacater (subscriber, #790)
[Link]
This is not necessarily the case for other distributions - which may have other priorities / commercial pressures or whatever - but that's their world. Disclaimer: I am a Debian developer since about 1998 but don't currently package software, though I do keep note of the tools and processes that do.
Posted Jul 5, 2022 10:51 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (7 responses)
The law does not require perfection, it deals with the real world.
The law requires good faith efforts, ie you do not get a free pass to appropriate stuff clearly labeled under someone else’s license, and you make efforts to fix things once you’re notified the labeling was in error.
Nothing more AND NOTHING LESS.
Posted Jul 5, 2022 10:58 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (5 responses)
Posted Jul 5, 2022 11:58 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (4 responses)
If you ignore those terms, it MAY be a copyright violation, depending on the extent and originality of the copying and depending on how much it is linked to overall program structure (ie the more accurate the model will be, the more likely it will be to infringe).
The instrument you use for this copying (CTRL + C or fancy ML) is pretty much irrelevant in the eyes of the law. The law cares about effects (you killed your neighbor) not the instrument used (you used a knife like your arch arch ancestor, you printed a gun, used a fancy sci fi laser or Harry Potter’s magical wand). But, tech people keep thinking they will fool a judge just by using an instrument never encountered before.
Also the law deals with the real world, not absolutes, so infringing accidentally in good faith (the code was mis-labelled) is not the same thing as deliberately ignoring the code license pro-eminently displayed on the github project landing page. In one case you get condemned for one symbolic dollar (provided you did due diligence to fix your mistake), in the other it can reach billions.
As for the “significant cost of retraining” just try this in front of a judge and the peanut gallery, we all know here those models are periodically retrained for lots of different reasons (including mistakes in the data set and licensing mistakes are not less worthy than others).
Notwithstanding the fact that Microsoft is the operator of one of the worlds biggest clouds, which the judge will find it hard to ignore.
Posted Jul 5, 2022 12:20 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (3 responses)
It may be, barely, possible for a large corporation like Google or Microsoft with their internal code bases which tend to be better curated (but still it will be hard).
You do realize that training a model the scale of Copilot costs a few millions every time you do it?
Good luck getting funding for retraining the free model every time Debian finds a copyright violation. I could see public or donated funding for a single training, but not for that.
So, if the law is what you claim it is, we can possibly still have proprietary models, but it's quite unlikely to have significant models trained on free software.
I think your rhetoric about tech people trying to fool judges is a bit misplaced and incendiary. I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough. And it's not like this is some device designed purely to try to circumvent law.
Posted Jul 5, 2022 12:50 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
First computing power is dirt cheap and what was prohibitively expensive yesterday is wasted on ad processing and crypto mining today.
Second, the law does not deal with absolutes it deals with the real world and proportionality.
It does not require instantaneous systematic compliance. That, would be pretty much impossible to achieve in the material world. It requires speedy realistic compliance (as soon as you can, not as soon as it is convenient or cheap for you).
Periodic retraining would be fine, as long as you do not delay it unduly to avoid any consequence. And you *will* retrain periodically if only because computing languages keep evolving and you will need to make the model aware of new variants.
In the meanwhile, it is computationally cheap to filter output to ignore suggestions found in code you’ve been informed is tainted.
And if you are convinced the amount of tainted code will largely exceed you capacity to filter, and you proceed with your ML project anyway, it will be hard to take is as anything but willful copyright infringement.
And it all terribly inconvenient I know. The law is not about your individual convenience.
> I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough.
”Even for copyrightable platforms and software packages, the determination whether infringement has occurred must take into account doctrines like fair use that protect the legitimate interests of follow-on users to innovate. But the promise of some threshold copyright protection for […] elements of computer software generally is a critically important driver of research and investment by companies like amici and rescinding that promise would have sweeping and harmful effects throughout the software industry”
Gregory G. Garre, Counsel for Microsoft Corporation, BRIEF FOR AMICI CURIAE MICROSOFT CORPORATION […] IN SUPPORT OF APPELLANT
That’s what Microsoft thinks when the code in question is not produced by Joe Nobody on Github
Posted Jul 5, 2022 14:16 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (1 responses)
In the future, it's possible that you may be able to train models people train today for millions for less, but even that is a bit speculative (I think the biggest advancements are likely to come from algorithmic development, but it's probably still possible to squeeze some computation per watt more). You still won't be able to train in the future the better models they then train for dirt cheap.
Posted Jul 5, 2022 14:32 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 11:01 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 12:15 UTC (Tue)
by pabs (subscriber, #43278)
[Link]
https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=ftp.deb...
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
https://snapshot.debian.org/removal/