Why is Copilot so bad?

Posted Jul 4, 2022 16:17 UTC (Mon) by SLi (subscriber, #53131)
In reply to: Why is Copilot so bad? by nim-nim
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

My point exactly. But would you have the world be such that every time you discover that there was some small piece of code in the ton of code you use to train the model on, you have to retrain the model for a cost of a few million?

As a practical matter, no such large corpus of code without any copyright violations to be discovered exists. I suspect the large corporations come closest. For the free software world, this idea would kill the last hope of training such models.

I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

Why is Copilot so bad?

Posted Jul 4, 2022 16:37 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

> I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

I'm not in GH so I don't know, but if I had to take a wild guess I'd say it's much simpler than that. The non-GH internal SCM systems are such an horrendous pain in the back to use and even get access to, I'm willing to bet the team working on the model, even if given permission to use those sources, would "nope" the heck out very, very fast and never look back.

Why is Copilot so bad?

Posted Jul 4, 2022 17:53 UTC (Mon) by Wol (subscriber, #4433) [Link]

Givn that I've worked with SourceSafe, I'm inclined to agree with you ... :-)

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 5, 2022 6:39 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (12 responses)

> For the free software world, this idea would kill the last hope of training such models.

Not at all.

JUST APPLY THE ORIGINAL LICENSING

We spent decades streamlining FLOSS licensing to make sure the number of actual licenses in play is small and their effects are clearly understood. The kind of legal effort proprietary companies coped on. As shown every single time some software giant tries to relicense its own (?) code and takes years clearing the effects of not doing due diligence before.

THERE IS NO VALID EXCUSE TO IGNORE FLOSS LICENSES.

There is easily 60% of github content that is governed by a small dozen of FLOSS licenses. This is more than enough to train a model on. Distinguishing between a dozen different terms is not hard.

This is especially galling from an extra-wealthy company who had the means for years to clear the legal status of its own code (but did not), spent years mocking people that wasted time arguing about the exact effect of FLOSS license terms, and then starts pillaging this very code without even trying to comply with the hardly won simple licensing state.

This is especially galling from a division (github) that has been asked for years to help committers navigate legalities, make half assed efforts, and then proceeds to ignore the result of those efforts.

Stop finding ridiculous excuses. FLOSS is about the only software trove where ML can work legally *because* of its licensing simplicity (that took a lot of effort to achieve). ASSUMING YOU APPLY THIS LICENSING. Otherwise, no better than proprietary software, and Microsoft has plenty of its own to play with, and it’s not welcome playing with other people’s software when not abiding with the legal conditions.

No better than the people that ignore creative commons terms because their own legal status is an utter mess, and they expect others to be just as bad. Not an honest mistake once they’ve been told repeatedly it’s not the case. They can stomp on their own licensing not on the one of others.

Why is Copilot so bad?

Posted Jul 5, 2022 6:48 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

(Also we *do* remember that Microsoft filed Amicus briefs on Oracle's side, when it spent years suing Google for 9 lines of rangecheck implementation, and now wants us to accept that copying of FLOSS code on industrial scale is not protected provided it’s mediated by a black box ML model).

Why is Copilot so bad?

Posted Jul 5, 2022 10:14 UTC (Tue) by SLi (subscriber, #53131) [Link] (10 responses)

So are you seriously telling me distributions like Debian do not find out regularly that they have been distributing something which is a copyright violation? Because if you are, you clearly just do not know.

There is no such thing as a truly massive corpus of code with a known license and guaranteed freedom from copyright issues. There just isn't. It's not an excuse.

Why is Copilot so bad?

Posted Jul 5, 2022 10:28 UTC (Tue) by amacater (subscriber, #790) [Link]

Debian and licenses - yes, it's one of the things that Debian maintainers do is licensing and copyright checking. If software is found where the licence is changed, it's removed. It's also one of the things that goes into Debian packaging checks, SPDX, reproducible builds ... there's a good faith effort to do this for every Debian package. Jokingly, I refer to Debian licence "fascism" as one of the saving graces of Debian because you _can_ be as sure as feasible that someone has checked.

This is not necessarily the case for other distributions - which may have other priorities / commercial pressures or whatever - but that's their world. Disclaimer: I am a Debian developer since about 1998 but don't currently package software, though I do keep note of the tools and processes that do.

Why is Copilot so bad?

Posted Jul 5, 2022 10:51 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (7 responses)

Irrelevant

The law does not require perfection, it deals with the real world.

The law requires good faith efforts, ie you do not get a free pass to appropriate stuff clearly labeled under someone else’s license, and you make efforts to fix things once you’re notified the labeling was in error.

Nothing more AND NOTHING LESS.

Why is Copilot so bad?

Posted Jul 5, 2022 10:58 UTC (Tue) by SLi (subscriber, #53131) [Link] (5 responses)

So are you saying that, if using models like Copilot is a copyright violation, the law would still not require you to stop using a model trained from Debian's source code once you have realized you trained it with unlicensed material? Because they did a good enough effort? Even if they could, at a significant cost, retrain it?

Why is Copilot so bad?

Posted Jul 5, 2022 11:58 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (4 responses)

It definitely NOT a copyright violation if you apply the licensing terms of the code you are copying.

If you ignore those terms, it MAY be a copyright violation, depending on the extent and originality of the copying and depending on how much it is linked to overall program structure (ie the more accurate the model will be, the more likely it will be to infringe).

The instrument you use for this copying (CTRL + C or fancy ML) is pretty much irrelevant in the eyes of the law. The law cares about effects (you killed your neighbor) not the instrument used (you used a knife like your arch arch ancestor, you printed a gun, used a fancy sci fi laser or Harry Potter’s magical wand). But, tech people keep thinking they will fool a judge just by using an instrument never encountered before.

Also the law deals with the real world, not absolutes, so infringing accidentally in good faith (the code was mis-labelled) is not the same thing as deliberately ignoring the code license pro-eminently displayed on the github project landing page. In one case you get condemned for one symbolic dollar (provided you did due diligence to fix your mistake), in the other it can reach billions.

As for the “significant cost of retraining” just try this in front of a judge and the peanut gallery, we all know here those models are periodically retrained for lots of different reasons (including mistakes in the data set and licensing mistakes are not less worthy than others).

Notwithstanding the fact that Microsoft is the operator of one of the worlds biggest clouds, which the judge will find it hard to ignore.

Why is Copilot so bad?

Posted Jul 5, 2022 12:20 UTC (Tue) by SLi (subscriber, #53131) [Link] (3 responses)

Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

It may be, barely, possible for a large corporation like Google or Microsoft with their internal code bases which tend to be better curated (but still it will be hard).

You do realize that training a model the scale of Copilot costs a few millions every time you do it?

Good luck getting funding for retraining the free model every time Debian finds a copyright violation. I could see public or donated funding for a single training, but not for that.

So, if the law is what you claim it is, we can possibly still have proprietary models, but it's quite unlikely to have significant models trained on free software.

I think your rhetoric about tech people trying to fool judges is a bit misplaced and incendiary. I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough. And it's not like this is some device designed purely to try to circumvent law.

Why is Copilot so bad?

Posted Jul 5, 2022 12:50 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (2 responses)

> Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

First computing power is dirt cheap and what was prohibitively expensive yesterday is wasted on ad processing and crypto mining today.

Second, the law does not deal with absolutes it deals with the real world and proportionality.

It does not require instantaneous systematic compliance. That, would be pretty much impossible to achieve in the material world. It requires speedy realistic compliance (as soon as you can, not as soon as it is convenient or cheap for you).

Periodic retraining would be fine, as long as you do not delay it unduly to avoid any consequence. And you *will* retrain periodically if only because computing languages keep evolving and you will need to make the model aware of new variants.

In the meanwhile, it is computationally cheap to filter output to ignore suggestions found in code you’ve been informed is tainted.

And if you are convinced the amount of tainted code will largely exceed you capacity to filter, and you proceed with your ML project anyway, it will be hard to take is as anything but willful copyright infringement.

And it all terribly inconvenient I know. The law is not about your individual convenience.

> I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough.

”Even for copyrightable platforms and software packages, the determination whether infringement has occurred must take into account doctrines like fair use that protect the legitimate interests of follow-on users to innovate. But the promise of some threshold copyright protection for […] elements of computer software generally is a critically important driver of research and investment by companies like amici and rescinding that promise would have sweeping and harmful effects throughout the software industry”

Gregory G. Garre, Counsel for Microsoft Corporation, BRIEF FOR AMICI CURIAE MICROSOFT CORPORATION […] IN SUPPORT OF APPELLANT

That’s what Microsoft thinks when the code in question is not produced by Joe Nobody on Github

Why is Copilot so bad?

Posted Jul 5, 2022 14:16 UTC (Tue) by SLi (subscriber, #53131) [Link] (1 responses)

Computing power dirt cheap? You clearly haven't moved into the world of AI yet. Seriously, training those models costs millions in electricity and computer time costs only per training run.

In the future, it's possible that you may be able to train models people train today for millions for less, but even that is a bit speculative (I think the biggest advancements are likely to come from algorithmic development, but it's probably still possible to squeeze some computation per watt more). You still won't be able to train in the future the better models they then train for dirt cheap.

Why is Copilot so bad?

Posted Jul 5, 2022 14:32 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Then it was utterly foolish to spend those millions before writing the little amount of code necessary to check the legal metadata. Behaving foolishly is a general consequence of thinking rules apply to others not you.

Why is Copilot so bad?

Posted Jul 5, 2022 11:01 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Also the default state of something you find somewhere on the street or on the web is not free to use it’s protected. You are not allowed to steal the pile of furniture lying in the street during a relocation just because every single table is not tagged off limits.

Why is Copilot so bad?

Posted Jul 5, 2022 12:15 UTC (Tue) by pabs (subscriber, #43278) [Link]

You are correct that Debian does have to remove code fairly regularly that was found to be non-free or even non-redistributable. Most instances are caught by maintainers before they enter Debian, but sometimes mistakes are made.

https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=ftp.deb...
https://snapshot.debian.org/removal/