|
|
Subscribe / Log in / New account

Class action against GitHub Copilot

Class action against GitHub Copilot

Posted Nov 11, 2022 17:42 UTC (Fri) by mathstuf (subscriber, #69389)
In reply to: Class action against GitHub Copilot by bluca
Parent article: Class action against GitHub Copilot

I eagerly await Microsoft's addition to Copilot's training set of the Windows and Office codebases if there's no such issue.


to post comments

Class action against GitHub Copilot

Posted Nov 11, 2022 19:36 UTC (Fri) by bluca (subscriber, #118303) [Link] (11 responses)

Those are not hosted on Github (not even in the private section) but in a completely separate pre-existing git forge, so I'm afraid you'll be waiting for a long time

Class action against GitHub Copilot

Posted Nov 11, 2022 20:07 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (10 responses)

So? If there's no worry about contributory infringement, why not train on it? Why limit yourselves to public code and not any code Microsoft has access to?

Class action against GitHub Copilot

Posted Nov 11, 2022 20:44 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Danger of accidental leaks of secret credentials hard-coded in config files/code. Which should not happen, but often does in private code.

Please note, that it's not a question of copyright.

Class action against GitHub Copilot

Posted Nov 11, 2022 21:39 UTC (Fri) by bluca (subscriber, #118303) [Link] (8 responses)

Because it's built by Github, on Github? It's also not scraping Gitlab instances and so on. Also, believe me, just trying to get access to those instances would be such a major PITA that anybody sane would just give up, leave and go fishing. There's nothing to gain anyway, so why bother?

Class action against GitHub Copilot

Posted Nov 12, 2022 3:12 UTC (Sat) by pabs (subscriber, #43278) [Link] (7 responses)

Software Heritage have managed to ingest lots of different software sources, I'm sure Microsoft could easily manage to do the same for training GitHub Copilot, or even just get a copy from SWH.

https://www.softwareheritage.org/

Class action against GitHub Copilot

Posted Nov 12, 2022 11:29 UTC (Sat) by bluca (subscriber, #118303) [Link] (6 responses)

Why bother? Accessing gigatons of data from you own infrastructure on prem is cheap and easy. The same volume of data from third parties is going to cost an arm and a leg in bandwidth alone. Is there any evidence that spending all that money would significantly improve the quality of the models in any way?

Class action against GitHub Copilot

Posted Nov 12, 2022 15:39 UTC (Sat) by farnz (subscriber, #17727) [Link] (5 responses)

It would have been wise for Microsoft to train Copilot against their crown jewels (Office and Windows) for two reasons:

  1. It makes their assertion that Copilot does not infringe anyone's copyright easier to defend if they're saying that it's safe to train it against their crown jewel codebases. The fact that MS haven't done this means that there's room to argue that they won't do it ever because of the risk of accidentally publishing parts of Windows or Office source code, and not just because of the difficulty of moving data from one business unit to another.
  2. There's still a lot of people working on Windows API codebases - having Copilot trained on what are presumably the "best" codebases in the world (on average) would help those people out.

Class action against GitHub Copilot

Posted Nov 12, 2022 18:31 UTC (Sat) by bluca (subscriber, #118303) [Link] (3 responses)

1) Nah, naysayers will never, ever be happy, it would help in no way whatsoever while costing a boatload of money and effort
2) [citation needed]

Class action against GitHub Copilot

Posted Nov 12, 2022 20:33 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (1 responses)

Sure, 100% satisfaction is not feasible for anything, but I think it'd make a *lot* of the skepticism subside (including mine). Why wouldn't it cost any more than Copilot already cost? Or is ingesting new code not done anymore and Copilot "frozen"? If it isn't frozen, what's the marginal cost of a few hundred million lines on top of the billions already ingested?

How the hell do you think anyone would get a citation for that? Are you saying that Microsoft doesn't have useful Win32 API usage to train on for Windows developers? Or are you saying that even Microsoft doesn't use it well enough to bother training anything on it?

Class action against GitHub Copilot

Posted Nov 12, 2022 22:00 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Modern neural networks are often trained in stages, so just ingesting an additional corpus of code might indeed require retraining everything. But they'll have to do it eventually anyway.

Class action against GitHub Copilot

Posted Nov 13, 2022 16:56 UTC (Sun) by farnz (subscriber, #17727) [Link]

For 1, it's not about the naysayers, it's about what you can say in court to convince a judge (or jury in some US civil cases) that the naysayers are overreacting. The statement "we trained this against our crown jewels, the Windows and Office codebases, because we are completely certain that its output cannot contain enough of our original code to infringe copyright" is a very convincing statement to a judge or jury - and even if the court finds that Copilot engages in contributory infringement of people's copyright (having seen a demo of it doing so), the court is likely to be lenient on Microsoft as a result - the fact of having trained it against their core business codebases is helpful evidence that any infringement by Copilot's output is unintentional and something Microsoft would fix, because it puts their core business at risk.

And for 2, which part do you want a citation on? That Office and Windows are a big Win32 codebase written by good developers? That people still write code for Win32? That there's boilerplate in Win32 that would be simplified with an AI assistant helping you write the code?

Class action against GitHub Copilot

Posted Nov 12, 2022 22:28 UTC (Sat) by anselm (subscriber, #2796) [Link]

OTOH, it could be the case that the source code for Windows and Office is so atrociously horrible that they don't want to contaminate their ML model with it -- especially if there's a chance that recognisable bits of it could leak out for everyone to see.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds