|
|
Subscribe / Log in / New account

Not trained on MS closed source

Not trained on MS closed source

Posted Jun 24, 2022 22:09 UTC (Fri) by glenn (subscriber, #102223)
Parent article: DeVault: GitHub Copilot and open source laundering

From the Copilot FAQ:

GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

It's not trained on Microsoft's internal closed source projects. Gee, I wonder why this is.


to post comments

Not trained on MS closed source

Posted Jun 24, 2022 22:27 UTC (Fri) by bluca (subscriber, #118303) [Link] (10 responses)

Because those are not on Github, they are on a completely separate and older system which is a massive pain in the backside to even gain access to, let alone use. Not surprised they wouldn't touch that with a barge pole to be honest.

Not trained on MS closed source

Posted Jun 24, 2022 22:46 UTC (Fri) by glenn (subscriber, #102223) [Link] (2 responses)

Maybe we can take Copilot seriously once Microsoft pledges to indemnify their users from lawsuits related to the code Copilot generates.

Not trained on MS closed source

Posted Jun 25, 2022 9:18 UTC (Sat) by bluca (subscriber, #118303) [Link] (1 responses)

What lawsuits?

Not trained on MS closed source

Posted Nov 11, 2022 22:16 UTC (Fri) by glenn (subscriber, #102223) [Link]

It took some time, but there's this one: https://lwn.net/Articles/914150

It's not unthinkable that users of the tool who for commercial purposes could also be sued.

Not trained on MS closed source

Posted Jun 26, 2022 22:27 UTC (Sun) by LtWorf (subscriber, #124958) [Link] (6 responses)

There's plenty of proprietary stuff on github that they didn't dare to use.

I think this alone shows that they are not very sure about the legality of what they are doing, but trust that developers won't be able to do anything about it (unlike the paying customers).

Not trained on MS closed source

Posted Jun 27, 2022 0:31 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

Exactly which proprietary stuff is publicly available and wasn't used?

Not trained on MS closed source

Posted Jun 27, 2022 5:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

GitHub does not require uploaded code to be FOSS, as long as it allows the use of GitHub's "fork" button (loosely equivalent to git clone), and one or two other pieces of site functionality. In theory, they could have limited their training set to only include FOSS repositories, and not proprietary or no-license repositories (in most jurisdictions, no license means the same thing as "all rights reserved").

But I couldn't find any statement in their FAQ one way or the other - it just refers to "public repositories on GitHub," a category including both FOSS and proprietary code. It's entirely possible that they are using all of that code, and IMHO that seems like the most straightforward way to read the sentence (which doesn't mean that it is the intended meaning, of course).

Not trained on MS closed source

Posted Jun 27, 2022 12:45 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (2 responses)

Or more accurately, they are *very* sure of the (il)legality of their approach and only dare to use it against projects that they feel won’t fight back.

Microsoft has access to plenty enough of proprietary code to train a model on, that they chose to use other people’s code instead says volumes.

Not trained on MS closed source

Posted Jun 27, 2022 18:31 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

Or more accurately, it is trained on publicly (legally) available corpora, as the law requires to be excepted from copyright restrictions

Not trained on MS closed source

Posted Jun 30, 2022 13:45 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

FOSS is not the same thing as public domain, it is all copyrighted

Not trained on MS closed source

Posted Jun 27, 2022 14:44 UTC (Mon) by excors (subscriber, #95769) [Link]

I think there are significant non-copyright reasons to exclude non-public code. E.g. The Copilot FAQ says:

> Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be rare that GitHub Copilot suggestions included personal data verbatim from the training set. [...] We have implemented a filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough. We will keep improving the filter system to be more intelligent to detect and remove more personal data from the suggestions.

"rare" != "never", and if someone stores sensitive personal data in a private GitHub repository then they absolutely don't want Copilot to reveal that information publicly to anyone who tries hard enough.

I expect the same applies to other confidential information, like yet-to-be-announced product names that companies might store in private repositories, or secret keys, or algorithms that they're protecting as trade secrets, etc.

Since the Copilot training data apparently includes public repositories even if they have a restrictive license, but excludes private repositories even if they have a very permissive license, it sounds like GitHub is confident that there are no copyright issues but is concerned about those other privacy issues.

Not trained on MS closed source

Posted Jun 27, 2022 14:26 UTC (Mon) by geert (subscriber, #98403) [Link]

"trained on A, including B", does not mean that it was trained on B only.
So it may have been trained on any source code that is publicly available; we don't know what exactly, the description in the FAQ is very vague (deliberately?).

FWIW, it might have been trained on whatever proprietary code that was ever leaked to the Internet, which might even include the sources of some version of Microsoft Windows ;-)


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds