Not trained on MS closed source
Not trained on MS closed source
Posted Jun 24, 2022 22:09 UTC (Fri) by glenn (subscriber, #102223)Parent article: DeVault: GitHub Copilot and open source laundering
From the Copilot FAQ:
GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.
It's not trained on Microsoft's internal closed source projects. Gee, I wonder why this is.
Posted Jun 24, 2022 22:27 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (10 responses)
Posted Jun 24, 2022 22:46 UTC (Fri)
by glenn (subscriber, #102223)
[Link] (2 responses)
Posted Jun 25, 2022 9:18 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Nov 11, 2022 22:16 UTC (Fri)
by glenn (subscriber, #102223)
[Link]
It's not unthinkable that users of the tool who for commercial purposes could also be sued.
Posted Jun 26, 2022 22:27 UTC (Sun)
by LtWorf (subscriber, #124958)
[Link] (6 responses)
I think this alone shows that they are not very sure about the legality of what they are doing, but trust that developers won't be able to do anything about it (unlike the paying customers).
Posted Jun 27, 2022 0:31 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Jun 27, 2022 5:41 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
But I couldn't find any statement in their FAQ one way or the other - it just refers to "public repositories on GitHub," a category including both FOSS and proprietary code. It's entirely possible that they are using all of that code, and IMHO that seems like the most straightforward way to read the sentence (which doesn't mean that it is the intended meaning, of course).
Posted Jun 27, 2022 12:45 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
Microsoft has access to plenty enough of proprietary code to train a model on, that they chose to use other people’s code instead says volumes.
Posted Jun 27, 2022 18:31 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Jun 30, 2022 13:45 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Posted Jun 27, 2022 14:44 UTC (Mon)
by excors (subscriber, #95769)
[Link]
> Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be rare that GitHub Copilot suggestions included personal data verbatim from the training set. [...] We have implemented a filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough. We will keep improving the filter system to be more intelligent to detect and remove more personal data from the suggestions.
"rare" != "never", and if someone stores sensitive personal data in a private GitHub repository then they absolutely don't want Copilot to reveal that information publicly to anyone who tries hard enough.
I expect the same applies to other confidential information, like yet-to-be-announced product names that companies might store in private repositories, or secret keys, or algorithms that they're protecting as trade secrets, etc.
Since the Copilot training data apparently includes public repositories even if they have a restrictive license, but excludes private repositories even if they have a very permissive license, it sounds like GitHub is confident that there are no copyright issues but is concerned about those other privacy issues.
Posted Jun 27, 2022 14:26 UTC (Mon)
by geert (subscriber, #98403)
[Link]
FWIW, it might have been trained on whatever proprietary code that was ever leaked to the Internet, which might even include the sources of some version of Microsoft Windows ;-)
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
Not trained on MS closed source
So it may have been trained on any source code that is publicly available; we don't know what exactly, the description in the FAQ is very vague (deliberately?).
