Not trained on MS closed source
Not trained on MS closed source
Posted Jun 27, 2022 14:44 UTC (Mon) by excors (subscriber, #95769)In reply to: Not trained on MS closed source by LtWorf
Parent article: DeVault: GitHub Copilot and open source laundering
> Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be rare that GitHub Copilot suggestions included personal data verbatim from the training set. [...] We have implemented a filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough. We will keep improving the filter system to be more intelligent to detect and remove more personal data from the suggestions.
"rare" != "never", and if someone stores sensitive personal data in a private GitHub repository then they absolutely don't want Copilot to reveal that information publicly to anyone who tries hard enough.
I expect the same applies to other confidential information, like yet-to-be-announced product names that companies might store in private repositories, or secret keys, or algorithms that they're protecting as trade secrets, etc.
Since the Copilot training data apparently includes public repositories even if they have a restrictive license, but excludes private repositories even if they have a very permissive license, it sounds like GitHub is confident that there are no copyright issues but is concerned about those other privacy issues.
