Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Specifically, we at Software Freedom Conservancy have been actively communicating with Microsoft and their GitHub subsidiary about our concerns with "Copilot" since they first launched it almost exactly a year ago. Our initial video chat call (in July 2021) with Microsoft and GitHub representatives resulted in several questions which they said they could not answer at that time, but would "answer soon". [...] Last week, after we reminded GitHub of (a) the pending questions that we'd waited a year for them to answer and (b) of their refusal to join public discussion on the topic, they responded a week later, saying they would not join any public nor private discussion on this matter because "a broader conversation [about the ethics of AI-assisted software] seemed unlikely to alter your [SFC's] stance, which is why we [GitHub] have not responded to your [SFC's] detailed questions". In other words, GitHub's final position on Copilot is: if you disagree with GitHub about policy matters related to Copilot, then you don't deserve a reply from Microsoft or GitHub. They only will bother to reply if they think they can immediately change your policy position to theirs. But, Microsoft and GitHub will leave you hanging for a year before they'll tell you that!
Posted Jun 30, 2022 21:21 UTC (Thu)
by bluca (subscriber, #118303)
[Link] (29 responses)
Posted Jun 30, 2022 22:20 UTC (Thu)
by Trelane (subscriber, #56877)
[Link] (14 responses)
Posted Jun 30, 2022 22:51 UTC (Thu)
by Karellen (subscriber, #67644)
[Link] (13 responses)
Posted Jun 30, 2022 23:23 UTC (Thu)
by scientes (guest, #83068)
[Link] (1 responses)
One issue is that NAT means that you need to rent a VPS, as it makes it much harder to just share from your personal computer.
Also, it is sort of a semantic web thing.
Posted Jul 1, 2022 12:41 UTC (Fri)
by tchernobog (guest, #73595)
[Link]
Git does not offer out of the box any issue tracking, CI pipelines, discoverability of branches and code review UI, etc. All things which are highly desirable by most software projects out there, open source or not. The fact that your code is hosted in a safe environment, it's easy to search across repositories, etc. is something many people are willing to pay for (either in money, or in liberty, as often is the case).
Posted Jul 1, 2022 13:23 UTC (Fri)
by bkuhn (subscriber, #58642)
[Link] (10 responses)
Karellen, I feel you could have made your point with less sarcasm, but raising the question that SFC organizationally uses Twitter is a reasonable thing to ask about. But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry. Speaking as the person at SFC who's primary job (as Policy Fellow) is to analyze and consider and recommend policy of how we approach these proprietary software situations, I'll note that we now live in a difficult and complex world where it has become increasingly difficult (at least in industrialized countries) to engage with communities and pursue the normal functions of life without interacting with proprietary software. Personally (outside of my work at SFC), I refuse to use Twitter also. It was a difficult decision for SFC to continue using Twitter (which, BTW, I prefer to call Agrawaland — and I used to call Dorseyville (and I guess I'll be calling Musktown soon?) — all to note that Twitter is not a democratic platform, it is a for-profit company's property under the autocratic control of its CEO). My colleague Karen Sandler and I gave two keynotes (at FOSDEM 2019 and 2020, respectively) about the challenges FOSS activists face in choosing when to use or refuse to use proprietary software. These are hard issues to decide. In fact, we internally talked quite a bit while planning the Give Up GitHub campaign to determine if GitHub had crossed enough lines that they are substantially worse in their behavior than other proprietary software companies. We believe they are, which is why we launched the campaign, but we understand that you may have a different opinion. Meanwhile, I'll put it on the agenda for future blog posts that I should write explaining how SFC came to the decision to keep participating in Agrawaland — particularly after they previous regime (Dorsey's) cut off the federation features abruptly (which lead to identi.ca's demise). Thanks so much for your inquiry and this will make a useful blog post. I can't promise a timeline for it as we have a lot of writing in the pipeline, but I will look into it!
Posted Jul 1, 2022 18:08 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (2 responses)
Um, the original inquiry was not mine? I'm a bit confused which parts of your response are directed at me, and which at bluca!
Posted Jul 1, 2022 19:24 UTC (Fri)
by bkuhn (subscriber, #58642)
[Link] (1 responses)
Yes, I'm so sorry for using your name, Karellen. I grabbed the wrong post. I was replying to bluca when I wrote this:
>> I feel you could have made your point with less sarcasm, but raising the question that SFC organizationally uses Twitter is a reasonable thing to ask about. But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry.
Posted Jul 1, 2022 22:58 UTC (Fri)
by Karellen (subscriber, #67644)
[Link]
Ha! No problem, I've replied to the wrong person by accident myself plenty of times in the past, on various fora.
Thanks, and keep up the great work.
Posted Jul 2, 2022 7:00 UTC (Sat)
by oldtomas (guest, #72579)
[Link]
You write, and very correctly
> I'll note that we now live in a difficult and complex world where it has become increasingly difficult (at least in industrialized countries) to engage with communities and pursue the normal functions of life without interacting with proprietary software.
(I'd disagree with the "industrialized" part: people in poorer countries are even more dependent on the "pay with your data" model)
Surveillance capitalism has learnt to interpose in our communications channels with other people or with the world in general, be it perception (Google glasses [1]), hospitality (AirBnB), personal relationships (Facebook), small markets (Amazon, eBay), public communications (Dorseyland -- uh -- Twitter), you name it.
They just insert themselves into the channels to strip-mine and monetize all that huge potential which was "going to waste" before. Wild west, claims, land that didn't belong to anyone: all over again, yay! (Of course, we know that society as a whole pays some price. It ain't a zero-sum game. The dead Rohingya
As mpldr notes elsewhere in this comment section, what Github pulled off (in its pre-Microsoft phase) is to cast a social network over collaborative software development. The parallels to Facebook are chilling. That's what Microsoft shelled out ~$7B for. They are drowning in cash, sure, but this is a significant amount, even for them. They didn't this for the goodness of their hearts.
Personally, I'm far more worried by this than by the questions about license status of software snippets shovelled around by some NLP AI. Although this latter question is also quite important (and thorny) and I'm happy SFC is taking on it.
Keep up the good work!
[1] Some might interject that one's dead. This instance is, but the breed ain't
Posted Jul 3, 2022 22:39 UTC (Sun)
by alfille (subscriber, #1631)
[Link] (4 responses)
Posted Jul 4, 2022 4:51 UTC (Mon)
by oldtomas (guest, #72579)
[Link]
Posted Jul 4, 2022 16:03 UTC (Mon)
by ttuttle (subscriber, #51118)
[Link] (2 responses)
I hate this. Whether or not I care about someone disrespecting the product, it's obnoxious: In a conversation about the merits of the product, it's lazy -- it's a way to smear the product without giving a proper explanation. In a conversation about something else, it's rude -- it's a way to push the speaker's opinion about the product even when it's irrelevant or distracting.
Posted Jul 5, 2022 22:34 UTC (Tue)
by hummassa (guest, #307)
[Link]
Maybe your annoyance with such discourse comes from lacking the usage of the https://en.wikipedia.org/wiki/Principle_of_charity -- a principle that is very useful in a respectful and beneficial dialogue. "Be strict on what you produce and lax on what you consume", like Unix :-) ...
Posted Jul 8, 2022 5:42 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
It depends. Making a pun with the name "Fox News" is what you wrote. Calling it "Murdoch TV" is an extremely important reminder that it's "free" only because _you_ are the product - without making fun of the name or smearing anything.
I find "Musktown" closer to the latter.
Posted Jul 8, 2022 5:36 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
I bet considering how much time you spend gifting trolls with incredibly detailed and professional answers. Amazing... return on investment for them!
I (sincerely!) hope a fair amount of copy/paste was involved at least? As you explained, it did not look like a brand new question.
Posted Jun 30, 2022 22:50 UTC (Thu)
by josh (subscriber, #17465)
[Link]
https://i.kym-cdn.com/photos/images/original/001/259/257/...
Posted Jun 30, 2022 23:27 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
The primary focus of this blog post (which itself is free software, aggregated on a proprietary platform which is permissible in their view) is not about the proprietary nature of the platform itself but about the implications of Copilot. I suspect you don't agree with the criticism of Copilot, however deflecting from it using this approach is unhelpful.
Posted Jul 1, 2022 0:10 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (3 responses)
Posted Jul 1, 2022 3:46 UTC (Fri)
by Trelane (subscriber, #56877)
[Link]
Posted Jul 1, 2022 6:55 UTC (Fri)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Jul 1, 2022 10:05 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Jul 4, 2022 8:58 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (7 responses)
It seems something you should disclose since you are all over the comment section defending copilot (with what I think are fundamentally wrong arguments).
Posted Jul 4, 2022 11:07 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (6 responses)
Posted Jul 4, 2022 12:34 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (5 responses)
You have 24 comments in this page defending copilot.
It is not a legal requirement that you disclose that you work for microsoft but it would be more honest.
Posted Jul 4, 2022 12:46 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (4 responses)
Posted Jul 4, 2022 15:36 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
Posted Jul 4, 2022 16:30 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
Posted Jul 10, 2022 1:03 UTC (Sun)
by k8to (guest, #15413)
[Link]
Surely, in low quality environments this is no longer done, but its reasonable and informs discussion as well as protects you.
Posted Jul 11, 2022 21:27 UTC (Mon)
by jschrod (subscriber, #1646)
[Link]
This is highly relevant.
Posted Jun 30, 2022 21:36 UTC (Thu)
by mpldr (guest, #154861)
[Link] (22 responses)
I would love to see people moving away from that platform to something more interested in actually being free (and just a completely crazy idea: maybe built on an open ecosystem). My projects have only been mirrored to GitHub for discoverability, but I am seriously reconsidering if I should waste CI seconds on it for no benefit except for Microsoft.
Posted Jun 30, 2022 22:09 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
I have never heard of any project ever putting its changelog in an annotated git tag. It's just not what anyone ever does, whether their projects use github or not. Changelogs go in git log | git shortlog or in files in the repository itself. (And github is perfectly happy to make annotated tags into releases, or was last time I tried -- my objection there is that you have to form your release tag names in a particular, highly stereotyped, frankly unusual way or you get ridiculously named files in your release tarballs.)
Posted Jun 30, 2022 22:23 UTC (Thu)
by mpldr (guest, #154861)
[Link]
Posted Jul 1, 2022 7:46 UTC (Fri)
by Sesse (subscriber, #53779)
[Link]
Posted Jul 1, 2022 5:29 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (18 responses)
Who will bell the cat?
More prosaically: What open platform are you proposing we use instead of GitHub?
Posted Jul 1, 2022 10:36 UTC (Fri)
by mpldr (guest, #154861)
[Link] (17 responses)
As always: try before you buy; maybe that approach is not for you, which is completely fine. If it's not for you, check out Codeberg and git.disroot.org
Posted Jul 1, 2022 12:32 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (13 responses)
Posted Jul 1, 2022 17:32 UTC (Fri)
by mpldr (guest, #154861)
[Link] (12 responses)
Posted Jul 1, 2022 17:51 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (11 responses)
Posted Jul 1, 2022 22:34 UTC (Fri)
by mpldr (guest, #154861)
[Link] (10 responses)
The message ID is usually in the URL, but (at least nowadays) there's usually a "Reply to thread" Button.
> tracking status of things
That's kind of a "you" problem. Some people do, and some people don't. And those who don't can usually follow better through the WebUI and use the aforementioned Reply to thread button
> mountains of spam
Reject text/html and you don't have spam.
> corporate email servers mangling plain text emails
What in the everloving bleep?! No mailserver should modify an emails content unless there's good reason (like a virus) because that will mess up PGP or – in corpo world – S/MIME signatures, so I somewhat doubt there's a lot of this behaviour.
Posted Jul 1, 2022 22:56 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (7 responses)
Most mailing list archives I see have neither.
> That's kind of a "you" problem. Some people do, and some people don't. And those who don't can usually follow better through the WebUI and use the aforementioned Reply to thread button
Well no, it's not a 'me' problem. It's blinding obvious if you open a PR on Github whether it has been merged or not. Instead open a random patch email on a mailing list archive and try to guess.
> Reject text/html and you don't have spam.
Have you actually ever seen a mailing list? I'm starting to doubt it, with statements like that.
> What in the everloving bleep?! No mailserver should modify an emails content unless there's good reason (like a virus) because that will mess up PGP or – in corpo world – S/MIME signatures, so I somewhat doubt there's a lot of this behaviour.
What they 'should' do according to you is irrelevant - vast majority of corporate email servers do exactly that, and there's diddly squat you can do about it as an employee.
Posted Jul 2, 2022 9:16 UTC (Sat)
by ddevault (subscriber, #99589)
[Link] (4 responses)
SourceHut also tracks the review status of patches:
https://lists.sr.ht/~sircmpwn/hare-dev/patches/
We also take responsibility for managing spam across the whole site, and remove that burden from list maintainers. Spam is exceedingly rare on SourceHut -- I think I've only ever seen one spam email make it past our filters in the past 3 years.
Posted Jul 3, 2022 13:45 UTC (Sun)
by vimpostor (guest, #159442)
[Link] (3 responses)
That looks really nice, but I wonder how does Sourcehut actually know whether a patch is applied?
I however like to apply from the command line and sometimes edit the patch before applying for some minor code style changes. Obviously this changes the patch, so how would Sourcehut know in that case that the patch was applied?
Posted Jul 3, 2022 13:55 UTC (Sun)
by ddevault (subscriber, #99589)
[Link] (2 responses)
But we intend to make it automatic. The essential heuristic is the commit date, which matches the Date header and survives amending and rebasing.
Posted Jul 3, 2022 15:17 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Wouldn't the author date be the one to trust?
Posted Jul 3, 2022 15:35 UTC (Sun)
by ddevault (subscriber, #99589)
[Link]
Posted Jul 2, 2022 23:48 UTC (Sat)
by mpldr (guest, #154861)
[Link] (1 responses)
I'd say the Linux kernel, Git, various GNU projects, and pretty much all Sourcehut projects tell a different story.
> Have you actually ever seen a mailing list? I'm starting to doubt it, with statements like that.
Sourcehut lists (chef's kiss), Mailman, Google Groups… I got around a bit and I may not have seen all possible solutions but at least some.
> What they 'should' do according to you is irrelevant - vast majority of corporate email servers do exactly that, and there's diddly squat you can do about it as an employee.
Then I'm just glad that I did not have the displeasure of experiencing this… it was mostly Outlook/O365 and Gmail with custom domains so far.
Posted Jul 3, 2022 12:20 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
I beg to differ. I didn't know *my own patches* had been pulled until I manually did a `log --author=` check out of curiosity and found my patches had finally made it. This is with Linux and Git at least.
Posted Jul 3, 2022 13:01 UTC (Sun)
by brunowolff (guest, #71160)
[Link] (1 responses)
I complain about this every couple of months. We use o365 and have enabled Safelinks. Safelinks corrupts email messages to make what appear to be URLs, proxy URLs for the same resource. It does this in at least text/html and text/plain parts. The replacement has a pattern that you can use to undo this in your mail reader (using a preprocessing script) with a low false positive rate. Another broken feature o365 has is temporarily replacing attachments with dummy ones while the original attachments are being scanned for viruses. If you notice this in time, you can go back and undelete the message (even if it was expunged) and get the attachment after it has been cleared. There are other broken features of this service not related to corrupting messages as well. I think they intentionally support only limited uses for email, which they think are common and don't care much about whatever they break for less common cases. Good luck trying to get an exception from your security people to opt out of this brokenness, even if your your case, the threat is extremely small and the brokenness causes more grief on average. They is more brokenness coming. They recently notified people about an attachment blocking feature that purports to block attachments of types on a blocklist, without providing a definition of how they determine the types of attachments. They don't say if they use Content-Type, Content-Disposition (from filename) and/or actually scan the attachment to determine the type, nor how that data actually maps to their list which were not standard mime type names. For people using the web interface there is more brokenness related to charset and no support for format=flowed.
Posted Jul 8, 2022 5:47 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
It became very clear a long time ago when Outlook killed bottom-posting.
Posted Jul 2, 2022 23:29 UTC (Sat)
by ceplm (subscriber, #41334)
[Link] (2 responses)
Could anybody confirm?
Posted Jul 2, 2022 23:39 UTC (Sat)
by mpldr (guest, #154861)
[Link] (1 responses)
Posted Jul 3, 2022 6:55 UTC (Sun)
by ceplm (subscriber, #41334)
[Link]
Posted Jun 30, 2022 22:10 UTC (Thu)
by bpearlmutter (subscriber, #14693)
[Link] (4 responses)
Posted Jun 30, 2022 23:16 UTC (Thu)
by dullfire (guest, #111432)
[Link] (1 responses)
Posted Jul 1, 2022 13:07 UTC (Fri)
by bpearlmutter (subscriber, #14693)
[Link]
Not our fault if people use it for other purposes.
Posted Jun 30, 2022 23:48 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jul 1, 2022 1:31 UTC (Fri)
by sfeam (subscriber, #2841)
[Link]
Posted Jul 1, 2022 0:48 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (48 responses)
> 1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
I assume this must be borne out of US-centric view - there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back. Whether a repository is proprietary or under an FOSS license, it's completely irrelevant, anyone can data mine all day long, as long as it's publicly and legally accessible.
> 2. If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
It is not trained only on FOSS? It's been stated many times it's trained on what is publicly available on Github (which includes proprietary repositories with no license) because... that's what the law allows. Moreover, it is common knowledge that Windows and Office are not stored on Github, so they aren't even accessible to Github.
> 3. Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot? If not, why are you withholding this information from the community?
This makes no sense as a question? The license, if any, is in the each repository on Github. What does it mean it is being "withheld"? It's all public repositories that anyone can freely clone without even an account...
All in all, Github provides an absolutely fantastic service, for free for OSS maintainers including CI time, and now with fancy autocomplete as an extra feature, wrapped in a very nice interface - although that's obviously subjective. Gitlab's interface comes very close (and for some things it's even better), but the huge fragmentation (having to set up and use dozens of accounts, one for each project that has an instance, is just a pain) coupled with no free CI for OSS projects means it just not enough. Everything else outside of these two is just atrocious to use for anybody who still hasn't got used to it (and sometimes not even then - looking at you, Gerrit).
Given all of this, personally I find these appeals very unconvincing. It's a proprietary service... so what? So is my internet connection, my phone service, my bank's website, the local tram ticket machine... these are all running on somebody else's machines and providing a service for external users, and if the owners are fine with running proprietary software on them, I certainly won't lose any sleep over it. It's my machines that I care about, so that I can tinker with them, fix issues that bother me, and so on.
Posted Jul 1, 2022 5:32 UTC (Fri)
by rsidd (subscriber, #2582)
[Link] (4 responses)
For developers and creators, this is a legal minefield and in my opinion people should stay away from copilot in any work that is going to be shared with others. But it is not a reason to boycott github.
Posted Jul 1, 2022 5:52 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Jul 1, 2022 13:36 UTC (Fri)
by dan_a (guest, #5325)
[Link]
An out of court settlement suggests it might be: https://www.billboard.com/music/music-news/musician-settl...
Posted Jul 1, 2022 7:31 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (1 responses)
I'm just going to note that I'd expect caselaw around humans remembering code snippets and reproducing them to be relevant here: it's not the worst analogy for what such a machine learning model does, and it's reasonably likely that at some point, there have been copyright cases based around the human ability to remember something they've seen before and reproduce it.
Posted Jul 1, 2022 8:07 UTC (Fri)
by rsidd (subscriber, #2582)
[Link]
Posted Jul 1, 2022 8:33 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (39 responses)
I note that you (along with many other CoPilot defenders) always focus heavily on the data mining (or "model training") side of the legal implications, and tend to ignore or gloss over the code generation side of things.
I have no issues at all with anyone gathering, analysing and performing computations on whatever FOSS source code they can get their hands on. It's out there with a license that explicitly states you're free to read it, analyse it and learn from it, for your own benefit.
Where I do have issues is where CoPilot outputs source code which is distributed to others. I fail to understand how the source code it produces can not be considered a "derivative work" of its source code inputs, as without those inputs it would produce no output at all. And producing and distributing a derivative work does require a license - or (as SFC ask for) some kind of explanation why the distributor feels a license is not needed.
It is strange if CoPilot's authors invoke the comparison with a compiler, where the output is owned by the operator. Because that's only true if the inputs are owned by the operator. You can't run someone else's source code through a compiler and then claim copyright ownership of the object code just because you invoked the compiler. I am not a copyright lawyer, but I know that isn't how copyright law works.
Posted Jul 1, 2022 10:12 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (26 responses)
To me it seems pretty obvious: the work is not consumed under the terms of the license, whatever it might be, so the license doesn't apply to anything that is produced from it. If there's a dual licensed project GPL+commercial (as it's quite common), and I buy the commercial license, anything I do with it is not affected by the terms of the GPL, because that's not how I got the project. In the same way, TDM copyright exceptions are what allow me to train a model on anything publicly accessible, which means I do not see how any claims about the output of the model being subject to the original licenses of the input hold water. The original license is irrelevant, because the law gives me an exception. That is a good thing by the way, we need more exceptions to our ever-more-draconian copyright laws.
Now on the question on whether the output of the model is a derived work - under copyright law, and not under the terms of whatever the original license was - that sounds complicated but it definitely does not seem as clear cut as "Infringement!" as some maximalist takes make it sound. When Copilot was first announced, Felix Reda (who was actually a MEP when these laws were written) wrote an excellent article that touched on that, and it still applies today:
https://felixreda.eu/2021/07/github-copilot-is-not-infrin...
Posted Jul 1, 2022 11:06 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (2 responses)
Thanks for the link, it's a very interesting read. Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out. Firstly: From the classic 2004 essay (written by a lawyer) What Colour are Your Bits?:
I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property! The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn't get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work, you Commie Mutant Traitor. Emphasis in original - it matters where the bits came from. But the whole thing is worth reading, if you've not seen it already. Secondly: (Emphasis mine.) Going back to the compiler analogy, this paragraph seems to imply that the output of compilers does not qualify for copyright protection - which is clearly absurd. And just because CoPilot doesn't produce output which corresponds to all of its input, that shouldn't matter either. Compilers throw away comments. And dead code. And redundant instructions. (Given suitably clever optimisation passes.) But that machine-generated output still qualifies for copyright protection.
Posted Jul 1, 2022 11:23 UTC (Fri)
by Karellen (subscriber, #67644)
[Link]
Posted Jul 1, 2022 11:41 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
> Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out.
Assuming the law were applied correctly (which it usually isn't :-( machine-generated code is a derivative work (it's a translation) of the original input. The transformation applied by the machine does not create or destroy copyright. So the machine-generated output is, FOR COPYRIGHT PURPOSES, IDENTICAL to the original input.
Cheers,
Posted Jul 1, 2022 11:37 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (21 responses)
But as I keep hearing repeated, to consume the work outwith the licence IT HAS TO BE FOR ACADEMIC / LEARNING PURPOSES.
So if Copilot is used to produce academic research papers or teaching material (ie books etc), then that's fine.
But if it's used to provide programming prompts and snippets of code to copy and use, THEN THE EXCEPTION DOES NOT APPLY, AND COPYRIGHT DOES APPLY.
So ANY AND ALL code supplied to the general populace is suspect. In other words, if I work for any programming shop, be it software house or end user, and I incorporate code from Copilot into my work, that code is copyright the original author. And that author's copyright applies! Which means I damn well better know where Copilot got it from!!! Okay, many snippets may be too small for copyright to apply, but that's a completely different argument.
tldr; if you're using Copilot to help you WRITE code (as opposed to providing you with study material), you are almost certainly breaking Copyright Law.
And if you're using Copilot to provide study material you're an idiot. It's teaching you the consensus method, not the correct method.
So just don't use copilot :-)
Cheers,
Posted Jul 1, 2022 12:37 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (20 responses)
That is factually wrong, and using all caps doesn't make it right. Consuming legally accessible public corpora for TDM is allowed for any purpose under the EU directive. The only difference is that academic institutions are allowed to ignore generic opt-outs.
There is currently no mechanism to express such opt-out, like you can for scrapers with a robots.txt. The W3C is working on a common spec for that: https://www.w3.org/2022/tdmrep/
Posted Jul 1, 2022 14:50 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (19 responses)
> TITLE II
https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790&from=EN
Posted Jul 1, 2022 15:05 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
Which search engines like Google will love. It (and I'm quite happy with this) makes it legal for them to have huge search databases.
But there's a very big difference between using that mined data to direct people back to the original source document, and outputting something based on that source (essentially creating a derived document) to be passed on to a third party without the first party knowing anything about it.
Maybe the grounds for feeling that way have changed, but I still feel that actually *using* the output from Copilot for pretty much anything other than study is a very dangerous occupation, and maybe even using it for study ...
Cheers,
Posted Jul 1, 2022 15:10 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Jul 1, 2022 15:18 UTC (Fri)
by ldearquer (guest, #137451)
[Link] (16 responses)
As an example, if you have a neural network that identifies the music style of an input song, I understand you may use copyrighted stuff for training your system. In real world usage, a user may input some song, and your system may respond "that's likely country music". But the inverse, where user input is "country music" and your system starts giving away excerpts of copyright protected songs...
Posted Jul 1, 2022 22:58 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (15 responses)
Posted Jul 2, 2022 9:58 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (14 responses)
Posted Jul 2, 2022 13:12 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (13 responses)
Posted Jul 2, 2022 17:09 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (2 responses)
Posted Jul 2, 2022 21:45 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (1 responses)
That's ridiculous. You can't prove a negative. It's the same as asking someone to prove they're not beating their spouse.
If Microsoft came out with a statement that as far as they can tell it doesn't happen, people will just claim they're lying. The only relevant evidence is if someone comes up with actual examples.
This is even leaving aside of code formatters like Black which are so opinionated it's almost that the point that for any piece of code there is only one way it can be formatted, so you couldn't even tell the different between an actual copy and an accidental one if you wanted to.
If you take a step back and think about what it would take to build such an AI model, if the model has any understanding of the structure of code, there's no reason at all to think that it will randomly copy entire blocks of text literally from the input. It's going to be working at a completely different level.
Posted Jul 3, 2022 15:13 UTC (Sun)
by ballombe (subscriber, #9523)
[Link]
Nobody outside MS really know how copilot actually work, so you cannot make any claim about it. 'AI model' is just a buzzword.
I do not see how the Math.isPrime example can occur outside literal copying.
Posted Jul 4, 2022 12:38 UTC (Mon)
by nye (subscriber, #51576)
[Link] (9 responses)
This back and forth kind of misses the point IMO. If copilot outputs something which is a verbatim copy of a substantial piece of code, then *of course* it shouldn't magically have had its copyright removed. Similarly, if a person with an exceptionally accurate memory writes down some copyrighted code that they memorised last year, the fact that they didn't literally copy/paste it has no real bearing on its copyright status. It feels like this shouldn't be controversial.
It seems you assert that it isn't or shouldn't be possible for copilot to do this, but how ever accurate that is I don't think it's particularly important - partly because it's hard to prove and partly because it could be subject to change.
All of the talk about verbatim outputs seems like a largely pointless distraction from the important part: the infinite set of outputs which are *not* a verbatim copy of a substantial piece of code, and which the copyright maximalists argue must be considered a derivative of all of its training inputs.
Here is what it boils down to: if I, as a programmer, either A) perform a sequence of steps, or B) write a program to perform a sequence of steps, then assuming that all inputs and outputs are the same, does the choice of A vs B affect the legality of the outcome? I don't believe that there's a logically coherent argument for the answer being "yes".
Posted Jul 4, 2022 12:48 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (5 responses)
Posted Jul 4, 2022 13:35 UTC (Mon)
by nye (subscriber, #51576)
[Link]
I am definitely not assuming that. If we're talking about "a verbatim copy of a substantial piece of code", then that's essentially my definition of "substantial", but I specifically said "if" in that section, and my point was that IMO it's not at all the important part of the discussion; it's just a distraction (this is why I considered it unimportant to define "substantial" in that context).
FWIW, while we're further entertaining the distraction anyway, I'm not even convinced that the repeatedly-cited fast inverse square root should be eligible for copyright protection - on the grounds that the only bit of creative work in it is the choice of a magic constant, which isn't typically something that would be considered copyrightable. It would be interesting to see if a court is ever asked to rule on this specific piece of code (although I think it's basically always a sad day when we get to the point that a court is required to rule on anything, so "interesting" should not be construed as "good").
Posted Jul 5, 2022 11:27 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
But even if they were, because the model is asked to predict what someone else would have written given the same program structure, they are far less independent from this very same structure than random snippets found on the web.
And, general structure is one of the things that distinguish fair use from plagiarism.
You can not have it both ways, mimic accurately what others would have done, and pretend you are not deriving their work (this is especially striking where people have used ML to complete damaged work of arts, more accurately than the best best forger. Who cares that the forgery was done one stroke at a time.)
Posted Jul 5, 2022 13:37 UTC (Tue)
by bluca (subscriber, #118303)
[Link] (2 responses)
Posted Jul 5, 2022 16:47 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
And if you did not, how can you claim the tool never outputs anything original?
Posted Jul 5, 2022 17:55 UTC (Tue)
by bluca (subscriber, #118303)
[Link]
Posted Jul 4, 2022 12:56 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
Not just the maximalists. Taking the word "derivative" at face value, all the output is derivative of the training data.
The question isn't whether it's derivative, the question is whether it's sufficiently *trivial* not to be copyright, or sufficiently complex and derived from just one or two training items to be a blatant copyright violation. And that will probably have to be determined on a case-by-case basis.
tldr; don't assume because it comes from Copilot that it's copyright-free... (don't assume that it isn't, either).
Cheers,
Posted Jul 4, 2022 13:42 UTC (Mon)
by nye (subscriber, #51576)
[Link] (1 responses)
That is the maximal possible interpretation, so yes, just the maximalists, by definition. You haven't even added so much as any vague handwaving about transformative use!
Posted Jul 4, 2022 14:19 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Except the quote I was replying to said COPYRIGHT maximalists.
And I certainly didn't claim that the output was - or even should be - copyright. I just said that it was - BY DEFINITION OF THE WORD - derivative.
If I openly said that *some* output is too trivial to copyright, how does that make me a copyright maximalist? And again, isn't "transformative use" - by definition - derivative? FFS, it's a *transformation* - it's the same thing but altered ...
Cheers,
Posted Jul 2, 2022 16:06 UTC (Sat)
by NAR (subscriber, #1313)
[Link]
Posted Jul 1, 2022 13:18 UTC (Fri)
by eduperez (guest, #11232)
[Link] (11 responses)
> It is strange if CoPilot's authors invoke the comparison with a compiler, where the output is owned by the operator. Because that's only true if the inputs are owned by the operator. You can't run someone else's source code through a compiler and then claim copyright ownership of the object code just because you invoked the compiler. I am not a copyright lawyer, but I know that isn't how copyright law works.
You do not need to go that far: there are proofs that show how CoPilot can output code that is a verbatim copy from one of it's sources; they cannot pretend that a verbatim copy of some code is not a copyright infringement, just because it passed through some AI algorithm.
Posted Jul 1, 2022 13:40 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (2 responses)
Certainly less common large verbatim copies should not happen, and the team was adding some checks for that IIRC. There might even be a config to disable that from happening now?
Posted Jul 1, 2022 14:09 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (1 responses)
https://news.ycombinator.com/item?id=27710287
So talking about "a single word" or "a single byte" or "a ⅓ of a bit" is just misleading. Copilot just copies entire functions… and you have no way of knowing if it just copy pasted an entire module from somewhere or it "created" something original.
That it "should not happen" doesn't really matter. It has been shown that it does happen.
Posted Jul 1, 2022 14:15 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Jul 1, 2022 13:42 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (7 responses)
Posted Jul 1, 2022 14:14 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (6 responses)
But nor can one conclude that the particular output in front of one is necessarily free of any copyright infringement.
As far as I'm concerned this lack of provenance is one major problem with the approach. The other major problem with the approach is that it is by no means guaranteed (AFAIK) that Copilot output actually does what it is supposed to do. I wonder whether it is usually less work to validate, debug, and clean up something that came out of Copilot than it is to come up with the same thing from scratch and avoid the entire minefield in the first place – i.e., whether Copilot is “worth it” in daily practice.
Posted Jul 2, 2022 13:13 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (5 responses)
Posted Jul 2, 2022 22:31 UTC (Sat)
by anselm (subscriber, #2796)
[Link] (4 responses)
Doesn't really matter. According to its own FAQ (and not just according to common sense), Copilot output code should be “rigorously tested”, “reviewed and vetted”, and “checked for security vulnerabilities”. (The Copilot FAQ also says that Copilot's output “may contain insecure coding patterns, bugs, or references to outdated APIs or idioms” and that it “may not always work, or even make sense”. Yep. Sounds just what we need. Bring it on.)
As a programmer I probably spend more time writing tests for my code and ensuring that it does what it is supposed to do than I do to come up with the code in the first place; if I need to write the tests and debug the code, anyway, then having to write it first is really the least of my worries, and if I write the code myself then at least copyright is much less likely to be an issue. Also, writing the code from scratch will probably be more creative and fun than having to bang dubious Copilot output into shape if it contains “insecure coding patterns” or “references to outdated APIs”, let alone subtle errors that render it inappropriate for the actual use case at hand.
Posted Jul 2, 2022 23:20 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (3 responses)
Posted Jul 2, 2022 23:51 UTC (Sat)
by anselm (subscriber, #2796)
[Link] (2 responses)
Whatever. The examples on their web site leave me underimpressed. E.g., the “IsPrimeTime.java” example takes a comment that reads
and completes that to
which is obviously an impressive blob of code but fails completely at its stated purpose. If that is really the best Copilot can do, to a point where they feel they must put it out as an advertisement, then please explain to me again why I should want to pay for drivel like that.
Posted Jul 3, 2022 10:47 UTC (Sun)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Jul 3, 2022 15:45 UTC (Sun)
by anselm (subscriber, #2796)
[Link]
If what Copilot does is worth $100/year to you (or your employer), then more power to you. From the examples on the web site – which I presume are showing Copilot at its best, because why else pick them as examples? –, I personally don't see that for me, and in any case my favourite editor is not among the ones Copilot supports, so getting to where I could actually use Copilot in the first place would be too much of a hassle as far as I'm concerned, so I think I'll pass.
Posted Jul 5, 2022 17:12 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
And anyone here thinks that will survive the first person that trains a ML on the production of some first-rank singer, makes it output a song, and tries to make some money out of it? Especially if it ends up a success? That’s not science fiction people are already succeeding to generate pretty nice paint forgeries from museum archives.
Hollywood would have none of it.
Data will be quickly redefined to exclude anything copyrightable. Or some other clause will clarify it applies to the model not using the model to create the same form of work.
Exactly like the “we are only neutral publishers” clause cloud providers fought for stood no longer than youtube starting to make big money from other people’s creations.
Lawmakers may not understand tech but they understand the money trail plenty fine.
Copilot is free (as in beer) to use? No? I thought so. Usual unethical behavior of big companies that knowingly cash on shady behavior, hoping the law takes its time to catch up with them.
Posted Jul 5, 2022 17:18 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
> And anyone here thinks that will survive the first person that trains a ML on the production of some first-rank singer, makes it output a song, and tries to make some money out of it? Especially if it ends up a success? That’s not science fiction people are already succeeding to generate pretty nice paint forgeries from museum archives.
Data MINING. Ie putting copyright materials *IN*to the model. And using the output for bug hunting, learning, that sort of stuff. But the output is a work, it's copyrightable, and it inherits the parent copyright.
The GP needs to stop confusing *IN*put with *OUT*put. Otherwise he's likely to spend the rest of his life paying off the lawyers who couldn't defend him ...
Cheers,
Posted Jul 5, 2022 17:29 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
The laws as written today in the EU do not cover the output of a model - only the data mining for training, and the model itself.
Therefore, I'd expect that the output of a model can infringe copyright - just as a human who reads something isn't covered by copyright, nor are they caught up in copyright if they remember chunks of what they read, but copyright does kick in if they reproduce something protected by copyright verbatim from memory.
Posted Jul 1, 2022 8:23 UTC (Fri)
by flussence (guest, #85566)
[Link] (2 responses)
You can *not* convince people to care about minutiae like this unless they were already predisposed to it. Even on a site like GitHub, because it's less about Social Coding™ these days and more of a forum where people go to complain that their freeware isn't working while others roleplay as Nick Krause every October to win promotional merchandise.
Microsoft's empire was built on pandering to those who think of the computer as a nuisance first and would prefer not to think about it at all, and in selling them products that reinforce that apathy. GitHub fits right into that market, and suggesting alternatives like GitLab or SourceHut misses the point of why people use this and not those in the first place (it's not because they're technically or morally superior in some way - GitLab isn't even either of those).
I mean, just look at everything that's happened to Twitter in the last 6 years that some people consider crossing an unacceptable boundary. There are a billion people still there (for reasons other than using it as an advertising space), who just don't care enough to leave, and nothing will make them.
Stuff like this is just Evangelical Christian preaching about sinners going to hell with some of the words changed. That's all it's ever been.
Actually after writing all that maybe I do begrudge them for trying. This campaign won't work, there's four decades of documentation of it not working, please try something different already.
Posted Jul 1, 2022 8:42 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
That's a pretty healthy attitude in general, actually. You don't think about the plastic used in your lighting switch every time you turn on the lights, do you? Unless you're a lighting switch enthusiast, you just want them to work and don't bother you.
Posted Jul 1, 2022 9:15 UTC (Fri)
by flussence (guest, #85566)
[Link]
And I never said it wasn't.
My continued existence on this earth is mostly paid for by the work of sparing others from having to think too hard about the computers they're using. When I say this isn't going to move the needle one bit, it's because I've seen into enough real people's lives to understand why.
If Copilot is automated piracy, then maybe this is a good time for a reminder that piracy is a service problem. We're not just failing to learn from GNU's historic mistakes by pretending things like this are effective, but those of the likes of SCO, RIAA, Microsoft itself. People want better service, not lectures and scaremongering. The free-as-in-freedom is just a side effect.
Posted Jul 1, 2022 8:24 UTC (Fri)
by vegard (subscriber, #52330)
[Link]
Posted Jul 1, 2022 9:28 UTC (Fri)
by danpb (subscriber, #4831)
[Link] (1 responses)
The debate over Copilot is important to have as the answer is not entirely clear either way. There is no practical way for any open source projects to avoid being imported into Copilot though, and moving project hosting makes no difference to this. Using Copilot as a justification for moving hosting service is at most a statement of unhappiness with their approach.
Posted Jul 5, 2022 0:02 UTC (Tue)
by abartlet (subscriber, #3928)
[Link]
Given that, I don't see what can be done.
Posted Jul 1, 2022 23:04 UTC (Fri)
by SLi (subscriber, #53131)
[Link] (87 responses)
To me, the ideological background for free software is that the world would be a better place if software was free and there would not be restrictive copyright laws. Because this is not the world we live in, the decision was made to use the tools of the copyright law itself to work against the world order that uses it to suppress software freedom. I think it has worked reasonably well.
From this perspective, it rubs me the wrong way when people seem to argue that copyright should restrict the creation and use of AI models. It seems antithetical to what I thought were the goals of the movement. Granted, Copilot is not an open model, and that is not a good thing for free software. But please, for the sake of humanity, don't try to expand the copyright madness to AI models, which (to me) anyway seem closely equivalent to a human having read the code in question and producing code based on what they learned.
For once I feel that the copyright law seems (based on what I judge to be the most beliavable expert opinions) to be in the state where I want it to be. Of course, as an AI practicioner, I might also be biased... But I really think it would be a very silly world where you couldn't use publically available data to train models without copyright ruining any hope of technological progress.
Posted Jul 2, 2022 1:14 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (24 responses)
Posted Jul 2, 2022 9:09 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (23 responses)
In fact, this feels somewhat like a knee-jerk reaction based on a hated company being behind this. Among the masses, I think that's a large part of it. I'm not arrogant enough to think that genuinely knowledgeable and philosophical people like those at SFC have that attitude, though, so that's what leaves me confused.
To me, it seems that the argument is essentially that it should not be realistically legally possible to create good AI models (for which you need at at least hundreds of gigabytes of source code) because of copyright reasons and the impossibility of vetting a copyright-safe set of such code. And I think this is a very counterproductive argument.
Training only one such model is expensive (and bad for environment) enough computation that it really, really should not be done separately for each mutually incompatible free software license, let alone retrain every time you discover that there's license unclarities with some small part of the input (I know, unheard of in the free software world...).
Posted Jul 2, 2022 11:49 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (11 responses)
There's nothing wrong with MODEL. But there's everything wrong with the USES THAT MAY BE MADE of the output.
The output is - must be - a derivative work of the inputs used to create it. That's what the word "derivative" means.
This then brings copyright into play. The output may be a conglomerate of multiple similar works, in which case the copyright status is probably "too trivial to be eligible". Or the output may be the sole match for a complex piece of code someone is trying to write, tempting them just to take Copilot's output verbatim as the solution to their problem. In that case the copyright status is "blatant piracy". And there's all the other points on the spectrum between them.
Mining publicly available code and using it for education is fine - why else would it be publicly available? It's generally accepted that stuff is out there for people to read and learn from.
But it's NOT generally acceptable that stuff is put out there for others to make a fast buck from. Using Copilot output for commercial purposes is NOT an acceptable default status - a lot of it has been shared on a "share and share alike" basis and people who don't "play the game" are Stealers of the Commons. Dunno about other countries, but "Finders Keepers" could land you in jail for theft over here (unlikely, but perfectly possible - you have a duty to try and find the rightful owner).
Cheers,
Posted Jul 2, 2022 12:40 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (10 responses)
The model is pretty much useless if you cannot use it for anything without fouling copyright. I really think that would be a very harmful development.
Luckily, I also think that that understanding of what a derivative work means in the copyright context is pretty wild and likely incorrect. Well, whatever the law turns out to mean, I wish people stopped advocating for such harmful interpretations. It may be that the law ends up preventing any real AI code models, but it definitely should not.
Posted Jul 2, 2022 13:10 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (8 responses)
Posted Jul 4, 2022 9:23 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (7 responses)
Copyleft only exists because copyright exists. I think we can agree on this point.
People use copyleft licenses because they want their work to remain free.
IF copyright didn't exist and all software source was public domain, we'd all be very glad that copilot was there to help writing code that would be free.
However copyright does exist, and copilot is going to be used mostly to write copyrighted proprietary software, using copyleft software. This is clearly something that the authors of copyleft software didn't want.
Not using github is not a solution because anyone (including microsoft itself) has every right to just mirror whatever on github.
Now you claim (and have an economical interest in claiming so) that copilot does not infringe. However you aren't a judge. And while I do agree that creating the model does not infringe, the output generated from the model is another thing entirely, and that might be infringing.
In any case people who wrote GPL code know that their work is going to be used in proprietary code, which goes against the license and against their wishes when they chose that license.
You are just betting that a future lawsuit will say that you are right. But even if you are wrong it will be the users of copilot being in violation, so microsoft is betting that it will be very hard to find who to sue and no lawsuit will ever happen.
To respond to your comment, no, having your license terms respected is not "bleak". Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.
The law to train a ML model doesn't say anything about using that model to generate new content.
Posted Jul 4, 2022 11:47 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (6 responses)
Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you
> (and have an economical interest in claiming so)
I am not in GH and I am not a shareholder, so you can park this nonsensical tinfoil-hattery straight away - I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.
> To respond to your comment, no, having your license terms respected is not "bleak".
It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software beside some boring indexing or suchlike software, as it would be de-facto impossible to compile a legal training corpus unless you have a metric ton of private code available to you. That would be dreadful, and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.
> Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.
It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code, that is to say all of which is publicly available Github (there's loads), as the team has said multiple times in public, because that's where the training data comes from. If it's on different systems (external or internal), it wasn't used, it's as simple as that - I don't even know if the GH org can access other systems, but from my own experience, I'm pretty sure they cannot even if they wanted to.
> The law to train a ML model doesn't say anything about using that model to generate new content.
Lawmakers were clearly and openly talking about AI applications, and not just some indexing applications or some such activities. A giant trunk of AI r&d is in the field of generating content, like GPT and so on. It seems like a bold assumption to think that the lawmakers weren't aware of all that.
Posted Jul 4, 2022 13:03 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (5 responses)
You claim that, but here you are with 26 comments defending microsoft's actions.
> I am not in GH and I am not a shareholder
I'm sure you vested or will vest stocks. It's common practice. And you do get a salary I hope?
> I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.
Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.
> It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software
Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.
> That would be dreadful
It is dreadful indeed. I'm not sure why you are considering microsoft to be this little innocent startup company.
> and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.
That's your personal opinion that you keep repeating but there is no agreement. And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.
> It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code
The open source one… not the proprietary one… Be intellectually honest please. I talked about proprietary code and you replied something entirely OT.
> If it's on different systems (external or internal), it wasn't used
And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?
> Lawmakers were clearly and openly talking about AI applications
Generating code is not the only ML application that can exist. Classifiers are ML.
I'm sure the lawmakers were aware, and that's why they talked about "training data" but not about "spitting out the training data verbatim".
You are reading what you would like to be written rather than what is actually written.
Posted Jul 4, 2022 18:16 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (4 responses)
And...?
> Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.
You forgot hand-carving a new silicon behind a blast door in an hazmat suit. Also TIL that Neovim is a proprietary editor. And there's no need for credit cards if you are an open source maintainer, you get it for free.
> Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.
You are both failing to see the point (major corporations would be fine if the law worked like the maximalists wanted it to, it's the rest that would be worse off) and also talking nonsense, there is no license violation anywhere. Feel free to point to the court cases if not. Just because a few trolls and edgy teenagers shout "violation!" it doesn't mean it's actually happening, you need to prove it. Can you?
> And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.
The fact that some are complaining doesn't mean the alternative if the law was different would be better. There's plenty of anti-vaxxers complaining about the vaccination programs wordwide, it doesn't mean we'd be better off without vaccines.
> And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?
It's because of the aliens trapped in those repos, duh! Now if you take off your tin foil hat for a moment and go read other replies, I've already given my uninformed guess on why only public repos on Github are used.
> You are reading what you would like to be written rather than what is actually written.
I'm not the one claiming that training a model violates copyright when it's explicitly allowed by law.
Posted Jul 4, 2022 18:30 UTC (Mon)
by corbet (editor, #1)
[Link] (3 responses)
Thank you.
Posted Jul 5, 2022 14:35 UTC (Tue)
by nye (subscriber, #51576)
[Link] (2 responses)
Posted Jul 5, 2022 14:52 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted Jul 6, 2022 9:34 UTC (Wed)
by sdalley (subscriber, #18550)
[Link]
But why argue at all. C'mon now, let's give Jon the respect he's entitled to as owner of this site...
Posted Jul 4, 2022 9:01 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link]
For example in my ML course at university we trained a thing to recognise handwriting. We didn't use it to generate a new font.
Posted Jul 3, 2022 4:28 UTC (Sun)
by pabs (subscriber, #43278)
[Link] (9 responses)
https://salsa.debian.org/deeplearning-team/ml-policy
Of course, the prohibitively large sizes of most of the training data sets and the prohibitively large costs of training make this scenario infeasible for various actually useful models, but maybe if there were a group working on and funding libre ML with training data storage, compute and reproducible training, then it would become feasible to have actually libre ML.
I don't believe that ToxicCandy models, nor proprietary models are a good idea. Also I believe that the purposes that many uses of ML models are put to are very unethical and that ML researchers need to think carefully about what the model they are creating will enable.
I haven't thought about Copilot enough to comment on the rest of your post.
Posted Jul 3, 2022 6:27 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
We need to be careful with the usage of that phrase. "Data" can refer to any set of information under the sun. But many information sets are not subject to copyright protection in the US (see Feist v. Rural), and are subject to sui generis database rights (similar but not identical to copyright) in the EU. This is further complicated by the fact that US law allows you to copyright the "selection or arrangement" of data that would otherwise not be subject to copyright.
In the case of Copilot, the inputs are, of course, subject to copyright. But, if you'll excuse my use of US law (it's the legal system I know best), there are a whole bunch of unanswered questions:
Your legal system will probably have a different set of unanswered questions, which may in turn have different answers. Regardless, trying to make strong claims about what is or is not legal is a fool's errand at this point.
Posted Jul 3, 2022 6:32 UTC (Sun)
by pabs (subscriber, #43278)
[Link]
Posted Jul 4, 2022 13:15 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
If you were a company… would you buy copilot knowing that after that every single github user can hit you with an infringement lawsuit?
I'm sure there are patent trolls interested in acquiring the rights to some github projects and going to get claims around :)
If I was a CTO in charge of a company I'd just not buy into it because the potential cost in legal fees and complete bankrupcy seems to greatly outweigh the time we could save.
Posted Jul 5, 2022 8:19 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (2 responses)
Posted Jul 5, 2022 8:49 UTC (Tue)
by geert (subscriber, #98403)
[Link] (1 responses)
Posted Jul 12, 2022 8:27 UTC (Tue)
by cortana (subscriber, #24596)
[Link]
Posted Jul 3, 2022 23:21 UTC (Sun)
by SLi (subscriber, #53131)
[Link] (2 responses)
You observed correctly that training such a model costs millions. It may be possible in the future that training will become less expensive due to algorithmic or hardware improvements. In practice what you can expect to happen is that people (and companies) will train larger and more useful models than the current ones.
So, assume you have trained such a model on, say, all the code in Debian. Now it turns out there's a small piece of code there that is actually not free software, perhaps not even distributable (happens all the time, I think?). What are you going to do, retrain it from scratch?
Posted Jul 4, 2022 3:36 UTC (Mon)
by pabs (subscriber, #43278)
[Link] (1 responses)
I hadn't thought of license incompatibility, but presumably it would indeed be a concern.
I know approximately zero about ML, but AFAIK retraining is the only option when it comes deficiencies in a model due to bad input data. For example if a model is indirectly biased against certain groups of people, the procedure is presumably to analyse the bias in the input data, then discard some subset of that data or add more data and then retrain the model from scratch. If an ML chatbot is racist because it was trained on internet comments from various sites, you either just delete all the ones from 4chan and hope there are no racist comments on Twitter etc :) or manually comb through all the millions of comments and delete racist ones. Or just give up on the internet as a source of input data :) So yeah, retraining is the only option in the face of non-free or non-redistributable code input.
Posted Jul 4, 2022 9:49 UTC (Mon)
by SLi (subscriber, #53131)
[Link]
I have very hard time seeing that repeating whenever someone discovers there was a few kilobytes of non-free code in the input.
Posted Jul 3, 2022 5:39 UTC (Sun)
by oldtomas (guest, #72579)
[Link]
Posted Jul 2, 2022 1:20 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (36 responses)
Posted Jul 2, 2022 13:16 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (35 responses)
Posted Jul 2, 2022 14:03 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (34 responses)
But the world we have right now is one where a giant corporation is using public domain (fine), permissively licensed (fine-ish) and copyleft (not so fine) code, rather than its own proprietary code, to train its AI model.
Posted Jul 2, 2022 14:22 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (23 responses)
Now it sounds like some want to throw the baby out with the bathwater and prevent all code AI models, apart from giants like Google or Microsoft training their own models on their own code (possibly for in-house use only, if they are scared of information leaks).
The society shouldn't adopt a copyright maximalist stance and stifle uses of AI merely because the first available models were proprietary.
The idea that "any use" of copyrighted works should require a license is a typical maximalist idea, and I expect to hear that more from the entertainment industry than free software proponents. Training an AI and using it to produce code is, rather clearly to me, one of those things that are only extremely tangential to any traditional purpose of the copyright system. It's fundamentally not at all different from a human reading publically available code and using the memories formed that way to write more. I don't think even the craziest copyright maximalists claim that the products of that are typically be derivative works.
Posted Jul 2, 2022 16:02 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (2 responses)
If a corporation is allowed to bleach the copyleft off of your code by using it as feedstock for an incomprehensibly complex computer algorithm and then asking the algorithm to solve that problem, copyleft is gravely wounded.
Posted Jul 2, 2022 22:12 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (1 responses)
How is this different to anyone looking at copylefted code in Github for inspiration to solve a problem they're having, and then using that idea, written in their own way, in their own program? Copyright is focussed on the copying of expression, not the copying of ideas. As long as you can argue the model is copying the idea, not the expression, copyright is completely irrelevant.
The whole issue comes down to the distinction we've made in copyright law between what compilers do (which is considered pure manipulation having no effect on copyright), and what people do (which is looking at pieces of source code to learn and use that to make more source code). Isn't the rule of thumb: if you're copying from one source it's plagiarism, if you're copying from two it's research?
I don't really see how a model built on examining lots of source, some copylefted, producing code reduces the value of the input code. If a computer model can actually come up with code that does something you've typed, perhaps it wasn't so original and it's the kind of thing we want to automate away anyway.
TBH, the idea of a model writing code for you to solve a problem sounds nice. But what would be really valuable is something that could see where many programs are solving a similar problem, that it makes a library for that and refactors all the other programs to use that.
Posted Jul 5, 2022 9:38 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Treating it as comparable to a human is what I suspect the courts will do, and rsidd has pointed out that music precedent in the case of George Harrison's "My Sweet Lord" suggests that if Copilot does output snippets of its training data unchanged, then unless that snippet is "purely functional", it'll be found to be a copyright infringement by the user of Copilot.
That's a risk for any user of Copilot to assess - are they OK about a possible infringement suit caused by the fact that Copilot has access to code owned by Alphabet, Meta, Microsoft and other entities whose code is on GitHub?
Posted Jul 4, 2022 8:48 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link] (19 responses)
There are not a thousand of different FLOSS licenses.
There are not a thousand of different combination rules.
Compared to the number of files the model ingests to output suggestions, determining what is the license of a project, what other licenses it can be combined with, and what are the possible licensing for the result is *TRIVIAL*. No need for special magic research exemptions, no need to anger people, just apply the original licensing, legally safe by design in all jurisdictions.
Pretending everything is public domain is not just laziness, it’s *opinionated* laziness, that tries to blur the lines so everything not “protected” by bigcorp lawyers is free to pillage, and everything produced by this pillaging can be safely put out of bounds.
Posted Jul 4, 2022 10:27 UTC (Mon)
by SLi (subscriber, #53131)
[Link] (17 responses)
Posted Jul 4, 2022 14:10 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link] (16 responses)
“Mister judge, some goods in that store are probably mislabeled, therefore I decided that paying for what i picked up was unnecessary”. How do you think that would work out ?
The law does not let out out of the hook because others may have made mistakes. Everyone makes mistakes. There’s a difference between making an honest mistake (trying and failing to achieve perfection) and not trying at all.
Posted Jul 4, 2022 16:17 UTC (Mon)
by SLi (subscriber, #53131)
[Link] (15 responses)
As a practical matter, no such large corpus of code without any copyright violations to be discovered exists. I suspect the large corporations come closest. For the free software world, this idea would kill the last hope of training such models.
I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.
Posted Jul 4, 2022 16:37 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (1 responses)
I'm not in GH so I don't know, but if I had to take a wild guess I'd say it's much simpler than that. The non-GH internal SCM systems are such an horrendous pain in the back to use and even get access to, I'm willing to bet the team working on the model, even if given permission to use those sources, would "nope" the heck out very, very fast and never look back.
Posted Jul 4, 2022 17:53 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jul 5, 2022 6:39 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (12 responses)
Not at all.
JUST APPLY THE ORIGINAL LICENSING
We spent decades streamlining FLOSS licensing to make sure the number of actual licenses in play is small and their effects are clearly understood. The kind of legal effort proprietary companies coped on. As shown every single time some software giant tries to relicense its own (?) code and takes years clearing the effects of not doing due diligence before.
THERE IS NO VALID EXCUSE TO IGNORE FLOSS LICENSES.
There is easily 60% of github content that is governed by a small dozen of FLOSS licenses. This is more than enough to train a model on. Distinguishing between a dozen different terms is not hard.
This is especially galling from an extra-wealthy company who had the means for years to clear the legal status of its own code (but did not), spent years mocking people that wasted time arguing about the exact effect of FLOSS license terms, and then starts pillaging this very code without even trying to comply with the hardly won simple licensing state.
This is especially galling from a division (github) that has been asked for years to help committers navigate legalities, make half assed efforts, and then proceeds to ignore the result of those efforts.
Stop finding ridiculous excuses. FLOSS is about the only software trove where ML can work legally *because* of its licensing simplicity (that took a lot of effort to achieve). ASSUMING YOU APPLY THIS LICENSING. Otherwise, no better than proprietary software, and Microsoft has plenty of its own to play with, and it’s not welcome playing with other people’s software when not abiding with the legal conditions.
No better than the people that ignore creative commons terms because their own legal status is an utter mess, and they expect others to be just as bad. Not an honest mistake once they’ve been told repeatedly it’s not the case. They can stomp on their own licensing not on the one of others.
Posted Jul 5, 2022 6:48 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 10:14 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (10 responses)
There is no such thing as a truly massive corpus of code with a known license and guaranteed freedom from copyright issues. There just isn't. It's not an excuse.
Posted Jul 5, 2022 10:28 UTC (Tue)
by amacater (subscriber, #790)
[Link]
This is not necessarily the case for other distributions - which may have other priorities / commercial pressures or whatever - but that's their world. Disclaimer: I am a Debian developer since about 1998 but don't currently package software, though I do keep note of the tools and processes that do.
Posted Jul 5, 2022 10:51 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (7 responses)
The law does not require perfection, it deals with the real world.
The law requires good faith efforts, ie you do not get a free pass to appropriate stuff clearly labeled under someone else’s license, and you make efforts to fix things once you’re notified the labeling was in error.
Nothing more AND NOTHING LESS.
Posted Jul 5, 2022 10:58 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (5 responses)
Posted Jul 5, 2022 11:58 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (4 responses)
If you ignore those terms, it MAY be a copyright violation, depending on the extent and originality of the copying and depending on how much it is linked to overall program structure (ie the more accurate the model will be, the more likely it will be to infringe).
The instrument you use for this copying (CTRL + C or fancy ML) is pretty much irrelevant in the eyes of the law. The law cares about effects (you killed your neighbor) not the instrument used (you used a knife like your arch arch ancestor, you printed a gun, used a fancy sci fi laser or Harry Potter’s magical wand). But, tech people keep thinking they will fool a judge just by using an instrument never encountered before.
Also the law deals with the real world, not absolutes, so infringing accidentally in good faith (the code was mis-labelled) is not the same thing as deliberately ignoring the code license pro-eminently displayed on the github project landing page. In one case you get condemned for one symbolic dollar (provided you did due diligence to fix your mistake), in the other it can reach billions.
As for the “significant cost of retraining” just try this in front of a judge and the peanut gallery, we all know here those models are periodically retrained for lots of different reasons (including mistakes in the data set and licensing mistakes are not less worthy than others).
Notwithstanding the fact that Microsoft is the operator of one of the worlds biggest clouds, which the judge will find it hard to ignore.
Posted Jul 5, 2022 12:20 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (3 responses)
It may be, barely, possible for a large corporation like Google or Microsoft with their internal code bases which tend to be better curated (but still it will be hard).
You do realize that training a model the scale of Copilot costs a few millions every time you do it?
Good luck getting funding for retraining the free model every time Debian finds a copyright violation. I could see public or donated funding for a single training, but not for that.
So, if the law is what you claim it is, we can possibly still have proprietary models, but it's quite unlikely to have significant models trained on free software.
I think your rhetoric about tech people trying to fool judges is a bit misplaced and incendiary. I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough. And it's not like this is some device designed purely to try to circumvent law.
Posted Jul 5, 2022 12:50 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
First computing power is dirt cheap and what was prohibitively expensive yesterday is wasted on ad processing and crypto mining today.
Second, the law does not deal with absolutes it deals with the real world and proportionality.
It does not require instantaneous systematic compliance. That, would be pretty much impossible to achieve in the material world. It requires speedy realistic compliance (as soon as you can, not as soon as it is convenient or cheap for you).
Periodic retraining would be fine, as long as you do not delay it unduly to avoid any consequence. And you *will* retrain periodically if only because computing languages keep evolving and you will need to make the model aware of new variants.
In the meanwhile, it is computationally cheap to filter output to ignore suggestions found in code you’ve been informed is tainted.
And if you are convinced the amount of tainted code will largely exceed you capacity to filter, and you proceed with your ML project anyway, it will be hard to take is as anything but willful copyright infringement.
And it all terribly inconvenient I know. The law is not about your individual convenience.
> I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough.
”Even for copyrightable platforms and software packages, the determination whether infringement has occurred must take into account doctrines like fair use that protect the legitimate interests of follow-on users to innovate. But the promise of some threshold copyright protection for […] elements of computer software generally is a critically important driver of research and investment by companies like amici and rescinding that promise would have sweeping and harmful effects throughout the software industry”
Gregory G. Garre, Counsel for Microsoft Corporation, BRIEF FOR AMICI CURIAE MICROSOFT CORPORATION […] IN SUPPORT OF APPELLANT
That’s what Microsoft thinks when the code in question is not produced by Joe Nobody on Github
Posted Jul 5, 2022 14:16 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (1 responses)
In the future, it's possible that you may be able to train models people train today for millions for less, but even that is a bit speculative (I think the biggest advancements are likely to come from algorithmic development, but it's probably still possible to squeeze some computation per watt more). You still won't be able to train in the future the better models they then train for dirt cheap.
Posted Jul 5, 2022 14:32 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 11:01 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
Posted Jul 5, 2022 12:15 UTC (Tue)
by pabs (subscriber, #43278)
[Link]
https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=ftp.deb...
Posted Jul 4, 2022 12:08 UTC (Mon)
by nye (subscriber, #51576)
[Link]
Nobody ever claimed that.
Posted Jul 2, 2022 14:23 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (9 responses)
Posted Jul 2, 2022 15:59 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Because the big capitalist proprietors still have more money and smart people to throw at the exercise than any other non-state actor.
Posted Jul 2, 2022 16:41 UTC (Sat)
by bluca (subscriber, #118303)
[Link]
Posted Jul 5, 2022 9:47 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (6 responses)
Building the model isn't subject to copyright restriction (which I agree is right and proper - we don't place copyright restrictions on people picking up information from code they read), but using it might be, just as I might be infringing copyright if I accidentally type in a byte-for-byte identical copy of something I read during code review at a past job.
There's precedent for this in human creativity - former Beatle George Harrison lost a case for "subconscious plagiarism" (hat tip to rsidd) because he listened to a song several years before writing a song that happened to be almost exactly the same melody. No copyright restrictions applied to George Harrison listening to the song he later infringed copyright on, but they did come into play once he created a "new" work that happened to be too similar to an existing work he knew about.
The same could well apply to Copilot - creating the model is OK (human analogy is consuming media), holding the model itself is OK (human analogy is having a memory of past work), but using the output of the model is infringement if it's regurgitated copyrightable code from its input ("subconscious plagiarism" in the Harrison case).
Posted Jul 5, 2022 10:54 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (5 responses)
The copyright violation would have to be in the parts that remain to be filled in once you have copied the parts not protected by copyright—for example:
- the purpose ("what the code does", for example "reciprocal square root")
So, in practice, for a small enough snippet that such an accident is plausible, what might remain
- variable names—but if they are "normal" and not very creative (using "i" for a loop counter or "number" for a number), it doesn't contribute a whole lot
The threshold for originality (in the US) is "low", but not nonexistent. Some things that have been deemed to not meet the threshold are (and remember that with code you need to meet it with what is left once you remove the substantial unprotected elements):
- Simple enough logos, even when there clearly is *some* creativity involved: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...
Posted Jul 5, 2022 11:08 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (4 responses)
It's unlikely to happen with a human coding, simply because I'm not going to copy any copyright-significant decisions from a colleague - I may have a very similar snippet, but the details will change, because that's the nature of a human copying out code from memory. It's more likely to happen with Copilot, since it sometimes regurgitates complete snippets of its input, unchanged, and in a very literal manner.
This is why I suspect the legality of Copilot is currently a lot greyer than either side would like us to think; where it copies code that's not eligible for copyright protection, it may be obvious that it's copied something, but not an infringement because there's no protection to infringe (just as me copying #define U8_MAX ((u8)~0U) from the Linux kernel is not infringing, because there's nothing in there to protect). The risk, however, comes in when the snippet is something that's eligible for copyright protection; I note, for example, that Copilot sometimes outputs the comments that go with a code snippet from its input, which are more likely to be protected than the code itself.
My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.
So on the GitHub side, the thing they're skating over is that the training process and the tool can be non-infringing without guaranteeing that the output is also non-infringing. On the SFC side, they're skating over the fact that a direct copy does not guarantee infringement, since not all code is eligible for protection. The truth all depends on what a judge says if such a case comes before them - and I'd expect to see that appealed to the highest legal authorities (Supreme Court in the USA).
Posted Jul 5, 2022 11:52 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (2 responses)
My more important point is that it *should* be legal, as a matter of sane policy that also would be the result that benefits free software, just like most pushback against copyright maximalism.
Posted Jul 5, 2022 15:26 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (1 responses)
I disagree that it should be legal - taking that position to an absurd extreme, if I train an ML model on Linux kernel versions alone, I could have an ML model that's cost me a few million dollars but that outputs proprietary kernels that are Linux-compatible and work on the hardware I care about. Effectively, copyright becomes non-existent for big companies who can afford to do this.
My position therefore depends strongly on what the tool actually outputs; if the snippets are such that they are not protected by copyright in their own right, and the tool only outputs unprotected snippets, then I'm OK with it; this probably needs some filtering on the output of the tool to remove known infringing snippets, which I'm also fine with ensuring is legal (it should not be infringement to include content purely for the purpose of ensuring that that content is not output by the tool - fair use sort of argument).
I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data. It's only the use you put them to that needs care - you could end up infringing by using a tool that is capable of outputting protected material, and it's on the tool user to watch for that and not accept infringing outputs from their tools.
Posted Jul 5, 2022 17:37 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
I suspect that would very much depend if someone manages to find a business model where a model, trained on someone else’s copyrighted production, makes a lot on money on its own (not via the output of original work copycats). People and lawmakers tend to take a dim view on someone making a lot of money from other people’s belongings without those getting a cut.
I doubt, for example, that the pharmaceutical companies will manage to escape forever paying back the countries whose fauna/flora they sampled to create medicines. The pressure will only grow with climate change and such natural products becoming harder to preserve.
Posted Jul 5, 2022 15:35 UTC (Tue)
by nye (subscriber, #51576)
[Link]
This seems eminently reasonable and appears (from the outside, of course) to be the same conclusion that Microsoft's lawyers have made. So far as I'm aware they haven't made an explicit statement on the matter, but I think it's reasonable to infer the first part (training process and model) from the fact that they approved the release of the software, and the second part (the status of the output) from the fact that they recommend "IP scanning" of any output that you use.
At least in the EU it's clearer such that it's hard to see how there could really be any other possible interpretation. Not sure if we have the same laws regarding ML data collection here in Brexit Britain or if that came too late.
Posted Jul 2, 2022 18:41 UTC (Sat)
by Karellen (subscriber, #67644)
[Link] (24 responses)
It's not the creation of AI models that I have a problem with. It's the creation of non-Free software, which is based on AI models, which is based on Free Software, that I have a problem with. If people use Free Software to train AI models, which they then use to do things other than create more code (e.g. to find bugs, and report them), I'm fine with that. If people use Free Software to train AI models to create new code which is licensed under terms compatible with the licenses of the original code, as is required for other derivative works, I'm fine with that too. An AI model isn't magic. Even if the people who wrote it and trained it don't entirely understand all the internal connections and weightings, it's still just a bit of software that takes a big pile of code as input, does a bunch of processing on it, and spits out some more code as output. Like a compiler, or a transpiler, or a linter. Requiring that the code output by bit of software called "an AI model" has to follow the same rules as code output by any of those other software tools, in terms of respecting the licenses of its inputs, is not "expanding copyright madness to AI models", it's just copyright.
Posted Jul 2, 2022 19:03 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (23 responses)
But, I believe your understanding of "the rules" is wrong. Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so, or 2) produce outputs that are legally derived works or require a license of the input, even if the outputs are both complex and useful.
Posted Jul 2, 2022 22:57 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (6 responses)
The output of Copilot is derived from its inputs. Therefore, by the definition of the word "derive", any and all output is a derivative of the input that was used to create it.
The only question is, to what extent does copyright either consider it a legal derivative work and hence subject to licence, or trivial and hence not subject to licence.
Any attempt to argue otherwise is basically playing Humpty Dumpty. The law does not define the word "derivative" as far as I know, so it means (approximately) what it means in common English. To argue that the output is not a derivative work is to argue that the English language is meaningless ...
(Oh, and while I don't know what the legal implications are, remember that the EU treats "works" and "data" separately. Saying that it's perfectly acceptable to treat works in public view as data fits nicely into the EU directive saying you can *train* an AI on by public "works" by treating it as data. But if you then treat the output as a work, you are promptly putting it back under copyright rules ...)
Cheers,
Posted Jul 3, 2022 6:38 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
> A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.
Do not ask why they offer two completely different definitions back-to-back, because I have no idea. That's how it is in the statute book, and (presumably) how Congress wrote it.
Unfortunately, they do not define the word "work," and that's really the sticking point here. If I count the number of "E"s in a novel, and publish that number on a website, surely the number is not a derivative work of the novel, despite the fact that it has been "transformed" from the novel. A number is not a (creative) work, so it can't be a derivative work. But where do you draw the line? This definition does not tell us.
Posted Jul 4, 2022 9:06 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
You’re trying to nitpick by claiming that if you add a sufficient number of indirections and air-splitting in the derivation steps, it’s not (legally) a derivation.
But even if a judge would agree to follow this kind of reasoning (most would reject it out of hands, that’s basically muddying waters and a judge core job is to un-muddle what parties present to reach a verdict), that also works the other way :
“if splitting steps till a suggestion is too small to be considered a derivation in law works, how many of those tidbits can you combine the other way till you reach the critical mass and the end result is definitely protected?”
You can’t exempt yourself from legal obligations via technical foolery. It does not work that way law-side.
Posted Jul 6, 2022 19:48 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
Of course, even if the model *is* a derivative work, you still can't sue GitHub for creating the model, because the GPL places no restrictions on creating derivative works unless you distribute them to other people. You could allege an AGPL violation, if you can find a specific AGPL codebase and prove that GitHub actually used it as an input, but that's much harder. Or you can allege that the output of the model is a derivative work, at which point you don't have to worry about whether the model itself is a derivative work (copyright law doesn't care), but then you really do need to identify specific inputs that are "substantially similar" to one particular output, and that might be difficult as well.
Posted Jul 20, 2022 14:49 UTC (Wed)
by ghane (guest, #1805)
[Link] (2 responses)
I have a question, sparked by your comment above.
1. Would an index of a work be a Derivative?
2. Would a concordance of a work be a Derivative?
In my mind, this is a single question :-)
Note that both of these can be done by a human or a program, with the exact same output.
Surely this must have been litigated somewhere already.
ISTR that because of the delays in publishing the Dead Sea Scrolls, a group in the 1990s published a complete Conocrdance, thus making the texts substantially available to researchers.
Posted Jul 20, 2022 15:13 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
Failure to publish is another major problem, because I believe most copyright laws apply to *legally* *published* work, from *the date of publication*.
There are plenty of cases of unpublished works being kept out of the public eye, and I can think of at least one where somebody published excerpts of a 200-yr-old work. But because he owned the original, nobody could get their hands on the complete work.
Another famous example of this sort of thing is Queen Victoria's diaries. She published the early ones unexpurgated. But when she died, her daughter and literary executor published "sanitised" and heavily edited versions, destroying the originals. Her nephew, George V, was horrified at such vandalism but was powerless. So the later diaries are missing roughly 2/3rds their original content :-(
But certainly as far as Co-Pilot is concerned, I think your references to indices and concordances misses the point. They may be new works of scholarship, but they are intended to direct you back to the original. Co-Pilot, it seems, hides the original from you so are "quoting blind" if you use its output.
Cheers,
Posted Jul 20, 2022 16:10 UTC (Wed)
by ghane (guest, #1805)
[Link]
I have found a reference to what I remembered: https://www.nytimes.com/1991/09/05/world/computer-breaks-...
I was specifically not referring to Copilot, but asking in general. However,
> But certainly as far as Co-Pilot is concerned, I think your references to indices and concordances misses the point. They may be new works of scholarship, but they are intended to direct you back to the original. Co-Pilot, it seems, hides the original from you so are "quoting blind" if you use its output.
Note that the reconstructed text from mining the concordance and indexes was useful precisely because the original was not available. If it had been, the reconstructed text would be useless. They specifically claimed that the reconstructed text was not new in any way, it had been done by a computer(!). The original text guys, paradoxically, while calling them "pirates", claimed the reconstructed text was not the same as the original, and hence had no value.
Posted Jul 2, 2022 23:11 UTC (Sat)
by Karellen (subscriber, #67644)
[Link] (2 responses)
I would say that the human brain is legally distinct from computer software, in a way that calling the computer software "AI" does not change.
I thought I specifically pointed out that I do not think this, when I said that a program which takes code as input and produces bug reports as output, would not have issues. (Because "code" and "bug reports" are different things.) A tool that takes code as input and outputs lines-of-code metrics, or cyclomatic complexity metrics, would also not need a specific license to do so. (Even though some proprietary software licenses attempt to deny the rights to that sort of activity!)
Not in general, or necessarily, no. But in the case of Copilot, which takes code as inputs, processes it, and then outputs more code which is generated from the processing of those inputs, what possible definition of "derived work" could there be that excludes it? The outputs are a result of the inputs - change the inputs and you get different outputs. Take the "training" inputs out, and you get no output at all. And, the outputs are the same type of thing as the inputs - code. The output code is generated from - derived from - the input code.
Posted Jul 3, 2022 0:15 UTC (Sun)
by SLi (subscriber, #53131)
[Link] (1 responses)
I believe there are quite a few very possible way of it not being a derivative work—in the legal sense (I don't really care about the common meaning of the word). For example, if the copying of expression, as opposed to ideas, from any single work protected by copyright is de minimis, then the new work is not a derivative work of the original work. So, some amount of copying of expression can happen without copyright implications.
Another way is that where essentially no copying of expression, but only copying of ideas, is going on.
Posted Jul 3, 2022 22:41 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Just to clarify for other commenters, this is a very complicated and jurisdiction-dependent legal analysis. As an example, in the 2nd Circuit of the US, they would do this: https://en.wikipedia.org/wiki/Abstraction-Filtration-Comp...
But in other jurisdictions, other tests will be used instead.
Posted Jul 2, 2022 23:15 UTC (Sat)
by anselm (subscriber, #2796)
[Link] (12 responses)
No. But if the device produces output that can be identified as a nontrivial part of a copyrighted work (e.g., a function definition), then the fact that it used an “AI model” does not mean it is somehow magically exempt from infringing on the copyright of that work.
In other words, if I produced that output myself by cutting and pasting the part in question from the original copyrighted work, I would obviously be infringing on its copyright. If Copilot produced the same output by passing the original copyrighted work through an AI model, why should that not be a copyright issue?
Posted Jul 4, 2022 12:06 UTC (Mon)
by nye (subscriber, #51576)
[Link] (11 responses)
Did anyone claim that it wouldn't?
Posted Jul 4, 2022 21:24 UTC (Mon)
by anselm (subscriber, #2796)
[Link] (7 responses)
Microsoft seems to think so. (They also claim that it doesn't happen very often, as if that was a valid excuse.)
Posted Jul 5, 2022 15:12 UTC (Tue)
by nye (subscriber, #51576)
[Link] (6 responses)
No they do not. They have not claimed that. They will not claim that. This straw man is *ridiculous* and seeing it repeated so often makes me want to scream.
The fundamental assertion that they're implicitly making by publishing copilot is that output from copilot is not automatically, ipso facto, an infringement of the license on its training data.
You[0] seem to be claiming that this further implies an assertion that the output from copilot is automatically, ipso facto, not an infringement of the license on its training data. Rather like claiming that "not all people are men" implies "all people are not men".
But not only are Github/MS not saying that, they are saying the opposite. In fact, what they *actually* say is this:
> You should take the same precautions as you would with any code you write that uses material you did not independently originate.
[0] In the plural sense. I imagine you *personally* have just been misled by the people making up this straw man, since it's so common.
Posted Jul 5, 2022 15:58 UTC (Tue)
by anselm (subscriber, #2796)
[Link] (5 responses)
In other words, they want us to perform the due diligence that they're not prepared to do themselves. This does not detract from the fact that they're misleading Copilot users about the copyright status of the code that Copilot emits, so they're potentially violating licenses such as the GPL or BSD license which stipulate that code covered by them can only be passed on if the license grant is also passed on.
Posted Jul 6, 2022 11:17 UTC (Wed)
by nye (subscriber, #51576)
[Link] (4 responses)
That directly contradicts the part of my comment that you quoted! Where are you getting this? Why do you think that you can tell such blatant lies and not get called out? I'm... well actually I'm just speechless at this point. I guess there's not much point continuing any further.
Posted Jul 6, 2022 13:12 UTC (Wed)
by anselm (subscriber, #2796)
[Link] (3 responses)
From what I've seen, Copilot does not annotate its suggestions with information about the status of the material it derives these suggestions from. That is, Copilot is “misleading” recipients of code snippets about their copyright status by not saying anything about their copyright status at all and instead requiring the recipients to figure out for themselves if the snippets are copyrighted (and if so, under what license, if any, they may be used). This may be justified from Github's/Microsoft's POV because many of the suggestions Copilot makes may be too trivial or too much like very obvious boilerplate to qualify for copyright protection in the first place, but there is no guarantee for that. Accidentally including, e.g., GPL material from Copilot output into their own non-GPL projects is a risk that Copilot users need to deal with somehow.
Nobody would have a problem with Copilot if Copilot said, where appropriate, “This code snippet derives from code licensed under the GPL”, because anyone receiving such a code snippet could then decide for themselves whether they wanted to accept it on those terms and act accordingly. (It would depend on the nature of the snippet in question whether this is an actual problem; e.g., three lines of schematic boilerplate from a GPL project are probably fairly innocuous to take over even for non-GPL code, but a nontrivial piece of nonobvious code might be more of an issue.) It would certainly suggest more effectively that the Copilot project is acting in good faith than simply sticking one's head in the sand.
Posted Jul 6, 2022 20:00 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
The whole "we'll tell you if your code looks similar to input data" thing is a search engine layered on top of Copilot, but that's really only going to be useful for very close matches. It doesn't have the smarts to say "well, this actually came from codebase X, even though it looks completely different to X."
Posted Jul 7, 2022 7:26 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Posted Jul 7, 2022 17:34 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted Jul 4, 2022 22:42 UTC (Mon)
by sfeam (subscriber, #2841)
[Link] (2 responses)
This starts to sound very close to the classic "infinite number of monkeys typing at random" scenario. Are the monkeys inevitably guilty of copyright violation?
The stereotypical madly-typing monkeys generate text strings where each character c is generated with uniform probability P(c) and accepted into the output text independent of that monkey's previous typing history. What if we bias P(c) to favor more readable text (give the monkeys Dvorak keyboards?). What if we filter acceptance by previous history (Markov filters? trained monkeys?). What if we house the monkeys in a black box and label it "Copilot"? What if we replace the monkeys with a neural net? Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?
Posted Jul 5, 2022 0:57 UTC (Tue)
by anselm (subscriber, #2796)
[Link]
It doesn't really matter exactly how the monkeys came up with the copy. Your copyright problem starts where you take the monkeys' output, which is demonstrably identical to a preexisting copyrighted work, and pass it off as something you're entitled to dispose of as you please, because the original copyright holder's claim that – never mind those monkeys – you just ripped off their stuff will be difficult for you to refute. (In the case of Copilot, this is, if anything, more difficult, because you effectively showed the monkeys the original copyrighted work first, so their coming up with a verbatim copy eventually will surprise nobody.)
Posted Jul 5, 2022 22:19 UTC (Tue)
by hummassa (guest, #307)
[Link]
2. the answer to
is: the monkeys are never guilty (monkeys are not people, only people can be guilty)... but if you copy, distribute, or perform the work received from the monkeys in public, then you are.
3. clarifying the last part: if the monkeys (or the ML model) produces a copyrightable piece of a copyrighted work, then the monkeys are nothing else than another medium where the copyrightable work is fixed. The monkeys, or the ML model, are just like an HD or a DVD-RW or a physical, printed book.
4. so: if a random number generator generates the number that when represented in binary is the same number as the image of a Blu-Ray of "Avengers: Endgame", this does ABSOLUTELY NOTHING over the copyright of the work. You can't burn it to a BR and play it in a public setting. You can't even copy it. Only Sony/Marvel and their licensees can.
Posted Jul 2, 2022 2:44 UTC (Sat)
by developer122 (guest, #152928)
[Link] (1 responses)
Posted Jul 2, 2022 2:54 UTC (Sat)
by developer122 (guest, #152928)
[Link]
1) github has a services contract with ICE?
2) github is proprietary software
3) github discredits copyleft?
4) github is owned by microsoft
1) is the only one above that I can really find motivating. The others 2) and 4) are pretty much "well duh, it's a modern website" while 3) might be plausible but smells of conspiracy theory. TBH, I default to permissive licences wherever possible so maybe I'm just not the target audience.
Posted Jul 2, 2022 7:11 UTC (Sat)
by gdt (subscriber, #6284)
[Link] (1 responses)
Coming from Australia -- a "fair dealing" country rather than a "fair use" country -- I can't see how Copilot is not making unauthorised reproductions of my work. Fair dealing is a 'black-letter law' list of uses of works for which you do not need a copyright license, and training an AI simply is not on that list. Without fair dealing, the only way to reproduce the code to train the AI is via a license, and if that is a license like the GPL then the AI output must meet the terms of the license.
Posted Jul 6, 2022 3:02 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
Copilot isn't relying on "fair use" in countries where that applies either; it's relying on the fact that it isn't designed to reproduce the works it was trained on at all. Despite a few noteworthy corner-cases where Copilot did not work as intended, which are being addressed with filters, the goal of the system is extraction of common (i.e., not creative) elements and synthesis from many sources—not storage and retrieval of specific works.
Posted Jul 2, 2022 9:57 UTC (Sat)
by ale2018 (guest, #128727)
[Link]
Posted Jul 7, 2022 20:16 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link]
Poettering’s to work for MS on systemd…
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
SFC's use of Twitter.
SFC's use of Twitter.
But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry.
SFC's use of Twitter.
> Um, the original inquiry was not mine? I'm a bit confused which parts of your response are directed at me, and which at bluca!
SFC's use of Twitter.
I grabbed the wrong post.
SFC's use of Twitter.
people, courtesy of Facebook is one particularly bitter example).
Coding is an inherently social endeavour.
SFC's use of Twitter.
SFC's use of Twitter.
SFC's use of Twitter.
SFC's use of Twitter.
SFC's use of Twitter.
SFC's use of Twitter.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Surely it can only reliably detect this, if you apply using the web interface.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
These are all fair points. Github and its servers are based in the US, so US law, not EU law, would apply. But the reality is that copyright claims on AI-derived work have not been tested. If copilot emits a code snippet that is identical or highly similar to a copyrighted work, is that copyright infringement? If you ask AI art machine to generate artwork of a soup can, and it outputs something similar to Andy Warhol (because that's in the training set), is that infringement? How similar is too similar?
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Not in the software domain, but George Harrison was famously found guilty of "subconscious plagiarism" in his "My Sweet Lord", which resembled the Chiffons' "He's So Fine" which he had undoubtedly heard. This was quite blatant, so the ruling was against him; a short tune fragment, like a code fragment, may escape consequences (jazz musicians consciously "quote" tune fragments all the time and aren't expected to pay royalties on the quoted bits).
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
I assume this must be borne out of US-centric view - there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back. Whether a repository is proprietary or under an FOSS license, it's completely irrelevant, anyone can data mine all day long, as long as it's publicly and legally accessible.Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
copyright conflicts would constantly arise when two authors use the same trivial statement independently of each other, such as “Bucks beats Hawks and advance to the NBA finals”, or “i = i+1”. The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality.
I think there's an important difference here. Obviously, it's possible for two small snippets of work to be identical, and still have been generated independently, with neither snippet having its origins in the other. But the output of CoPilot does have its origins in the code it is trained on. It has colour.On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Of course it's a generic opt-out, you can't pick and choose the parsers you don't like.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
>
> MEASURES TO ADAPT EXCEPTIONS AND LIMITATIONS TO THE DIGITAL AND CROSS-BORDER ENVIRONMENT
>
> <...>
>
> Article 4
>
> Exception or limitation for text and data mining
>
> 1. Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.
>
> 2. Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.
>
> 3. The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.
>
> 4. This Article shall not affect the application of Article 3 of this Directive.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
For training it seems OK to use whatever is lawfully accessible, and retain copies for as long as training lasts for this purpose.
But I don't see how that adds any exception on copyright for real world usage of your trained system.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
copilot seems to be less transformative than a C compiler generating machine code and so far binaries have always been considered derivative from the source.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Even for larger chunks, with the usual example being the fast inverse square root - but that snippet has been copied so many times in so many places, it's almost folklore now. And in all cases I've seen someone was _intentionally_ steering the autocomplete engine toward that answer. Would it pop up in a completely unrelated case when not actively trying to make it appear? And would a copyright lawsuit from the original author based solely on that snippet win a court case? I don't know, I'm not a judge, but it's really not as clear cut.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
To be fair to CoPilot, I think that argument is only solid for output which is a verbatim copy of one specific input. Given that most of CoPilot's suggestions are not verbatim copies of one specific input, I'm not sure how useful it is in the general case. Just because a few of its outputs are blatant copyright infringement, I don't think that it necessarily follows that all of its outputs are.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Just because a few of its outputs are blatant copyright infringement, I don't think that it necessarily follows that all of its outputs are.
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
// Math.isPrime(int) returns whether the given number is prime or not
@Test
// Math.isPrime(int) returns whether the given number is prime or not
@Test
public void testIsPrime() {
assertTrue(Math.isPrime(2));
assertTrue(Math.isPrime(3));
assertTrue(Math.isPrime(5));
assertTrue(Math.isPrime(7));
assertTrue(Math.isPrime(11));
assertTrue(Math.isPrime(13));
assertTrue(Math.isPrime(17));
assertTrue(Math.isPrime(19));
assertTrue(Math.isPrime(23));
assertTrue(Math.isPrime(29));
}
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Wol
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
GiveUpGitHub misses the practical details, like the need to maintain an official mirror
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
I'm thinking that perhaps this particular subthread has gone as far as it needs to; let's stop it here.
Can we stop here?
Can we stop here?
Perhaps you have the time to watch an out-of-control comment thread - on a holiday - to find the perfect point at which to intervene. I apologize, but I lack that time.
Can we stop here?
Can we stop here?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
free and open data
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
https://snapshot.debian.org/removal/
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
- how it does it, especially if it is the best or one of a limited number of good ways to do it (so, yes, perhaps counterintuitively the expression in a particularly clever code snippet might enjoy less protection)
- whatever is dictated by external factors (the magic numbers in the reciprocal square root code? There are probably other reasons why they are not protected, but they also aren't because they need to be exactly those numbers to work, as dictated by a mathematical law); this also applies to what the coding style dictates
and what must pass the originality threshold to attain copyright protection is things like:
- Stylistic things that do not come directly from coding style or the way things are commonly done. How you group your code. Perhaps the order of some lines of code, where you insert blank lines (in cases where it would be unlikely for two coders to do it the same way), etc.
- Comments. Short, purely technically descriptive snippers are probably unlikely alone to meet the originality threshold, but if you come up with enough similar technical prose, even in the form of multiple short comments that alone aren't original enough, I think this might be your best bet of violating copyright.
- Blank forms
- Typefaces
- This vodka bottle: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
From this perspective, it rubs me the wrong way when people seem to argue that copyright should restrict the creation and use of AI models.
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Wol
Why is Copilot so bad?
Why is Copilot so bad?
Would you say that a human brain is in some sense certainly more magic than that?
Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so
Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would [...] produce outputs that are legally derived works [...] of the input
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so, or 2) produce outputs that are legally derived works or require a license of the input, even if the outputs are both complex and useful.
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities
(emphasis mine)
Why is Copilot so bad?
> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities
(emphasis mine)
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Why is Copilot so bad?
Can a text generated by random generation of sequential characters constitute copyright violation? An unequivocal "yes" answer seems to sound a death-knell for clean room implementations. An unequivocal "no" answer lets Copilot off the hook. Why is Copilot so bad?
Why is Copilot so bad?
Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?
Why is Copilot so bad?
> Are the monkeys inevitably guilty of copyright violation?
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!
Software Freedom Conservancy: Give Up GitHub: The Time Has Come!