DeVault: GitHub Copilot and open source laundering
GitHub’s Copilot is trained on software governed by these terms, and it fails to uphold them, and enables customers to accidentally fail to uphold these terms themselves. Some argue about the risks of a “copyleft surprise”, wherein someone incorporates a GPL licensed work into their product and is surprised to find that they are obligated to release their product under the terms of the GPL as well. Copilot institutionalizes this risk and any user who wishes to use it to develop non-free software would be well-advised not to do so, else they may find themselves legally liable to uphold these terms, perhaps ultimately being required to release their works under the terms of a license which is undesirable for their goals.
Chances are that many people will disagree with DeVault's reasoning, but
this is an issue that merits some discussion still.
      Posted Jun 23, 2022 16:26 UTC (Thu)
                               by mpldr (guest, #154861)
                              [Link] (24 responses)
       
     
    
      Posted Jun 23, 2022 18:31 UTC (Thu)
                               by NYKevin (subscriber, #129325)
                              [Link] (10 responses)
       
1. It's not clear to me whether this claim is actually correct. A model is ultimately "just" a big bag of statistical information, and I honestly don't know whether (US) copyright law attaches to such things in the first place, but I'm skeptical (see e.g. Feist v. Rural). 
* The kernel of truth here is that, in practice, clean-room engineering is often a good idea for the avoidance of legal risk. But there's nothing in either the GPL or the copyright statute that says you have to do it. Because that would be stupid. Imagine if novelists couldn't read books without running into copyright issues. 
     
    
      Posted Jun 23, 2022 19:26 UTC (Thu)
                               by ballombe (subscriber, #9523)
                              [Link] (1 responses)
       
Maybe there is some specific request that led copilot to return whole body of some GPL files.  For example, by looking for certain patterns that occurs in a single software etc. 
     
    
      Posted Jun 23, 2022 19:52 UTC (Thu)
                               by Gaelan (guest, #145108)
                              [Link] 
       
Amusingly, it also autocompleted a BSD license onto that code. 
     
      Posted Jun 25, 2022 1:00 UTC (Sat)
                               by gerdesj (subscriber, #5446)
                              [Link] (2 responses)
       
Quite.  Also, many putative authorities on the GPL seem to forget that there are many legal systems.  If you are going to dive in and be authoritative on he GPL then you really should present an argument that works for all legal systems that the GPL attempts to work within.  Quite a job!  
Legal is as legal does:  Some legal systems have a concept of "reasonable" or what a "reasonable" person would do and I think that is what the GPL is riffing off.  There's also the concept of being able to "quietly enjoy [something]".  I'm a Brit. so my local legal system informs my knowledge here.  Not all legal systems work like that.  
I think it is fair to say that we all have strange ideas about how the GPL works.  There's no need to call out end users. 
     
    
      Posted Jun 25, 2022 11:47 UTC (Sat)
                               by Wol (subscriber, #4433)
                              [Link] 
       
And far too many authorities read what they want to see, not what's actually there. I've sure been guilty of that. I think my knowledge of the GPL now is pretty good, precisely because I've had plenty of people call me out on my mistakes. 
How many "experts" have NOT been through that learning experience? The majority of them? 
Cheers, 
     
      Posted Jun 26, 2022 3:22 UTC (Sun)
                               by gdt (subscriber, #6284)
                              [Link] 
       The deeper point about differing copyright laws is not so much interpretation of copyright licenses, but the distinction between "fair use" and "fair dealing". Copilot claims its actions are fair use, and therefore the license is irrelevant. However in fair dealing jurisdictions Copilot's use of the program source must either meet the copyright license or one of the black-letter list of allowed uses in the fair dealing exceptions of that juridiction's copyright law. 
     
      Posted Jun 26, 2022 9:40 UTC (Sun)
                               by gspr (subscriber, #91542)
                              [Link] (4 responses)
       
Surely it does apply in one extreme, namely that of a really good model! If I take a copyrighted picture and create a model that very accurately reproduces said picture, I don't think it's very unlikely that my model runs afoul of the original's copyright. 
In the other extreme—that of a really terrible model—it probably doesn't, but we probably shouldn't write off the models in-between those extremes. 
     
    
      Posted Jun 27, 2022 16:51 UTC (Mon)
                               by Wol (subscriber, #4433)
                              [Link] (3 responses)
       
If, however, you're referring to a model like most people here are - computer science, otherwise known as maths - then equally copyright should NOT apply, because it's maths. Or "sweat of the brow". Or a whole other bunch of doctrines that lawyers do their best to mis-understand but that state quite clearly it is not copyrightable material. 
Cheers, 
     
    
      Posted Jun 28, 2022 1:26 UTC (Tue)
                               by hummassa (subscriber, #307)
                              [Link] (1 responses)
       
If a mathematical model produces as its output a perfect reproduction of an copyrightable and copyrighted work (something novel produced by the human mind), then said mathematical model is nothing but a copying apparatus. There is no difference between the neural model of Copilot and a big HD containing all the works it's seen, just well indexed. The output is just a copy of the copyrighted work, subject to the same protections under the laws and treaties. 
     
    
      Posted Jun 28, 2022 8:00 UTC (Tue)
                               by Wol (subscriber, #4433)
                              [Link] 
       
Oh the joys of the ambiguity of English ... 
As I read it "copyright attaches to the model" - in other words there is no copyright *in* the model. But if there is copyright in the *original*, then that applies *to* the model as well ... 
I think we're talking at cross purposes ... :-) 
Cheers, 
     
      Posted Jun 28, 2022 6:37 UTC (Tue)
                               by gspr (subscriber, #91542)
                              [Link] 
       
Clearly absurd. 
     
      Posted Jun 23, 2022 20:28 UTC (Thu)
                               by mrugiero (guest, #153040)
                              [Link] (2 responses)
       
     
    
      Posted Jun 23, 2022 20:58 UTC (Thu)
                               by mpldr (guest, #154861)
                              [Link] (1 responses)
       
There's nothing wrong with that – quite the opposite – but it's not helpful if what you want is a legal review. You may get some interesting points from them, sure; but it's not exactly helpful when trying to find out what is actually law (let alone that this is a court's job) 
     
    
      Posted Jun 24, 2022 2:33 UTC (Fri)
                               by scientes (guest, #83068)
                              [Link] 
       
/Almost not sarcastic 
     
      Posted Jun 24, 2022 11:27 UTC (Fri)
                               by flussence (guest, #85566)
                              [Link] (9 responses)
       
The GPL2 didn't make nVidia *or* AMD play nice with Linux (key phrase: "preferred form for modification"), the GPL3 didn't stop TiVoization (they trivially routed around it; especially Apple), the AGPL3 didn't stop SaaS vendor lock-in (instead they weaponised it against each other), and no current or future iteration of it will stop Microsoft committing automated for-profit piracy at global scale like is happening here. 
The only thing the GPL *is*, clearly, is weak DRM powered by magical thinking and a bunch of weird elitist old men clinging to a power fantasy dreamed up half a century ago, from which they refuse to grow up from. People who try to actually play by the stated rules get a worse experience, corporations engage in automated piracy with neural networks with little to no legal repercussions, or often just old-fashioned copyright infringement if they're peddling white-label ARM devices, and any attempts to resist this toothless status quo from within the system get you ostracised. Most users of GPLed software have never and will never know it exists, never mind read or understand it, and if they did wouldn't be able to meaningfully exercise rights granted under it. But it sure makes some people feel smug about themselves on their moral high horse. 
I feel like at this point the only way to stop this cancer of trillionaires strip-mining the creative output of individuals is to stop giving away any legal rights to that work in the first place. Make the code utterly radioactive to anyone who takes license texts seriously, especially corporate lawyers: All Rights Reserved, free for personal use only, the software shall be used for good not evil, and with a written threat to DMCA anyone found uploading to github or any other platform of similar size and motive. Piracy is going to happen anyway, but we can still choose who feels safe and comfortable doing it. 
Thought experiment: if you train a neural network on the text of the GPL itself and coax output from the machine that superficially resembles the input but with manually chosen tweaks that change its meaning, are you exempt from the copyright header in the original, as MSFT seems to think it is? If so, that's the final nail in the coffin for software copyright as a whole; the words of the legal document and the colour of the bits don't mean anything any more. 
     
    
      Posted Jun 26, 2022 0:07 UTC (Sun)
                               by salimma (subscriber, #34460)
                              [Link] (8 responses)
       
Any specific example for AMD here? They seem to be much better citizens when it comes to the GPL, at least compared to nVidia (and even nVidia is finally open sourcing kernel drivers) 
     
    
      Posted Jun 26, 2022 0:23 UTC (Sun)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (7 responses)
       
     
    
      Posted Jun 26, 2022 10:00 UTC (Sun)
                               by flussence (guest, #85566)
                              [Link] (6 responses)
       
Rumour has it that AMD middle management wanted the FOSS option they were due to announce to be kept slightly inferior to fglrx for Reasons, and having an independent effort that didn't depend on firmware blobs or the decrepit x86-only int10 VBIOS like that did was severely embarrassing them. 
Not many people may remember this now, but the R200/(reverse-engineered)R300 driver also used to be blob-free. Strangely that stopped being the case after AMD took over, even though it was feature-complete. 
     
    
      Posted Jun 26, 2022 18:49 UTC (Sun)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
Because it made no sense to reimplement the critical power management and link training code multiple times, instead of doing it once in AtomBIOS.  
     
      Posted Jun 26, 2022 19:40 UTC (Sun)
                               by mjg59 (subscriber, #23239)
                              [Link] (3 responses)
       
The DRM side of the R2/3/400 driver always required a firmware blob - but for a long time it was just embedded inside the kernel driver, so wasn't user visible. My recollection (which seems to be supported by the driver, but it's been a long time since I looked at this properly so I could be wrong) is that a bunch of the 2D acceleration in that driver depended on DRM, so effectively the 2D driver also had a blob dependency if you wanted it to work properly. 
The difference between -radeonhd and -ati as far as reliance on firmware goes was that the defined interface to various pieces of card functionality was to execute interpreted scripts present in the card flash. These scripts didn't do anything that the driver couldn't, so you could absolutely reimplement that functionality in the driver - the problem is that card vendors used these scripts as a way to abstract hardware differences (eg, using RAM from different vendors with different timing constraints), and ignoring Atom would mean having to have card-specific data in the driver before that card would work correctly. A hybrid approach is to use Atom for data but not for code, but even then there are still risks due to the fact that the defined interface is the scripts and not the data tables. A card vendor could modify the way the script interpreted the tables (or even hardcode stuff directly into the script) and again you'd need card-specific knowledge to avoid that. -radeonhd spent a while trying to avoid executing any Atom code, but effectively relied on it anyway - it couldn't program the card from cold and so depended on the system firmware having executed the scripts before it ran. In any case, support for executing Atom code (including running the ASIC init function) was added to -radeonhd by September of 2007. 
Looking at the initial commits to support r500 in the -ati driver, I think the only time it would ever call int10 is if the card was entirely uninitialised. -radeonhd would do exactly the same if it was configured without support for doing Atom-based init. 
     
    
      Posted Jun 29, 2022 10:12 UTC (Wed)
                               by flussence (guest, #85566)
                              [Link] (2 responses)
       
I learned something today. Thanks for explaining all that to my dumb ass. 
This is the kind of thing that makes me keep my subscription to the site going. 
     
    
      Posted Jun 29, 2022 10:31 UTC (Wed)
                               by mjg59 (subscriber, #23239)
                              [Link] (1 responses)
       
There's tradeoffs. I'd love to avoid having to rely on non-free code to make hardware work, and I'm not going to criticise people for writing drivers that do that. But the reality is that any such driver is going to work less well than a driver that uses the defined interface to call non-free firmware (in much the same way that we call into non-free UEFI code to set boot variables these days), and so there's value in the driver that calls non-free code existing, and also it's unsurprising that distros would pick the one that works with more hardware. 
Luc's priorities on -radeonhd were probably based on his experience with the VIA chipsets that were extremely limited by what the BIOS permitted (and yeah it turns out that not being able to set any modes other than those that are hardcoded in the BIOS is not good!), but the outcome was also that his driver for those chipsets simply didn't work on all hardware - I had a VIA-based laptop that would just give a black screen with his driver, because the BIOS didn't match his expectations. To be completely fair, on Radeon I hit some similar constraints when I was researching reclocking the RAM for power management - the Atom scripts simply took too long, so I took out a bunch of the wait statements and hardcoded those into the kernel and it was great, and then after a couple of days of uptime the card would wedge during a reclock and also it didn't work on all hardware, so it turns out there was a reason that those were there in the first place. So I absolutely understand the desire to have native code for all of this, but also in the absence of vendors providing explicit contracts about hardware behaviour, a driver that doesn't use the defined interfaces is inherently going to break things. 
     
    
      Posted Jun 29, 2022 11:31 UTC (Wed)
                               by farnz (subscriber, #17727)
                              [Link] 
       It's also worth noting in this context that part of the reason to have vendor scripts of some form (be they AtomBIOS, ACPI or others) is that power delivery changes can't happen instantly - and different board manufacturers will have done different transient analysis to determine what their hardware can reliably support. The results of that analysis (whether it's a "rule of thumb" assessment or a proper calculation) need communicating to the driver somehow - and a small scripting language is as good as any other way to deal with it, especially since the edge cases get complex if you're doing a per-board calculation based on measuring the final system during post-manufacture testing of a board.
 That said, given the quality of some vendor code, I understand Luc's reluctance to trust it - I've encountered one vendor who asserted that the CPU would detect an OUT 0xCF8, EAX instruction in userspace and then ensure that nothing else accessed PCI configuration space until the same userspace process executed either OUT 0xCFC, EAX or IN EAX, 0xCFC later, on the basis that if the CPU didn't do that, it would be possible for their userspace driver to crash. I'm not even sure how this could work under Linux…
      
           
     
      Posted Jun 26, 2022 20:16 UTC (Sun)
                               by sjj (guest, #2020)
                              [Link] 
       
     
      Posted Jun 23, 2022 18:21 UTC (Thu)
                               by bluca (subscriber, #118303)
                              [Link] (2 responses)
       
     
    
      Posted Jun 23, 2022 18:43 UTC (Thu)
                               by bluss (guest, #47454)
                              [Link] (1 responses)
       
     
    
      Posted Jun 23, 2022 19:38 UTC (Thu)
                               by bluca (subscriber, #118303)
                              [Link] 
       
     
      Posted Jun 23, 2022 18:26 UTC (Thu)
                               by Deleted user 129183 (guest, #129183)
                              [Link] (31 responses)
       
This would be a good recommendation, but unfortunately… 
The way software freedom is defined, it means that everyone is free to redistribute the source code as they wish, _including uploading it verbatim to G*tHub_ even if it doesn’t come from there. You cannot forbid it in a licence, because it would make the software non-free (and potentially introduce a licence incompatibility). Sure, you can just ask people not to, but not everybody listens, and when you notice that somebody uploaded your software to G*tHub against your wishes, it may be already too late and the code may be already stolen and incorporated into the machine learning dataset. 
The only winning move would be not to play and not publish the source code anywhere nor show it to anyone. Which will obviously make your software proprietary, which may be not something that you would want. 
> Instead, I would update your licenses to clarify that incorporating the code into a machine learning model is considered a form of derived work, and that your license terms apply to the model and any works produced with that model. 
…which would be also just disregarded by G*tHub. Again, the only way to win is not to play. 
     
    
      Posted Jun 23, 2022 18:49 UTC (Thu)
                               by NYKevin (subscriber, #129325)
                              [Link] (28 responses)
       
Aha, I missed that line in DeVault's post. 
This is not a thing you can do. The judge decides what counts as a derivative work. You don't. The license cannot override the applicable copyright law. If copyright law does not say that a license is required in order to create the model in the first place, then no provision of the license can prohibit it. Not even "All rights reserved, do not redistribute." OTOH, if the applicable copyright law says that a license is required to create such a model, then GitHub has to comply with the terms of the license, and it doesn't matter whether the license explicitly calls this out or not. 
     
    
      Posted Jun 23, 2022 23:07 UTC (Thu)
                               by Wol (subscriber, #4433)
                              [Link] (27 responses)
       
If your licence makes it clear that you consider putting your source into a ML algorithm creates a derivative work, then the Judge is likely to agree with you. 
If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous". Fair enough. But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not). 
Cheers, 
     
    
      Posted Jun 24, 2022 1:13 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] (26 responses)
       
> If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous". 
That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. At best, the defendant might be able to raise the defense of "innocent infringement" in some jurisdictions. But under US law, that does not relieve the defendant of liability, it merely reduces the monetary amount of their damages, which can still be quite substantial if many copies were made. Also, a valid copyright notice often defeats or greatly weakens this defense (see e.g. 17 USC 401(d)), depending on jurisdiction. 
Seriously, if anyone in this thread is contemplating acting on this suggestion, I would strongly urge that person to consult an attorney who specializes in copyright law. This is not how the law works at all. You cannot go before a judge and say "the law was ambiguous so I just did it anyway," and expect to automatically win. 
> But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not). 
You have it backwards. Either the defendant is arguing that no license is required (and so it doesn't matter whether the license is valid or invalid), or the defendant is arguing that the license is valid and its actions fall within the scope of the license. Arguing that the license is invalid is something the plaintiff might do in order to defeat the latter defense; it never makes sense for the defendant to raise such an argument. 
     
    
      Posted Jun 24, 2022 2:30 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (25 responses)
       
And the legislature will not examine YOUR code, and decide YOUR case ... It  is the Judge who *decides* whether it is a derivative work or not. 
> That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. 
Who was talking about the *law*? I was talking about the *licence*. 
If the licence makes it absolutely clear that the licensor considers ML to create derivative code, then the licensee cannot claim an innocent mistake. The licensee MUST claim that a licence is not required and copyright does not apply. 
At the end of the day, it's down to the Judge to *apply* the law. And if it is clear to the Judge that the defendant "knew or should have known" that they were acting against the wishes of the licensor, then there is no defence of estoppel, or "innocent infringement", or "but I thought it was okay". 
And, faced with the choice of siding with the plaintiff and saying to the defendent "you knew the defendent did not permit that", or CREATING NEW LAW by explicitly defining ML into the Public Domain or whatever, which do you think a Judge is going to choose? 
At the end of the day, putting this stuff into your licence does not change the law. But it makes it a damn sight more likely that the Judge is going to side with your interpretation of the law. 
Cheers, 
     
    
      Posted Jun 24, 2022 2:55 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] (24 responses)
       
That is exactly my point. The judge will make this decision, based on the facts of the case and what the legislature wrote in the statute. Not based on the license. The license has zero to do with what is or is not a derivative work. 
     
    
      Posted Jun 24, 2022 8:04 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (23 responses)
       
If the Judge has to decide whether ML is a derivative work or not (and create new law in the process!!!), then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor. 
If the license says "I consider this to be a derivative work", the Judge will not want to create new law by disagreeing - you're effectively twisting the Judge's arm. How far he lets you twist it is down to him :-) 
Remember PJ - Judges try to upset the apple-cart as little as possible. If you give the Judge an out, he will take it ... 
Cheers, 
     
    
      Posted Jun 24, 2022 9:15 UTC (Fri)
                               by kleptog (subscriber, #1183)
                              [Link] (7 responses)
       
In particular, the EU Copyright Directive states that Text and Data Mining for the purpose of research and education is permitted. You can write whatever your like in your license, it has no effect. Now, GitHub is making a commercial product here so they don't get to claim an broad exemption. So it comes to the individual member states to regulate as they see fit. 
Which basically makes the conclusion: It depends. 
     
    
      Posted Jun 24, 2022 11:01 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] (3 responses)
       
     
    
      Posted Jun 24, 2022 11:03 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] 
       
     
      Posted Jun 24, 2022 12:20 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (1 responses)
       
That's easily got round - opt out by default and grant people you DO like a licence. In other words, the default is "mining is permitted", and the law says you have to explicitly change THE DEFAULT if you don't like it. Pretty sensible, imho. 
Cheers, 
     
    
      Posted Jun 24, 2022 12:37 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] 
       
     
      Posted Jun 24, 2022 13:49 UTC (Fri)
                               by edeloget (subscriber, #88392)
                              [Link] (2 responses)
       
The same things goes for books, photographies and so on. You can train a language model on copyrighted books for research and education. But if you want to do it for the purpose of offering a commercial service, you have to get the proper authorization (this can be costly but it's not out of reach for a company like Microsoft). 
I haven't read all the comments below, so maybe I'll state a point that has already been proposed. The "model is a derivative work" question is interesting, yet I don't think this is a real issue. The problem I see (and I think it's a problem at the moment I read the "we can do it" rationale by Github) is that I don't believe them: contrary to what they say, the code written by the machine is a derivative work, no matter how hard they'll try to press on this. Not only you can easily obtain code that is a direct copy paste of existing code (with comments, if needed :)) but even if you don't, you'll end up with code that 1) has a striking similary with exiting code (after I have used it for a while, I don't envision Copilot to magically imagine new algorithms) and 2) is directly inspired by the input code (Copilot is unable to code a solution to a new problem ; for exemple, it cannot propose you to use an API it does not already know). 
So, as a conclusion, I would not go by the "a model created using this code is a derivative work" clause. I would go by "any code created by a ML model trained with this code is a derivative work" clause which I find both more logical and more satisfactory. As a consequence, the tool itself can exist, but cannot be used to created anything but free software - as I see it, this would be a win-win situation (although it might be tough to market to software shops :)) 
     
    
      Posted Jun 24, 2022 15:57 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] 
       
The EU directive allows TDM for commercial programs too. It adds an opt-out provision for that case. 
     
      Posted Jun 24, 2022 16:36 UTC (Fri)
                               by rgmoore (✭ supporter ✭, #75)
                              [Link] 
       I absolutely agree that any code produced by Copilot that is a verbatim copy of anything from its training corpus would be a copyright violation, unless that piece fell under one of the established limits on copyright, such as purely functional material that has a restricted number of ways it can be expressed or a snippet that's too small to be considered expressive.  Material that's suspiciously similar to something in the training corpus would be at least deeply suspect.  But that would be true whether it comes from Copilot or from a human programmer.  If your code is a copy of someone else's, it's a copyright violation regardless of how it got that way unless it isn't eligible for copyright in the first place.
 The problem with trying to restrict Copilot (and similar programs) is threefold:
 Again, this applies only to training the model.  The output of the model is a different thing and may be a copyright violation even if the model itself isn't.
      
           
     
      Posted Jun 24, 2022 10:00 UTC (Fri)
                               by mjg59 (subscriber, #23239)
                              [Link] (9 responses)
       
Why? Do you have examples of this occurring? 
     
    
      Posted Jun 24, 2022 12:25 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (8 responses)
       
PJ was very clear on this - Judges are very reluctant to rock the boat. If "is this a derivative work" is not clear then, given the choice of a NARROW interpretation of the licence that says "the licence denies permission, I'll side with the licence", or a BROAD interpretation that says "all licences like that are invalid", which one are they going to choose? 
Especially when the defendant has "known or should have known" the plaintiff's express wish and ignored it. 
Cheers, 
     
    
      Posted Jun 24, 2022 17:45 UTC (Fri)
                               by rgmoore (✭ supporter ✭, #75)
                              [Link] (3 responses)
       The point is that isn't how licenses work.  The idea behind a copyright license is that the licensor grants the licensee some rights they would normally be denied by copyright law in exchange for a consideration.  For example, if copyright law would normally deny me the right to use your program to train my ML model, you can write a license that would grant me that right.
 But in practice a license can't prevent someone from doing something they would otherwise have the right to do under copyright law.  It is possible to write a license that requires the licensee giving up some rights under copyright law as part of the consideration they get for receiving some other rights.  But nobody is forced to agree to the license!  If they simply refuse to accept the license, they can continue doing anything they normally had the right to do under copyright law.  Refusing the licensing terms would deny them whatever rights the license would grant them, but if they weren't intending to do those things it's an empty threat.
      
           
     
    
      Posted Jun 24, 2022 20:25 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (2 responses)
       
But if copyright LAW is not clear on the matter? 
That is what everybody is ignoring - it is down to the Judge to decide what the law IS. If the licence explicitly refuses permission, does the Judge make a NARROW ruling that says the licensor's explicit wishes rule, or a BROAD ruling that all such clauses are invalid. 
PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time. 
And I don't know which case it was, but there was a discussion about a pro-software-patent Judge some while back, who ruled "In THIS case, the software is clearly non-patentable. I can't conceive of a scenario where any software is patentable". Note he didn't even attempt to say software isn't patentable. He was pro-patents. But he stated, in a ruling, "I don't think it is possible for software to pass the patentability bar". He made a very narrow ruling, but accepted that the consequences would probably be wide. 
Cheers, 
     
    
      Posted Jun 24, 2022 21:21 UTC (Fri)
                               by rgmoore (✭ supporter ✭, #75)
                              [Link] (1 responses)
       But that applies only if the broad and narrow ruling turn out the same way.  In that case, the judge will usually rule on the narrowest possible grounds that results in the outcome they think is right for the case.  If the broad and narrow grounds for the ruling have opposite results, the judge has to go based on which one seems to be a better reading of the law and situation, not just on narrow versus broad.  More generally, narrow vs broad is something that's more true of low-level judges than of higher-level ones.  Even if individual judges make narrow rulings, it's likely that different judges will rule differently.  That will create uncertainty and force a higher court to rule on the matter, creating a broader ruling.  That's the way these things usually go.
      
           
     
    
      Posted Jun 25, 2022 17:05 UTC (Sat)
                               by khim (subscriber, #9252)
                              [Link] 
       Where does that idea comes from? Narrow ruling is used precisely to ensure the possibility of broad ruling (made in a different case by a different judge later) to proclaim the opposite outcome! Narrow is almost always better. Because, well, it's narrow. It describes the situation more precisely. The only save you can have is to proclaim that narrow reading is so narrow it's not applicable to your case at all. That often happens with patents (judge is presented with half-dozen of patents which can be, theoretically, be treated as prior art and eliminate the patent completely, but 9 times out of 10 judge doesn't do that, but only just proclaim that yes, patent is still valid, just not applicable for your case). I would say it's the way these things usually don't go. 99% of time decision doesn't reach high enough courts to decide anything definitively. Usually it takes dozens of cases and decades of litigation for that to happen. 
     
      Posted Jun 24, 2022 19:30 UTC (Fri)
                               by mjg59 (subscriber, #23239)
                              [Link] (3 responses)
       
     
    
      Posted Jun 24, 2022 20:43 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (2 responses)
       
And that is exactly the argument in front of the Judge. *IS* it a derived work? And if the law is unclear, and the licensor is explicit that he considers it a derived work, then the only safe option for the Judge is to rule that it IS a derived work and let the legislators sort it out. 
And this is why you can NOT "defer to the law" in this argument. The question at issue is not "is this a derivative work according to the law?", but "what is the law?". THAT is the argument in front of the Judge. 
The legislators can choose to let the genie out the bottle. The Judge will not be happy about letting the genie out the bottle off his own bat. 
Cheers, 
     
    
      Posted Jun 24, 2022 21:13 UTC (Fri)
                               by mjg59 (subscriber, #23239)
                              [Link] (1 responses)
       
     
    
      Posted Jul 1, 2022 8:32 UTC (Fri)
                               by nim-nim (subscriber, #34454)
                              [Link] 
       
Despite years of commercial pretense to the reverse professionals know “smart” systems are no smarter than the human who coded them. 
     
      Posted Jun 24, 2022 18:23 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] (4 responses)
       
Nope, that's not how the law works. Stating it over and over again does not make it true. 
For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else. 
     
    
      Posted Jun 24, 2022 18:28 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
     
      Posted Jun 24, 2022 20:50 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (1 responses)
       
And if that's not enough for him to make up his mind? 
That *SHOULD* be all that's needed. But if that IS all that's needed, why can't we all make our own minds up? Surely it's obvious? Why do we need Judges? It can't be THAT hard ... ? 
Cheers, 
     
    
      Posted Jun 24, 2022 21:59 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
     
      Posted Jun 27, 2022 14:18 UTC (Mon)
                               by anselm (subscriber, #2796)
                              [Link] 
       
I think it would still be of some interest whether X resulted from Y through “cp Y X” or through a query to Copilot whose model was trained on Y. In the first case, X is pretty clearly a derived work of Y. In the second case, Microsoft, at least, would probably like to claim it isn't.
 
     
      Posted Jun 26, 2022 0:11 UTC (Sun)
                               by salimma (subscriber, #34460)
                              [Link] (1 responses)
       
It will probably also curtail your network effect even further though. 
     
    
      Posted Jun 27, 2022 9:22 UTC (Mon)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
     
      Posted Jun 23, 2022 22:26 UTC (Thu)
                               by jmspeex (subscriber, #51639)
                              [Link] (16 responses)
       
     
    
      Posted Jun 24, 2022 1:07 UTC (Fri)
                               by developer122 (guest, #152928)
                              [Link] (15 responses)
       
There were a raft of people arguing that all code from copilot should be GPL due to ingested GPL code, to which I reply: What gives the GPL priority? 
There is code on github under lots of licences, eg the CDDL which is a copyleft free software licence but incompatible with the GPL. As there is significant CDDL code on github, under the same reasoning perhaps all the output should be CDDL? 
An then there's all the code on github which specifies no licence at all. Just because it is publicly available does not mean any and all rights have been granted to you. Plenty of this code is proprietary. 
The end result is an unlicencable pile of legal mush that no sane lawyer should be going anywhere near. 
     
    
      Posted Jun 24, 2022 8:16 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] (14 responses)
       
     
    
      Posted Jun 24, 2022 9:19 UTC (Fri)
                               by Vipketsh (guest, #134480)
                              [Link] (9 responses)
       
I can see it being difficult to dispel an argument that Copilot does something more than just mining when it is able to regurgitate numerous lines of code verbatim, including comments.  Is it text/data mining if the representation of some code is changed to a set of statistical weights ?  Is that any different to, say, compressing the code ?  Is it different to encrypting the code requiring a password to gain back the original representation ? 
 
     
    
      Posted Jun 24, 2022 11:09 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] (8 responses)
       
     
    
      Posted Jun 24, 2022 11:52 UTC (Fri)
                               by Vipketsh (guest, #134480)
                              [Link] (7 responses)
       
The major considering factor in this case for me (IANAL) is that if "AI models" are allowed to reproduce code verbatim then there is a legally sound method to de-license any and all code: Take something similar to copilot train it with the code you would like to de-license and then get your device to reproduce it.  That is clearly not acceptable, yet it is what copilot does in some cases. 
In the end this is a new situation which touches on various concepts and courts and/or law makers need to figure out where this and similar situations stand. 
     
    
      Posted Jun 24, 2022 13:01 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] 
       
     
      Posted Jun 24, 2022 18:37 UTC (Fri)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
IMHO we don't need to figure this out, unless somebody wants to allege that GitHub is violating the AGPL by not providing the model's source code, parameters, etc. Such a lawsuit would be complicated, messy, and bring in all sorts of difficult legal issues.
 
For literally any other legal claim under the sun, you can just point directly at the output and say "Regardless of how it was created, that output is substantially similar to some portion of the XYZ codebase, and therefore it is a derivative work of XYZ and is in violation of [whatever license applies]." Judges have been dealing with the "we can't directly prove that the defendant copied the plaintiff's work" problem for a very long time, and are perfectly capable of finding for the plaintiff anyway.
      
           
     
      Posted Jun 24, 2022 19:05 UTC (Fri)
                               by samth (guest, #1290)
                              [Link] (2 responses)
       
     
    
      Posted Jun 24, 2022 21:16 UTC (Fri)
                               by Wol (subscriber, #4433)
                              [Link] (1 responses)
       
The problem is when the law says "black is white", or "true is false". What was that about the ?Pennsylvania legislature trying to define "pi = 3"? 
Lawyers are supposed to understand logic. Maths is a subset of logic. Yet how many legal screw-ups do we have because lawyers want to legislate maths out of existence? 
And then  we get people here who think law is "black and white". PJ was quite clear - "law is squishy". BECAUSE lawyers don't understand black and white! Or because life mostly ISN'T black and white! 
It would be lovely if things were that simple. But what answer do you get when you ask the question "what's 6 times 9?". When you ask the question "What is reality?" Because even a Physicist will turn round and say "the Universe doesn't know"!. YOUR reality is different from MINE, and we have no way of telling if there even IS a correct version of events. 
For the most part, law tends to fix (or forget) its worst mistakes, but isn't that true of Computer Scientists too? 
You need to try and twist reality to your view, or you'll find other people WILL twist it to theirs. Quite possibly without even trying or intending to ... 
Cheers, 
     
    
      Posted Jun 24, 2022 22:52 UTC (Fri)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
Hey, PA has its problems. But this one came from Indiana. Luckily there was a school teacher that was there to help teach some math. 
https://www.straightdope.com/21341975/did-a-state-legisla... 
     
      Posted Jun 24, 2022 21:12 UTC (Fri)
                               by rgmoore (✭ supporter ✭, #75)
                              [Link] (1 responses)
       No, because copyright is fundamentally path independent.  All that is needed is to show that an identifiable part of copyrighted work A shows up in work B.  Once the author of work A has shown that, a copyright violation is assumed, and it's up to the author of B to provide a specific defense for why it isn't a copyright violation.  There are valid defenses, but laundering the code through an AI is not one of them.
      
           
     
    
      Posted Jun 25, 2022 6:00 UTC (Sat)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
     
      Posted Jun 24, 2022 16:56 UTC (Fri)
                               by Lennie (subscriber, #49641)
                              [Link] 
       
     
      Posted Jun 25, 2022 14:23 UTC (Sat)
                               by eduperez (guest, #11232)
                              [Link] (2 responses)
       
I do not think anybody is arguing about the model training, but the output from the model once it has been trained. 
If the model produces an output that is a verbatim copy of some GPL'd code (see https://twitter.com/mitsuhiko/status/1410886329924194309 for an example), is that code free now, just because it was produced by some AI? Or is it still protected by the GPL, because it is a derivative work? When can we consider that the output has been produced by the AI, and when can we consider it is still a derived work? This is the legal mush. 
     
    
      Posted Jun 25, 2022 21:27 UTC (Sat)
                               by Vipketsh (guest, #134480)
                              [Link] (1 responses)
       
Even so, I still think that the legal status of the "AI model" warrants further introspection.  When talking about open source code the "AI model" isn't much of a consideration because at worst it stores code in some cryptic way that is available in a much more easier digested form.  But that changes a whole lot if the training material is some propriety code stolen from somewhere.  If your "AI model" is able to reproduce the stolen code verbatim (or sufficient parts for copyright to apply) and training of the "AI model" is "give an exception on copyright rules when doing text and data mining, for any purpose" (bluca's words) that should mean that this trained "AI model" is fully legal and thus a legal distribution mechanism for the stolen code.  Surely there is a legal principal to prevent things to work out this way ?  At what point does an "AI model" turn into a "distribution mechanism" ? 
 
     
    
      Posted Jun 26, 2022 9:46 UTC (Sun)
                               by bluca (subscriber, #118303)
                              [Link] 
       
     
      Posted Jun 24, 2022 22:09 UTC (Fri)
                               by glenn (subscriber, #102223)
                              [Link] (12 responses)
       
It's not trained on Microsoft's internal closed source projects.  Gee, I wonder why this is.
      
           
     
    
      Posted Jun 24, 2022 22:27 UTC (Fri)
                               by bluca (subscriber, #118303)
                              [Link] (10 responses)
       
     
    
      Posted Jun 24, 2022 22:46 UTC (Fri)
                               by glenn (subscriber, #102223)
                              [Link] (2 responses)
       
     
    
      Posted Jun 25, 2022 9:18 UTC (Sat)
                               by bluca (subscriber, #118303)
                              [Link] (1 responses)
       
     
    
      Posted Nov 11, 2022 22:16 UTC (Fri)
                               by glenn (subscriber, #102223)
                              [Link] 
       
It's not unthinkable that users of the tool who for commercial purposes could also be sued. 
     
      Posted Jun 26, 2022 22:27 UTC (Sun)
                               by LtWorf (subscriber, #124958)
                              [Link] (6 responses)
       
I think this alone shows that they are not very sure about the legality of what they are doing, but trust that developers won't be able to do anything about it (unlike the paying customers). 
     
    
      Posted Jun 27, 2022 0:31 UTC (Mon)
                               by bluca (subscriber, #118303)
                              [Link] (1 responses)
       
     
    
      Posted Jun 27, 2022 5:41 UTC (Mon)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
But I couldn't find any statement in their FAQ one way or the other - it just refers to "public repositories on GitHub," a category including both FOSS and proprietary code. It's entirely possible that they are using all of that code, and IMHO that seems like the most straightforward way to read the sentence (which doesn't mean that it is the intended meaning, of course). 
     
      Posted Jun 27, 2022 12:45 UTC (Mon)
                               by nim-nim (subscriber, #34454)
                              [Link] (2 responses)
       
Microsoft has access to plenty enough of proprietary code to train a model on, that they chose to use other people’s code instead says volumes. 
     
    
      Posted Jun 27, 2022 18:31 UTC (Mon)
                               by bluca (subscriber, #118303)
                              [Link] (1 responses)
       
     
    
      Posted Jun 30, 2022 13:45 UTC (Thu)
                               by nim-nim (subscriber, #34454)
                              [Link] 
       
     
      Posted Jun 27, 2022 14:44 UTC (Mon)
                               by excors (subscriber, #95769)
                              [Link] 
       
> Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be rare that GitHub Copilot suggestions included personal data verbatim from the training set. [...] We have implemented a filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough. We will keep improving the filter system to be more intelligent to detect and remove more personal data from the suggestions. 
"rare" != "never", and if someone stores sensitive personal data in a private GitHub repository then they absolutely don't want Copilot to reveal that information publicly to anyone who tries hard enough. 
I expect the same applies to other confidential information, like yet-to-be-announced product names that companies might store in private repositories, or secret keys, or algorithms that they're protecting as trade secrets, etc. 
Since the Copilot training data apparently includes public repositories even if they have a restrictive license, but excludes private repositories even if they have a very permissive license, it sounds like GitHub is confident that there are no copyright issues but is concerned about those other privacy issues. 
     
      Posted Jun 27, 2022 14:26 UTC (Mon)
                               by geert (subscriber, #98403)
                              [Link] 
       
FWIW, it might have been trained on whatever proprietary code that was ever leaked to the Internet, which might even include the sources of some version of Microsoft Windows ;-) 
 
     
      Posted Aug 30, 2022 13:27 UTC (Tue)
                               by scientes (guest, #83068)
                              [Link] (1 responses)
       
What I noticed is that github has serious rate-limiting on their search functionality, but google's codesearch project (which was a nerdy project and the RE2 regex library was built for it) embraced these type of searches. It looks like Microsoft is hostile to technical proficiency and just wants control and classical Microsoft things. 
     
    
      Posted Aug 30, 2022 15:22 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
     
    DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
2. It's not relevant. What matters is whether the output of the model is a derivative work of the original, which is a completely different legal question. Derivative works are not subject to some sort of magical "transitive property" that requires the model to also be a derivative work; you can argue that the output is derivative while taking no position on the status of the model itself. Similarly, you could argue that the output is *not* derivative, again taking no position on the model. The status of the model is not relevant to the question, unless you're going to allege an AGPL** violation.
** The AGPL is the only widely-used license whose obligations attach on creation of a derivative work, rather than on distribution of that work. As far as I know, GitHub has no intention of distributing the model itself to anyone, so if you want to sue GitHub just for creating the model, you'd have to claim an AGPL violation specifically.
DeVault: GitHub Copilot and open source laundering
      
That would strengthen the case.
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Impartiality is for the judge, not for the litigants.
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      PJ was quite clear that given a choice between a broad or narrow ruling, the Judge would opt for the narrow ruling every time.
      > But that applies only if the broad and narrow ruling turn out the same way.
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      For the purposes of determining whether X is a derivative work of Y, the judge looks at X, Y (its contents, not its license), and the copyright statute. Nothing else.
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
Wol
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      The major considering factor in this case for me (IANAL) is that if "AI models" are allowed to reproduce code verbatim then there is a legally sound method to de-license any and all code: Take something similar to copilot train it with the code you would like to de-license and then get your device to reproduce it. That is clearly not acceptable, yet it is what copilot does in some cases.
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
      From the Copilot FAQ:
Not trained on MS closed source
      GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
Not trained on MS closed source
      
So it may have been trained on any source code that is publicly available; we don't know what exactly, the description in the FAQ is very vague (deliberately?).
DeVault: GitHub Copilot and open source laundering
      
DeVault: GitHub Copilot and open source laundering
      
 
           