Amazon's CodeWhisperer [LWN.net]

Amazon's CodeWhisperer

Posted Jul 5, 2022 16:16 UTC (Tue) by dskoll (subscriber, #1630) [Link] (3 responses)

Also not a lawyer, but I know a little about patents from a previous job. A patent is different from copyright. To infringe copyright, you have to distribute a work contrary to the terms of its license, or derive a work from a copyrighted work and distribute it contrary to the original work's license.

For a patent, the only thing that matters is what you do, not how you got there. So for example, when the LZW compression algorithm was patented, it wouldn't matter if you copied a reference implementation, created a brand-new implementation on your own, or used a Copilot-derived implementation... you'd still be infringing the patent.

If you do infringe on a patent, it's sometimes better not to know, because willful infringement carries a lot higher penalty than inadvertent infringement.

I doubt Amazon or MSFT would be responsible for notifying users of their AI code-generating software about potential patent infringement... that risk lies entirely with the users.

Amazon's CodeWhisperer

Posted Jul 5, 2022 17:22 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

Code Whisperer looks better than Copilot here - as I read it (I could be wrong) it does not present you with a suggested *completion* of what you're doing, it presents you with *examples* of what seem to be the same thing - WITH PROVENANCE.

So when Code Whisperer makes a suggestion, it looks like it tells you where you got it from, and you have the information you need to do due diligence.

It seems Copilot doesn't bother ...

Cheers,
Wol

Amazon's CodeWhisperer

Posted Jul 6, 2022 14:59 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

If what you say is true then what it does is both more useful and safer.

Sounds pretty nice in theory, at least. More like search engine than a tool to write garbage.

Amazon's CodeWhisperer

Posted Jul 6, 2022 21:02 UTC (Wed) by KJ7RRV (subscriber, #153595) [Link]

I won't use it because it's proprietary, but CodeWhisperer does sound legal and more useful than Copilot.

Amazon's CodeWhisperer

Posted Jul 5, 2022 16:21 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (2 responses)

It’s a problem for the services if they commingle code with different legal terms and do not present you the terms attached to whatever they suggest.

That service aspect apart, it changes nothing for you as a consumer or publisher of code. The service can be sued as accessory to copyright infringement, but the infringement is still yours (unless the service promises legal insurance as part of its terms of use).

As a consumer, you’re still supposed to perform legal due diligence on the third party code you integrate.

As a publisher, you’re supposed to make sure your legal terms are clearly written and clearly notified.

Copyright is still the same dangerous hairball than when AT&T published Unix (Lions book and all) and everyone involved ended up in court due to general carelessness.

Amazon's CodeWhisperer

Posted Jul 7, 2022 2:38 UTC (Thu) by dvdeug (guest, #10998) [Link] (1 responses)

US copyright law is a quite different hairball than when AT&T published Unix, actually. Then it was a lot easier to lose copyright, for example by publishing without proper copyright notices, which apparently is a trap AT&T fell into. Now copyright is automatic, which is a lot better if your code is being copied, and means there's a lot less public domain code out there.

Amazon's CodeWhisperer

Posted Jul 7, 2022 7:13 UTC (Thu) by Wol (subscriber, #4433) [Link]

The other massive trap AT&T fell into - a rather bigger trap actually - was intentionally removing copyright notices.

They published Unix without any copyright notices, including removing any copyright notices from public code they incorporated. I think they officially relied on trade secret law.

Then, when they sued Berkeley they claimed copyright over the lot. That's why they were so desperate to keep the settlement quiet, because Berkeley's defence was "Hey, *you* removed our copyright notices, and now you're suing us for copying our own code!". At which point they were asked to *prove* which code was theirs and they were forced to respond "we don't actually have a clue".

Sadly, that's not the only case I know of where the copyright pirate has threatened action against the author.

Cheers,
Wol

Amazon's CodeWhisperer

Posted Jul 5, 2022 18:30 UTC (Tue) by nickodell (subscriber, #125165) [Link] (19 responses)

The patent issue is clearer than the copyright issue.

For a patent, if you invent the same thing as a previous patent, then you're infringing on that patent. It doesn't matter if you invented it independently. (However, the penalties for willful infringement are higher.)

For copyright, if you come up with the same idea, the way you came up with it matters. One interpretation is that language models are doing some form of reasoning, so a similar work appearing in the training data isn't necessarily proof that the language model is copying that previous work. Another interpretation is that a language model is just copying part of its input and changing a few things.

There are awkward effects for both possible interpretations. If you accept the first interpretation, then how do you measure whether a model is doing "enough" reasoning? If you accept the second interpretation, that implies that the output of e.g. GPT-3 is jointly owned by every person who's written anything on the internet. Practically speaking, it would become illegal to train an AI on common crawl data.

I don't think any court has ruled on it either way.

Amazon's CodeWhisperer

Posted Jul 5, 2022 18:46 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (18 responses)

One thing that people have problems understanding is that copyright is not just about literal copying.

You can take all the words in a text, and arrange them in sentences meaning something else, and the result will be non infringing.

You can take the same text, and replace every single word with a synonym, and the result will be definitely infringing. None of the words survived but the structure is still the same.

That makes models, that analyze the structure of the code being written, and suggest bits to make it closer to someone else’s structure, especially problematic.

Amazon's CodeWhisperer

Posted Jul 5, 2022 21:11 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (16 responses)

The classic example with writing is that you can change the medium or genre of a work and it can still be a derivative. All those comic book movies are still derivatives of the original comics, even if they don't directly swipe story lines. Similarly, The Magnificent Seven is still a derivative of Seven Samurai even though the setting, character names, and even the language all changed.

That said, the functional nature of code makes it a more difficult case than something purely expressive like fiction or poetry. If there are few enough ways of achieving the same purpose efficiently, it's possible to argue the code is determined purely by functional constraints and therefore isn't expressive. This is especially true if the code is implementing a published algorithm, like quicksort or the sieve of Eratosthenes.

Amazon's CodeWhisperer

Posted Jul 6, 2022 5:48 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (3 responses)

However the model will make sure that, out of the few enough ways to achieve a purpose, you select one others already chose. Hard to see how that can square with copyright if you intend to ignore original licensing.

Amazon's CodeWhisperer

Posted Jul 6, 2022 11:15 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

Copyright is intended to protect creative expression. If there are only a limited number of ways to express something, then the resulting expression may not qualify for copyright protection - e.g. in C++11 code, there's only a few plausible ways[1] to check that a std::string is empty, and using one of those is unlikely to be protected by copyright even if it's a direct copy and paste from another code base.

[1] Two possible ways to get size of string natively and compare to 0. The empty() method. Using c_str() or data() and then checking to see if the pointer points to NULL, or using strlen() to check the C string length. Comparing for equality to an empty string constant via either operator==() or compare().

And of course, there's the implausible ways that might conceivably be enough to get copyright protection depending on context - using find_first_not_of or find_last_not_of to find a character not in the empty string.

Amazon's CodeWhisperer

Posted Jul 6, 2022 15:09 UTC (Wed) by khim (subscriber, #9252) [Link]

Code tends to be at odds with copyrightability: if what you writing is convoluted enough is convoluted enough to warrant copyright protection then very often it's convoluted enough to perform badly.

There are always some exceptions like 0x5f3759df: if you are using that then you maybe violating copyright, but if you use 0x5f375a86 (which is simply the best constant for that algorithm) then you are not violating.

But what if you want to compatible with some existing code? Quake, e.g.? Then you need 0x5f3759df to stay compatible!

Ultimately problems like that mean that for every line of code you need court decision… which is not practical, obviously.

Amazon's CodeWhisperer

Posted Jul 6, 2022 17:32 UTC (Wed) by rgmoore (✭ supporter ✭, #75) [Link]

At least with CodeWhisperer it sounds as if you're shown the original code and the licensing terms, so you have a chance to comply with the terms if you want to. If you intend to violate the original licensing, that's on you, not on the software that shows you how others have done it.

Amazon's CodeWhisperer

Posted Jul 6, 2022 7:18 UTC (Wed) by LtWorf (subscriber, #124958) [Link] (10 responses)

Hollywood constantly rips off old sci-fi stories passing them off as original. I guess since most of the original authors are dead they don't fear repercussions.

Amazon's CodeWhisperer

Posted Jul 6, 2022 11:37 UTC (Wed) by farnz (subscriber, #17727) [Link] (7 responses)

Depends when the original author died, and whether the new work is similar enough to the old work to qualify as a derived work.

For the death date side, if the original author died before 1950, then copyright term is over anyway, and no protection applies.

The "is it similar enough" side is more complex - the rules on whether a work is actually derived from another are complex and require a degree of human judgement - and this is the bit that Copilot and Hollywood both depend upon, in that something may be a copy, but not rise to the level of infringing copyright.

Amazon's CodeWhisperer

Posted Jul 6, 2022 17:08 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> For the death date side, if the original author died before 1950, then copyright term is over anyway, and no protection applies.

This is true, but there are a rather surprisingly large number of additional "outs" in US copyright law (and generally *not* in the copyright law of any other country, because the US is weird):

* Published (only) outside the US: It's complicated, but probably not copyrighted if it was out of copyright in its home country on January 1, 1996.
* Published (in the US) more than 95 years ago, and before 1978 (i.e. 1927 and earlier): Out of copyright per https://www.law.cornell.edu/uscode/text/17/304
* Published (in the US) before 1964, and copyright not manually renewed: Out of copyright (renewal is now automatic). Many older works fall under this exception, particularly anything that was seen as ephemeral or low-value. This includes quite a few pulp magazines, which you can now read on the Internet Archive for free.
* Published (in the US) before March 1, 1989, and no copyright notice or registration: Never copyrighted.
* Sound recording, published (in the US) before February 15, 1972: This used to be a complicated morass of state laws, but Congress fixed it in 2018, see https://www.law.cornell.edu/uscode/text/17/1401

Amazon's CodeWhisperer

Posted Jul 7, 2022 2:27 UTC (Thu) by dvdeug (guest, #10998) [Link] (5 responses)

> For the death date side, if the original author died before 1950, then copyright term is over anyway, and no protection applies.

Life + 70 is only true in part of the world, mainly Europe. The US copyright laws are very hairy, but anything published more than 95 years ago is in the public domain, and anything published since then may not be, with author death dates only mattering right now for works first published after 2002. Lots of the rest of the world is life+50 (e.g. China) and only a couple of nations are life+60, but that includes India, which has more people than the EU.

Amazon's CodeWhisperer

Posted Jul 7, 2022 7:54 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

Note that you're talking about a whole load of ways in which, depending on where you are, something is out of copyright at life + 70 but also out of copyright before that. Hollywood tries to sell its movies globally, and thus wants to be on the upper limit of copyright - and even in the US, a story is out of copyright by that point.

Amazon's CodeWhisperer

Posted Jul 7, 2022 22:46 UTC (Thu) by dvdeug (guest, #10998) [Link]

Yes, Hollywood has to worry about more or less global copyright. However, just being life+70 does not make a work PD in the US, and there's a lot of authors, like George Orwell, whose works are in the public domain in the EU, but won't be PD in the US until 95 years from publication, or 2044 for _1948_.

Amazon's CodeWhisperer

Posted Jul 8, 2022 12:23 UTC (Fri) by Ross (guest, #4065) [Link] (2 responses)

I don't follow what you are saying about 2002. The US definitely also follows the life + 70 rule (they are constantly encouraging other countries to do so as well).

This treatment in the US goes back to everything published in 1978 or later by individuals (not corporations), so it is definitely relevant and can very easily extend the term beyond 95 years. It will extend the duration of copyright for most works which would otherwise expire starting in 2048.

Amazon's CodeWhisperer

Posted Jul 8, 2022 13:38 UTC (Fri) by dvdeug (guest, #10998) [Link]

https://guides.library.cornell.edu/copyright/publicdomain . The US copyright law shows where it's been patched over and over to bring conformity with new standards while being mostly bug compatible with the way things were. The US only went life plus 70 for works made since 1978 or published since 2002. Since US law is a mess and other countries don't need bug compatibility with US law, US copyright maximalists are pushing for life+70.

Amazon's CodeWhisperer

Posted Jul 8, 2022 13:47 UTC (Fri) by dvdeug (guest, #10998) [Link]

Sorry, I didn't read carefully enough. Works first published between 1978 and 2001 are treated as if the author died in 1978 if the author died earlier. So 2002 is the first year that any work first published in that year is life+70. For example, all of Mark Twain's unpublished works were "published" in 2001, so the copyright owners are claiming copyright through 2048.

Amazon's CodeWhisperer

Posted Jul 6, 2022 11:52 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

It's not unusual for a movie to credit (not always very prominently if the author and work are obscure) works from which a script was historically derived. Hollywood would much rather have a sign-off and give away points on the net (which are often close to worthless) than risk a lawsuit with unknowable damages some day in the future.

However of course if the scriptwriter is dishonest they might not tell anybody where the original idea came from.

Realistically it's only going to be short works anyway. Condensing a novel into a movie loses almost everything except the outline plot - and even then it might take two or three movies to get it on film. There are a lot of old SF shorts with interesting ideas in them, but many would need a lot of work to sell a movie in the 21st century. It's notable how many of the original "Dangerous Visions" seem pretty tame now, or are outrageous for very different reasons. "If All Men Were Brothers, Would You Let One Marry Your Sister?" is the sort of thing you could probably do (but Hollywood wouldn't give you money for it) but it would go unnoticed, who cares? Likewise "Eutopia" in which the big deal is homosexuality. Or e.g. the mediocre Dick short "Faith of Our Fathers" which I'm sure modern people would guess is Phil Dick because of all the hallucinogenics, but is no "The Man in the High Castle".

I'd like it if the average "Sci Fi" movie I saw was as clever as "Raft of the Titanic" (what if the Titanic doesn't quite sink and many aboard survive...), never mind "Golem XIV" (instead of taking ages to discover that optimal play in Tic-tac-toe is a draw like in "Wargames", what if the machine the Americans built to plan World War III is categorically smarter than us) or "Orphanogenesis" (if we are just software, what happens if you just randomize the parameters and execute the resulting software in a virtual machine?).

Amazon's CodeWhisperer

Posted Jul 7, 2022 2:31 UTC (Thu) by dvdeug (guest, #10998) [Link]

I'm still confused with the new Top Gun; the original licensed a magazine article. The remake apparently failed to relicense it, so they're in court. How that was wise or profitable, I don't know; they may have a good case it's not legally derivative, but having licensed it for the first movie isn't going to help their case.

Amazon's CodeWhisperer

Posted Jul 8, 2022 9:31 UTC (Fri) by NAR (subscriber, #1313) [Link]

So did that Mandalorian episode infringe on the Magnificient Seven/Seven Samurai copyrights?

Amazon's CodeWhisperer

Posted Jul 7, 2022 15:10 UTC (Thu) by esemwy (guest, #83963) [Link]

Not exactly the case. You can’t copyright facts. If I’m reading a text and restate facts in my own words, I’m free and clear as far as copyright is concerned. This is how multiple news sites publish the same story. This is also the essence of what makes copyrighting code difficult.