Gentoo bans AI-created contributions [LWN.net]

Gentoo bans AI-created contributions

Posted Apr 18, 2024 21:04 UTC (Thu) by mb (subscriber, #50428) [Link] (20 responses)

>but in a few years it will be seamlessly integrated into all sorts of products

Right. That won't resolve the open questions, though.

Just processing copyrighted material through some sort of "AI" filter should not make the Copyright go away.
Or alternatively, any program processing any data shall be allowed to remove Copyright.
Cannot choose both.

Gentoo bans AI-created contributions

Posted Apr 18, 2024 21:14 UTC (Thu) by snajpa (subscriber, #73467) [Link]

> Right. That won't resolve the open questions, though.

I have a feeling that trend is going to accelerate. Open questions kinda rendered "obsolete" by even more pressing new open questions :-D

Gentoo bans AI-created contributions

Posted Apr 19, 2024 9:55 UTC (Fri) by kleptog (subscriber, #1183) [Link] (18 responses)

>Just processing copyrighted material through some sort of "AI" filter should not make the Copyright go away.

Well, if you're processing 1TB of data into a 1GB model, it's very questionable whether you can really consider it a derived work any more.

As a human I have consumed enormous amounts of copyrighted data, and the responsibility to respect copyright lies with me, not the tools I use. Similarly, the responsibility for respecting copyright lies with the person using the LLM. An LLM is not going to produce something that resembles an existing copyrighted work without explicit prompting. I find it hard to believe it's going to happen by accident.

> Or alternatively, any program processing any data shall be allowed to remove Copyright.

Or, the person using the program is responsible for complying with any relevant laws.

(I'm getting strong "colour of your bits" vibes. The tools you use are not relevant to the discussion of copyright.)

Gentoo bans AI-created contributions

Posted Apr 19, 2024 13:41 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (11 responses)

> Well, if you're processing 1TB of data into a 1GB model, it's very questionable whether you can really consider it a derived work any more.

So you're saying that if I rip a music CD that is ~700MiB of data, but then use lossy compression and make it into 50MiB of data, I'm actually allowed to do that?

Gentoo bans AI-created contributions

Posted Apr 19, 2024 14:36 UTC (Fri) by farnz (subscriber, #17727) [Link] (10 responses)

14:1 compression like that is well within the expected bounds of today's psychoacoustically lossless techniques. 1000:1 is not, so the argument is that if you rip a music CD and get ~700 MiB PCM data, and compress that down to 700 KiB, the result of decompressing it back to a human-listenable form is going to be so radically different to the original that this use is transformative, not derivative.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 16:17 UTC (Fri) by samlh (subscriber, #56788) [Link] (9 responses)

If you turn the music into MIDI, that could achieve such compression, and would still be a derivative work.

The same argument may reasonably apply for LLMs given how much verbatim input can be extracted in practice.

Gentoo bans AI-created contributions

Posted Apr 20, 2024 15:36 UTC (Sat) by Paf (subscriber, #91811) [Link] (8 responses)

A lot of verbatim input can be extracted from *me* in practice, surely enough that I could violate copyright from memory.

So uh what about the other stuff I create? I know what good data visualization looks like because I have read many data viz based articles over the years. Etc

Gentoo bans AI-created contributions

Posted Apr 20, 2024 15:59 UTC (Sat) by LtWorf (subscriber, #124958) [Link] (7 responses)

Well if you learn the whole divina commedia and then write it down you won't become the author :)

Humans extrapolate in a way that machines cannot. So the comparison doesn't hold.

A human can write functioning code in whatever programming language after reading the manual from that language. A text generator needs terabytes worth of examples before it can start producing something that is approximatively correct.

I don't think comparing a brain with a server farm makes sense.

Gentoo bans AI-created contributions

Posted Apr 22, 2024 9:25 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

Humans extrapolate in a way that machines cannot.

This isn't, as far as I can tell, true. The problem with AI is not that it can't extrapolate, it's that the only thing it can do is extrapolate. A human can extrapolate, but we can also switch to inductive reasoning, deduction, cause-and-effect, and most importantly a human is able to combine multiple forms of reasoning to get results quickly and efficiently.

Note that a human writer has also had terabytes worth of language as examples before they start producing things that are correct - we spend years in "childhood" where we're learning from examples. Dismissing AI for needing a huge amount of examples, when humans need literal years between birth and writing something approximately correct is not advancing the conversation any.

Gentoo bans AI-created contributions

Posted Apr 22, 2024 14:44 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (5 responses)

Note that a human writer has also had terabytes worth of language as examples before they start producing things that are correct

It's not terabytes, though. A really fast reader might be able to read a kilobyte per minute. If they read at that speed for 16 hours a day, they might be able to manage a megabyte per day. That would mean a gigabyte every 3 years of solid, fast reading doing nothing else every day. So a truly dedicated reader could manage at most a few tens of GB over a lifetime. Most people probably manage at most a few GB. Speaking isn't a whole lot faster. That means most humans are able to learn their native languages using orders of magnitude fewer examples than LLMs are.

To me, this is a sign the LLM stuff, at least the way we're doing it, is probably a side track. It's a neat way to get something that produces competent text, and because it has been trained on a huge range of texts it will be able to interact in just about any area. But it's a very inefficient way of learning language compared to the way humans do it. If we want something more like AGI, we need to think more about the way humans learn and try to teach our AI that way, rather than just throwing more texts at the problem.

Gentoo bans AI-created contributions

Posted Apr 22, 2024 14:50 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

That carries with it the assumption that text is a complete representation of what people use to learn to speak and listen, before they move onto reading and writing. It also assumes that we have no pre-prepared pathways to assist with language acquisition.

Once you add in full-fidelity video and audio at the quality that a child can see and hear, you get to terabytes of data input before a human can read. Now, there's a good chance that a lot of that is unnecessary, but you've not shown that - merely asserted that it's false.

Gentoo bans AI-created contributions

Posted Apr 22, 2024 21:34 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

That carries with it the assumption that text is a complete representation of what people use to learn to speak and listen, before they move onto reading and writing.

To the contrary, I think the different learning environment is part of what we need to reproduce if we want more human-like AI. A huge problem with LLM is that they are largely fed a single kind of input. It's no wonder chatbots have problems interacting with the world; they've read about it but never dealt with it firsthand. If we want an AI that can deal with the world as we do, it needs a full set of senses and probably a body so it can do something more than chat or paint.

Gentoo bans AI-created contributions

Posted Apr 23, 2024 9:26 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

Right, but you were claiming that because the input to a child can be summarised in a small amount of text, the child's neural network is clearly learning from that small amount of data, and not from the extra signals carried in spoken work and in body language as well.

This is what makes the "training data is so big" argument unreasonable; it involves a lot of assumptions about the training data needed to make a human capable of what we do, and then says "if my assumptions are correct, AI is data-inefficient", without justifying the assumptions.

Personally, I think the next big step we need to take is to get Machine Learning to a point where training and inference happen at the same time; right now, there's a separation between training (teaching the computer) and inference (using the trained model), such that no learning can take place during inference, and no useful output can be extracted during training. And that's not the way any natural intelligence (from something very stupid like a chicken, to something very clever like a Nobel Prize winner) works; we naturally train our neural networks as we use them to make inferences, and don't have this particular mode switch.

Gentoo bans AI-created contributions

Posted Apr 23, 2024 10:51 UTC (Tue) by Wol (subscriber, #4433) [Link]

Plus, it's well known that our neural networks have dedicated sub-systems for, eg recognising faces, that can be rapidly trained.

For example, baby learns what mum sounds like in the womb, and that is re-inforced by mum hugging new-born. My grand-daughter was prem, and while there don't appear to be any lasting effects, it's well known that separating mother and child at birth has very noticeable impacts in the short term. Not all of them repairable ...

We're spending far too much effort throwing brute force at these problems without trying to understand what's actually going on. I'm amazed at how much has been forgotten about how capable the systems of the 70's and 80's were - the prolog "AI Doctor" running on a Tandy or Pet that could out-perform a GP in diagnosis skills. The robot crab that could play in the surf-zone powered by a 6502. I'm sure there are plenty more examples, where our super-duper AI "more power than sent a man to the moon" would find it impossible to compete with that ancient tech ...

Modern man thinks he's so clever, because he's lost touch with the achievements of the past ...

Cheers,
Wol

Gentoo bans AI-created contributions

Posted Apr 23, 2024 15:20 UTC (Tue) by Wol (subscriber, #4433) [Link]

> we naturally train our neural networks as we use them to make inferences, and don't have this particular mode switch.

We also don't feed back to our AIs "this is wrong, this is right". So it's free to spout garbage (hallucinate) with no way of correcting it.

Cheers,
Wol

Gentoo bans AI-created contributions

Posted Apr 19, 2024 17:09 UTC (Fri) by ballombe (subscriber, #9523) [Link] (5 responses)

> Well, if you're processing 1TB of data into a 1GB model, it's very questionable whether you can really consider it a derived work any more.

But this is not uniform compression. The most relevant parts are kept mostly verbatim, the least relevant part are ignored. The AI trick is that there is no easy way to find which parts are kept verbatim, and there is no easy way to find the source.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 17:28 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

> there is no easy way to find the source.

This is not a technical problem at all.
It's not relevant how you compressed the data to be a derived work.
And there is no single right or wrong answer to whether something is a derived works. Always been like that.

But. Machine learning breaks Copyright in a fundamental way in that it is very similar to human learning so that one can use human learning reasoning, but at the same time it is fast and cheap.

While you need to put significant effort into your work when human-learning from others and creating new non-derived work, with MI this is just a click of a button.
This is where Copyright breaks. It's hard for a human to create new non-derived work.
But it's cheap for machine learning to do the same thing.

A human "filter" processing ("learning") work set "A" into non-derived work "B" is expensive. So it's almost never done just for copying and Copyright erasing.
A machine learning filter, however, is cheap and is easy to erase Copyright in that way.

Copyright is fundamentally broken. It's not a technical problem.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 20:17 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

> This is where Copyright breaks. It's hard for a human to create new non-derived work.

This is absurd. The value of a copyrighted work is not dependant on the amount of effort that went into it.

If anything LLMs are a great equaliser. It used to be that to be a great writer you needed to have a great idea for a story and the skill to execute it. Now people with a great idea but not quite as good writing skills get a chance they might not otherwise.

Copyright protects the economic value and moral value of a work. The fact that other people can now also create new works easier does not reduce the value of copyright at all (or break it). Copyright does not protect all uses of your work, only those that reduce the economic value of the original.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 20:49 UTC (Fri) by mb (subscriber, #50428) [Link]

>The value of a copyrighted work is not dependant on the amount of effort that went into it.

I did never claim that.

Gentoo bans AI-created contributions

Posted Apr 21, 2024 22:37 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> And there is no single right or wrong answer to whether something is a derived works. Always been like that.

This is prosaically true in the sense that it's ultimately a judgment call on the part of the trier of fact, but as a matter of law, there absolutely is an answer, and the law generally expects you to know it. You can wave your hands about copyright "breaking" all you want, but the legal system is not going to be impressed.

You are correct, however, that AI does put the legal system in a bit of an awkward spot. Up until now, derivative works have been decided by the trier of fact (judge or jury) looking at the original and the allegedly infringing work side by side, and seeing if they're close enough that copying can be inferred. The legal term used in the US is "substantially similar" (or "strikingly similar"), but most countries are going to use a similar method in their courts.

That works fine when you have one original. When you have two billion originals, and an unbounded set of potentially infringing works, it's a bit impractical. Right now, the unspoken expectation is that the plaintiff has to do the leg work of figuring out which images to put side by side in this comparison (US courts would say the plaintiff is "master of their complaint" and thus responsible for deciding exactly what is and is not in scope). That's not easy in the case of AI, and it's the main reason (or at least, a major reason) that artists have struggled to sue image generators successfully.

But that does not imply that an image generator "erases" copyright as you have put it. If an artist is able to find a specific infringing output that closely resembles their art, and their art was used as a training input, then the artist might have a case. Saying "the AI breaks copyright" is not going to be an effective defense.

Gentoo bans AI-created contributions

Posted Apr 21, 2024 22:39 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

(Also, everything I said about image generators applies to LLMs, except for the "struggled to sue" part. The New York Times did sue OpenAI and alleged very specific, word-for-word copying of whole paragraphs of text.)

Gentoo bans AI-created contributions

Posted Apr 18, 2024 22:22 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> Honestly, this feels like a rerun of the "you can't use a spell/grammer checker on your school assignment because that's cheating".

My feeling in all of this is IFF you use an AI to help you write a valid report (of whatever sort) that's fine. The AI is the *assistant*. If, however, the AI is the *author* then you don't want to go near it with a barge pole.

In other words, if there is a *human* involved, who has sanity checked it for hallucinations, accuracy, what-have-you, then that's fine. If the human sending it can't be bothered, then why should the human receiving it bother, either? And if it's the AI bot that's sending it, then you REALLY don't want to know!

Cheers,
Wol

Gentoo bans AI-created contributions

Posted Apr 20, 2024 11:20 UTC (Sat) by Baughn (subscriber, #124425) [Link]

That would make sense, but it isn’t what the policy states?

Gentoo bans AI-created contributions

Posted Apr 19, 2024 8:03 UTC (Fri) by atnot (subscriber, #124910) [Link] (5 responses)

> Right now people are using prompts in chatbots, but in a few years it will be seamlessly integrated into all sorts of products. It's only going to get faster and cheaper as time goes on.

There's no real reason to believe this will happen. For one, the big inherent problem of this type of system, just making shit up, really makes it unsuitable to use for anything but optional autocomplete, especially compared to the more specialized models that already exist. Secondly, while the AI boosters keep talking about exponential improvements, that hasn't actually happened. GPT-4 was released a year ago and the best openai can do is a few percentage point improvement on some benchmarks. Which is not what you'd expect in a field where improvements are supposedly so low hanging that people predict a doubling of capability (unspecified) every 18 months. Hardware has been looking a bit better with the usual dozen percentage points perf/w we've come to expect every 2-3 years, but it's hardly revolutionary either.

It also remains to be seen how much development effort will remain with this technology when investors realize it's been severely overhyped in two or three quarters or however long it'll take them.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 10:44 UTC (Fri) by snajpa (subscriber, #73467) [Link] (3 responses)

well, the LLMs available *today*, from what I’ve tried, are pretty close for me to be able to replace Grek KH and his stable picks, while I haven’t even tried to fine tune the models, all zero-shot classification... It’s not that far, people will absolutely use it for way more than an autocomplete.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 10:54 UTC (Fri) by snajpa (subscriber, #73467) [Link] (1 responses)

Gre*g* K. H., my apologies - I have nothing but respect for the guy for the record, it’s just that it seemed like a good case to test out the LLMs abilities, got the idea from reading the discussions here under the “all users must upgrade” posts ;)) I can highly recommend, if you have some spare time, play around with these things. It really can be a huge productivity booster, that’s what it can be used for _today_. I wonder what world we’re going to wake up to tomorrow… of course we’ll inflate it into an enormous bubble, but so far IMHO we aren’t there, with our collective expectations, we (“the markets”) are still pretty close to what the tech can actually do.

Gentoo bans AI-created contributions

Posted Apr 23, 2024 21:02 UTC (Tue) by flussence (guest, #85566) [Link]

> we (“the markets”)

Quite a revealing slip.

Gentoo bans AI-created contributions

Posted Apr 19, 2024 13:45 UTC (Fri) by LtWorf (subscriber, #124958) [Link]

This is a very extraordinary claim that is going to require a peer reviewed paper or some other proof to substantiate it.

Gentoo bans AI-created contributions

Posted Apr 20, 2024 15:37 UTC (Sat) by Paf (subscriber, #91811) [Link]

You really really don’t know what’s possible or isn’t. The recent advancements were enormous and came suddenly, remember.