Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Posted Apr 18, 2024 20:50 UTC (Thu) by kleptog (subscriber, #1183)Parent article: Gentoo bans AI-created contributions
Posted Apr 18, 2024 21:04 UTC (Thu)
by mb (subscriber, #50428)
[Link] (20 responses)
Right. That won't resolve the open questions, though.
Just processing copyrighted material through some sort of "AI" filter should not make the Copyright go away.
Posted Apr 18, 2024 21:14 UTC (Thu)
by snajpa (subscriber, #73467)
[Link]
I have a feeling that trend is going to accelerate. Open questions kinda rendered "obsolete" by even more pressing new open questions :-D
Posted Apr 19, 2024 9:55 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (18 responses)
Well, if you're processing 1TB of data into a 1GB model, it's very questionable whether you can really consider it a derived work any more.
As a human I have consumed enormous amounts of copyrighted data, and the responsibility to respect copyright lies with me, not the tools I use. Similarly, the responsibility for respecting copyright lies with the person using the LLM. An LLM is not going to produce something that resembles an existing copyrighted work without explicit prompting. I find it hard to believe it's going to happen by accident.
> Or alternatively, any program processing any data shall be allowed to remove Copyright.
Or, the person using the program is responsible for complying with any relevant laws.
(I'm getting strong "colour of your bits" vibes. The tools you use are not relevant to the discussion of copyright.)
Posted Apr 19, 2024 13:41 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (11 responses)
So you're saying that if I rip a music CD that is ~700MiB of data, but then use lossy compression and make it into 50MiB of data, I'm actually allowed to do that?
Posted Apr 19, 2024 14:36 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (10 responses)
14:1 compression like that is well within the expected bounds of today's psychoacoustically lossless techniques. 1000:1 is not, so the argument is that if you rip a music CD and get ~700 MiB PCM data, and compress that down to 700 KiB, the result of decompressing it back to a human-listenable form is going to be so radically different to the original that this use is transformative, not derivative.
Posted Apr 19, 2024 16:17 UTC (Fri)
by samlh (subscriber, #56788)
[Link] (9 responses)
The same argument may reasonably apply for LLMs given how much verbatim input can be extracted in practice.
Posted Apr 20, 2024 15:36 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (8 responses)
So uh what about the other stuff I create? I know what good data visualization looks like because I have read many data viz based articles over the years. Etc
Posted Apr 20, 2024 15:59 UTC (Sat)
by LtWorf (subscriber, #124958)
[Link] (7 responses)
Humans extrapolate in a way that machines cannot. So the comparison doesn't hold.
A human can write functioning code in whatever programming language after reading the manual from that language. A text generator needs terabytes worth of examples before it can start producing something that is approximatively correct.
I don't think comparing a brain with a server farm makes sense.
Posted Apr 22, 2024 9:25 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (6 responses)
This isn't, as far as I can tell, true. The problem with AI is not that it can't extrapolate, it's that the only thing it can do is extrapolate. A human can extrapolate, but we can also switch to inductive reasoning, deduction, cause-and-effect, and most importantly a human is able to combine multiple forms of reasoning to get results quickly and efficiently.
Note that a human writer has also had terabytes worth of language as examples before they start producing things that are correct - we spend years in "childhood" where we're learning from examples. Dismissing AI for needing a huge amount of examples, when humans need literal years between birth and writing something approximately correct is not advancing the conversation any.
Posted Apr 22, 2024 14:44 UTC (Mon)
by rgmoore (✭ supporter ✭, #75)
[Link] (5 responses)
It's not terabytes, though. A really fast reader might be able to read a kilobyte per minute. If they read at that speed for 16 hours a day, they might be able to manage a megabyte per day. That would mean a gigabyte every 3 years of solid, fast reading doing nothing else every day. So a truly dedicated reader could manage at most a few tens of GB over a lifetime. Most people probably manage at most a few GB. Speaking isn't a whole lot faster. That means most humans are able to learn their native languages using orders of magnitude fewer examples than LLMs are.
To me, this is a sign the LLM stuff, at least the way we're doing it, is probably a side track. It's a neat way to get something that produces competent text, and because it has been trained on a huge range of texts it will be able to interact in just about any area. But it's a very inefficient way of learning language compared to the way humans do it. If we want something more like AGI, we need to think more about the way humans learn and try to teach our AI that way, rather than just throwing more texts at the problem.
Posted Apr 22, 2024 14:50 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (4 responses)
That carries with it the assumption that text is a complete representation of what people use to learn to speak and listen, before they move onto reading and writing. It also assumes that we have no pre-prepared pathways to assist with language acquisition.
Once you add in full-fidelity video and audio at the quality that a child can see and hear, you get to terabytes of data input before a human can read. Now, there's a good chance that a lot of that is unnecessary, but you've not shown that - merely asserted that it's false.
Posted Apr 22, 2024 21:34 UTC (Mon)
by rgmoore (✭ supporter ✭, #75)
[Link] (3 responses)
Posted Apr 23, 2024 9:26 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (2 responses)
Right, but you were claiming that because the input to a child can be summarised in a small amount of text, the child's neural network is clearly learning from that small amount of data, and not from the extra signals carried in spoken work and in body language as well.
This is what makes the "training data is so big" argument unreasonable; it involves a lot of assumptions about the training data needed to make a human capable of what we do, and then says "if my assumptions are correct, AI is data-inefficient", without justifying the assumptions.
Personally, I think the next big step we need to take is to get Machine Learning to a point where training and inference happen at the same time; right now, there's a separation between training (teaching the computer) and inference (using the trained model), such that no learning can take place during inference, and no useful output can be extracted during training. And that's not the way any natural intelligence (from something very stupid like a chicken, to something very clever like a Nobel Prize winner) works; we naturally train our neural networks as we use them to make inferences, and don't have this particular mode switch.
Posted Apr 23, 2024 10:51 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
For example, baby learns what mum sounds like in the womb, and that is re-inforced by mum hugging new-born. My grand-daughter was prem, and while there don't appear to be any lasting effects, it's well known that separating mother and child at birth has very noticeable impacts in the short term. Not all of them repairable ...
We're spending far too much effort throwing brute force at these problems without trying to understand what's actually going on. I'm amazed at how much has been forgotten about how capable the systems of the 70's and 80's were - the prolog "AI Doctor" running on a Tandy or Pet that could out-perform a GP in diagnosis skills. The robot crab that could play in the surf-zone powered by a 6502. I'm sure there are plenty more examples, where our super-duper AI "more power than sent a man to the moon" would find it impossible to compete with that ancient tech ...
Modern man thinks he's so clever, because he's lost touch with the achievements of the past ...
Cheers,
Posted Apr 23, 2024 15:20 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
We also don't feed back to our AIs "this is wrong, this is right". So it's free to spout garbage (hallucinate) with no way of correcting it.
Cheers,
Posted Apr 19, 2024 17:09 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (5 responses)
But this is not uniform compression. The most relevant parts are kept mostly verbatim, the least relevant part are ignored. The AI trick is that there is no easy way to find which parts are kept verbatim, and there is no easy way to find the source.
Posted Apr 19, 2024 17:28 UTC (Fri)
by mb (subscriber, #50428)
[Link] (4 responses)
This is not a technical problem at all.
But. Machine learning breaks Copyright in a fundamental way in that it is very similar to human learning so that one can use human learning reasoning, but at the same time it is fast and cheap.
While you need to put significant effort into your work when human-learning from others and creating new non-derived work, with MI this is just a click of a button.
A human "filter" processing ("learning") work set "A" into non-derived work "B" is expensive. So it's almost never done just for copying and Copyright erasing.
Copyright is fundamentally broken. It's not a technical problem.
Posted Apr 19, 2024 20:17 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (1 responses)
This is absurd. The value of a copyrighted work is not dependant on the amount of effort that went into it.
If anything LLMs are a great equaliser. It used to be that to be a great writer you needed to have a great idea for a story and the skill to execute it. Now people with a great idea but not quite as good writing skills get a chance they might not otherwise.
Copyright protects the economic value and moral value of a work. The fact that other people can now also create new works easier does not reduce the value of copyright at all (or break it). Copyright does not protect all uses of your work, only those that reduce the economic value of the original.
Posted Apr 19, 2024 20:49 UTC (Fri)
by mb (subscriber, #50428)
[Link]
I did never claim that.
Posted Apr 21, 2024 22:37 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
This is prosaically true in the sense that it's ultimately a judgment call on the part of the trier of fact, but as a matter of law, there absolutely is an answer, and the law generally expects you to know it. You can wave your hands about copyright "breaking" all you want, but the legal system is not going to be impressed.
You are correct, however, that AI does put the legal system in a bit of an awkward spot. Up until now, derivative works have been decided by the trier of fact (judge or jury) looking at the original and the allegedly infringing work side by side, and seeing if they're close enough that copying can be inferred. The legal term used in the US is "substantially similar" (or "strikingly similar"), but most countries are going to use a similar method in their courts.
That works fine when you have one original. When you have two billion originals, and an unbounded set of potentially infringing works, it's a bit impractical. Right now, the unspoken expectation is that the plaintiff has to do the leg work of figuring out which images to put side by side in this comparison (US courts would say the plaintiff is "master of their complaint" and thus responsible for deciding exactly what is and is not in scope). That's not easy in the case of AI, and it's the main reason (or at least, a major reason) that artists have struggled to sue image generators successfully.
But that does not imply that an image generator "erases" copyright as you have put it. If an artist is able to find a specific infringing output that closely resembles their art, and their art was used as a training input, then the artist might have a case. Saying "the AI breaks copyright" is not going to be an effective defense.
Posted Apr 21, 2024 22:39 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Posted Apr 18, 2024 22:22 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
My feeling in all of this is IFF you use an AI to help you write a valid report (of whatever sort) that's fine. The AI is the *assistant*. If, however, the AI is the *author* then you don't want to go near it with a barge pole.
In other words, if there is a *human* involved, who has sanity checked it for hallucinations, accuracy, what-have-you, then that's fine. If the human sending it can't be bothered, then why should the human receiving it bother, either? And if it's the AI bot that's sending it, then you REALLY don't want to know!
Cheers,
Posted Apr 20, 2024 11:20 UTC (Sat)
by Baughn (subscriber, #124425)
[Link]
Posted Apr 19, 2024 8:03 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (5 responses)
There's no real reason to believe this will happen. For one, the big inherent problem of this type of system, just making shit up, really makes it unsuitable to use for anything but optional autocomplete, especially compared to the more specialized models that already exist. Secondly, while the AI boosters keep talking about exponential improvements, that hasn't actually happened. GPT-4 was released a year ago and the best openai can do is a few percentage point improvement on some benchmarks. Which is not what you'd expect in a field where improvements are supposedly so low hanging that people predict a doubling of capability (unspecified) every 18 months. Hardware has been looking a bit better with the usual dozen percentage points perf/w we've come to expect every 2-3 years, but it's hardly revolutionary either.
It also remains to be seen how much development effort will remain with this technology when investors realize it's been severely overhyped in two or three quarters or however long it'll take them.
Posted Apr 19, 2024 10:44 UTC (Fri)
by snajpa (subscriber, #73467)
[Link] (3 responses)
Posted Apr 19, 2024 10:54 UTC (Fri)
by snajpa (subscriber, #73467)
[Link] (1 responses)
Posted Apr 23, 2024 21:02 UTC (Tue)
by flussence (guest, #85566)
[Link]
Quite a revealing slip.
Posted Apr 19, 2024 13:45 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link]
Posted Apr 20, 2024 15:37 UTC (Sat)
by Paf (subscriber, #91811)
[Link]
Gentoo bans AI-created contributions
Or alternatively, any program processing any data shall be allowed to remove Copyright.
Cannot choose both.
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Humans extrapolate in a way that machines cannot.
Gentoo bans AI-created contributions
Note that a human writer has also had terabytes worth of language as examples before they start producing things that are correct
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
That carries with it the assumption that text is a complete representation of what people use to learn to speak and listen, before they move onto reading and writing.
To the contrary, I think the different learning environment is part of what we need to reproduce if we want more human-like AI. A huge problem with LLM is that they are largely fed a single kind of input. It's no wonder chatbots have problems interacting with the world; they've read about it but never dealt with it firsthand. If we want an AI that can deal with the world as we do, it needs a full set of senses and probably a body so it can do something more than chat or paint.
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Wol
Gentoo bans AI-created contributions
Wol
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
It's not relevant how you compressed the data to be a derived work.
And there is no single right or wrong answer to whether something is a derived works. Always been like that.
This is where Copyright breaks. It's hard for a human to create new non-derived work.
But it's cheap for machine learning to do the same thing.
A machine learning filter, however, is cheap and is easy to erase Copyright in that way.
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Wol
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions
Gentoo bans AI-created contributions