LWN: Comments on "Democratizing AI with open-source language models"

Democratizing AI with open-source language models

mpr22 — Tue, 11 Jul 2023 18:42:46 +0000

> What probably makes humans unique is we have a voice-box, with which we can make a massive range of sounds.

It's all the bits and pieces above the voice box that allow human speech to feature such a bewildering array of sounds, from open back unrounded vowels to bilabial clicks by way of velar ejectives, palatal approximants, voiced alveolar lateral affricates, and dental plosives.

Democratizing AI with open-source language models

kleptog — Tue, 11 Jul 2023 11:38:23 +0000

Agreed, even OpenAI agrees LLMs aren't going to get better just by throwing more data at them. Further advancement will require linking them with other things.

However, I would argue that LLMs understand grammer just as well as most people: they can produce grammatically correct sentences nearly 100% of the time, but would not be able to describe the difference between an adjective and an adverb, or explain why it's a "big brown dog" and not a "brown big dog".

ISTM though, that LLM do solve one of the biggest stumbling blocks for general AI. Before now, getting a computer to understand complex sentences and context and resolve some of the inherit ambiguity in language was really hard. Now we're much much closer. That said, what exactly the new "hard problem" is won't be clear for a few years and in the mean time, open-source models give a whole generation of students a level playing field to start from.

I think what's mostly needed is linking an LLM with some kind of reliable memory so it doesn't just make stuff up. And some kind of super-ego that monitors its sub-modules for odd behaviour and corrects them. The "shit, what I'm about to say will have serious long term consequences, close mouth now" feature.

Democratizing AI with open-source language models

smurf — Tue, 11 Jul 2023 04:51:30 +0000

That's not a counter-argument. The key word is "grammar" and there seems to be no indication any animal can learn that.

Without grammar, you can't talk about talking. You can't even talk about interacting with the world: you don't teach your child "if you see a fire, drop everything, shout 'Fire!' and run to the exit" by demonstration, do you? You simply tell them, or you show them a picture/movie. The AI doesn't and indeed cannot distinguish between the picture and the world, as multiple prompt jail breaks have demonstrated.

Input to the LLMs contains a lot of examples how to spell, yet they didn't learn the concept of "spelling". And that's the lowest-level example. You don't learn that from one training run, no matter how large the model is. You learn that by evolution, i.e. you iterate on the *structure* of the model, not just its parameters, just like evolution did. (And, like evolution, you need to run a million iterations across a million samples each, in order to get anywhere.)

That, or somebody designs a neural network architecture, or (I suspect) several interconnected ones, that might be able to understand grammar, from first principles. Good luck; we still have no idea how something like that should look like.

Meanwhile, the current crop of models are a local optimum. You can play around with them however you like, but they are unlikely to ever extract understanding of sentence structure from a billion example sentences, or of spatial relationships from a billion example 2D pictures. It's all just random bits. In fact that's how AI image generation works – you transform a heap of random bits into something not quite as random and repeat the process until you get a nice but meaningless-to-the-AI image.

Democratizing AI with open-source language models

Wol — Mon, 10 Jul 2023 21:03:28 +0000

> There is thus no guarantee that non-human non-children can do whatever it is humans do, and there is even less guarantee that if they *do* produce something that looks like language, that they are using the same mechanisms to learn or generate it that people do.

Chances are they do use the same mechanisms, given that gene commonality is pretty high across species ... Birds and mammals all communicate with sound, so we all have mechanisms for making sounds, and for identifying them. What probably makes humans unique is we have a voice-box, with which we can make a massive range of sounds.

Also there's no guarantee that the ability to make assorted sounds evolves together with the ability to recognise and associate meaning to those sounds. Although there is clear evidence that that has happened in other species. Two examples I'm aware of - whales can communicate over ocean distances - when we don't make noises that interfere whales are known to send messages across the Atlantic. And starlings - one of the excellent mimics of the bird world - are known to associate certain sounds with human behaviour! Go back 40 years and the trimphone was very popular in the UK. There are thousands of recorded instances of starlings mimicking said phone - clearly KNOWING that the human in the garden will go runnning back into the house. They were obviously doing it for the amusement (to them) factor. Who's to say that won't evolve into something we would call language (if they haven't got it already).

Cheers,
Wol

Democratizing AI with open-source language models

nix — Mon, 10 Jul 2023 15:22:06 +0000

Besides mirror neurons we humans have a heap of more-or-less-special-purpose structures in our brains.

And whether or not you think humans have special-purpose structures in our brains for any purpose (yes, it should be dead obvious to anyone that we do, but some people particularly in machine learning persist in claiming that there is none and that our brains are undifferentiated in any way that matters: so let's take that as read), it is clear that human children are better at learning language than other organisms with brains that grow up around a lot of it (say, domestic dogs): even if a few of them can pick up a few words, nothing else has managed to pick up grammar: the most you get is disordered word salad, the like of which you only see from human first-language learners if brain-damaged. So there is *something* different about what human children are or do that makes them especially good at learning language compared to every other animal on Earth -- and one thing that is definitely true is that all human languages are created almost entirely by generations of children and are thus optimized to be learned by children.

There is thus no guarantee that non-human non-children can do whatever it is humans do, and there is even less guarantee that if they *do* produce something that looks like language, that they are using the same mechanisms to learn or generate it that people do. I would venture to suggest that the presence of "glitch tokens" and their utterly bizarre behaviour makes it quite clear that LLMs don't know what language is, don't know what words are, and don't even know what it means to spell something or repeat it back to you, so their actual level of knowledge is lower than a two-year-old human child's, even if they're very good at faking it. No human child who knew what spelling was would reply to 'spell " petertodd" back to me" with 'L-E-I-L-A-N'. GPT-4 doesn't even realise it's done anything wrong, because it can't realise anything at all.

Democratizing AI with open-source language models

smurf — Thu, 06 Jul 2023 05:28:19 +0000

Besides mirror neurons we humans have a heap of more-or-less-special-purpose structures in our brains. For instance there's one in our vision system that routes human faces to one place in the brain and not-faces to someplace else. If that's damaged you can't recognize faces, period.

The same thing seems to apply to true/false, reality/fiction, fair/unfair, and a number of other concepts which babies understand even before they properly learn language. The same thing applies to some mammals, and (with variations) even some non-mammals.

My point is: if you want to teach any of that to an AI, you need to have parts that recognize those special-purpose concepts in its neural structure and THEN train it to apply the language model TO that structure.

One useful analog here can be seen in last year's AI day video from Tesla, where they reported how their driving model works. They added specific structures for the AI to "remember" things like occluded cars or people because, surprise, just throwing a bunch of labelled images at it doesn't cause it to come up with that mechanism on its own.

Evolution did (and we play peek-a-boo with our kids so that they link that evolutionary structure to the real world), but it has its own way of remembering and optimizing basic neuronal network architecture – DNA. Training some random GPT model or two, or a thousand, on random data from the 'net is unlikely to replicate that even if it's labelled appropriately, which it currently mostly isn't.

Democratizing AI with open-source language models

kleptog — Wed, 05 Jul 2023 21:47:37 +0000

> Another problem is that though the models clearly know what text is statistically likely to follow other text, that doesn't mean they understand anything about grammar.

This is not really a problem (yet). Humans learning their first language learn to use it correctly without knowing anything about grammar. The grammar of a language is after all defined by "how people actually use the words" and not some list of rules someone drew up somewhere. Ergo, an LLM is learning language exactly the same way humans do.

Of course grammar evolves over time as people use it and the risk is that if LLMs start talking to each other a lot they might start evolving a grammer that is distinct from what most people use. While linguistically interesting, this would not useful for the purpose of interacting with most people.

The main problem is that LLMs have no real world references, so things like first, tallest, north, left, scale, distance, weight, etc are all a mystery to it. Leading to weird conversations where it can correctly give you the altitude of two cities, but then get wrong which is higher. At some point though someone is going to successfully couple an LLM with an analytic engine that does understand these things.

Though it does touch on a question I asked my AI course lecturer years ago: is it possible for an AI to truly understand human language if it doesn't live for years in a corporeal body interacting with the real world? How can it otherwise understand what heavy, light, bright, dark, pain, hunger or fear are? And related: would humans still be as empathic if we didn't have mirror neurons?

Democratizing AI with open-source language models

nix — Wed, 05 Jul 2023 20:46:28 +0000

See also the sort-of-sequel _Driver_, in _Valuable Humans in Transit and Other Stories_. Why is that the title? That would be telling.

Democratizing AI with open-source language models

nix — Wed, 05 Jul 2023 20:42:19 +0000

I suspect there are a lot of causes for that and some of them are inherent in the structure of the model, but it seems likely that part of the problem is that shoveling everything on the Internet, true or false, into the model produces a model that is not weighted for truth over falsity.

Another problem is that though the models clearly know what text is statistically likely to follow other text, that doesn't mean they understand anything about grammar. In particular it is trivial to construct examples showing that even the largest current models don't know what the word "not" means. If they don't understand what it means to say something is the case as opposed to saying something is not the case -- if they are happy to consider each equally likely and the word "not" just as another token rather than as something with substantial semantic effect -- I don't see how they can possibly ever not bullshit.

Democratizing AI with open-source language models

JohnFA — Wed, 07 Jun 2023 14:29:12 +0000

I agree with you prediction re the effect of LLMs on those motivated primarily by itch scratching. Seems like it should be mostly helpful there.

I was thinking more of Open Source projects that come out of the commercial sector. Apache 2.0 seem to be a favourite there due to the patent license. Without copyright protection, there's no way to enforce a patent license clause.

(n.b. I should have said avoiding copyright is the problem that LLMs bring, as I see it, as this affects not only copyleft but other contractual things such as patent licenses. So in other words there could be a problem not just with non-permissive licenses, but also with certain permissive ones such as Apache 2.0)

Democratizing AI with open-source language models

geert — Wed, 07 Jun 2023 10:31:56 +0000

Patent it?

Democratizing AI with open-source language models

kleptog — Tue, 06 Jun 2023 16:29:39 +0000

> In the version of the future you are imagining, there is much reduced incentive to make source code available, if you have developed something special, since then an LLM can simply reimplement it, and others can free ride.

Let me see if I'm understanding this correctly. Your argument is that if you write a program to scratch your own itch, you would be less likely to make it open-source because you think other people with the same itch could use an LLM to solve the problem instead?

I don't see how that could be true, instead I'm imagining that you'd scratch your itch, using an LLM to get there faster, and open-source it so other people can also use it. And they in turn use LLMs to make it even better faster and scratch even more itches. Seems to me LLMs and Open Source complement each other nicely, the combination beating each individually.

Frankly, I think LLMs work better for open source because the number of people with itches that need scratching but currently cannot code is vastly larger than business with proprietary software can possibly support.

(When they get good enough, LLMs are good at making convincing looking code snippets, but they're not there yet for larger projects.)

It feels a bit like the argument: if we ever invent replicators then people will stop making art and that would be bad. Dude, we'd have *replicators*, the possibilities for art would explode.

> Through copyleft, copyright has been a huge net positive for Open Source (though perhaps not a net positive over all, but that is debatable, because Open Source is so important.)

Copyleft has played its part, but there is a huge amount of non-copylefted open source code out there. I'm in the "only running code has value" camp, if it's not solving actual problems, it's just a bunch of oddly low-entropy bits. At my work the bean-counters assign value to the code I write, but in my opinion it valueless to anyone outside the business. It has value only because of the organisation it is embedded in so real world problems are solved.

And it only works because of the mountains of open source code it uses.

Also, just because an LLM may be colourless, the prompts that go into it are not. You can make it produce any output you want by asking it to repeat what you say. The output is definitly a derived work of the prompt. I don't see how the current situation changes much for copyright of source code. Only the expression was ever covered, not the ideas themselves. If we get to the situation where merely expressing an idea is enough to get a functional implementation, we'll have invented magic.

Democratizing AI with open-source language models

Wol — Tue, 06 Jun 2023 15:36:14 +0000

That's not the future, that's today.

I can see your copyrighted work, have a lightbulb moment, and re-implement it much better than you did.

The sad fact (from the position of copyright maximalists) is that having the idea is the hard part. With novels, that's fine, but with functional software as soon as someone sees what it can do, they can copy the functionality far more easily than it cost you to have the original idea, and there's NOTHING you can do about it.

Cheers,
Wol

Democratizing AI with open-source language models

JohnFA — Tue, 06 Jun 2023 10:58:26 +0000

Quite possibly you are correct, but if this is the case, I don't see why this is good for software openness.

In the version of the future you are imagining, there is much reduced incentive to make source code available, if you have developed something special, since then an LLM can simply reimplement it, and others can free ride.

I do not think supporters of Open Source should be celebrating this.

Through copyleft, copyright has been a huge net positive for Open Source (though perhaps not a net positive over all, but that is debatable, because Open Source is so important.)

Without copyright, the situation for Source Code is in danger of degenerating into something like we have to data sets: https://lu.is/blog/2016/09/21/copyleft-attribution-and-da...

Democratizing AI with open-source language models

kleptog — Mon, 05 Jun 2023 14:26:13 +0000

> You're right in that the output's copyright status is not determined. However, that's a different problem which doesn't seem to have a good commonly-accepted answer yet – though IMHO if you manage to convince an AI model to output something that looks like art, you're the artist – and thus you get to hold the copyright.

I'm not clear why this is controversial: if you convince Photoshop to produce something that that looks like art, you're the artist and you get to hold the copyright (assuming it's original). It's a tool, you manipulate the tool to produce art, you're an artist. That it's an ML model is irrelevant.

Now, if an ML model only produces minor variations of existing artwork, then it's not a useful model. (Or it might be the worlds greatest compression algorithm.) If you manipulate Photoshop to produce an exact replica of a Mondriaan, then that's your problem, not the tool's.

> Back to the topic: The point here is that the output's status is expressly NOT determined by – and/or does not depend on – the copyright status of its input, or any part thereof.

Well sure, but that doesn't mean the output is free of copyright. Colourless models can produce coloured output. There is no "copyright-erasing" going on because, as you say, the processing done does not determine the copyright status.

Democratizing AI with open-source language models

smurf — Mon, 05 Jun 2023 13:31:01 +0000

You're right in that the output's copyright status is not determined. However, that's a different problem which doesn't seem to have a good commonly-accepted answer yet – though IMHO if you manage to convince an AI model to output something that looks like art, you're the artist – and thus you get to hold the copyright.

Back to the topic: The point here is that the output's status is expressly NOT determined by – and/or does not depend on – the copyright status of its input, or any part thereof.

Democratizing AI with open-source language models

kleptog — Mon, 05 Jun 2023 10:07:43 +0000

> false. Japan just declared, by fiat of law, that "learning isn't stealing", and thus anything reproduced by an AI isn't copyrighted – at least not by the owners of the copyrights of whatever the AI learned from.

> https://m-cacm.acm.org/news/273479-japan-goes-all-in-copyright-doesnt-apply-to-ai-training/fulltext

The way I'm reading it is that it's similar to the EU position: training on public data isn't a-priori copyright violation, just like browsing the web isn't violating anyone's copyright. It doesn't (AFAICT) say anything about the copyright status of the *output* of the model. Just like the law doesn't state that anything typed by a human is automatically free of copyright violations.

Democratizing AI with open-source language models

smurf — Mon, 05 Jun 2023 05:00:43 +0000

> The point is that the copyright status (ie colour) of something cannot be determined just by looking at the bits

true

> so any processing you do cannot "erase" any copyright either.

false. Japan just declared, by fiat of law, that "learning isn't stealing", and thus anything reproduced by an AI isn't copyrighted – at least not by the owners of the copyrights of whatever the AI learned from.

https://m-cacm.acm.org/news/273479-japan-goes-all-in-copyright-doesnt-apply-to-ai-training/fulltext

Democratizing AI with open-source language models

mb — Sun, 04 Jun 2023 20:51:39 +0000

> so any processing you do cannot "erase" any copyright either.

I do not think this is what the big players think.
Currently large language models are trained on Open Source code and they emit source code without a license transfer. That is what all those "programming AIs" do.

>It's like those people who claimed to make a copyright remover by running the bits through an obfuscation algorithm and then reversing it. That's just not how it works.

I fully agree. But the big large language model developers apparently disagree.

Democratizing AI with open-source language models

kleptog — Sun, 04 Jun 2023 20:27:34 +0000

> You put your colored bits into the model and they come out in a transformed way on the other end. If that processing erases copyright from those bits, then copyright law has essentially become useless.

The point is that the copyright status (ie colour) of something cannot be determined just by looking at the bits, so any processing you do cannot "erase" any copyright either. Any argument ending with "so copyright law becomes useless" is bunk, because courts will simply make it not useless.

It's like those people who claimed to make a copyright remover by running the bits through an obfuscation algorithm and then reversing it. That's just not how it works.

Democratizing AI with open-source language models

mb — Sun, 04 Jun 2023 15:51:12 +0000

Yes, that's essentially what I said.
You put your colored bits into the model and they come out in a transformed way on the other end.
If that processing erases copyright from those bits, then copyright law has essentially become useless.

Democratizing AI with open-source language models

kleptog — Sun, 04 Jun 2023 14:55:34 +0000

> The problem is the other way around: You can train on Open Source software source code and create equivalent proprietary software from it. That is where the question whether this process would erase copyright comes up.

Copyright is about copying. Either the output looks substantially like the input, or it doesn't. Whether it was generated by an LLM or by a person typing isn't relevant. This is a "colour of your bits" thing.

Democratizing AI with open-source language models

mb — Sun, 04 Jun 2023 09:54:01 +0000

You don't have the Windows source code as training material.
Just writing (or in this case generating) a compatible OS is not a derived work (as in copyright), of course. See ReactOS.

The problem is the other way around: You can train on Open Source software source code and create equivalent proprietary software from it. That is where the question whether this process would erase copyright comes up.

Democratizing AI with open-source language models

joib — Sun, 04 Jun 2023 09:21:13 +0000

If the legal interpretation is going to be that LLM produced content (code, prose, whatever) is not a derivative work of the training material, that would more or less make copyright irrelevant. Does it matter that Windows is proprietary if you can tell a LLM to "make me an OS with a flashy GUI and a win32 compatible ABI".

Democratizing AI with open-source language models

JohnFA — Sat, 03 Jun 2023 16:39:44 +0000

Can you elaborate on how AI might be beneficial to software freedom?

Perhaps by making it possible for more people to usefully contribute to Open Source projects?

Democratizing AI with open-source language models

JohnFA — Fri, 02 Jun 2023 04:13:31 +0000

Sounds like it'll be legal in Japan: https://technomancers.ai/japan-goes-all-in-copyright-does...

Democratizing AI with open-source language models

floppus — Thu, 01 Jun 2023 20:46:39 +0000

It's certainly an issue that has been discussed quite a bit, including here on LWN, in the context of GitHub's Copilot.

There's certainly no consensus about whether this sort of practice (laundering algorithms via large-scale AI models to escape copyright) is legal, nor any consensus about whether it ought to be legal. In practice it seems many people are doing it and not worrying about whether it's legal or not.

Does AI threaten the concept of copyleft? Yes, clearly. But copyleft was never (or shouldn't have been) the end goal. Will AI ultimately be harmful or beneficial to *software freedom*? That remains to be seen.

Democratizing AI with open-source language models

JohnFA — Wed, 31 May 2023 12:27:01 +0000

This seems like a potentially very significant issue for Open Source?

If code can be reimplemented by LLMs and escape copyright projection, this would neutralise copyleft.

Are you aware of any other discussions around this issue? Is it as serious as it sounds?

Democratizing AI with open-source language models

Rudd-O — Wed, 31 May 2023 11:49:40 +0000

> But now, if you google for "open source language model", it seems to have quickly become a widespread term that means that at least the large data set is not under a free software license or open source license or free culture license. The reason for this seems to basically be that freely licensed datasets are much smaller and less useful, so the people making these models have decided to promote a new definition that is different and degraded in an important way, and they seem to be doing it successfully. Am I understanding this right?

Yes, and this is profoundly tragic.

I would feel confident pinning the blame for the *origin* of this epistemic degeneration on OpenAI — whose stated intentions were to be an open project for building AI... then promptly reversed course.

Democratizing AI with open-source language models

jezuch — Wed, 24 May 2023 16:24:26 +0000

I'm late to the party but I have to note that reading this, especially the last part, eerily reminded me of https://qntm.org/mmacevedo

Democratizing AI with open-source language models

farnz — Wed, 24 May 2023 11:37:00 +0000

If you're interested in the background to this comment, look up the "Statute of Anne", the first copyright law. As paulj says, it wasn't about the authors at all - it was a consequence of the Stationers' Company trying to recreate their monopoly on printing after it became legal for anyone to own a printing press.

Democratizing AI with open-source language models

paulj — Wed, 24 May 2023 08:42:41 +0000

The original point of copyright was for publishers - note *not* authors - to re-assert the oligopoly they previously enjoyed over the printing of books.

The argument "won't someone think of the poor authors?" was made by the *printers*, but for completely self-serving reasons. The printers never cared for the authors before when they had their oligopoly by guild and royal decree, nor did they once they got copyright. They just made authors sign over the rights.

Democratizing AI with open-source language models

mb — Wed, 24 May 2023 07:24:44 +0000

If I can remove Copyright from a program by clicking a button in Copyright-Remover-3000 AI app, the the rip-off-protection is useless and uneffective.
Rewriting by human takes significant amounts of ressources. That is the only reason why the ripoff protection works today.

Democratizing AI with open-source language models

smurf — Wed, 24 May 2023 05:38:42 +0000

Huh.

Copyright has, or originally had, a point. The point was not to allow publishers to rip off authors by printing verbatim copies of their work without paying them.

The point is not, and never was, to prevent me from reading your work and then writing something in the same style, or using the general concepts, or whatever.

So what the heck is the problem? Sure a machine could rewrite Stephen King's latest blockbuster until it's no longer immediately recognizeable as a copy. So could I. Approximately nobody would buy that, however.

So what's the problem we're trying to solve here?

Democratizing AI with open-source language models

Lennie — Tue, 23 May 2023 19:24:02 +0000

> Because if we just let machine learning models remove copyright and be ok with that, it renders copyright law completely useless. I could then always just take foreign copyrighted A and make my copyrighted B out of it very quickly and cheaply. What's copyright good for then?

That was exactly what I'm implying... did we just figure out that in a year copyright's current roles will be useless ?

Democratizing AI with open-source language models

mb — Tue, 23 May 2023 18:47:30 +0000

> Or are you referring to something else ?

Yes. The problem is, that the creators of copyright law never has AI in mind.
So it's not made for it.

It's time to either think of a way to adapt copyright law, or to get rid of it entirely. Because if we just let machine learning models remove copyright and be ok with that, it renders copyright law completely useless. I could then always just take foreign copyrighted A and make my copyrighted B out of it very quickly and cheaply. What's copyright good for then?

> So I guess the lawyer step was missing ?

Yes. If you keep a non-trivial human step, I would say it's probably fine then. The machine steps can be seen as (de-)compilation then.

Democratizing AI with open-source language models

Lennie — Tue, 23 May 2023 06:50:00 +0000

Wikipedia defines: "Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners."

So I guess the lawyer step was missing ?

Or are you referring to something else ?

The clean room implementation would get far enough away from the original copyrighted work assuming it was done correctly, but maybe you are referring to the situation that the output might be a re-implementation of the copyright of someone else (because it was trained on other works).

Democratizing AI with open-source language models

smurf — Sun, 21 May 2023 07:52:28 +0000

Well, they have done plenty in the past.

The Disney example will serve. After all, while another extension seems to be off the table, nobody is actually working on reverting (some of) this atrocity either.

NB, according to Wikipedia the worst example is Mexico. Life+100 years. Ugh.

Democratizing AI with open-source language models

NYKevin — Sun, 21 May 2023 04:46:24 +0000

> the copyright to The Mouse! shall!! not!!! expire!!!1!

PSA: The (oldest) copyright to The Mouse will expire on January 1, 2024, and neither Disney nor Congress have shown the slightest interest in doing anything about it. Disney has even publicly acknowledged[1] that it will happen. It was a useful example for a while, but we will soon need to start using a new one.

[1]: https://www.nytimes.com/2022/12/27/business/mickey-mouse-...

Democratizing AI with open-source language models

NightMonkey — Sun, 21 May 2023 01:24:54 +0000

Don't miss that there are also lots of poorly paid humans, hired in countries without a history of worker protection, assisting the proprietary "A.I." shops to trim the rough edges of their bots output...

https://time.com/6247678/openai-chatgpt-kenya-workers/

Can open frameworks and corpus help here? Not so sure myself... But I think this should be considered in comparing the two licensing and "intellectual property" models.