Why not two licences?

Posted Oct 27, 2024 8:00 UTC (Sun) by burki99 (subscriber, #17149)
In reply to: Why not two licences? by josh
Parent article: OSI readies controversial Open AI definition

How about Open Code AI vs Open Data AI. That would keep the ambiguous term Source out of labels.

Why not two licences?

Posted Oct 27, 2024 23:12 UTC (Sun) by neggles (subscriber, #153254) [Link] (2 responses)

The AI/ML industry already has names for these two types of model; Open Weights for the former, and Open Data for the latter. Both are subtypes of "Open Source", just with differences in which components are openly available.

Contrary to popular belief, merely having access to the dataset doesn't actually give you the ability to reproduce a model. For example, the dataset for Stable Diffusion 1.5 is known and publicly available, the preprocessing steps taken are known, and the training process is known, but any attempt to exactly recreate it is unlikely to succeed due to changes in hardware, specific step counts, underlying library bug fixes and behaviour changes, specific corrupted images, etc.

Meta and OpenAI and MidJourney and many others have made newer, better variants of the same concept/using similar architectures, but nobody's made a reproduction, and not for lack of trying.

That said, why would you want to?

If you want to significantly improve the model in a general sense you'll need to adjust the architecture, which necessitates retraining it from clean anyway - and you'll likely want to replace the dataset with one more suited to your desired use case.

If you just want to make it good at generating some specific type of image, finetuning on top of the existing base model weights requires orders of magnitude less compute and a completely different domain-specific dataset anyway.

This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.

I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

There is no real benefit to requiring open datasets, and a lot of massive downsides; it would effectively kill open source AI because no company would risk the potential legal exposure from doing such a thing - even if the data *doesn't* have copyright issues, proving that in court could be rather difficult and time-consuming. Copyright trolling is just as real of a thing as patent trolling - just ask anyone who's been subject to frivolous DMCA takedown requests.

Why not two licences?

Posted Oct 28, 2024 20:00 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.
>
> I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

I generally agree that the anti-AI artists are on the wrong side here (or at least, they have not demonstrated to my satisfaction that their absolutist position holds water in every possible circumstance).

But I think it would be helpful to take a step back and avoid vilifying a group of people whose industry is currently undergoing a serious level of tech-induced upheaval. People are scared and worried for their economic future, and I think they do have a right to be upset and to advocate for their self interest, even as I disagree with them.

IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training. It is most plausible to me that that will end up being a fact-specific inquiry which requires individualized analysis of both the model and the specific training work whose copyright is allegedly infringed.* Frankly, I'm not entirely convinced that any version of open source AI would be viable in such a regime - but the point is that artists posting angry messages on Xitter are probably not going to change that outcome one way or the other.

Disclaimer: I work for Google, which does some GenAI stuff that I'm not involved with. Opinions are my own, speculation about the future is quite possibly wrong.

* For example, The New York Times provided evidence (in their lawsuit against OpenAI) that ChatGPT can sometimes produce verbatim or near-verbatim copies of their columns. I could imagine courts wanting to see evidence similar to that before entering a finding of infringement.

Civil vs Common law

Posted Oct 29, 2024 10:03 UTC (Tue) by kleptog (subscriber, #1183) [Link]

> IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training.

That might work for Common Law systems like the US, but in Civil Law systems (most of Europe at least) the courts generally can't make these kinds of major policy changes and will punt the problem to the legislature to sort out. The EU AI Act 2024 covers all the stuff that everyone could agree on already (e.g. training for non-commercial use is fine). The question of what to do about training data for commercial use is subject of much political debate. The EU executive/legislature is currently in a look and see mode to see if businesses can work out an agreement amongst themselves and are only likely to step in if it looks like someone is using their market power illegitimately, or we start seeing serious negative public policy impacts.