ChatGPT says...

Posted Oct 25, 2024 22:31 UTC (Fri) by leromarinvit (subscriber, #56850)
Parent article: OSI readies controversial Open AI definition

Given the topic, asking an AI for its thoughts on the matter seemed apt. So I prompted ChatGPT (which is, of course, about as far removed from "Open Source AI" as seems possible):

> Given the following article, write a comment about whether or not training data should be included in the OSAID. Explain why, either way.
> [...]

It seems it wants to see training data included in the definition:

> Whether training data should be included in the Open Source AI Definition (OSAID) is a crucial issue, and I believe it should be required, not optional. Leaving training data out of the definition weakens the foundational principles of open source, particularly the ability to fully understand, modify, and build upon AI systems.

> Training data is central to how AI models are developed, and it plays a significant role in shaping the behavior and biases of the system. Without access to the data—only providing "detailed information"—users are restricted in their ability to recreate or meaningfully modify the system. This exclusion limits transparency and undermines the goal of fostering collaboration and innovation, which are core values of the open-source community.

> While there are valid concerns about the ethical and legal challenges of releasing training data, such as privacy issues and proprietary data restrictions, these challenges should be addressed case-by-case rather than lowering the bar for what qualifies as open source. By allowing AI systems to be considered "open" without sharing the training data, the OSAID risks diluting the meaning of open source and setting a precedent where essential components of AI systems can remain closed while still benefiting from the open-source label.

> In short, if the goal is to preserve the spirit of open source in AI, training data should not be optional, as omitting it compromises transparency, accountability, and the ability to fully exercise the four freedoms that open source aims to protect.

;-)

While I happen to agree with the conclusion, the reasoning seems more like a summary of the article to me, instead of actual independent thought (well, duh - it's an LLM, it can't think). In particular, what stands out to me is that with source code, I can not only add new features or modify existing ones, but also remove parts entirely. This seems impossible with the kind of iterative modification that seems to be the only kind taken into account here (to my admittedly very limited understanding of the matter).

Or, more to the point:
> training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

ChatGPT says...

Posted Oct 26, 2024 0:01 UTC (Sat) by rettichschnidi (subscriber, #93261) [Link] (2 responses)

> Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

True for LLMs, at least as of today. But there are many (smaller) ML/"AI" applications that can be trained by SME and even enthusiasts already today.

ChatGPT says...

Posted Oct 26, 2024 4:36 UTC (Sat) by jmspeex (subscriber, #51639) [Link]

Absolutely. All of the ML models that ended up in Opus 1.5 (https://opus-codec.org/demo/opus-1.5/) can be trained on a single GPU. And there's still people using the original RNNoise I trained in 2017 on a laptop *CPU*. There's tons of useful things you can do with DNNs that are 1M weights or smaller.

ChatGPT says...

Posted Oct 26, 2024 7:02 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

Long as they participate in Reproducible Builds and an independent party has reproduced the exact same model from the same inputs once, it’s at least not totally unpalatable if not everyone has the pockets to regenerate it from scratch in practice.

Cost of training a model

Posted Oct 28, 2024 8:54 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

I'm leery of arguments based on the cost of training a model, because they generalise to "no software is open source unless you're willing to donate a working system to anyone who wants to build it". Instead, I'd prefer to see arguments around both the reproducibility of training (since if training is not reproducible, a skilled hacker is unable to test whether their optimized training process produces the same results in under 1% of the cost of the original), or the incompleteness of training data.

Put another way; if it currently costs $100m in cloud GPU costs to train a model, but a clever person is able to work out how to do the same training on a $500 laptop for $100 in electricity, I want that clever person to be able to show that their technique works exactly as well as the original, and not to be held back by the lack of open models and training sets they can use to show off their technique.

Cost of training a model

Posted Oct 29, 2024 17:16 UTC (Tue) by paulj (subscriber, #341) [Link]

The human brain is capable of some pretty sophisticated training with just thousands of Joules, and really profound learning with MWs, so.... the day will come.

(We may run into some profound ethical issues with ML before that day, but that's another topic).

ChatGPT says...

Posted Oct 29, 2024 17:13 UTC (Tue) by paulj (subscriber, #341) [Link]

> Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

This isn't quite true, depending on what you mean.

a) You can start from scratch, just start with random parameters again, and train with your own data. I.e., just don't use the parameter set.

b) If there are specific things in the trained parameters that you don't like/want, you can /keep/ training the model but apply your own objective function to penalise the thing you don't like. Essentially "correcting" whatever it is you don't like.

The challenge with b may that you don't know what is in there that you don't want. However, you will probably have the same problem if you have the full input data. The model is basically given you the training data in a digested form that at least lets you query it though.

That "correcting" process is something all the big players with public AI tools surely must have /big/ teams of people working on: just filtering out whatever "fruity" content happens to be discovered in public LLMs and generative AIs. Teams manually looking for "dodgy" content, to then come up with the widest set of prompts that spit out that dodgy content (in all its combinations), to then pass on to other teams to create refinement training cases and test-cases. So that we get the safely sanitised LLMs / models suitable for public use.