OSI readies controversial Open AI definition
The Open Source Initiative (OSI) has been working on defining Open Source AI—that is what constitutes an AI system that can be used, studied, modified, and shared for any purpose—for almost two years. Its board will be voting on the Open Source AI Definition (OSAID) on Sunday, October 27, with the 1.0 version slated to be published on October 28. It is never possible to please everyone in such an endeavor, and it would be folly to make that a goal. However, a number of prominent figures in the open-source community have voiced concerns that OSI is setting the bar too low with the OSAID—which will undo decades of community work to cajole vendors into adhering to or respecting the original Open Source Definition (OSD).
Defining Open Source AI
OSI executive director Stefano Maffulli announced
the organization's intent to provide a definition for open-source AI
in June 2023. He took exception to announcements of
"large language models, foundational models, tooling, services all
claiming to be 'open' or 'Open Source'
", while adding restrictions
which run afoul of the OSD. A survey
of large-language model (LLM) systems in 2023 found that ostensibly
open-source LLMs did not live up to the name.
The problem is not quite as simple as saying "use an OSD-compliant
license" for LLMs, because there are many more components to
consider. The original OSD is understood to apply to the
source code of a program in "the preferred form in which a
programmer would modify the program
". A program is not considered
open source if a developer cannot study, use, modify, and share a
program, and a license is not OSD‑compliant if it does not
preserve those freedoms. A program can include non-free data and still
be open source. For example, the game Quake III Arena
(Q3A) is available under the GPLv2. That distribution, however, does
not include the pak
files that contain the maps, textures, and other content required to
actually play the commercial game. Despite that, others can still use
the Q3A code to create their own games, such as Tremulous.
When discussing an "AI system", however, things are much more complicated. There is more than just the code that is used to run the models to do work of some kind, and the data is not something that can be wholly separate from the system in the way that it can be with a game. When looking at, say, LLMs, there is the model architecture, the code used to train models, model parameters, the techniques and methodologies used for training, the procedures for labeling training data, the supporting libraries, and (of course) the data used to train the models.
OSI has been working on its definition since last year. It held a kickoff meeting on June 21, 2023 at the Mozilla headquarters in San Francisco. It invited participation afterward via a regular series of in-person and online sessions, and with a forum for online discussions. LWN covered one of the sessions, held at FOSDEM 2024, in February.
The current draft of the OSAID takes its definition of an AI system from the Organisation for Economic Co-operation and Development (OECD) Recommendation of the Council on Artificial Intelligence:
A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.
This includes source code for training and running the system,
model parameters "such as weights or other configuration
settings
", as well as "sufficiently detailed information about
the data used to train the system so that a skilled person can build a
substantially equivalent system
".
Preferred form to make modifications
Those elements must all be available under OSI-approved licenses,
according to the proposed definition, which seems perfectly in line
with what we've come to expect when something is called "open
source". There is an exception, though, for things like the data
information and model parameters which must be available under
"OSI-approved terms
". The definition of OSI-approved terms is
not supplied yet.
There is no requirement to make the training data available. To be
compliant with the current draft of the OSAID, an AI system need only
provide "detailed information
" about the data but not the data
itself.
The OSI published
version 0.0.9 on August 22. It acknowledged then that "training data is
one of the most hotly debated parts of the definition
". However,
the OSI was choosing not to require training data:
After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.
Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
As it stands, some feel that the OSAID falls short of allowing the
four freedoms that it is supposed to ensure. For example, julia
ferraioli wrote
that without including data, the only things that the OSAID guarantees
are the ability to use and distribute an AI system. "They would be
able to build on top of it, through methods such as transfer learning
and fine-tuning, but that's it.
"
Tom Callaway has written at length on LinkedIn about why open data should be a requirement. He acknowledges that there are good reasons that distributors of an AI system may not want, or be able, to distribute training data. For example, the data itself may have a high monetary value on its own, and a vendor may be unwilling or unable to share it. Acme Corp might license a data set and have permission to create an AI system using it, but not the ability to distribute the data itself. The data might have legal issues, ranging from confidentiality (e.g., medical data sets) to a desire to avoid lawsuits from using copyrighted data.
All of those are understandable reasons for not distributing data with an AI system, he said, but they don't argue for crafting a definition that allows companies to call their system open:
If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts. While there are lots of companies who would like to see open source mean less, I think it's critical that we not compromise here, even if it means there are less Open Source AI systems at first.
Objections to lack of training data are more than an attachment to the original meaning of open source. Giacomo Tesio posted a list of issues he considered unaddressed in the RC2 version of the OSAID, including a claim that there is inherent insecurity due to the ability to plant undetectable backdoors in machine-learning models.
Others weigh in
The Free Software Foundation (FSF) announced
that it was working on "a statement of criteria for free machine
learning applications
" to call something a free (or libre)
machine-learning application. The FSF says that it is close to a
definition, and is working on the exact text. However, it adds that
"we believe that we cannot say a ML application 'is free' unless
all its training data and the related scripts for processing it
respect all users, following the four freedoms
".
However, the FSF makes a distinction between non-free and unethical in this case:
It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.
The Software Freedom Conservancy
has announced
an "aspirational statement
" about LLM-backed generative AI for
programming called "Machine-Learning-Assisted Programming that
Respects User Freedom". Unlike the OSAID, this target focuses solely
on computer-assisted programming, and was developed in
response to GitHub
Copilot. The announcement did not directly name the OSI or the OSAID effort, but
said "we have avoided any process that effectively auto-endorses
the problematic practices of companies whose proprietary products are
already widely deployed
". It describes an ideal LLM system built
only with FOSS, with all components available, and only for the creation of FOSS.
Response to criticisms
I emailed Maffulli about some of the criticisms of the current OSAID draft, and asked why OSI appears to be "lowering the bar" when the OSI has never budged on source availability and use restrictions. He replied:
I'll be blunt: you mention "source redistribution" in your question and that's what leads people like [Callaway] into a mental trap [...]
There are some groups believing that more components are required to guarantee more transparency. Other groups instead believe that model parameters and architecture are enough to modify AI. The Open Source AI Definition, developed publicly with a wide variety of stakeholders worldwide, with deep expertise on building AI (see the list of endorsers), found that while those approaches are legitimate, neither is optimal. The OSAID grants users the rights (with licenses) and the tools (with the list of required components) to meaningfully collaborate and innovate on (and fork, if required) AI systems. We have not compromised on our principles: we learned many new things from actual AI experts along the way.
Maffulli objected to the idea that the OSAID was weaker or making
concessions, and said that the preferred form for modifying ML systems
was what is in the OSAID: "it's not me nor OSI board saying that,
it's in the list of endorsers and in [Carnegie Mellon University's] comment
". He
added that OSI had synthesized input from "AI builders, users, and
deployers, content creators, unions, ethicists, lawyers, software
developers from all over the world
" to arrive at the definition. A
"simple translation
" of the OSD, he said, would not work.
Stephen O'Grady, founder of the RedMonk analyst firm, also makes
the case that the OSD does not easily translate to AI projects. But he
does not believe that the term open source "can or should be
extended into the AI world
" as he wrote
in a blog post on October 22:
At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts.
O'Grady makes the case that the OSI has set out on a pragmatic path to define open-source AI, which requires nuance. Open source has succeeded, in part, because the OSD removes nuance. Does a license comply with the OSD or doesn't it? It's pretty easy to determine. Less so with the OSAID. The pragmatic path, he said:
Involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: "If you're explaining, you're losing."
It would have been better, he said, if the OSI had not tried to
"bend and reshape a decades old definition
" and instead had
tried to craft something from a clean slate. That seems unlikely now,
he said, after two years of trying to "thread the needle between
idealism and capitalism to arrive at an ideologically sound and yet
commercially acceptable
" definition.
Indeed, it seems likely that the OSI board will move forward with the current draft of the OSAID or something close to it. The impact that will have is much less certain.
Posted Oct 25, 2024 22:15 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Where does the CMU comment says anything about the 'the preferred form for modifying ML systems' ?
Posted Oct 26, 2024 22:48 UTC (Sat)
by somlo (subscriber, #92421)
[Link]
Conspicuously absent are any of the AI/ML "heavy hitters", so I'm not really sure what conclusion one should draw from that...
Posted Oct 25, 2024 22:17 UTC (Fri)
by shironeko (subscriber, #159952)
[Link]
Personally it would seem that, unfortunately for some, Stallman was right yet again.
Posted Oct 25, 2024 22:31 UTC (Fri)
by leromarinvit (subscriber, #56850)
[Link] (6 responses)
> Given the following article, write a comment about whether or not training data should be included in the OSAID. Explain why, either way.
It seems it wants to see training data included in the definition:
> Whether training data should be included in the Open Source AI Definition (OSAID) is a crucial issue, and I believe it should be required, not optional. Leaving training data out of the definition weakens the foundational principles of open source, particularly the ability to fully understand, modify, and build upon AI systems.
> Training data is central to how AI models are developed, and it plays a significant role in shaping the behavior and biases of the system. Without access to the data—only providing "detailed information"—users are restricted in their ability to recreate or meaningfully modify the system. This exclusion limits transparency and undermines the goal of fostering collaboration and innovation, which are core values of the open-source community.
> While there are valid concerns about the ethical and legal challenges of releasing training data, such as privacy issues and proprietary data restrictions, these challenges should be addressed case-by-case rather than lowering the bar for what qualifies as open source. By allowing AI systems to be considered "open" without sharing the training data, the OSAID risks diluting the meaning of open source and setting a precedent where essential components of AI systems can remain closed while still benefiting from the open-source label.
> In short, if the goal is to preserve the spirit of open source in AI, training data should not be optional, as omitting it compromises transparency, accountability, and the ability to fully exercise the four freedoms that open source aims to protect.
;-)
While I happen to agree with the conclusion, the reasoning seems more like a summary of the article to me, instead of actual independent thought (well, duh - it's an LLM, it can't think). In particular, what stands out to me is that with source code, I can not only add new features or modify existing ones, but also remove parts entirely. This seems impossible with the kind of iterative modification that seems to be the only kind taken into account here (to my admittedly very limited understanding of the matter).
Or, more to the point:
Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.
Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.
Posted Oct 26, 2024 0:01 UTC (Sat)
by rettichschnidi (subscriber, #93261)
[Link] (2 responses)
True for LLMs, at least as of today. But there are many (smaller) ML/"AI" applications that can be trained by SME and even enthusiasts already today.
Posted Oct 26, 2024 4:36 UTC (Sat)
by jmspeex (subscriber, #51639)
[Link]
Posted Oct 26, 2024 7:02 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link]
Posted Oct 28, 2024 8:54 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (1 responses)
I'm leery of arguments based on the cost of training a model, because they generalise to "no software is open source unless you're willing to donate a working system to anyone who wants to build it". Instead, I'd prefer to see arguments around both the reproducibility of training (since if training is not reproducible, a skilled hacker is unable to test whether their optimized training process produces the same results in under 1% of the cost of the original), or the incompleteness of training data.
Put another way; if it currently costs $100m in cloud GPU costs to train a model, but a clever person is able to work out how to do the same training on a $500 laptop for $100 in electricity, I want that clever person to be able to show that their technique works exactly as well as the original, and not to be held back by the lack of open models and training sets they can use to show off their technique.
Posted Oct 29, 2024 17:16 UTC (Tue)
by paulj (subscriber, #341)
[Link]
(We may run into some profound ethical issues with ML before that day, but that's another topic).
Posted Oct 29, 2024 17:13 UTC (Tue)
by paulj (subscriber, #341)
[Link]
This isn't quite true, depending on what you mean.
a) You can start from scratch, just start with random parameters again, and train with your own data. I.e., just don't use the parameter set.
b) If there are specific things in the trained parameters that you don't like/want, you can /keep/ training the model but apply your own objective function to penalise the thing you don't like. Essentially "correcting" whatever it is you don't like.
The challenge with b may that you don't know what is in there that you don't want. However, you will probably have the same problem if you have the full input data. The model is basically given you the training data in a digested form that at least lets you query it though.
That "correcting" process is something all the big players with public AI tools surely must have /big/ teams of people working on: just filtering out whatever "fruity" content happens to be discovered in public LLMs and generative AIs. Teams manually looking for "dodgy" content, to then come up with the widest set of prompts that spit out that dodgy content (in all its combinations), to then pass on to other teams to create refinement training cases and test-cases. So that we get the safely sanitised LLMs / models suitable for public use.
Posted Oct 26, 2024 1:40 UTC (Sat)
by chuckwolber (subscriber, #138181)
[Link] (4 responses)
If I listen to music that ultimately influences my style as a musician, I am no less free when I play in my heavily influenced style and the music that influenced me is no less encumbered by copyright. The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.
Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.
By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society. The very idea is absurd.
We invented a shortcut to that problem a long time ago - it is called a "social contract". We cannot possibly know everything that is going on in your head, but we can set forth expectations and apply relevant penalties.
I propose we rethink the OSAI in the same way.
Posted Oct 26, 2024 5:12 UTC (Sat)
by shironeko (subscriber, #159952)
[Link]
Posted Oct 26, 2024 13:11 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (1 responses)
And herein lies the entire problem with AI, imho.
The qualia (whatever that means) of the human brain is much LESS than we give it credit for. But it is QUALITATIVELY DIFFERENT from what we call AI and LLM and that junk. AI and LLMs are bayesian inference machines. Human brains have external forces imposing a (not necessarily correct) definition of "right" and "wrong".
When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever. When a computer bayesian machine messes up, it's too expensive to retrain.
I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank ... (I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...)
Cheers,
Posted Oct 26, 2024 17:31 UTC (Sat)
by khim (subscriber, #9252)
[Link]
That's extremely rare occurance. That could be true for LLMs, today, but many smaller-scaled AI models are retrained from scratch rountinely. Pretty soon there would be more LLMs retrained from scratch than100 or 200 billions of people that ever lived on this planet. Would that mean that LLMs would achieve parity with humans, when that would happen? Hard to predict… my gut feeling is that no, that wouldn't happen – but not because encounters with lions or wildebeests made humans are that much better, but because nature invented a lot of tricks over billions of years that we don't know how to replicate… yet. It could. In fact today AI already does smaller number of mistakes than human on such tasks. Yes, AI does different mistakes, but average precision is better. We just tend to dismiss the issues with “optical illusions” and exaggerate AI mistakes. To feel better about themselves, maybe? When you are shown picture where the exact same color looks “white” in one place and “black” in the other place you don't even need AI to reveal the difference, simple sheet of paper is enough – but that's not perceived as human brain deficiency because we all are built the same and all fall prey to the exact same illusion. AIs are built differently and they are fooled by different illusions than what humans misperceive – and that difference is, in our arrogance, treated as “human superiority”. It's only when AI starts doing things so much better than human, beats human so supremely decisively, when there are no doubt that AI is “head and shoulders” above human for this or that task… only then humans… redefine this particular task as “not really important”. Like it happened to go and chess, among other things.
Posted Oct 27, 2024 16:10 UTC (Sun)
by ghodgkins (subscriber, #157257)
[Link]
In my understanding, training an neural net is deterministic in the sense that matters for reproducibility. If you train the same model architecture in the same environment with the same data, you'll get the same final weights. This is true even if you draw from random distributions during training, as long as you choose the same seed(s) for the PRNG.
The input-output mapping of the trained model is usually also deterministic, except for some special-purpose stochastic models. Even those you may be able to make reproducible by fixing the PRNG seed, as above.
> The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.
> By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society.
I think it's reasonable to have different expectations for software tools and the people that use them, and honestly kind of absurd not to.
> The very idea is absurd.
For humans, certainly. One key difference between humans and NNs here is that NNs have a thing called "training" with well-defined inputs and output, in a consistent and well-defined format, which makes enumerating the training data entirely feasible.
> We cannot possibly know everything that is going on in your head
But we can know everything that is going on inside a NN, although we may not be able to interpret it with respect to the inputs and outputs.
Posted Oct 26, 2024 6:57 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Oct 31, 2024 13:21 UTC (Thu)
by danpb (subscriber, #4831)
[Link]
None the less, I do like to see documents like this. The industry is awash with opinions/policies that are almost entirely driven by corporate vested interests whom want to push & sell use of AI everywhere. There is not nearly enough analysis & input from groups who are looking at it from an objective & independent POV, without their judgement being clouded by a need to sell AI.
Posted Oct 26, 2024 6:59 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link] (4 responses)
It certainly has lost any and all credibility with these actions and their whole procedere, which was designed to cater to corporate interests over open source principles right from the beginning (leading questions, lack of survey response possibilities that disagree, etc).
Many people certainly don’t want those theft machines blessed by something they identify with.
(I will, for one, *not* enable reply notifications, nor engage further here. I’ve written up on my webservers and discussed on the OSI MLs and in Fedi all that needs to be said.)
Posted Oct 26, 2024 7:53 UTC (Sat)
by rettichschnidi (subscriber, #93261)
[Link]
Can confirm. While my posts (https://files.reto-schneider.ch/2024-10-22%2021-15-xx%20O...) might have been below their quality expectation, having them marked as "Pending" (for review) and then silently dropped (deleted) after ~2 days seems not ok to me. At least some kind of feedback/notification would have been appreciated.
Posted Oct 26, 2024 14:39 UTC (Sat)
by IanKelling (subscriber, #89418)
[Link] (2 responses)
The CMU statement which calls it a "participatory co-design process" which was "inclusive" makes me roll my eyes. It was drafted privately by the OSI board in some undisclosed way.
Posted Oct 28, 2024 7:23 UTC (Mon)
by WolfWings (subscriber, #56790)
[Link] (1 responses)
Or do you just mean it required javascript enabled at all to function and post comments?
Posted Oct 28, 2024 7:45 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link]
Posted Oct 26, 2024 15:26 UTC (Sat)
by burki99 (subscriber, #17149)
[Link] (8 responses)
Therefore I would suggest to come up with
a) would be the one suggested by OSI, b) would additionally require the training data to be available (which implies an open content license that allows redistribution of the training data).
Posted Oct 26, 2024 15:55 UTC (Sat)
by jzb (editor, #7867)
[Link]
License discussions are no place for reasonable suggestions. :) In seriousness, though - that's not a terrible idea. I'm not sure I'd go with "Pragmatic" and "Strict" as the names, but something that indicates "everything but the data is available" and "this is the gold standard of openness, including open data" with the naming reflecting that might satisfy some of the folks unhappy with tagging anything less than fully open as "open source".
Posted Oct 26, 2024 21:26 UTC (Sat)
by josh (subscriber, #17465)
[Link] (5 responses)
Posted Oct 27, 2024 8:00 UTC (Sun)
by burki99 (subscriber, #17149)
[Link] (3 responses)
Posted Oct 27, 2024 23:12 UTC (Sun)
by neggles (subscriber, #153254)
[Link] (2 responses)
Contrary to popular belief, merely having access to the dataset doesn't actually give you the ability to reproduce a model. For example, the dataset for Stable Diffusion 1.5 is known and publicly available, the preprocessing steps taken are known, and the training process is known, but any attempt to exactly recreate it is unlikely to succeed due to changes in hardware, specific step counts, underlying library bug fixes and behaviour changes, specific corrupted images, etc.
Meta and OpenAI and MidJourney and many others have made newer, better variants of the same concept/using similar architectures, but nobody's made a reproduction, and not for lack of trying.
That said, why would you want to?
If you want to significantly improve the model in a general sense you'll need to adjust the architecture, which necessitates retraining it from clean anyway - and you'll likely want to replace the dataset with one more suited to your desired use case.
If you just want to make it good at generating some specific type of image, finetuning on top of the existing base model weights requires orders of magnitude less compute and a completely different domain-specific dataset anyway.
This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.
I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.
There is no real benefit to requiring open datasets, and a lot of massive downsides; it would effectively kill open source AI because no company would risk the potential legal exposure from doing such a thing - even if the data *doesn't* have copyright issues, proving that in court could be rather difficult and time-consuming. Copyright trolling is just as real of a thing as patent trolling - just ask anyone who's been subject to frivolous DMCA takedown requests.
Posted Oct 28, 2024 20:00 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I generally agree that the anti-AI artists are on the wrong side here (or at least, they have not demonstrated to my satisfaction that their absolutist position holds water in every possible circumstance).
But I think it would be helpful to take a step back and avoid vilifying a group of people whose industry is currently undergoing a serious level of tech-induced upheaval. People are scared and worried for their economic future, and I think they do have a right to be upset and to advocate for their self interest, even as I disagree with them.
IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training. It is most plausible to me that that will end up being a fact-specific inquiry which requires individualized analysis of both the model and the specific training work whose copyright is allegedly infringed.* Frankly, I'm not entirely convinced that any version of open source AI would be viable in such a regime - but the point is that artists posting angry messages on Xitter are probably not going to change that outcome one way or the other.
Disclaimer: I work for Google, which does some GenAI stuff that I'm not involved with. Opinions are my own, speculation about the future is quite possibly wrong.
* For example, The New York Times provided evidence (in their lawsuit against OpenAI) that ChatGPT can sometimes produce verbatim or near-verbatim copies of their columns. I could imagine courts wanting to see evidence similar to that before entering a finding of infringement.
Posted Oct 29, 2024 10:03 UTC (Tue)
by kleptog (subscriber, #1183)
[Link]
That might work for Common Law systems like the US, but in Civil Law systems (most of Europe at least) the courts generally can't make these kinds of major policy changes and will punt the problem to the legislature to sort out. The EU AI Act 2024 covers all the stuff that everyone could agree on already (e.g. training for non-commercial use is fine). The question of what to do about training data for commercial use is subject of much political debate. The EU executive/legislature is currently in a look and see mode to see if businesses can work out an agreement amongst themselves and are only likely to step in if it looks like someone is using their market power illegitimately, or we start seeing serious negative public policy impacts.
Posted Oct 28, 2024 13:45 UTC (Mon)
by bud (subscriber, #5327)
[Link]
Posted Oct 26, 2024 21:26 UTC (Sat)
by tzafrir (subscriber, #11501)
[Link]
Posted Oct 27, 2024 2:50 UTC (Sun)
by NightMonkey (subscriber, #23051)
[Link]
FYI, that "old political adage" was from U.S. President Ronald Reagan. Seems odd to see that used here.
Posted Oct 29, 2024 15:27 UTC (Tue)
by paulj (subscriber, #341)
[Link] (3 responses)
1. Provide public funding for projects to gather, collate, store and make available data for public use (in various arenas);
That said, I have to take with a common theme in some of the objections that suggest that a piece of software can not be considered open if some significant data inputs to that software are not open, e.g. as per Tom Calloway's suggestion:
> "If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts."
Or the suggestion that ANN software is not open, even if different sets of weights are available, if input data is not open, e.g.:
> For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system.
The former suggestion, that at least all significant possible input data to some software must be open, for the software to be considered open, is not something the Free Software world has stood fast on before. We did not require that every possible C programme must be open for GCC to be considered open. There are a number of Free Software game engines, for which game-data packs exist that are proprietary - e.g., often the original game that spawned the game engine, along with other proprietary games using the same engine. We don't consider a GPL Quake or Descent game engine to not be Free Software because there are game packs that are non-free, do we?
The latter suggestion, that an ANN is not trainable without the original input data is technically wrong. The ANN is eminently trainable. At worst, you start with random weights. If you've been given a set of weights from a previous training run, you've actually got a _lossy compressed form_ of the original data-set, that will be a much much better starting point than otherwise (given you appear to be interested in the original data set).
We need to be clear there are a number of different components here:
1. The ANN software
3a is a lossily compressed form of 3b.
1 is useful on its own.
1 + 2 + 3a are very very useful, and you can add your own input data set and train further from there. I.e., you are in a much better position starting your training from here, than from 3b, given you are demanding 3b to be open to you. 3a saves you a *lot* of compute!
3b will ultimately be required in many context for social reasons, as I mention at the start, but I see that as a different issue and one that will be solved by public means.
I think it will be extremely hard to /require/ that 3b is made available, just for logistical reasons, given the vast volume of data - even if the input data is already public. And if the data is already public (e.g. web crawls), you can assemble it yourself. Obviously 1 and 2 can and should be publishable under Free Software licences, without demanding that all cases of 3b must be made available under a Free Software licence.
So it comes down to whether someone should be able to publish a 3a, while not publishing 3b, and still be able to claim some "Open Source" label. If 3b is merely collated from public data, and the methodology to replicate that collation is obvious (web scraping, say) or otherwise described, perhaps there should be some accommodation that allows such 3a data-sets to be distributed along with Free Software by Free Software entities?
For other cases, where 3a is a distillation of some other data-set that not be replicated from public data, should this be treated as code or data? Data is what it is really. What if 3a is a distillation of personal data that could not be distributed otherwise under other laws (e.g. GDPR)? Should we say that no Free Software entity should be able to distribute a useful data-set, that could be used to run some Free Software LLM and do useful things for users?
What if the 3a lossy-compressed data is the /only/ way that the 3b data-set could be legally distributed anywhere in the world? E.g., 3b is personal medical data, and 3a is the model state of an ANN that uses homomorphic-encryption precisely to ensure that the very sensitive data of 3b can still be distilled to a compressed data set of 3a that can then be used by ANNs around the world? Is Free Software to be cut-off from such advances forever?
Posted Oct 29, 2024 15:34 UTC (Tue)
by paulj (subscriber, #341)
[Link]
I was trying to think of some scenario where the training data would be too sensitive to distribute, but then... query access has same sensitivity, so.. unclear if that scenario would ever plausibly interact with Free Software issues. I can't think of a good example anyway.
Posted Oct 30, 2024 12:45 UTC (Wed)
by kleptog (subscriber, #1183)
[Link] (1 responses)
The case for training on private datasets (like medical data) is a different issue. Obviously you can only train an AI to detect lung cancer if you have access to datasets with information about people who may have lung cancer. Ideally you'd like such datasets to be available on a Reasonable and Non-Discriminatory basis. This sort of access is already available for other purposes, I don't see how AI requires anything new here.
(Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)
Posted Oct 30, 2024 12:54 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link]
Asimov had "robopsychologists" doing things just like that.
The preferred form for modifying ML systems'
The preferred form for modifying ML systems'
modification
ChatGPT says...
> [...]
> training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
ChatGPT says...
ChatGPT says...
ChatGPT says...
Cost of training a model
Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.
Cost of training a model
ChatGPT says...
Everything is not a nail...
Everything is not a nail...
Everything is not a nail...
Wol
> When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever.
Everything is not a nail...
NN training is deterministic (enough)
It is not true that the weights "owe nothing" to the training data. As mentioned above, for a fixed PRNG seed, they are in fact a very complex closed-form function of that data - certainly "dependent" in the probability sense.
Debian policy
Debian policy
Re: Response to criticisms
Re: Response to criticisms
Re: Response to criticisms
Re: Response to criticisms
Re: Response to criticisms
Why not two licences?
a) Pragmatic Open Source AI
b) Strict Open Source AI
Why not two licences?
Why not two licences?
Why not two licences?
Why not two licences?
Why not two licences?
>
> I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.
Civil vs Common law
Why not two licences?
Why not two licences?
"If you're explaining, you're losing."
Training data, weights and model
2. Regulate the use of large-scale corporate data-sets for training, potentially making provisions for fair-use access under reasonable terms
2. The model specification (if not hard-coded in 1)
3a. The model state (the parameters and activation states)
3b. The input training data that created 3a
2 is useful on its own (either described in some DSL to be used by some instance of 1, or described in literature)
1 + 2 are very useful.
Training data, weights and model
Training data, weights and model
Training data, weights and model