LWN: Comments on "OSI readies controversial Open AI definition" https://lwn.net/Articles/995159/ This is a special feed containing comments posted to the individual LWN article titled "OSI readies controversial Open AI definition". en-us Wed, 24 Sep 2025 22:01:56 +0000 Wed, 24 Sep 2025 22:01:56 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Debian policy https://lwn.net/Articles/996480/ https://lwn.net/Articles/996480/ danpb <div class="FormattedComment"> The document in that repo has this text in at the top of it "Disclaimer: This draft is UNOFFICIAL and does not represent the point of view of the Debian Project.". So it doesn't actually seem to be an actual Debian policy, unless this text has been moved somewhere else and this repo with the disclaimer is an outdated copy ?<br> <p> None the less, I do like to see documents like this. The industry is awash with opinions/policies that are almost entirely driven by corporate vested interests whom want to push &amp; sell use of AI everywhere. There is not nearly enough analysis &amp; input from groups who are looking at it from an objective &amp; independent POV, without their judgement being clouded by a need to sell AI.<br> <p> </div> Thu, 31 Oct 2024 13:21:11 +0000 Training data, weights and model https://lwn.net/Articles/996284/ https://lwn.net/Articles/996284/ mathstuf <div class="FormattedComment"> <span class="QuotedText">&gt; (Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)</span><br> <p> Asimov had "robopsychologists" doing things just like that.<br> </div> Wed, 30 Oct 2024 12:54:49 +0000 Training data, weights and model https://lwn.net/Articles/996282/ https://lwn.net/Articles/996282/ kleptog <div class="FormattedComment"> I think that for most people, 3a (the model state) is the "preferred form of modification" anyway. Having all the training data is a nice idea, but useless in practice because you can't meaningfully review all the data anyway. And just retraining with the same data doesn't give you anything new. Just about anything you might want to do you can do with just the weights: fine tuning for the results you want.<br> <p> The case for training on private datasets (like medical data) is a different issue. Obviously you can only train an AI to detect lung cancer if you have access to datasets with information about people who may have lung cancer. Ideally you'd like such datasets to be available on a Reasonable and Non-Discriminatory basis. This sort of access is already available for other purposes, I don't see how AI requires anything new here.<br> <p> (Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)<br> </div> Wed, 30 Oct 2024 12:45:40 +0000 Cost of training a model https://lwn.net/Articles/996217/ https://lwn.net/Articles/996217/ paulj <div class="FormattedComment"> The human brain is capable of some pretty sophisticated training with just thousands of Joules, and really profound learning with MWs, so.... the day will come.<br> <p> (We may run into some profound ethical issues with ML before that day, but that's another topic).<br> </div> Tue, 29 Oct 2024 17:16:59 +0000 ChatGPT says... https://lwn.net/Articles/996215/ https://lwn.net/Articles/996215/ paulj <div class="FormattedComment"> <span class="QuotedText">&gt; Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.</span><br> <p> This isn't quite true, depending on what you mean.<br> <p> a) You can start from scratch, just start with random parameters again, and train with your own data. I.e., just don't use the parameter set.<br> <p> b) If there are specific things in the trained parameters that you don't like/want, you can /keep/ training the model but apply your own objective function to penalise the thing you don't like. Essentially "correcting" whatever it is you don't like.<br> <p> The challenge with b may that you don't know what is in there that you don't want. However, you will probably have the same problem if you have the full input data. The model is basically given you the training data in a digested form that at least lets you query it though.<br> <p> That "correcting" process is something all the big players with public AI tools surely must have /big/ teams of people working on: just filtering out whatever "fruity" content happens to be discovered in public LLMs and generative AIs. Teams manually looking for "dodgy" content, to then come up with the widest set of prompts that spit out that dodgy content (in all its combinations), to then pass on to other teams to create refinement training cases and test-cases. So that we get the safely sanitised LLMs / models suitable for public use.<br> </div> Tue, 29 Oct 2024 17:13:57 +0000 Training data, weights and model https://lwn.net/Articles/996210/ https://lwn.net/Articles/996210/ paulj <div class="FormattedComment"> Wish I could edit. Strike the last question, if I can query it I can (slowly) access the data obviously. ;)<br> <p> I was trying to think of some scenario where the training data would be too sensitive to distribute, but then... query access has same sensitivity, so.. unclear if that scenario would ever plausibly interact with Free Software issues. I can't think of a good example anyway.<br> </div> Tue, 29 Oct 2024 15:34:28 +0000 Training data, weights and model https://lwn.net/Articles/996207/ https://lwn.net/Articles/996207/ paulj <div class="FormattedComment"> To start off with, I think data-sets for training are obviously incredibly important. I think having open data sets available for training fully-reproducible and open "AI" models will be incredibly important to society, sooner rather than later. I think it will be so important that ultimately societies will do 1 or both of: <br> <p> 1. Provide public funding for projects to gather, collate, store and make available data for public use (in various arenas); <br> 2. Regulate the use of large-scale corporate data-sets for training, potentially making provisions for fair-use access under reasonable terms<br> <p> That said, I have to take with a common theme in some of the objections that suggest that a piece of software can not be considered open if some significant data inputs to that software are not open, e.g. as per Tom Calloway's suggestion:<br> <p> <span class="QuotedText">&gt; "If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts."</span><br> <p> Or the suggestion that ANN software is not open, even if different sets of weights are available, if input data is not open, e.g.:<br> <p> <span class="QuotedText">&gt; For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system. </span><br> <p> The former suggestion, that at least all significant possible input data to some software must be open, for the software to be considered open, is not something the Free Software world has stood fast on before. We did not require that every possible C programme must be open for GCC to be considered open. There are a number of Free Software game engines, for which game-data packs exist that are proprietary - e.g., often the original game that spawned the game engine, along with other proprietary games using the same engine. We don't consider a GPL Quake or Descent game engine to not be Free Software because there are game packs that are non-free, do we?<br> <p> The latter suggestion, that an ANN is not trainable without the original input data is technically wrong. The ANN is eminently trainable. At worst, you start with random weights. If you've been given a set of weights from a previous training run, you've actually got a _lossy compressed form_ of the original data-set, that will be a much much better starting point than otherwise (given you appear to be interested in the original data set).<br> <p> We need to be clear there are a number of different components here:<br> <p> 1. The ANN software<br> 2. The model specification (if not hard-coded in 1)<br> 3a. The model state (the parameters and activation states)<br> 3b. The input training data that created 3a<br> <p> 3a is a lossily compressed form of 3b. <br> <p> 1 is useful on its own.<br> 2 is useful on its own (either described in some DSL to be used by some instance of 1, or described in literature)<br> 1 + 2 are very useful.<br> <p> 1 + 2 + 3a are very very useful, and you can add your own input data set and train further from there. I.e., you are in a much better position starting your training from here, than from 3b, given you are demanding 3b to be open to you. 3a saves you a *lot* of compute!<br> <p> 3b will ultimately be required in many context for social reasons, as I mention at the start, but I see that as a different issue and one that will be solved by public means. <br> <p> I think it will be extremely hard to /require/ that 3b is made available, just for logistical reasons, given the vast volume of data - even if the input data is already public. And if the data is already public (e.g. web crawls), you can assemble it yourself. Obviously 1 and 2 can and should be publishable under Free Software licences, without demanding that all cases of 3b must be made available under a Free Software licence. <br> <p> So it comes down to whether someone should be able to publish a 3a, while not publishing 3b, and still be able to claim some "Open Source" label. If 3b is merely collated from public data, and the methodology to replicate that collation is obvious (web scraping, say) or otherwise described, perhaps there should be some accommodation that allows such 3a data-sets to be distributed along with Free Software by Free Software entities? <br> <p> For other cases, where 3a is a distillation of some other data-set that not be replicated from public data, should this be treated as code or data? Data is what it is really. What if 3a is a distillation of personal data that could not be distributed otherwise under other laws (e.g. GDPR)? Should we say that no Free Software entity should be able to distribute a useful data-set, that could be used to run some Free Software LLM and do useful things for users?<br> <p> What if the 3a lossy-compressed data is the /only/ way that the 3b data-set could be legally distributed anywhere in the world? E.g., 3b is personal medical data, and 3a is the model state of an ANN that uses homomorphic-encryption precisely to ensure that the very sensitive data of 3b can still be distilled to a compressed data set of 3a that can then be used by ANNs around the world? Is Free Software to be cut-off from such advances forever?<br> </div> Tue, 29 Oct 2024 15:27:43 +0000 Civil vs Common law https://lwn.net/Articles/996170/ https://lwn.net/Articles/996170/ kleptog <div class="FormattedComment"> <span class="QuotedText">&gt; IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training.</span><br> <p> That might work for Common Law systems like the US, but in Civil Law systems (most of Europe at least) the courts generally can't make these kinds of major policy changes and will punt the problem to the legislature to sort out. The EU AI Act 2024 covers all the stuff that everyone could agree on already (e.g. training for non-commercial use is fine). The question of what to do about training data for commercial use is subject of much political debate. The EU executive/legislature is currently in a look and see mode to see if businesses can work out an agreement amongst themselves and are only likely to step in if it looks like someone is using their market power illegitimately, or we start seeing serious negative public policy impacts.<br> </div> Tue, 29 Oct 2024 10:03:25 +0000 Why not two licences? https://lwn.net/Articles/996100/ https://lwn.net/Articles/996100/ NYKevin <div class="FormattedComment"> <span class="QuotedText">&gt; This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.</span><br> <span class="QuotedText">&gt;</span><br> <span class="QuotedText">&gt; I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.</span><br> <p> I generally agree that the anti-AI artists are on the wrong side here (or at least, they have not demonstrated to my satisfaction that their absolutist position holds water in every possible circumstance).<br> <p> But I think it would be helpful to take a step back and avoid vilifying a group of people whose industry is currently undergoing a serious level of tech-induced upheaval. People are scared and worried for their economic future, and I think they do have a right to be upset and to advocate for their self interest, even as I disagree with them.<br> <p> IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training. It is most plausible to me that that will end up being a fact-specific inquiry which requires individualized analysis of both the model and the specific training work whose copyright is allegedly infringed.* Frankly, I'm not entirely convinced that any version of open source AI would be viable in such a regime - but the point is that artists posting angry messages on Xitter are probably not going to change that outcome one way or the other.<br> <p> Disclaimer: I work for Google, which does some GenAI stuff that I'm not involved with. Opinions are my own, speculation about the future is quite possibly wrong.<br> <p> * For example, The New York Times provided evidence (in their lawsuit against OpenAI) that ChatGPT can sometimes produce verbatim or near-verbatim copies of their columns. I could imagine courts wanting to see evidence similar to that before entering a finding of infringement.<br> </div> Mon, 28 Oct 2024 20:00:24 +0000 Why not two licences? https://lwn.net/Articles/996010/ https://lwn.net/Articles/996010/ bud <div class="FormattedComment"> This is exactly what I thought. The most important thing seems to be clarity/transparency. And a binary state (open source or not) may not be expressive enough. An interesting analogy may be the CC licenses. And they even use icons to create transparency at first sight..<br> </div> Mon, 28 Oct 2024 13:45:36 +0000 Cost of training a model https://lwn.net/Articles/995997/ https://lwn.net/Articles/995997/ farnz <blockquote> Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially. </blockquote> <p>I'm leery of arguments based on the cost of training a model, because they generalise to "no software is open source unless you're willing to donate a working system to anyone who wants to build it". Instead, I'd prefer to see arguments around both the reproducibility of training (since if training is not reproducible, a skilled hacker is unable to test whether their optimized training process produces the same results in under 1% of the cost of the original), or the incompleteness of training data. <p>Put another way; if it currently costs $100m in cloud GPU costs to train a model, but a clever person is able to work out how to do the same training on a $500 laptop for $100 in electricity, I want that clever person to be able to show that their technique works exactly as well as the original, and not to be held back by the lack of open models and training sets they can use to show off their technique. Mon, 28 Oct 2024 08:54:10 +0000 Re: Response to criticisms https://lwn.net/Articles/995994/ https://lwn.net/Articles/995994/ mjg59 <div class="FormattedComment"> It's code that you have no permission to modify or redistribute, which is the generally accepted definition of proprietary software.<br> </div> Mon, 28 Oct 2024 07:45:31 +0000 Re: Response to criticisms https://lwn.net/Articles/995991/ https://lwn.net/Articles/995991/ WolfWings <div class="FormattedComment"> ...proprietary javascript? Sorry but how does that even occur? It's plain text, downloaded to your browser, at worst a little obfuscated and often minified but still readable generally with basic filters.<br> <p> Or do you just mean it required javascript enabled at all to function and post comments?<br> </div> Mon, 28 Oct 2024 07:23:29 +0000 Why not two licences? https://lwn.net/Articles/995972/ https://lwn.net/Articles/995972/ neggles <div class="FormattedComment"> The AI/ML industry already has names for these two types of model; Open Weights for the former, and Open Data for the latter. Both are subtypes of "Open Source", just with differences in which components are openly available.<br> <p> Contrary to popular belief, merely having access to the dataset doesn't actually give you the ability to reproduce a model. For example, the dataset for Stable Diffusion 1.5 is known and publicly available, the preprocessing steps taken are known, and the training process is known, but any attempt to exactly recreate it is unlikely to succeed due to changes in hardware, specific step counts, underlying library bug fixes and behaviour changes, specific corrupted images, etc.<br> <p> Meta and OpenAI and MidJourney and many others have made newer, better variants of the same concept/using similar architectures, but nobody's made a reproduction, and not for lack of trying.<br> <p> That said, why would you want to?<br> <p> If you want to significantly improve the model in a general sense you'll need to adjust the architecture, which necessitates retraining it from clean anyway - and you'll likely want to replace the dataset with one more suited to your desired use case.<br> <p> If you just want to make it good at generating some specific type of image, finetuning on top of the existing base model weights requires orders of magnitude less compute and a completely different domain-specific dataset anyway.<br> <p> This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.<br> <p> I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.<br> <p> There is no real benefit to requiring open datasets, and a lot of massive downsides; it would effectively kill open source AI because no company would risk the potential legal exposure from doing such a thing - even if the data *doesn't* have copyright issues, proving that in court could be rather difficult and time-consuming. Copyright trolling is just as real of a thing as patent trolling - just ask anyone who's been subject to frivolous DMCA takedown requests.<br> </div> Sun, 27 Oct 2024 23:12:00 +0000 NN training is deterministic (enough) https://lwn.net/Articles/995957/ https://lwn.net/Articles/995957/ ghodgkins <div class="FormattedComment"> <span class="QuotedText">&gt; a non-deterministic system (neural nets)</span><br> <p> In my understanding, training an neural net is deterministic in the sense that matters for reproducibility. If you train the same model architecture in the same environment with the same data, you'll get the same final weights. This is true even if you draw from random distributions during training, as long as you choose the same seed(s) for the PRNG.<br> <p> The input-output mapping of the trained model is usually also deterministic, except for some special-purpose stochastic models. Even those you may be able to make reproducible by fixing the PRNG seed, as above.<br> <p> <span class="QuotedText">&gt; The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.</span><br> <br> It is not true that the weights "owe nothing" to the training data. As mentioned above, for a fixed PRNG seed, they are in fact a very complex closed-form function of that data - certainly "dependent" in the probability sense. <br> <p> <span class="QuotedText">&gt; By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society.</span><br> <p> I think it's reasonable to have different expectations for software tools and the people that use them, and honestly kind of absurd not to. <br> <p> <span class="QuotedText">&gt; The very idea is absurd.</span><br> <p> For humans, certainly. One key difference between humans and NNs here is that NNs have a thing called "training" with well-defined inputs and output, in a consistent and well-defined format, which makes enumerating the training data entirely feasible.<br> <p> <span class="QuotedText">&gt; We cannot possibly know everything that is going on in your head</span><br> <p> But we can know everything that is going on inside a NN, although we may not be able to interpret it with respect to the inputs and outputs.<br> </div> Sun, 27 Oct 2024 16:10:10 +0000 Why not two licences? https://lwn.net/Articles/995948/ https://lwn.net/Articles/995948/ burki99 <div class="FormattedComment"> How about Open Code AI vs Open Data AI. That would keep the ambiguous term Source out of labels.<br> </div> Sun, 27 Oct 2024 08:00:20 +0000 "If you're explaining, you're losing." https://lwn.net/Articles/995939/ https://lwn.net/Articles/995939/ NightMonkey <div class="FormattedComment"> "If you're explaining, you're losing."<br> <p> FYI, that "old political adage" was from U.S. President Ronald Reagan. Seems odd to see that used here.<br> </div> Sun, 27 Oct 2024 02:50:06 +0000 The preferred form for modifying ML systems' https://lwn.net/Articles/995925/ https://lwn.net/Articles/995925/ somlo <div class="FormattedComment"> Speaking of the CMU "endorsement", note that the largest percentage of signatories seem to be librarians, followed by business school faculty (Tepper), and a single member of the School of Computer Science faculty (specializing in "Ethics and Computational Technologies").<br> <p> Conspicuously absent are any of the AI/ML "heavy hitters", so I'm not really sure what conclusion one should draw from that...<br> </div> Sat, 26 Oct 2024 22:48:27 +0000 Why not two licences? https://lwn.net/Articles/995922/ https://lwn.net/Articles/995922/ tzafrir <div class="FormattedComment"> The Debian ML policy (linked above) uses the name "ToxicCandy" for what you name here "Pragmatic".<br> </div> Sat, 26 Oct 2024 21:26:42 +0000 Why not two licences? https://lwn.net/Articles/995923/ https://lwn.net/Articles/995923/ josh <div class="FormattedComment"> Sounds great. For accuracy and consistency with widespread industry understanding of the term "Open Source", let's call the first one "proprietary AI", and the second one "Open Source AI".<br> </div> Sat, 26 Oct 2024 21:26:29 +0000 Everything is not a nail... https://lwn.net/Articles/995920/ https://lwn.net/Articles/995920/ khim <font class="QuotedText">&gt; When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever.</font> <p>That's extremely rare occurance.</p> <font class="QuotedText">&gt; When a computer bayesian machine messes up, it's too expensive to retrain.</font> <p>That could be true for LLMs, today, but many smaller-scaled AI models are retrained from scratch rountinely.</p> <p>Pretty soon there would be more LLMs retrained from scratch than100 or 200 billions of people that ever lived on this planet.</p> <p>Would that mean that LLMs would achieve parity with humans, when that would happen? Hard to predict… my gut feeling is that no, that wouldn't happen – but not because encounters with lions or wildebeests made humans are that much better, but because nature invented a lot of tricks over billions of years that we don't know how to replicate… yet.</p> <font class="QuotedText">&gt; I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank</font> <p>It could. In fact today AI already does smaller number of mistakes than human on such tasks.</p> <p>Yes, AI does <b>different</b> mistakes, but average precision is better.</p> <font class="QuotedText">&gt; I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...</font> <p>We just tend to dismiss the issues with “optical illusions” and exaggerate AI mistakes. To feel better about themselves, maybe?</p> <p>When you are shown picture where the exact same color looks “white” in one place and “black” in the other place you don't even need AI to reveal the difference, simple sheet of paper is enough – but that's not perceived as human brain deficiency because we all are built the same and all fall prey to <b>the exact same illusion</b>.</p> <p>AIs are built differently and they are fooled by <b>different</b> illusions than what humans misperceive – and that difference is, in our arrogance, treated as “human superiority”.</p> <p>It's only when AI starts doing things so much better than human, beats human so supremely decisively, when there are no doubt that AI is “head and shoulders” above human for this or that task… only then humans… redefine this particular task as “not really important”. Like it happened to go and chess, among other things.</p> Sat, 26 Oct 2024 17:31:59 +0000 Why not two licences? https://lwn.net/Articles/995919/ https://lwn.net/Articles/995919/ jzb <p>License discussions are no place for reasonable suggestions. :)</p> <p>In seriousness, though - that's not a terrible idea. I'm not sure I'd go with "Pragmatic" and "Strict" as the names, but something that indicates "everything but the data is available" and "this is the gold standard of openness, including open data" with the naming reflecting that <em>might</em> satisfy some of the folks unhappy with tagging anything less than fully open as "open source".</p> Sat, 26 Oct 2024 15:55:23 +0000 Why not two licences? https://lwn.net/Articles/995916/ https://lwn.net/Articles/995916/ burki99 <div class="FormattedComment"> It seems rather obvious: There are two views on this topic, both reasonable, but fundementally incompatible.<br> <p> Therefore I would suggest to come up with <br> a) Pragmatic Open Source AI<br> b) Strict Open Source AI<br> <p> a) would be the one suggested by OSI, b) would additionally require the training data to be available (which implies an open content license that allows redistribution of the training data).<br> <p> <p> </div> Sat, 26 Oct 2024 15:26:19 +0000 Re: Response to criticisms https://lwn.net/Articles/995912/ https://lwn.net/Articles/995912/ IanKelling <div class="FormattedComment"> A discussion forum opened many months after they already published a draft definition. For a long time, it was just a web comment system requiring proprietary javascript, which is still there. And yes, I made sure OSI knew from the beginning that they published a draft definition which suggested only one way to comment publicly, which excluded people who strongly believed in principles which were part of the definition.<br> <p> The CMU statement which calls it a "participatory co-design process" which was "inclusive" makes me roll my eyes. It was drafted privately by the OSI board in some undisclosed way.<br> </div> Sat, 26 Oct 2024 14:39:47 +0000 Everything is not a nail... https://lwn.net/Articles/995910/ https://lwn.net/Articles/995910/ Wol <div class="FormattedComment"> <span class="QuotedText">&gt; Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.</span><br> <p> And herein lies the entire problem with AI, imho.<br> <p> The qualia (whatever that means) of the human brain is much LESS than we give it credit for. But it is QUALITATIVELY DIFFERENT from what we call AI and LLM and that junk. AI and LLMs are bayesian inference machines. Human brains have external forces imposing a (not necessarily correct) definition of "right" and "wrong".<br> <p> When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever. When a computer bayesian machine messes up, it's too expensive to retrain.<br> <p> I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank ... (I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...)<br> <p> Cheers,<br> Wol<br> </div> Sat, 26 Oct 2024 13:11:20 +0000 Re: Response to criticisms https://lwn.net/Articles/995902/ https://lwn.net/Articles/995902/ rettichschnidi <div class="FormattedComment"> <span class="QuotedText">&gt; and post deletion</span><br> <p> Can confirm. While my posts (<a href="https://files.reto-schneider.ch/2024-10-22%2021-15-xx%20OSI%20Screenshot.png">https://files.reto-schneider.ch/2024-10-22%2021-15-xx%20O...</a>) might have been below their quality expectation, having them marked as "Pending" (for review) and then silently dropped (deleted) after ~2 days seems not ok to me. At least some kind of feedback/notification would have been appreciated.<br> </div> Sat, 26 Oct 2024 07:53:10 +0000 ChatGPT says... https://lwn.net/Articles/995901/ https://lwn.net/Articles/995901/ mirabilos <div class="FormattedComment"> Long as they participate in Reproducible Builds and an independent party has reproduced the exact same model from the same inputs once, it’s at least not totally unpalatable if not everyone has the pockets to regenerate it from scratch in practice.<br> </div> Sat, 26 Oct 2024 07:02:28 +0000 Re: Response to criticisms https://lwn.net/Articles/995898/ https://lwn.net/Articles/995898/ mirabilos <div class="FormattedComment"> The OSI has actively ignored (including at least zack, who even was an OSI board director) or even silenced some criticism, including discussion forum bans and post deletion.<br> <p> It certainly has lost any and all credibility with these actions and their whole procedere, which was designed to cater to corporate interests over open source principles right from the beginning (leading questions, lack of survey response possibilities that disagree, etc).<br> <p> Many people certainly don’t want those theft machines blessed by something they identify with.<br> <p> (I will, for one, *not* enable reply notifications, nor engage further here. I’ve written up on my webservers and discussed on the OSI MLs and in Fedi all that needs to be said.)<br> </div> Sat, 26 Oct 2024 06:59:03 +0000 Debian policy https://lwn.net/Articles/995899/ https://lwn.net/Articles/995899/ pabs <div class="FormattedComment"> I prefer the Debian policy about libre AI:<br> <p> <a href="https://salsa.debian.org/deeplearning-team/ml-policy">https://salsa.debian.org/deeplearning-team/ml-policy</a><br> </div> Sat, 26 Oct 2024 06:57:58 +0000 Everything is not a nail... https://lwn.net/Articles/995895/ https://lwn.net/Articles/995895/ shironeko <div class="FormattedComment"> I hope you are not letting others to modify and redistribute your brain. By all definition, your mind is proprietary to you.<br> </div> Sat, 26 Oct 2024 05:12:17 +0000 ChatGPT says... https://lwn.net/Articles/995892/ https://lwn.net/Articles/995892/ jmspeex <div class="FormattedComment"> Absolutely. All of the ML models that ended up in Opus 1.5 (https://opus-codec.org/demo/opus-1.5/) can be trained on a single GPU. And there's still people using the original RNNoise I trained in 2017 on a laptop *CPU*. There's tons of useful things you can do with DNNs that are 1M weights or smaller. <br> </div> Sat, 26 Oct 2024 04:36:24 +0000 Everything is not a nail... https://lwn.net/Articles/995883/ https://lwn.net/Articles/995883/ chuckwolber <div class="FormattedComment"> It was relatively easy to create a definition of openness that applies cleanly to deterministic systems like compiled software. It seems like we are trying to apply the same definition to a non-deterministic system (neural nets) and we keep failing to notice the real problem.<br> <p> If I listen to music that ultimately influences my style as a musician, I am no less free when I play in my heavily influenced style and the music that influenced me is no less encumbered by copyright. The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.<br> <p> Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.<br> <p> By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society. The very idea is absurd.<br> <p> We invented a shortcut to that problem a long time ago - it is called a "social contract". We cannot possibly know everything that is going on in your head, but we can set forth expectations and apply relevant penalties.<br> <p> I propose we rethink the OSAI in the same way.<br> </div> Sat, 26 Oct 2024 01:40:50 +0000 ChatGPT says... https://lwn.net/Articles/995872/ https://lwn.net/Articles/995872/ rettichschnidi <div class="FormattedComment"> <span class="QuotedText">&gt; Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.</span><br> <p> True for LLMs, at least as of today. But there are many (smaller) ML/"AI" applications that can be trained by SME and even enthusiasts already today.<br> </div> Sat, 26 Oct 2024 00:01:51 +0000 ChatGPT says... https://lwn.net/Articles/995860/ https://lwn.net/Articles/995860/ leromarinvit <div class="FormattedComment"> Given the topic, asking an AI for its thoughts on the matter seemed apt. So I prompted ChatGPT (which is, of course, about as far removed from "Open Source AI" as seems possible):<br> <p> <span class="QuotedText">&gt; Given the following article, write a comment about whether or not training data should be included in the OSAID. Explain why, either way.</span><br> <span class="QuotedText">&gt; [...]</span><br> <p> It seems it wants to see training data included in the definition:<br> <p> <span class="QuotedText">&gt; Whether training data should be included in the Open Source AI Definition (OSAID) is a crucial issue, and I believe it should be required, not optional. Leaving training data out of the definition weakens the foundational principles of open source, particularly the ability to fully understand, modify, and build upon AI systems.</span><br> <p> <span class="QuotedText">&gt; Training data is central to how AI models are developed, and it plays a significant role in shaping the behavior and biases of the system. Without access to the data—only providing "detailed information"—users are restricted in their ability to recreate or meaningfully modify the system. This exclusion limits transparency and undermines the goal of fostering collaboration and innovation, which are core values of the open-source community.</span><br> <p> <span class="QuotedText">&gt; While there are valid concerns about the ethical and legal challenges of releasing training data, such as privacy issues and proprietary data restrictions, these challenges should be addressed case-by-case rather than lowering the bar for what qualifies as open source. By allowing AI systems to be considered "open" without sharing the training data, the OSAID risks diluting the meaning of open source and setting a precedent where essential components of AI systems can remain closed while still benefiting from the open-source label.</span><br> <p> <span class="QuotedText">&gt; In short, if the goal is to preserve the spirit of open source in AI, training data should not be optional, as omitting it compromises transparency, accountability, and the ability to fully exercise the four freedoms that open source aims to protect.</span><br> <p> ;-)<br> <p> While I happen to agree with the conclusion, the reasoning seems more like a summary of the article to me, instead of actual independent thought (well, duh - it's an LLM, it can't think). In particular, what stands out to me is that with source code, I can not only add new features or modify existing ones, but also remove parts entirely. This seems impossible with the kind of iterative modification that seems to be the only kind taken into account here (to my admittedly very limited understanding of the matter).<br> <p> Or, more to the point:<br> <span class="QuotedText">&gt; training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.</span><br> <p> Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.<br> <p> Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.<br> </div> Fri, 25 Oct 2024 22:31:24 +0000 modification https://lwn.net/Articles/995862/ https://lwn.net/Articles/995862/ shironeko <div class="FormattedComment"> Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).<br> <p> Personally it would seem that, unfortunately for some, Stallman was right yet again.<br> </div> Fri, 25 Oct 2024 22:17:17 +0000 The preferred form for modifying ML systems' https://lwn.net/Articles/995864/ https://lwn.net/Articles/995864/ ballombe <div class="FormattedComment"> <span class="QuotedText">&gt; Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: ""it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment"</span><br> <p> Where does the CMU comment says anything about the 'the preferred form for modifying ML systems' ?<br> </div> Fri, 25 Oct 2024 22:15:24 +0000