Distributions quote of the week

[Posted April 30, 2025 by corbet]

Maybe we've been ethical hypocrites all along about machine learning applications packaged in Debian, and the current LLM craze is a good opportunity to clean house and reaffirm a strict free software policy including training data. I'm rather sympathetic to that argument, frankly, just because the simplicity of the "source code for everything, no exceptions" position is comfortable in my brain. But we should be fairly sure about what we're agreeing to before making that decision.

— Russ Allbery

What if the training data never made it out of RAM?

Posted May 1, 2025 3:06 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> 4. As you discovered, finding the training data, even when upstream has
> retained it (which I suspect will not always be the case, since I
> expect in at least some cases upstream would just start over if they
> wanted to retrain the model and therefore would view at least some of
> the training data as equivalent to ephemeral object files they would
> discard), is not going to be easy since almost no one cares. This is of
> course not a new problem in free software, and we have long experience
> with telling upstreams that no, we really do care about all of the
> source code, but it is incrementally more work of a type that most
> Debian packagers truly dislike doing.

For a personal project (that was never publicly distributed, and most likely never will be, for various irrelevant reasons), I once trained a poker AI by having it play many simulated rounds against itself. As it happens, I didn't really need to use machine learning at all, because I was only using this "model" to estimate the win probability, and that can be computed exactly, but I was lazy and didn't feel like working through all the complicated conditional probability formulas, so I just wrote a big loop and had the computer pre-calculate win/loss outcomes against a set of heuristics, which turned out to work perfectly well in practice. I didn't even properly optimize it - it was written in pure Python (and did not use NumPy or anything resembling NumPy).

Until I read this message, I would not have *considered* retaining a log of exactly what every single one of those simulated games looked like. It simply would not have occurred to me that anyone, myself included, could possibly want or need such a thing. My code updated a bunch of heuristic win counts in an SQLite database, and made no attempt to persist the cards that were dealt. I figured that if I ever needed it to be more precise than that, I would sit down at the whiteboard, work it all out symbolically, and throw the entire database away.

Almost always copyrighted ???

Posted May 1, 2025 7:06 UTC (Thu) by Wol (subscriber, #4433) [Link] (9 responses)

> Right now, people are mostly thinking about LLMs, which
> are trained on large amounts of writing, which is almost always
> copyrighted because it's one of the core types of artistic creativity
> recognized by copyright laws.

Except HUGE amounts of such material is NOT copyrighted, because it predates modern copyright law. The status of stuff from last century is murky, but everything before that is not copyright, to the best of my knowledge ...

Cheers,
Wol

Almost always copyrighted ???

Posted May 1, 2025 7:20 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (6 responses)

Do you really want an LLM that talks like it's from the 19th century? There are probably some use cases for that, but it's not what the average user has in mind.

Almost always copyrighted ???

Posted May 1, 2025 8:20 UTC (Thu) by Wol (subscriber, #4433) [Link] (4 responses)

Dickens? Jane Austen (18th century)? Dracula? Shakespeare (16/17th century)? Do I really want an LLM that's never heard of classic English Literature?

(Okay, with computing maybe, but even then mechanical computers (Babbage et al) go back many centuries ...)

And while the vernacular may change over a span of 20 years or so (1940s English has a recognisable style, I'm sure other eras do too), the last MAJOR shift in the English language came with Shakespeare - he is easily understood by modern ears, while anything pre-Elizabethan feels noticeably strange.

Cheers,
Wol

Almost always copyrighted ???

Posted May 1, 2025 10:33 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

It's not just a matter of language. It's a matter of cultural and social context.

Imagine asking such an LLM about women's role in the workplace, the political structure of the British Commonwealth, or literally any celebrity who became famous after the 1930's or thereabouts. At best, its answers will be hilariously out of date, if they are even coherent.

Taking the Commonwealth as an example, the (relevant) Balfour Declaration was in 1926, the US copyright cutoff year is currently 1930, and the UK does not have a relevant cutoff date because it moved to life+50 in 1911. So if we restrict ourselves to public domain training materials, we have at most four years' worth of source materials, that were mostly written by Americans, in the late 1920's, when the British Empire was still a thing and World War II was a decade away. That is not a recipe for accurate information about the state of the British Commonwealth in 2025, or for that matter 1945.

Almost always copyrighted ???

Posted May 1, 2025 11:20 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> It's not just a matter of language. It's a matter of cultural and social context.

And?

The initial comment was along the lines of "all the material is copyrighted". Yes most of the modern material is, but there is a HUGE corpus that isn't. And any decent LLM needs both.

(And it was you that said "why would we want old material?" Because - just as you said without modern material questions about today would be hilariously inaccurate because of missing information - without the old material questions about today are likely to be hilariously inaccurate because of missing information and historical context.)

"Unread comments" is a great tool in many ways, but when the original post disappears, the thread can go off the rails because context is lost, as appears to be the case here ...

Cheers,
Wol

Almost always copyrighted ???

Posted May 2, 2025 1:18 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

I do not use "unread comments," I simply felt that your first comment (at the top of the thread) was a complete non-sequitur. The quote you originally replied to was about general-purpose LLMs, not some obscure niche use case where you want an LLM role-playing as somebody from the 19th century.

Almost always copyrighted ???

Posted May 2, 2025 8:58 UTC (Fri) by Wol (subscriber, #4433) [Link]

The tl;dr of my first comment was "Just because something is copyrightABLE material doesn't mean it's copyrightED material".

Russ's quote basically assumed that the two were the same.

Cheers,
Wol

Almost always copyrighted ???

Posted May 1, 2025 22:34 UTC (Thu) by ejr (subscriber, #51652) [Link]

Steam punk goes LLM.

Almost always copyrighted ???

Posted May 3, 2025 1:08 UTC (Sat) by rra (subscriber, #99804) [Link] (1 responses)

Yeah, sorry, that's not what I meant, and I could have used a more precise phrasing.

My point was not that all writing is copyrighted, which is clearly not true. My point is that writing, the art form, is subject to copyright law, whereas things like records of chess games are not, because that's how the copyright system works. This means that not all machine learning models are created equal when it comes to copyright analysis, if one wants to analyze the copyright status of the training data.

There are some types of models you can train on repositories of pure facts that are generally not copyrighted, but the type of models that are trained on writing require at least thinking about it. It's certainly possible to construct copyright-free repositories of text, but this is not what you get by default. You have to put some effort into doing that (except, I guess, in some narrow cases such as public legal records).

Almost always copyrighted ???

Posted May 3, 2025 8:57 UTC (Sat) by Wol (subscriber, #4433) [Link]

Pedantics are us. :-)

I guess sticking something like "liable to copyright" is the pedantic version :-)

Cheers,
Wol