|
|
Subscribe / Log in / New account

Almost always copyrighted ???

Almost always copyrighted ???

Posted May 1, 2025 8:20 UTC (Thu) by Wol (subscriber, #4433)
In reply to: Almost always copyrighted ??? by NYKevin
Parent article: Distributions quote of the week

Dickens? Jane Austen (18th century)? Dracula? Shakespeare (16/17th century)? Do I really want an LLM that's never heard of classic English Literature?

(Okay, with computing maybe, but even then mechanical computers (Babbage et al) go back many centuries ...)

And while the vernacular may change over a span of 20 years or so (1940s English has a recognisable style, I'm sure other eras do too), the last MAJOR shift in the English language came with Shakespeare - he is easily understood by modern ears, while anything pre-Elizabethan feels noticeably strange.

Cheers,
Wol


to post comments

Almost always copyrighted ???

Posted May 1, 2025 10:33 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

It's not just a matter of language. It's a matter of cultural and social context.

Imagine asking such an LLM about women's role in the workplace, the political structure of the British Commonwealth, or literally any celebrity who became famous after the 1930's or thereabouts. At best, its answers will be hilariously out of date, if they are even coherent.

Taking the Commonwealth as an example, the (relevant) Balfour Declaration was in 1926, the US copyright cutoff year is currently 1930, and the UK does not have a relevant cutoff date because it moved to life+50 in 1911. So if we restrict ourselves to public domain training materials, we have at most four years' worth of source materials, that were mostly written by Americans, in the late 1920's, when the British Empire was still a thing and World War II was a decade away. That is not a recipe for accurate information about the state of the British Commonwealth in 2025, or for that matter 1945.

Almost always copyrighted ???

Posted May 1, 2025 11:20 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> It's not just a matter of language. It's a matter of cultural and social context.

And?

The initial comment was along the lines of "all the material is copyrighted". Yes most of the modern material is, but there is a HUGE corpus that isn't. And any decent LLM needs both.

(And it was you that said "why would we want old material?" Because - just as you said without modern material questions about today would be hilariously inaccurate because of missing information - without the old material questions about today are likely to be hilariously inaccurate because of missing information and historical context.)

"Unread comments" is a great tool in many ways, but when the original post disappears, the thread can go off the rails because context is lost, as appears to be the case here ...

Cheers,
Wol

Almost always copyrighted ???

Posted May 2, 2025 1:18 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

I do not use "unread comments," I simply felt that your first comment (at the top of the thread) was a complete non-sequitur. The quote you originally replied to was about general-purpose LLMs, not some obscure niche use case where you want an LLM role-playing as somebody from the 19th century.

Almost always copyrighted ???

Posted May 2, 2025 8:58 UTC (Fri) by Wol (subscriber, #4433) [Link]

The tl;dr of my first comment was "Just because something is copyrightABLE material doesn't mean it's copyrightED material".

Russ's quote basically assumed that the two were the same.

Cheers,
Wol


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds