|
|
Log in / Subscribe / Register

Filesystems and case-insensitivity

Filesystems and case-insensitivity

Posted May 30, 2019 14:18 UTC (Thu) by smurf (subscriber, #17840)
In reply to: Filesystems and case-insensitivity by Serentty
Parent article: Filesystems and case-insensitivity

Hmm. If that rule had actually been followed, we'd still have room on the base plane (i.e. codepoints below 65536).

(How many primitives would you need for Chinese?)

On the other hand, in that case we wouldn't all use UTF-8 by now – simply because that would require twice the storage space for Chnese text, more or less. Nowadays that doesn't really matter, but at the time it was a problem.


to post comments

Filesystems and case-insensitivity

Posted May 30, 2019 14:54 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

https://en.wikipedia.org/wiki/Template:CJK_ideographs_in_... has a helpful list with the numbers of CJK codepoints, and I assume only the earliest ones were needed for legacy compatibility - the rest were presumably added because they couldn't already be represented. Recently Unicode 10.0 added "CJK Extension F" (7473 codepoints) so it seems they're still not finished. Then there's all the other scripts being added, like Tangut ("a major historic script of China") with another ~7000 codepoints. And about 1700 emojis (https://unicode.org/emoji/charts/emoji-list.html).

Maybe the 64K limit could have lasted for many more years if they had made some different design choices early on, but given the goal of being a universal standard for all text, it seems inevitable the limit would be broken eventually. It's better to have broken it earlier than later.

Filesystems and case-insensitivity

Posted May 31, 2019 15:06 UTC (Fri) by smurf (subscriber, #17840) [Link]

True, that.

Seems that quite a few of Chinese people with interesting names (i.e. using archaic characters) suddenly couldn't get an official document any more because, surprise, their name wasn't in the "official" charset …

Filesystems and case-insensitivity

Posted May 31, 2019 18:37 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Technically, most of CJK characters can be decomposed into simpler characters. About 70% of Mandarin characters follow the "radical-phonetic" model and can theoretically be composed.

Filesystems and case-insensitivity

Posted Jun 6, 2019 5:09 UTC (Thu) by Serentty (guest, #132335) [Link]

Encoding Chinese text based on primitives is the Holy Grail of Chinese text encoding, but no one has actually been able to come up with a realistic solution for it, and it's probably just not realistic. Korean text on the other hand is really easy to encode based on primitives, as it's just 22 letters combined in predictable ways.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds