Filesystems and case-insensitivity

Posted Nov 29, 2018 8:30 UTC (Thu) by nim-nim (subscriber, #34454)
Parent article: Filesystems and case-insensitivity

It's pretty sad how encoding problems, have to be presented via a case sensitivity bias, before US devs even consider them. F-up all non-English languages of the world: NOT A PROBLEM. Mistreating English casing: HUGE PROBLEM.

Anyway, hope this gets fixed. Transition to UTF-8 was awful for *x filesystems, I sure hope there won't be a v2 with wide encoding problems added to the mix whenever UTF-8 gets deprecated in favour of something better.

Filesystems and case-insensitivity

Posted Nov 29, 2018 9:15 UTC (Thu) by dgm (subscriber, #49227) [Link] (8 responses)

Adding casing to the kernel is a sure recipe for intense pain *when* (not if) the next transition happens. And all for solving a non-existing problem. Oh vey.

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:38 UTC (Thu) by eru (subscriber, #2753) [Link] (7 responses)

*when* (not if) the next transition happens

I would hope that is never. UTF-8 can represent all characters now in practical use. The main risk is designing emojis going totally out of hand, and they insist each of them should have a UNICODE code point... oh wait...

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:41 UTC (Thu) by chithanh (guest, #52801) [Link] (6 responses)

> I would hope that is never. UTF-8 can represent all characters now in practical use.

That is not correct. In particular, Unicode (and by extension UTF-8) is deficient regarding some characters in African languages, due to the Unicode consortium's policy regarding precomposed characters vs. combining diacritics. They don't want to introduce new equivalences.

Filesystems and case-insensitivity

Posted May 29, 2019 23:00 UTC (Wed) by Serentty (guest, #132335) [Link] (5 responses)

This is not a deficiency with Unicode. Precomposed characters such as É have only ever been encoded in Unicode as a matter of compatibility with legacy encodings, and wouldn't have been included if not for this. They continue to be used because they save you a few bytes, which you might as well go for even if compression makes it moot in the end. Combining diacritics have always been the preferred method as they are much more flexible, and allow users to compose arbitrary characters without needing to constantly update their software or risk mojibake. Many scripts in Unicode work entirely though combining diacritics and it works just fine; the Indic scripts are good examples. It should be noted that the legacy encodings for these scripts usually worked that way as well. Conformant implementations will treat composed and decomposed characters identically, so the advantage of going down the rabbit hole of trying to provide every precomposed character anyone might ever want isn't really worth it when composition works just as well. If you notice that combining diacritics aren't giving you the nice hand-tweaked glyphs that precomposed characters are, and you end up with the diacritic looking all wrong, take it up with the developer of the text renderer or the font, because that's not how Unicode is supposed to work.

Filesystems and case-insensitivity

Posted May 30, 2019 14:18 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

Hmm. If that rule had actually been followed, we'd still have room on the base plane (i.e. codepoints below 65536).

(How many primitives would you need for Chinese?)

On the other hand, in that case we wouldn't all use UTF-8 by now – simply because that would require twice the storage space for Chnese text, more or less. Nowadays that doesn't really matter, but at the time it was a problem.

Filesystems and case-insensitivity

Posted May 30, 2019 14:54 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

https://en.wikipedia.org/wiki/Template:CJK_ideographs_in_... has a helpful list with the numbers of CJK codepoints, and I assume only the earliest ones were needed for legacy compatibility - the rest were presumably added because they couldn't already be represented. Recently Unicode 10.0 added "CJK Extension F" (7473 codepoints) so it seems they're still not finished. Then there's all the other scripts being added, like Tangut ("a major historic script of China") with another ~7000 codepoints. And about 1700 emojis (https://unicode.org/emoji/charts/emoji-list.html).

Maybe the 64K limit could have lasted for many more years if they had made some different design choices early on, but given the goal of being a universal standard for all text, it seems inevitable the limit would be broken eventually. It's better to have broken it earlier than later.

Filesystems and case-insensitivity

Posted May 31, 2019 15:06 UTC (Fri) by smurf (subscriber, #17840) [Link]

True, that.

Seems that quite a few of Chinese people with interesting names (i.e. using archaic characters) suddenly couldn't get an official document any more because, surprise, their name wasn't in the "official" charset …

Filesystems and case-insensitivity

Posted May 31, 2019 18:37 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Technically, most of CJK characters can be decomposed into simpler characters. About 70% of Mandarin characters follow the "radical-phonetic" model and can theoretically be composed.

Filesystems and case-insensitivity

Posted Jun 6, 2019 5:09 UTC (Thu) by Serentty (guest, #132335) [Link]

Encoding Chinese text based on primitives is the Holy Grail of Chinese text encoding, but no one has actually been able to come up with a realistic solution for it, and it's probably just not realistic. Korean text on the other hand is really easy to encode based on primitives, as it's just 22 letters combined in predictable ways.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:29 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

> F-up all non-English languages of the world: NOT A PROBLEM.

Before UTF-8, there never was an encoding that could represent "all non-English languages". At most it could store one other language, or ten (Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.

> whenever UTF-8 gets deprecated

It won't be. There's no reason at all to do that, and several billion reasons not to.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:43 UTC (Thu) by eru (subscriber, #2753) [Link]

(Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.

To be fair, that was the UNICODE spec at the time. Similarly, Java originally used 16-bit characters (and a char type is still 16 bits wide there). Now Java internally encodes strings as UTF-16 in order to support the expansion of UNICODE.