|
|
Log in / Subscribe / Register

Working with UTF-8 in the kernel

Working with UTF-8 in the kernel

Posted Mar 30, 2019 21:44 UTC (Sat) by mirabilos (subscriber, #84359)
In reply to: Working with UTF-8 in the kernel by foom
Parent article: Working with UTF-8 in the kernel

Yes, it must be consistent. What if a new release of Unicode comes out? Boom.

Another reason why this belongs into userspace.

And no, the turkish case is not theoretical. They have words which only differ in the dot above the i, and in one case, one of the two words is normal and one a rather crass insult, which led to (IIRC) a knife attack (well, some kind of real-life attack at the person) because they had no dotless i on their keyboard when texting.

I’ll quote someone else: just because your latin alphabet has 26 letters, not everyone else’s does. Imagine if we’d *always* (independent on what word it’s in) make “oo” compare the same as “u”, for example.


to post comments

Working with UTF-8 in the kernel

Posted Mar 30, 2019 21:51 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> And no, the turkish case is not theoretical. They have words which only differ in the dot above the i, and in one case, one of the two words is normal and one a rather crass insult, which led to (IIRC) a knife attack (well, some kind of real-life attack at the person) because they had no dotless i on their keyboard when texting.
This is the story: https://gizmodo.com/a-cellphones-missing-dot-kills-two-pe...

Although I personally wouldn't blame the cellphone here.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 22:43 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

The situation was bad.

Bad technology made it worse.

The cellphone doesn't get off scot-free here.

Working with UTF-8 in the kernel

Posted Mar 31, 2019 1:46 UTC (Sun) by foom (subscriber, #14868) [Link] (3 responses)

Hopefully the filesystem records what mapping it was created with, like NTFS does. Otherwise, some of your files may become inaccessible when a new mapping is switched to (which, iirc, did happen on HFS+ before. That's not good...)

Re: Turkish swears -- you can name your files either word just fine -- the filesystem does not be change your chosen filename to the other name! Only if you try to make files named both, in the same directory, will you run into an issue. I still claim that is *highly* unlikely.

If we treated oo and u as the same for filename comparison purposes, because that was a very common language's policy, I rather suspect that also wouldn't be a huge problem. (It'd be weird to have such behavior, as that isn't a common policy, however.)

Working with UTF-8 in the kernel

Posted Mar 31, 2019 19:17 UTC (Sun) by naptastic (guest, #60139) [Link]

> because that was a very common language's policy

Which one‽ I've never heard of this and I am dying to know! MY BRAIN IS HUNGRY

Working with UTF-8 in the kernel

Posted Apr 4, 2019 5:37 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

Hopefully the filesystem records what mapping it was created with, like NTFS does. Otherwise, some of your files may become inaccessible when a new mapping is switched to (which, iirc, did happen on HFS+ before. That's not good...)

This seems like the key to me. If the case folding rules can change, there's no way to guarantee that the same file will always be accessible the same way, and that's true whether the case folding happens in the kernel or in userspace.

Working with UTF-8 in the kernel

Posted Apr 4, 2019 12:28 UTC (Thu) by bosyber (guest, #84963) [Link]

> If we treated oo and u as the same for filename comparison purposes, because that was a very common language's policy
Is it? I know that it might be that way effectively in German, but in Dutch it is absolutely not, they are completely different sounds (the german u sound is closer to Dutch oe double sound, but not oo which is a loong vowel in Dutch.).


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds