Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Posted Mar 29, 2019 10:05 UTC (Fri) by smurf (subscriber, #17840)In reply to: Working with UTF-8 in the kernel by ikm
Parent article: Working with UTF-8 in the kernel
There's another problem here. Correct case folding is locale dependent. One example: Turkish has an i and an ı (i without the dot). Unicode helpfully has an İ (capital I with a dot) right next to it. Guess what happens when you case-fold these in Turkey vs. everywhere else.
Posted Mar 29, 2019 11:21 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (40 responses)
Also, I don't think you can blame Unicode for the fact that Turkish and English has different alphabets.
Posted Mar 29, 2019 13:32 UTC (Fri)
by drag (guest, #31333)
[Link] (37 responses)
What it does mean is that your case insensitive lookups for the file system are actually case sensitive with insensitive elements. How sensitive it ends up being depends on what language a user uses.
Unless, of course, the kernel is aware of the user's locale and changes the responses accordingly.
Posted Mar 29, 2019 22:53 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link] (36 responses)
Posted Mar 30, 2019 13:17 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (6 responses)
Posted Apr 4, 2019 8:33 UTC (Thu)
by dvdeug (guest, #10998)
[Link] (5 responses)
Posted Apr 5, 2019 8:11 UTC (Fri)
by dgm (subscriber, #49227)
[Link] (4 responses)
So Linux cannot exchange data with MacOS and Windows?! PANIC!
Or put another way: if I show you that less than 1% of the population really wants or needs a case-insensitive filesystem, can I disregard your claims?
Posted Apr 8, 2019 2:02 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (3 responses)
Posted Apr 8, 2019 21:18 UTC (Mon)
by foom (subscriber, #14868)
[Link] (2 responses)
That may be, but FAT, exFAT, and NTFS don't use the unicode case folding rules. If the justification is to make something compatible with those systems, do we actually need the (rather complex) unicode rules?
Posted Apr 8, 2019 23:30 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (1 responses)
In what way are the Unicode case-folding rules rather complex? They are for the most part fairly simple, one to one matchings of characters, with a few exceptions that you just have to deal with. The German ß and the various titlecase characters in Unicode are there and are going to have to be dealt with.
Posted Apr 9, 2019 15:35 UTC (Tue)
by foom (subscriber, #14868)
[Link]
You say that other cases "have to be dealt with"...but we have widely used examples showing that to not actually be the case.
Posted Mar 30, 2019 13:59 UTC (Sat)
by SLi (subscriber, #53131)
[Link] (28 responses)
Posted Mar 30, 2019 16:45 UTC (Sat)
by foom (subscriber, #14868)
[Link] (27 responses)
Neither Mac nor windows filesystems' case folding is locale sensitive, either. (NTFS does write a file during filesystem creation containing the case folding rules for that drive, so you _could_ make them be whatever you like, at the risk of breaking everything...)
Everyone likes to bring up this example, but I rather expect the likelihood of normal Turkish users noticing and caring that they can't create two such files in the same directory is rather a theoretical problem, not an actual one.
Posted Mar 30, 2019 18:19 UTC (Sat)
by nybble41 (subscriber, #55106)
[Link] (4 responses)
Posted Mar 30, 2019 21:41 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
So, the kernel should have nothing to do with this *at all*.
Posted Apr 1, 2019 9:45 UTC (Mon)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
Posted Apr 5, 2019 1:24 UTC (Fri)
by xtifr (guest, #143)
[Link] (1 responses)
This means *all* the overheads will truly need to be present only for those who actively *use* the system.
I honestly don't know how this is all going forward without *at least* a user-space proof-of-concept system.
Posted Apr 6, 2019 21:56 UTC (Sat)
by foom (subscriber, #14868)
[Link]
But that behavior would be pretty awful -- which files you can access depending upon your current locale? There's a reason filesystems (including this ext4 proposal) store the mapping used when creating the filesystem...
> without *at least* a user-space proof-of-concept system.
Two have been mentioned already. Android has an overlay filesystem for local access, and samba implements it when exporting the filesystem over the network.
Posted Mar 30, 2019 21:44 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link] (6 responses)
Another reason why this belongs into userspace.
And no, the turkish case is not theoretical. They have words which only differ in the dot above the i, and in one case, one of the two words is normal and one a rather crass insult, which led to (IIRC) a knife attack (well, some kind of real-life attack at the person) because they had no dotless i on their keyboard when texting.
I’ll quote someone else: just because your latin alphabet has 26 letters, not everyone else’s does. Imagine if we’d *always* (independent on what word it’s in) make “oo” compare the same as “u”, for example.
Posted Mar 30, 2019 21:51 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Although I personally wouldn't blame the cellphone here.
Posted Mar 30, 2019 22:43 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link]
Bad technology made it worse.
The cellphone doesn't get off scot-free here.
Posted Mar 31, 2019 1:46 UTC (Sun)
by foom (subscriber, #14868)
[Link] (3 responses)
Re: Turkish swears -- you can name your files either word just fine -- the filesystem does not be change your chosen filename to the other name! Only if you try to make files named both, in the same directory, will you run into an issue. I still claim that is *highly* unlikely.
If we treated oo and u as the same for filename comparison purposes, because that was a very common language's policy, I rather suspect that also wouldn't be a huge problem. (It'd be weird to have such behavior, as that isn't a common policy, however.)
Posted Mar 31, 2019 19:17 UTC (Sun)
by naptastic (guest, #60139)
[Link]
Which one‽ I've never heard of this and I am dying to know! MY BRAIN IS HUNGRY
Posted Apr 4, 2019 5:37 UTC (Thu)
by rgmoore (✭ supporter ✭, #75)
[Link]
This seems like the key to me. If the case folding rules can change, there's no way to guarantee that the same file will always be accessible the same way, and that's true whether the case folding happens in the kernel or in userspace.
Posted Apr 4, 2019 12:28 UTC (Thu)
by bosyber (guest, #84963)
[Link]
Posted Apr 1, 2019 6:46 UTC (Mon)
by marcH (subscriber, #57642)
[Link] (14 responses)
Check the numerous, real-world examples and references given in the comment to the previous LWN article: https://lwn.net/Articles/784041/ It's not just Turkish: like another natural language topic case-sensitivity is very complex and (among others) locale-specific - not just in theory but in practice.
foom wrote:
Interesting, references?
nybble41 wrote:
Wait... should Linux be "bug for bug" compatible or linguistically correct?
Posted Apr 3, 2019 2:51 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (8 responses)
Posted Apr 3, 2019 5:07 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (7 responses)
Posted Apr 3, 2019 6:18 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (6 responses)
Posted Apr 8, 2019 17:53 UTC (Mon)
by hkario (subscriber, #94864)
[Link] (5 responses)
Posted Apr 8, 2019 20:30 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (4 responses)
Posted Apr 9, 2019 18:30 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
Posted Apr 10, 2019 0:50 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (2 responses)
Posted Apr 17, 2019 22:15 UTC (Wed)
by chithanh (guest, #52801)
[Link]
ß (U+00DF) indeed has no uppercase mapping in Unicode.
So if you start with ẞ and then convert to lowercase and then to uppercase again you might end up with SS.
Also, if you perform a case-insensitive filename match for ẞ it will return a file named ß.
Posted Apr 17, 2019 22:40 UTC (Wed)
by marcH (subscriber, #57642)
[Link]
Just for fun, some more "real-world" case insensitivity (from comments in the previous LWN thread)
Good luck supporting this in your filesystem.
> If the German speakers really wanted a change,...
Thanks, you just confirmed case sensitivity is not "hard science" no matter how hard Unicode tries to pretend it is. What a surprise considering it's part of natural languages. That's why it definitely has a place in high level interface user interfaces like file explorers, choosers and maybe interactive command lines even (with some autocorrection) but certainly not "hardwired" at a very low level in filesystems where it has already been seen causing damage.
Posted Apr 4, 2019 13:00 UTC (Thu)
by foom (subscriber, #14868)
[Link] (4 responses)
Search for $upcase -- the name of the 128KB pseudo file stored on in every NTFS filesystem. You can also look at the NTFS filesystem driver for Linux.
This file contains the corresponding uppercased character (2 bytes) for each one of the 65536 unicode characters. When windows wants to compare filenames, it simply indexes each character in each string through this table, to make an uppercase string, before doing the comparison.
When you reformat a drive it writes the newest mapping to the file, and that partition will use the same mapping as long as you keep it.
And, yes, I am quite aware that everyone who knows anything about unicode is crying out in distress at the utter WRONGness of what I said above...
But of course, the secret is that users aren't really the ones who care about case insensitive comparisons... They are using gui file pickers and such higher level tools where the filesystem case behavior doesn't matter.
Note the primary use cases given for Linux (samba exports, Android emulating a FAT filesystem on top of ext4) are all about *software* expectations, not humans. Software that was written with hardcoded filenames of the wrong case. That's why ntfs's braindead case folding is not really a problem.
Which does rather bring into question whether implementing "correct" normalization and case folding in Linux even has a point... It won't make it more compatible with the legacy software to do that...
Posted Apr 8, 2019 6:21 UTC (Mon)
by cpitrat (subscriber, #116459)
[Link] (3 responses)
Posted Apr 8, 2019 21:49 UTC (Mon)
by foom (subscriber, #14868)
[Link]
It does seem rather incongruous to me to justify the feature via by pointing to samba's emulation of NTFS case folding, and Android's emulation of FAT file name lookup rules, but then implementing unicode normalization and correct unicode case folding...which those don't do.
Posted Apr 11, 2019 20:49 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Forcing all filenames to be valid utf-16 will break quite a lot elsewhere ... I think that if you want to implement the utf universe properly in utf-16, you end up back with the 8-bit codeset mess, only bigger ...
Cheers,
Posted Apr 11, 2019 23:15 UTC (Thu)
by foom (subscriber, #14868)
[Link]
It stores filenames as arbitrary sequences of 16-bit values. There are a few tens of values you cannot use (ascii control characters 0-31, and some ascii punctuation), but everything else is fair game. In particular, invalid utf16 containing broken surrogate pairs is perfectly fine.
Posted Apr 7, 2019 22:40 UTC (Sun)
by jschrod (subscriber, #1646)
[Link]
I think a case could be made that one could blame Unicode for *not* representing these different alphabets. After all, the same code point is used for "different" characters in the alphabets - if one agrees to your statement that these are really different alphabets.
But, in real life, too much water has flown under this bridge to discuss it outside an evening in a wine bar with some friends who are encoding freaks. I have to admit I would be part of such a discussion... ;-)
Cheers, Joachim
Posted Apr 11, 2019 13:12 UTC (Thu)
by robbe (guest, #16131)
[Link]
Posted Mar 29, 2019 14:54 UTC (Fri)
by mina86 (guest, #68442)
[Link] (4 responses)
Posted Mar 30, 2019 4:27 UTC (Sat)
by gps (subscriber, #45638)
[Link] (1 responses)
What's one more? Some of the above could even go away after this in some system designs.
Posted Mar 31, 2019 14:41 UTC (Sun)
by mina86 (guest, #68442)
[Link]
That may be so but I'd rather my Samba server crashed than my kernel oopsed or executed malicious code because Unicode was handled incorrectly. Just like putting HTTP server inside the kernel wasn't a good idea, I'm not yet convinced that putting Unicode handling is.
Posted Apr 8, 2019 6:24 UTC (Mon)
by cpitrat (subscriber, #116459)
[Link] (1 responses)
Posted Apr 11, 2019 20:57 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
This is the story: https://gizmodo.com/a-cellphones-missing-dot-kills-two-pe...
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Hopefully the filesystem records what mapping it was created with, like NTFS does. Otherwise, some of your files may become inaccessible when a new mapping is switched to (which, iirc, did happen on HFS+ before. That's not good...)
Working with UTF-8 in the kernel
Is it? I know that it might be that way effectively in German, but in Dutch it is absolutely not, they are completely different sounds (the german u sound is closer to Dutch oe double sound, but not oo which is a loong vowel in Dutch.).
Working with UTF-8 in the kernel
> Neither Mac nor windows filesystems' case folding is locale sensitive, either. (NTFS does write a file during filesystem creation containing the case folding rules for that drive, so you _could_ make them be whatever you like, at the risk of breaking everything...)
> The fact that case folding is broken everywhere else it's been implemented offers a good argument against implementing it in Linux.
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
But ẞ (U+1E9E) has a lowercase mapping of ß.
But a case-insensitive filename match for ß will not return a file named ẞ.
Working with UTF-8 in the kernel
a file named ß.
> But a case-insensitive filename match for ß will not return a file named ẞ.
https://www.google.com/search?q=FRANCAIS
https://www.google.com/search?q=FRANÇAIS
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Wol
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Working with UTF-8 in the kernel
Wol
