|
|
Log in / Subscribe / Register

Working with UTF-8 in the kernel

Working with UTF-8 in the kernel

Posted Apr 1, 2019 6:46 UTC (Mon) by marcH (subscriber, #57642)
In reply to: Working with UTF-8 in the kernel by foom
Parent article: Working with UTF-8 in the kernel

> Everyone likes to bring up this example, but I rather expect the likelihood of normal Turkish users noticing and caring that they can't create two such files in the same directory is rather a theoretical problem, not an actual one.

Check the numerous, real-world examples and references given in the comment to the previous LWN article: https://lwn.net/Articles/784041/ It's not just Turkish: like another natural language topic case-sensitivity is very complex and (among others) locale-specific - not just in theory but in practice.

foom wrote:
> Neither Mac nor windows filesystems' case folding is locale sensitive, either. (NTFS does write a file during filesystem creation containing the case folding rules for that drive, so you _could_ make them be whatever you like, at the risk of breaking everything...)

Interesting, references?

nybble41 wrote:
> The fact that case folding is broken everywhere else it's been implemented offers a good argument against implementing it in Linux.

Wait... should Linux be "bug for bug" compatible or linguistically correct?


to post comments

Working with UTF-8 in the kernel

Posted Apr 3, 2019 2:51 UTC (Wed) by dvdeug (subscriber, #10998) [Link] (8 responses)

Linux should implement the standard rules for case folding, and not worry about the linguistic details.

Working with UTF-8 in the kernel

Posted Apr 3, 2019 5:07 UTC (Wed) by marcH (subscriber, #57642) [Link] (7 responses)

What are these "standard rules" and which locale are they based on? US English?

Working with UTF-8 in the kernel

Posted Apr 3, 2019 6:18 UTC (Wed) by dvdeug (subscriber, #10998) [Link] (6 responses)

http://www.unicode.org/versions/Unicode12.0.0/ch05.pdf offers basically the standard rules, though some details you might have to refer to the technical reports. It's based on all locales; Turkish and some Lithuanian dictionary usage are about the only locales that have a case exception from standard case folding.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 17:53 UTC (Mon) by hkario (subscriber, #94864) [Link] (5 responses)

and German... SS being upper case of ß, but ss being the lower case of SS...

Working with UTF-8 in the kernel

Posted Apr 8, 2019 20:30 UTC (Mon) by dvdeug (subscriber, #10998) [Link] (4 responses)

That's not an exception. In all locales, ß uppercases to SS.

Working with UTF-8 in the kernel

Posted Apr 9, 2019 18:30 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (3 responses)

Except it doesn’t, any more, in German.

Working with UTF-8 in the kernel

Posted Apr 10, 2019 0:50 UTC (Wed) by dvdeug (subscriber, #10998) [Link] (2 responses)

Citation needed. If you're talking about the Capital Eszett, it has been explicitly excluded from case-folding because it is non-standard. If the German speakers really wanted a change, given they are the only modern language group using it, it would change in all locales.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:15 UTC (Wed) by chithanh (guest, #52801) [Link]

I think it is more complex than that.

ß (U+00DF) indeed has no uppercase mapping in Unicode.
But ẞ (U+1E9E) has a lowercase mapping of ß.

So if you start with ẞ and then convert to lowercase and then to uppercase again you might end up with SS.

Also, if you perform a case-insensitive filename match for ẞ it will return a file named ß.
But a case-insensitive filename match for ß will not return a file named ẞ.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:40 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Also, if you perform a case-insensitive filename match for ẞ it will return
a file named ß.
> But a case-insensitive filename match for ß will not return a file named ẞ.

Just for fun, some more "real-world" case insensitivity (from comments in the previous LWN thread)
https://www.google.com/search?q=FRANCAIS
https://www.google.com/search?q=FRANÇAIS

Good luck supporting this in your filesystem.

> If the German speakers really wanted a change,...

Thanks, you just confirmed case sensitivity is not "hard science" no matter how hard Unicode tries to pretend it is. What a surprise considering it's part of natural languages. That's why it definitely has a place in high level interface user interfaces like file explorers, choosers and maybe interactive command lines even (with some autocorrection) but certainly not "hardwired" at a very low level in filesystems where it has already been seen causing damage.

Working with UTF-8 in the kernel

Posted Apr 4, 2019 13:00 UTC (Thu) by foom (subscriber, #14868) [Link] (4 responses)

> Interesting, references?

Search for $upcase -- the name of the 128KB pseudo file stored on in every NTFS filesystem. You can also look at the NTFS filesystem driver for Linux.

This file contains the corresponding uppercased character (2 bytes) for each one of the 65536 unicode characters. When windows wants to compare filenames, it simply indexes each character in each string through this table, to make an uppercase string, before doing the comparison.

When you reformat a drive it writes the newest mapping to the file, and that partition will use the same mapping as long as you keep it.

And, yes, I am quite aware that everyone who knows anything about unicode is crying out in distress at the utter WRONGness of what I said above...

But of course, the secret is that users aren't really the ones who care about case insensitive comparisons... They are using gui file pickers and such higher level tools where the filesystem case behavior doesn't matter.

Note the primary use cases given for Linux (samba exports, Android emulating a FAT filesystem on top of ext4) are all about *software* expectations, not humans. Software that was written with hardcoded filenames of the wrong case. That's why ntfs's braindead case folding is not really a problem.

Which does rather bring into question whether implementing "correct" normalization and case folding in Linux even has a point... It won't make it more compatible with the legacy software to do that...

Working with UTF-8 in the kernel

Posted Apr 8, 2019 6:21 UTC (Mon) by cpitrat (subscriber, #116459) [Link] (3 responses)

If the primary use case is to be compatible with NTFS, why not implement it the same way ? As I understand it, NTFS support will require a fake unicode version ?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 21:49 UTC (Mon) by foom (subscriber, #14868) [Link]

I don't know.

It does seem rather incongruous to me to justify the feature via by pointing to samba's emulation of NTFS case folding, and Android's emulation of FAT file name lookup rules, but then implementing unicode normalization and correct unicode case folding...which those don't do.

Working with UTF-8 in the kernel

Posted Apr 11, 2019 20:49 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Because, as I understand it, utf-16 is now seen to have been a mistake.

Forcing all filenames to be valid utf-16 will break quite a lot elsewhere ... I think that if you want to implement the utf universe properly in utf-16, you end up back with the 8-bit codeset mess, only bigger ...

Cheers,
Wol

Working with UTF-8 in the kernel

Posted Apr 11, 2019 23:15 UTC (Thu) by foom (subscriber, #14868) [Link]

Er what? I don't really understand your comment, but NTFS doesn't implement utf-16.

It stores filenames as arbitrary sequences of 16-bit values. There are a few tens of values you cannot use (ascii control characters 0-31, and some ascii punctuation), but everything else is fair game. In particular, invalid utf16 containing broken surrogate pairs is perfectly fine.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds