|
|
Log in / Subscribe / Register

Working with UTF-8 in the kernel

Working with UTF-8 in the kernel

Posted Apr 3, 2019 2:51 UTC (Wed) by dvdeug (subscriber, #10998)
In reply to: Working with UTF-8 in the kernel by marcH
Parent article: Working with UTF-8 in the kernel

Linux should implement the standard rules for case folding, and not worry about the linguistic details.


to post comments

Working with UTF-8 in the kernel

Posted Apr 3, 2019 5:07 UTC (Wed) by marcH (subscriber, #57642) [Link] (7 responses)

What are these "standard rules" and which locale are they based on? US English?

Working with UTF-8 in the kernel

Posted Apr 3, 2019 6:18 UTC (Wed) by dvdeug (subscriber, #10998) [Link] (6 responses)

http://www.unicode.org/versions/Unicode12.0.0/ch05.pdf offers basically the standard rules, though some details you might have to refer to the technical reports. It's based on all locales; Turkish and some Lithuanian dictionary usage are about the only locales that have a case exception from standard case folding.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 17:53 UTC (Mon) by hkario (subscriber, #94864) [Link] (5 responses)

and German... SS being upper case of ß, but ss being the lower case of SS...

Working with UTF-8 in the kernel

Posted Apr 8, 2019 20:30 UTC (Mon) by dvdeug (subscriber, #10998) [Link] (4 responses)

That's not an exception. In all locales, ß uppercases to SS.

Working with UTF-8 in the kernel

Posted Apr 9, 2019 18:30 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (3 responses)

Except it doesn’t, any more, in German.

Working with UTF-8 in the kernel

Posted Apr 10, 2019 0:50 UTC (Wed) by dvdeug (subscriber, #10998) [Link] (2 responses)

Citation needed. If you're talking about the Capital Eszett, it has been explicitly excluded from case-folding because it is non-standard. If the German speakers really wanted a change, given they are the only modern language group using it, it would change in all locales.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:15 UTC (Wed) by chithanh (guest, #52801) [Link]

I think it is more complex than that.

ß (U+00DF) indeed has no uppercase mapping in Unicode.
But ẞ (U+1E9E) has a lowercase mapping of ß.

So if you start with ẞ and then convert to lowercase and then to uppercase again you might end up with SS.

Also, if you perform a case-insensitive filename match for ẞ it will return a file named ß.
But a case-insensitive filename match for ß will not return a file named ẞ.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:40 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Also, if you perform a case-insensitive filename match for ẞ it will return
a file named ß.
> But a case-insensitive filename match for ß will not return a file named ẞ.

Just for fun, some more "real-world" case insensitivity (from comments in the previous LWN thread)
https://www.google.com/search?q=FRANCAIS
https://www.google.com/search?q=FRANÇAIS

Good luck supporting this in your filesystem.

> If the German speakers really wanted a change,...

Thanks, you just confirmed case sensitivity is not "hard science" no matter how hard Unicode tries to pretend it is. What a surprise considering it's part of natural languages. That's why it definitely has a place in high level interface user interfaces like file explorers, choosers and maybe interactive command lines even (with some autocorrection) but certainly not "hardwired" at a very low level in filesystems where it has already been seen causing damage.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds