Working with UTF-8 in the kernel

Posted Apr 4, 2019 8:33 UTC (Thu) by dvdeug (guest, #10998)
In reply to: Working with UTF-8 in the kernel by SLi
Parent article: Working with UTF-8 in the kernel

The rules for doing it are virtually language independent. Turkish and a small set of related languages do have a problematic difference with the dotted i, but the rest of the Latin-script languages all agree, and there seems to be no disagreements among the other languages that use casing scripts. It's unfortunate that 1% of the world's population won't get proper casing, but at this point practical compatibility with other operating systems seems more important.

Working with UTF-8 in the kernel

Posted Apr 5, 2019 8:11 UTC (Fri) by dgm (subscriber, #49227) [Link] (4 responses)

> practical compatibility with other operating systems seems more important.

So Linux cannot exchange data with MacOS and Windows?! PANIC!

Or put another way: if I show you that less than 1% of the population really wants or needs a case-insensitive filesystem, can I disregard your claims?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 2:02 UTC (Mon) by dvdeug (guest, #10998) [Link] (3 responses)

If you want to support FAT or NTFS, you need to support case-insensitive filesystems. You can half-ass it and write out potentially corrupt filesystems, but I think most of the users of these filesystems with Windows don't want that. Fortunately, there are rules for locale-insensitive case-folding, and they aren't random or arbitrary.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 21:18 UTC (Mon) by foom (subscriber, #14868) [Link] (2 responses)

> Fortunately, there are rules for locale-insensitive case-folding, and they aren't random or arbitrary.

That may be, but FAT, exFAT, and NTFS don't use the unicode case folding rules. If the justification is to make something compatible with those systems, do we actually need the (rather complex) unicode rules?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 23:30 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

What rules do they use?

In what way are the Unicode case-folding rules rather complex? They are for the most part fairly simple, one to one matchings of characters, with a few exceptions that you just have to deal with. The German ß and the various titlecase characters in Unicode are there and are going to have to be dealt with.

Working with UTF-8 in the kernel

Posted Apr 9, 2019 15:35 UTC (Tue) by foom (subscriber, #14868) [Link]

NTFS and exFAT only maps a single utf16 code unit to another single utf16 code unit, via a lookup table written to disk during filesystem creation. No unicode normalization, no multicharacter equivalencies, and no folding for any characters above FFFF.

You say that other cases "have to be dealt with"...but we have widely used examples showing that to not actually be the case.