|
|
Subscribe / Log in / New account

Working with UTF-8 in the kernel

By Jonathan Corbet
March 28, 2019
In the real world, text is expressed in many languages using a wide variety of character sets; those character sets can be encoded in a lot of different ways. In the kernel, life has always been simpler; file names and other string data are just opaque streams of bytes. In the few cases where the kernel must interpret text, nothing more than ASCII is required. The proposed addition of case-insensitive file-name lookups to the ext4 filesystem changes things, though; now some kernel code must deal with the full complexity of Unicode. A look at the API being provided to handle encodings illustrates nicely just how complicated this task is.

The Unicode standard, of course, defines "code points"; to a first approximation, each code point represents a specific character in a specific language group. How those code points are represented in a stream of bytes — the encoding — is a separate question. Dealing with encodings has challenges of its own, but over the years the UTF-8 encoding has emerged as the preferred way of representing code points in many settings. UTF-8 has the advantages of representing the entire Unicode space while being compatible with ASCII — a valid ASCII string is also valid UTF-8. The developers implementing case independence in the kernel decided to limit it to the UTF-8 encoding, presumably in the hope of solving the problem without going entirely insane.

The API that resulted has two layers: a relatively straightforward set of higher-level operations and the primitives that are used to implement them. We'll start at the top and work our way down.

The high-level UTF-8 API

At a high level, the operations that will be needed can be described fairly simply: validate a string, normalize a string, and compare two strings (perhaps with case folding). There is, though, a catch: the Unicode standard comes in multiple versions (version 12.0.0 was released in early March), and each version is different. The normalization and case-folding rules can change between versions, and not all code points exist in all versions. So, before any of the other operations can be used, a "map" must be loaded for the Unicode version of interest:

    struct unicode_map *utf8_load(const char *version);

The given version number can be NULL, in which case the latest supported version will be used and a warning will be emitted. In the ext4 implementation, the Unicode version used with any given filesystem is stored in the superblock. The latest version can be explicitly requested by obtaining its name from utf8version_latest(), which takes no parameters. The return value from utf8_load() is a map pointer that can be used with other operations, or an error-pointer value if something goes wrong. The returned pointer should be freed with utf8_unload() when it is no longer needed.

UTF-8 strings are represented in this interface using the qstr structure defined in <linux/dcache.h>. That reveals an apparent assumption that the use of this API will be limited to filesystem code; that is true for now, but could change in the future.

The single-string operations provided are:

    int utf8_validate(const struct unicode_map *um, const struct qstr *str);
    int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
		       unsigned char *dest, size_t dlen);
    int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
		      unsigned char *dest, size_t dlen);

All of the functions require the map pointer (um) and the string to be operated upon (str). utf_validate() returns zero if str is a valid UTF-8 string, non-zero otherwise. A call to utf8_normalize() will store a normalized version of str in dest and return the length of the result; utf8_casefold() does case folding as well as normalization. Both functions will return -EINVAL if the input string is invalid or if the result would be longer than dlen.

Comparisons are done with:

    int utf8_strncmp(const struct unicode_map *um,
		     const struct qstr *s1, const struct qstr *s2);
    int utf8_strncasecmp(const struct unicode_map *um,
		     const struct qstr *s1, const struct qstr *s2);

Both functions will compare the normalized versions of s1 and s2; utf8_strncasecmp() will do a case-independent comparison. The return value is zero if the strings are the same, one if they differ, and -EINVAL for errors. These functions only test for equality; there is no "greater than" or "less than" testing.

Moving down

Normalization and case folding require the kernel to gain detailed knowledge of the entire Unicode code point space. There are a lot of rules, and there are multiple ways of representing many code points. The good news is that these rules are packaged, in machine-readable form, with the Unicode standard itself. The bad news is that they take up several megabytes of space.

The UTF-8 patches incorporate these rules by processing the provided files into a data structure in a C header file. A fair amount of space is then regained by removing the information for decomposing Hangul (Korean) code points into their base components, since this is a task that can be done algorithmically as well. There is still a lot of data that has to go into kernel space, though, and it's naturally different for each version of Unicode.

The first step for code wanting to use the lower-level API is to get a pointer to this database for the Unicode version in use. That is done with one of:

    struct utf8data *utf8nfdi(unsigned int maxage);
    struct utf8data *utf8nfdicf(unsigned int maxage);

Here, maxage is the version number of interest, encoded in an integer form from the major, minor, and revision numbers using the UNICODE_AGE() macro. If only normalization is needed, utf8nfdi() should be called; use utf8nfdicf() if case folding is also required. The return value will be an opaque pointer, or NULL if the given version cannot be supported.

Next, a cursor should be set up to track progress working through the string of interest:

    int utf8cursor(struct utf8cursor *cursor, const struct utf8data *data,
	           const char *s);
    int utf8ncursor(struct utf8cursor *cursor, const struct utf8data *data,
		    const char *s, size_t len);

The cursor structure must be provided by the caller, but is otherwise opaque; data is the database pointer obtained above. If the length of the string (in bytes) is known, utf8ncursor() should be used; utf8cursor() can be used when the length is not known but the string is null-terminated. These functions return zero on success, nonzero otherwise.

Working through the string is then accomplished by repeated calls to:

    int utf8byte(struct utf8cursor *u8c);

This function will return the next byte in the normalized (and possibly case-folded) string, or zero at the end. UTF-8-encoded code points can take more than one byte, of course, so individual bytes do not, on their own, represent code points. Due to decomposition, the return string may be longer than the one passed in.

As an example of how these pieces fit together, here is the full implementation of utf8_strncasecmp():

    int utf8_strncasecmp(const struct unicode_map *um,
		         const struct qstr *s1, const struct qstr *s2)
    {
	const struct utf8data *data = utf8nfdicf(um->version);
	struct utf8cursor cur1, cur2;
	int c1, c2;

	if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
	    return -EINVAL;

	if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
	    return -EINVAL;

	do {
	    c1 = utf8byte(&cur1);
	    c2 = utf8byte(&cur2);

	    if (c1 < 0 || c2 < 0)
		return -EINVAL;
	    if (c1 != c2)
		return 1;
	} while (c1);

	return 0;
    }

There are other functions in the low-level API for testing validity, getting the length of strings, and so on, but the above captures the core of it. Those interested in the details can find them in this patch.

That is quite a bit of complexity when one considers that it is all there just to compare strings; we are now far removed from the simple string functions found in Kernighan & Ritchie. But that, it seems, is the world that we live in now. At least we get some nice emoji for all of that complexity 👍.

Index entries for this article
KernelCharacter encoding


to post comments

Working with UTF-8 in the kernel

Posted Mar 28, 2019 18:29 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

Does utf8byte() actually return the next _byte_? That's a strange decision. Wouldn't the right choice normally be either to return a _code point_, or, if implementing the Unicode Collation Algorithm (UCA), the next _weight_?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 9:32 UTC (Fri) by grawity (subscriber, #80596) [Link]

I'm somewhat disappointed it wasn't called utf8rune().

Working with UTF-8 in the kernel

Posted Mar 28, 2019 21:48 UTC (Thu) by ikm (guest, #493) [Link] (55 responses)

> But that, it seems, is the world that we live in now

Is case-insensitive file name support the only user of this code? Then being able to handle Unicode is hardly the requirement of the world we live in - we can already have emojis in file names, after all, and I doubt those are case sensitive.

Working with UTF-8 in the kernel

Posted Mar 28, 2019 23:35 UTC (Thu) by gdt (subscriber, #6284) [Link] (7 responses)

The essential requirement is efficient case-insensitive comparison of file names. At present the provided API is not efficient; there's also races between checking the filename is not in use and creating a new file with that filename. The kernel design choices are: (1) the kernel supports UTF-8, (2) the kernel gives an efficient race-free user-space API to allow a directory to be listed, and changes to that directory locked whilst the user space handles UTF-8. Choice (2) is scary enough that choice (1) looks better.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 3:08 UTC (Fri) by zlynx (guest, #2285) [Link] (5 responses)

I may not understand something here. But if you read a directory and assume that just because there's no file there, you are free to make a new one, that's a bad assumption. And always has been. That's the source of several /tmp vulnerabilities in the past.

Always assume someone stole your filename. It isn't your until you hold a handle to it.

So how is this case normalization system helping anyone?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 6:09 UTC (Fri) by khim (subscriber, #9252) [Link] (4 responses)

Case normalization removes the need for the whole thing. To implement case-insensitive semantic in usersapce you must check if SoMeFiLeNaMe.txt is there and then create SomeFilename.txt atomically. If kernel is asked to create SomeFilename.txt and returns reference to SoMeFiLeNaMe.txt then this atomicity would be handled in kernel.

P.S. I wonder if these tables (without code) could be exposed to userspace. Userspace guys ALSO often need to deal with Unicode and if kernel already has all these tables... why not use them?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 6:35 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

The overhead of cross-address access will probably make it impractical for userspace.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 8:26 UTC (Fri) by felix.s (guest, #104710) [Link] (2 responses)

It seems to work fine for vDSO, doesn't it?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 8:28 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That would work for basically static data. At this point a special file in /proc might work just as well.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 9:31 UTC (Fri) by dezgeg (subscriber, #92243) [Link]

Having the data tables readable from /proc sounds unattractive due to this part from the article:

"The UTF-8 patches incorporate these rules by processing the provided files into a data structure in a C header file. A fair amount of space is then regained by removing the information for decomposing Hangul (Korean) code points into their base components, since this is a task that can be done algorithmically as well."

Exporting these non-standard tables to userspace would lock in this custom format implementation detail forever.

Working with UTF-8 in the kernel

Posted Apr 1, 2019 14:12 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

Nothing scary about that: Open directory (or use an already open descriptor), acquire lock which prevents adding/ removing entries, process accumulated change notifications, create/ remove entry, unlock.

Such a lock must already exists, BTW, it may be sufficient to expose that. Advisory locking would probably be ok as UNIX processes are usually supposed to cooperate and not fight with each other. Vastly simpler in the kernel than 'hard-coding' a specific, known-to-be-broken/ deficient case translation mechanism into certain filesystems. Considering cases like "vertical bar plus combining overline' (aka T, not going happen as that's an ASCII codepoint), I consider "kernel supports UTF-8" much more 'scary'.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 10:05 UTC (Fri) by smurf (subscriber, #17840) [Link] (46 responses)

There's also the problem of composites. Unicode, in its infinite wisdom(*), has multiple ways to store the same character (an 'ä' is either a single latin1 character, or an 'a' followed by a combining diaeresis – any sane designer would have stored the modifiers first, but I digress). You need to agree on one form with which to represent file names because the user typically can't easily generate the other, and even copy+paste tends to get mangled.

There's another problem here. Correct case folding is locale dependent. One example: Turkish has an i and an ı (i without the dot). Unicode helpfully has an İ (capital I with a dot) right next to it. Guess what happens when you case-fold these in Turkey vs. everywhere else.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 11:21 UTC (Fri) by Sesse (subscriber, #53779) [Link] (40 responses)

Normalization is already dealt with in the patch.

Also, I don't think you can blame Unicode for the fact that Turkish and English has different alphabets.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 13:32 UTC (Fri) by drag (guest, #31333) [Link] (37 responses)

> Also, I don't think you can blame Unicode for the fact that Turkish and English has different alphabets.

What it does mean is that your case insensitive lookups for the file system are actually case sensitive with insensitive elements. How sensitive it ends up being depends on what language a user uses.

Unless, of course, the kernel is aware of the user's locale and changes the responses accordingly.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 22:53 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (36 responses)

The kernel is never aware of the user’s locale, so this belongs into userspace.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 13:17 UTC (Sat) by SLi (subscriber, #53131) [Link] (6 responses)

But the point is, there is no "case folding" independent of language; the rules for doing it are inherently language dependent.

Working with UTF-8 in the kernel

Posted Apr 4, 2019 8:33 UTC (Thu) by dvdeug (guest, #10998) [Link] (5 responses)

The rules for doing it are virtually language independent. Turkish and a small set of related languages do have a problematic difference with the dotted i, but the rest of the Latin-script languages all agree, and there seems to be no disagreements among the other languages that use casing scripts. It's unfortunate that 1% of the world's population won't get proper casing, but at this point practical compatibility with other operating systems seems more important.

Working with UTF-8 in the kernel

Posted Apr 5, 2019 8:11 UTC (Fri) by dgm (subscriber, #49227) [Link] (4 responses)

> practical compatibility with other operating systems seems more important.

So Linux cannot exchange data with MacOS and Windows?! PANIC!

Or put another way: if I show you that less than 1% of the population really wants or needs a case-insensitive filesystem, can I disregard your claims?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 2:02 UTC (Mon) by dvdeug (guest, #10998) [Link] (3 responses)

If you want to support FAT or NTFS, you need to support case-insensitive filesystems. You can half-ass it and write out potentially corrupt filesystems, but I think most of the users of these filesystems with Windows don't want that. Fortunately, there are rules for locale-insensitive case-folding, and they aren't random or arbitrary.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 21:18 UTC (Mon) by foom (subscriber, #14868) [Link] (2 responses)

> Fortunately, there are rules for locale-insensitive case-folding, and they aren't random or arbitrary.

That may be, but FAT, exFAT, and NTFS don't use the unicode case folding rules. If the justification is to make something compatible with those systems, do we actually need the (rather complex) unicode rules?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 23:30 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

What rules do they use?

In what way are the Unicode case-folding rules rather complex? They are for the most part fairly simple, one to one matchings of characters, with a few exceptions that you just have to deal with. The German ß and the various titlecase characters in Unicode are there and are going to have to be dealt with.

Working with UTF-8 in the kernel

Posted Apr 9, 2019 15:35 UTC (Tue) by foom (subscriber, #14868) [Link]

NTFS and exFAT only maps a single utf16 code unit to another single utf16 code unit, via a lookup table written to disk during filesystem creation. No unicode normalization, no multicharacter equivalencies, and no folding for any characters above FFFF.

You say that other cases "have to be dealt with"...but we have widely used examples showing that to not actually be the case.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 13:59 UTC (Sat) by SLi (subscriber, #53131) [Link] (28 responses)

Can you explain how the user space could help here? For a Turkish user, any system that considers "I" and "i" the same filename/string modulo case is broken. Both are letters and both have upper and lower forms, but they are different letters.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 16:45 UTC (Sat) by foom (subscriber, #14868) [Link] (27 responses)

The case folding for a filesystem must be consistent, so it can't have such locale-specific rules.

Neither Mac nor windows filesystems' case folding is locale sensitive, either. (NTFS does write a file during filesystem creation containing the case folding rules for that drive, so you _could_ make them be whatever you like, at the risk of breaking everything...)

Everyone likes to bring up this example, but I rather expect the likelihood of normal Turkish users noticing and caring that they can't create two such files in the same directory is rather a theoretical problem, not an actual one.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 18:19 UTC (Sat) by nybble41 (subscriber, #55106) [Link] (4 responses)

The fact that case folding is broken everywhere else it's been implemented offers a good argument against implementing it in Linux.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 21:41 UTC (Sat) by mirabilos (subscriber, #84359) [Link] (3 responses)

case-insensitive filename comparison belongs into userspace, because there, if you setlocale(3) correctly, things like toupper(3), tolower(3) and strcoll(3) work properly even for Turkish.

So, the kernel should have nothing to do with this *at all*.

Working with UTF-8 in the kernel

Posted Apr 1, 2019 9:45 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (2 responses)

The kernel needs to provide the encoding and normalisation userspace can work from. You can’t push those to userspace, because userspace then ends up guessing (badly) the filesystem encoding.

Working with UTF-8 in the kernel

Posted Apr 5, 2019 1:24 UTC (Fri) by xtifr (guest, #143) [Link] (1 responses)

If the case-insensitive FS is a user-space overlay on an existing filesystem, then you don't *need* to guess anything. It will do case-folding based on the locale of the user who mounted the overlay.

This means *all* the overheads will truly need to be present only for those who actively *use* the system.

I honestly don't know how this is all going forward without *at least* a user-space proof-of-concept system.

Working with UTF-8 in the kernel

Posted Apr 6, 2019 21:56 UTC (Sat) by foom (subscriber, #14868) [Link]

> It will do case-folding based on the locale of the user who mounted the overlay.

But that behavior would be pretty awful -- which files you can access depending upon your current locale? There's a reason filesystems (including this ext4 proposal) store the mapping used when creating the filesystem...

> without *at least* a user-space proof-of-concept system.

Two have been mentioned already. Android has an overlay filesystem for local access, and samba implements it when exporting the filesystem over the network.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 21:44 UTC (Sat) by mirabilos (subscriber, #84359) [Link] (6 responses)

Yes, it must be consistent. What if a new release of Unicode comes out? Boom.

Another reason why this belongs into userspace.

And no, the turkish case is not theoretical. They have words which only differ in the dot above the i, and in one case, one of the two words is normal and one a rather crass insult, which led to (IIRC) a knife attack (well, some kind of real-life attack at the person) because they had no dotless i on their keyboard when texting.

I’ll quote someone else: just because your latin alphabet has 26 letters, not everyone else’s does. Imagine if we’d *always* (independent on what word it’s in) make “oo” compare the same as “u”, for example.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 21:51 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> And no, the turkish case is not theoretical. They have words which only differ in the dot above the i, and in one case, one of the two words is normal and one a rather crass insult, which led to (IIRC) a knife attack (well, some kind of real-life attack at the person) because they had no dotless i on their keyboard when texting.
This is the story: https://gizmodo.com/a-cellphones-missing-dot-kills-two-pe...

Although I personally wouldn't blame the cellphone here.

Working with UTF-8 in the kernel

Posted Mar 30, 2019 22:43 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

The situation was bad.

Bad technology made it worse.

The cellphone doesn't get off scot-free here.

Working with UTF-8 in the kernel

Posted Mar 31, 2019 1:46 UTC (Sun) by foom (subscriber, #14868) [Link] (3 responses)

Hopefully the filesystem records what mapping it was created with, like NTFS does. Otherwise, some of your files may become inaccessible when a new mapping is switched to (which, iirc, did happen on HFS+ before. That's not good...)

Re: Turkish swears -- you can name your files either word just fine -- the filesystem does not be change your chosen filename to the other name! Only if you try to make files named both, in the same directory, will you run into an issue. I still claim that is *highly* unlikely.

If we treated oo and u as the same for filename comparison purposes, because that was a very common language's policy, I rather suspect that also wouldn't be a huge problem. (It'd be weird to have such behavior, as that isn't a common policy, however.)

Working with UTF-8 in the kernel

Posted Mar 31, 2019 19:17 UTC (Sun) by naptastic (guest, #60139) [Link]

> because that was a very common language's policy

Which one‽ I've never heard of this and I am dying to know! MY BRAIN IS HUNGRY

Working with UTF-8 in the kernel

Posted Apr 4, 2019 5:37 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

Hopefully the filesystem records what mapping it was created with, like NTFS does. Otherwise, some of your files may become inaccessible when a new mapping is switched to (which, iirc, did happen on HFS+ before. That's not good...)

This seems like the key to me. If the case folding rules can change, there's no way to guarantee that the same file will always be accessible the same way, and that's true whether the case folding happens in the kernel or in userspace.

Working with UTF-8 in the kernel

Posted Apr 4, 2019 12:28 UTC (Thu) by bosyber (guest, #84963) [Link]

> If we treated oo and u as the same for filename comparison purposes, because that was a very common language's policy
Is it? I know that it might be that way effectively in German, but in Dutch it is absolutely not, they are completely different sounds (the german u sound is closer to Dutch oe double sound, but not oo which is a loong vowel in Dutch.).

Working with UTF-8 in the kernel

Posted Apr 1, 2019 6:46 UTC (Mon) by marcH (subscriber, #57642) [Link] (14 responses)

> Everyone likes to bring up this example, but I rather expect the likelihood of normal Turkish users noticing and caring that they can't create two such files in the same directory is rather a theoretical problem, not an actual one.

Check the numerous, real-world examples and references given in the comment to the previous LWN article: https://lwn.net/Articles/784041/ It's not just Turkish: like another natural language topic case-sensitivity is very complex and (among others) locale-specific - not just in theory but in practice.

foom wrote:
> Neither Mac nor windows filesystems' case folding is locale sensitive, either. (NTFS does write a file during filesystem creation containing the case folding rules for that drive, so you _could_ make them be whatever you like, at the risk of breaking everything...)

Interesting, references?

nybble41 wrote:
> The fact that case folding is broken everywhere else it's been implemented offers a good argument against implementing it in Linux.

Wait... should Linux be "bug for bug" compatible or linguistically correct?

Working with UTF-8 in the kernel

Posted Apr 3, 2019 2:51 UTC (Wed) by dvdeug (guest, #10998) [Link] (8 responses)

Linux should implement the standard rules for case folding, and not worry about the linguistic details.

Working with UTF-8 in the kernel

Posted Apr 3, 2019 5:07 UTC (Wed) by marcH (subscriber, #57642) [Link] (7 responses)

What are these "standard rules" and which locale are they based on? US English?

Working with UTF-8 in the kernel

Posted Apr 3, 2019 6:18 UTC (Wed) by dvdeug (guest, #10998) [Link] (6 responses)

http://www.unicode.org/versions/Unicode12.0.0/ch05.pdf offers basically the standard rules, though some details you might have to refer to the technical reports. It's based on all locales; Turkish and some Lithuanian dictionary usage are about the only locales that have a case exception from standard case folding.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 17:53 UTC (Mon) by hkario (subscriber, #94864) [Link] (5 responses)

and German... SS being upper case of ß, but ss being the lower case of SS...

Working with UTF-8 in the kernel

Posted Apr 8, 2019 20:30 UTC (Mon) by dvdeug (guest, #10998) [Link] (4 responses)

That's not an exception. In all locales, ß uppercases to SS.

Working with UTF-8 in the kernel

Posted Apr 9, 2019 18:30 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (3 responses)

Except it doesn’t, any more, in German.

Working with UTF-8 in the kernel

Posted Apr 10, 2019 0:50 UTC (Wed) by dvdeug (guest, #10998) [Link] (2 responses)

Citation needed. If you're talking about the Capital Eszett, it has been explicitly excluded from case-folding because it is non-standard. If the German speakers really wanted a change, given they are the only modern language group using it, it would change in all locales.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:15 UTC (Wed) by chithanh (guest, #52801) [Link]

I think it is more complex than that.

ß (U+00DF) indeed has no uppercase mapping in Unicode.
But ẞ (U+1E9E) has a lowercase mapping of ß.

So if you start with ẞ and then convert to lowercase and then to uppercase again you might end up with SS.

Also, if you perform a case-insensitive filename match for ẞ it will return a file named ß.
But a case-insensitive filename match for ß will not return a file named ẞ.

Working with UTF-8 in the kernel

Posted Apr 17, 2019 22:40 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Also, if you perform a case-insensitive filename match for ẞ it will return
a file named ß.
> But a case-insensitive filename match for ß will not return a file named ẞ.

Just for fun, some more "real-world" case insensitivity (from comments in the previous LWN thread)
https://www.google.com/search?q=FRANCAIS
https://www.google.com/search?q=FRANÇAIS

Good luck supporting this in your filesystem.

> If the German speakers really wanted a change,...

Thanks, you just confirmed case sensitivity is not "hard science" no matter how hard Unicode tries to pretend it is. What a surprise considering it's part of natural languages. That's why it definitely has a place in high level interface user interfaces like file explorers, choosers and maybe interactive command lines even (with some autocorrection) but certainly not "hardwired" at a very low level in filesystems where it has already been seen causing damage.

Working with UTF-8 in the kernel

Posted Apr 4, 2019 13:00 UTC (Thu) by foom (subscriber, #14868) [Link] (4 responses)

> Interesting, references?

Search for $upcase -- the name of the 128KB pseudo file stored on in every NTFS filesystem. You can also look at the NTFS filesystem driver for Linux.

This file contains the corresponding uppercased character (2 bytes) for each one of the 65536 unicode characters. When windows wants to compare filenames, it simply indexes each character in each string through this table, to make an uppercase string, before doing the comparison.

When you reformat a drive it writes the newest mapping to the file, and that partition will use the same mapping as long as you keep it.

And, yes, I am quite aware that everyone who knows anything about unicode is crying out in distress at the utter WRONGness of what I said above...

But of course, the secret is that users aren't really the ones who care about case insensitive comparisons... They are using gui file pickers and such higher level tools where the filesystem case behavior doesn't matter.

Note the primary use cases given for Linux (samba exports, Android emulating a FAT filesystem on top of ext4) are all about *software* expectations, not humans. Software that was written with hardcoded filenames of the wrong case. That's why ntfs's braindead case folding is not really a problem.

Which does rather bring into question whether implementing "correct" normalization and case folding in Linux even has a point... It won't make it more compatible with the legacy software to do that...

Working with UTF-8 in the kernel

Posted Apr 8, 2019 6:21 UTC (Mon) by cpitrat (subscriber, #116459) [Link] (3 responses)

If the primary use case is to be compatible with NTFS, why not implement it the same way ? As I understand it, NTFS support will require a fake unicode version ?

Working with UTF-8 in the kernel

Posted Apr 8, 2019 21:49 UTC (Mon) by foom (subscriber, #14868) [Link]

I don't know.

It does seem rather incongruous to me to justify the feature via by pointing to samba's emulation of NTFS case folding, and Android's emulation of FAT file name lookup rules, but then implementing unicode normalization and correct unicode case folding...which those don't do.

Working with UTF-8 in the kernel

Posted Apr 11, 2019 20:49 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Because, as I understand it, utf-16 is now seen to have been a mistake.

Forcing all filenames to be valid utf-16 will break quite a lot elsewhere ... I think that if you want to implement the utf universe properly in utf-16, you end up back with the 8-bit codeset mess, only bigger ...

Cheers,
Wol

Working with UTF-8 in the kernel

Posted Apr 11, 2019 23:15 UTC (Thu) by foom (subscriber, #14868) [Link]

Er what? I don't really understand your comment, but NTFS doesn't implement utf-16.

It stores filenames as arbitrary sequences of 16-bit values. There are a few tens of values you cannot use (ascii control characters 0-31, and some ascii punctuation), but everything else is fair game. In particular, invalid utf16 containing broken surrogate pairs is perfectly fine.

Working with UTF-8 in the kernel

Posted Apr 7, 2019 22:40 UTC (Sun) by jschrod (subscriber, #1646) [Link]

Ahem...

I think a case could be made that one could blame Unicode for *not* representing these different alphabets. After all, the same code point is used for "different" characters in the alphabets - if one agrees to your statement that these are really different alphabets.

But, in real life, too much water has flown under this bridge to discuss it outside an evening in a wine bar with some friends who are encoding freaks. I have to admit I would be part of such a discussion... ;-)

Cheers, Joachim

Working with UTF-8 in the kernel

Posted Apr 11, 2019 13:12 UTC (Thu) by robbe (guest, #16131) [Link]

I blame Unicode for the fact that they did not spend a codepoint for LATIN SMALL LETTER I WITH DOT ABOVE.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 14:54 UTC (Fri) by mina86 (guest, #68442) [Link] (4 responses)

There's also the curious case of lc(ẞ) = ß but uc(ß) = SS. Or sigma having two lower case forms. I anticipate a world of pain and kernel devs will have only themselves to blame. ;)

Working with UTF-8 in the kernel

Posted Mar 30, 2019 4:27 UTC (Sat) by gps (subscriber, #45638) [Link] (1 responses)

Nah, that world of pain already exists. Today it is distributed among umpteen different application and library implementations that differ.

What's one more? Some of the above could even go away after this in some system designs.

Working with UTF-8 in the kernel

Posted Mar 31, 2019 14:41 UTC (Sun) by mina86 (guest, #68442) [Link]

> Nah, that world of pain already exists. Today it is distributed among umpteen different application and library implementations that differ.

That may be so but I'd rather my Samba server crashed than my kernel oopsed or executed malicious code because Unicode was handled incorrectly. Just like putting HTTP server inside the kernel wasn't a good idea, I'm not yet convinced that putting Unicode handling is.

Working with UTF-8 in the kernel

Posted Apr 8, 2019 6:24 UTC (Mon) by cpitrat (subscriber, #116459) [Link] (1 responses)

If only more people were as reasonable as Irish who changed the Gaelic alphabet to be compatible with typewriters (changing there accentuated laters with combinations of naked letters, like gh) and therefore now don't have any problem like this ...

Working with UTF-8 in the kernel

Posted Apr 11, 2019 20:57 UTC (Thu) by Wol (subscriber, #4433) [Link]

Well if you do this, aren't you going eventually to end up with just one letter in your alphabet? Okay, I'm being facetious, but this happened to the Roman alphabet in the ?1400s, when printing arrived. Pre-printing, the written English alphabet had a fair few more letters than 26. I'm not sure of the details, but at least one example is the replacement of thorn (looks like the Yen symbol) with Y, hence all the signs "Ye Olde Coffee Shop".

Cheers,
Wol

Working with UTF-8 in the kernel

Posted Apr 2, 2019 21:10 UTC (Tue) by daniels (subscriber, #16193) [Link] (2 responses)

Is emoji in LWN article body text, one of the four horsemen?

Working with UTF-8 in the kernel

Posted Apr 2, 2019 23:31 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

One of the four horsemen of the a-PILE-OF-POO-calypse, perhaps? (AU+1F4A9calypse for short, but that's harder to pronounce.)

Working with UTF-8 in the kernel

Posted Apr 11, 2019 5:01 UTC (Thu) by lysse (guest, #3190) [Link]

Crapocalypse?

Working with UTF-8 in the kernel

Posted Apr 12, 2019 20:12 UTC (Fri) by donbarry (guest, #10485) [Link]

I'm really astonished that this userspace library is in the kernel to allow several application servers intended to bridge compatibility with broken operating systems to be somewhat simpler.

At this rate, and aided by its UTF library, by Zawinski's law soon the kernel will contain an email client.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds