|
|
Log in / Subscribe / Register

Working with UTF-8 in the kernel

Working with UTF-8 in the kernel

Posted Mar 28, 2019 23:35 UTC (Thu) by gdt (subscriber, #6284)
In reply to: Working with UTF-8 in the kernel by ikm
Parent article: Working with UTF-8 in the kernel

The essential requirement is efficient case-insensitive comparison of file names. At present the provided API is not efficient; there's also races between checking the filename is not in use and creating a new file with that filename. The kernel design choices are: (1) the kernel supports UTF-8, (2) the kernel gives an efficient race-free user-space API to allow a directory to be listed, and changes to that directory locked whilst the user space handles UTF-8. Choice (2) is scary enough that choice (1) looks better.


to post comments

Working with UTF-8 in the kernel

Posted Mar 29, 2019 3:08 UTC (Fri) by zlynx (guest, #2285) [Link] (5 responses)

I may not understand something here. But if you read a directory and assume that just because there's no file there, you are free to make a new one, that's a bad assumption. And always has been. That's the source of several /tmp vulnerabilities in the past.

Always assume someone stole your filename. It isn't your until you hold a handle to it.

So how is this case normalization system helping anyone?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 6:09 UTC (Fri) by khim (subscriber, #9252) [Link] (4 responses)

Case normalization removes the need for the whole thing. To implement case-insensitive semantic in usersapce you must check if SoMeFiLeNaMe.txt is there and then create SomeFilename.txt atomically. If kernel is asked to create SomeFilename.txt and returns reference to SoMeFiLeNaMe.txt then this atomicity would be handled in kernel.

P.S. I wonder if these tables (without code) could be exposed to userspace. Userspace guys ALSO often need to deal with Unicode and if kernel already has all these tables... why not use them?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 6:35 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

The overhead of cross-address access will probably make it impractical for userspace.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 8:26 UTC (Fri) by felix.s (guest, #104710) [Link] (2 responses)

It seems to work fine for vDSO, doesn't it?

Working with UTF-8 in the kernel

Posted Mar 29, 2019 8:28 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That would work for basically static data. At this point a special file in /proc might work just as well.

Working with UTF-8 in the kernel

Posted Mar 29, 2019 9:31 UTC (Fri) by dezgeg (guest, #92243) [Link]

Having the data tables readable from /proc sounds unattractive due to this part from the article:

"The UTF-8 patches incorporate these rules by processing the provided files into a data structure in a C header file. A fair amount of space is then regained by removing the information for decomposing Hangul (Korean) code points into their base components, since this is a task that can be done algorithmically as well."

Exporting these non-standard tables to userspace would lock in this custom format implementation detail forever.

Working with UTF-8 in the kernel

Posted Apr 1, 2019 14:12 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

Nothing scary about that: Open directory (or use an already open descriptor), acquire lock which prevents adding/ removing entries, process accumulated change notifications, create/ remove entry, unlock.

Such a lock must already exists, BTW, it may be sufficient to expose that. Advisory locking would probably be ok as UNIX processes are usually supposed to cooperate and not fight with each other. Vastly simpler in the kernel than 'hard-coding' a specific, known-to-be-broken/ deficient case translation mechanism into certain filesystems. Considering cases like "vertical bar plus combining overline' (aka T, not going happen as that's an ASCII codepoint), I consider "kernel supports UTF-8" much more 'scary'.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds