User: Password:
Subscribe / Log in / New account

The kernel and character set encodings

The kernel and character set encodings

Posted Feb 20, 2004 22:19 UTC (Fri) by spitzak (guest, #4593)
In reply to: The kernel and character set encodings by Cato
Parent article: The kernel and character set encodings

There is no problem with UTF-8 filenames. The bytes should be stored
unchanged, and unchanged bytes should be used to look up the file. It
does not matter if those bytes are a legal UTF-8 string or not, to say
nothing of what normalization form they are.

Unfortunately there are hordes of people out there who think dumb ideas
like case-insensitivity should be applied at low levels to stuff that
really is binary data. This kind of thinking is what causes complexity,
and complexity causes bugs and security holes.

Any program that takes a string it thinks is UTF-8 and does
<i>ANYTHING</i> other than pass the exact bytes unchanged to another
interface that wants UTF-8 is by definition broken. This simple rule will
completely eliminate all ambiguity about UTF-8.

(Log in to post comments)

The kernel and character set encodings

Posted Feb 21, 2004 7:49 UTC (Sat) by Cato (subscriber, #7643) [Link]

This problem needs to be addressed somewhere, though not necessarily in the kernel (perhaps in glibc or the GUI layer): two users create identical looking filenames using Vietnamese accented characters (letter + 2 accents in different order, 3 Unicode characters altogher). Then, there are two identical-looking filenames and you don't know how to type the 'right' one. Even if there is only one file involved, without Unicode normalisation you wouldn't be able to use bash filename completion, since you might type the accents in a different order to that used in the filename, though there would be no visual clue as to your mistake.

Given these issues, which affect command line tools as much as GUIs, it may be sensible to put NFC normalisation in glibc or the kernel, despite the complexity. Files created from another system on a Linux NFS filesystem would of course bypass glibc, so the alternatives are batch renormalisation (always an option, convmv may do this) or putting NFC in the kernel.

It's not good enough to say 'case-insensitivity should not be in the kernel' - you need to address these use cases and say how and where you would solve them.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds