User: Password:
|
|
Subscribe / Log in / New account

The kernel and character set encodings

The kernel and character set encodings

Posted Feb 19, 2004 9:24 UTC (Thu) by Cato (subscriber, #7643)
Parent article: The kernel and character set encodings

You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on.

Getting the character encoding right is difficult, and with UTF-8 there is an additional complication, Unicode normalisation - the issue here is that in certain languages, you might have a symbol on the page being encoded as 3 Unicode characters: the letter with accent 1 then accent 2 in one string, and the letter with accent 2 then accent 1 in another string. These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. Unicode normalisation defines a specific order for all such 'combining character' strings, but unfortunately there is more than one normalisation form: Linux and the W3C use NFC, while Darwin and MacOS X use NFD, even on UFS filesystems.

Unicode makes life more complicated for everyone and it's likely some of this needs to be in the kernel, or at least glibc, for uniformity. For more links on Unicode, from a Perl/Wiki oriented perspective, see the plan for TWiki support of UTF-8 and this Unicode normalisation page.


(Log in to post comments)

The kernel and character set encodings

Posted Feb 19, 2004 9:54 UTC (Thu) by one2team (guest, #7316) [Link]

« You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. »

These encodings are mostly useless in a true multi-user system. Why ? Because they are all incompatible. So there is no way for a user that uses encoding A to read stuff (including filenames) made by another user using encoding B. And this is true even for close stuff (KOI8-U and KOI8-R for example). Not to speak of the poor users that may want to quote another langage (French + Russian, Welsh + Greek etc).

The only thing all those encodings are compatible with is english, which restricts second language to english and english only.

One could argue userspace would have just to use Greek encoding for Greek filenames, Russian for Russian ones and so on. But the crux of the problem is userspace have no way to request or guess what encoding was used to write a filename, since the kernel does not enforce any particular encoding nor provides encoding info to userspace.

One additionnal problem is some byte strings can result in invalid UTF-8 and cause applications to barf if they try to decode them.

The kernel and character set encodings

Posted Feb 19, 2004 13:12 UTC (Thu) by Cato (subscriber, #7643) [Link]

These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.

What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.

The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.

The kernel and character set encodings

Posted Feb 19, 2004 11:18 UTC (Thu) by ibukanov (guest, #3942) [Link]

> These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison.

You do not need even Unicode normalization for that. In most fonts the following two lines would have exactly the same visual presentation (you have to view the page with UTF-8 encoding as LWN does not allow to enter РОТ in HTML comments due to bugs in recognition of &code; escapes):
POT
РОТ
yet the first uses pure ASCII and the second uses only Cyrillic characters and means mouth in Russian.

IMHO such examples supports the notion that kernel should not impose any policy on file names encoding as in practice there are always more then one way to encode the same visual presentation and UTF-8 with Unicode does not help here.

The kernel and character set encodings

Posted Feb 19, 2004 12:14 UTC (Thu) by mwh (subscriber, #582) [Link]

> Unicode makes life more complicated for everyone
  If Unicode is a horde of zombies with flaming dung sticks, 
  the hideous intricacies of JIS, Chinese Big-5, Chinese 
  Traditional, KOI-8, et cetera are at least an army of ogres 
  with salt and flensing knives.        -- Eric S. Raymond, python-dev
Unicode isn't that hard to deal with, although I'd admit to not having any intuition for what the right answer is in this situation.

The kernel and character set encodings

Posted Feb 20, 2004 22:19 UTC (Fri) by spitzak (guest, #4593) [Link]

There is no problem with UTF-8 filenames. The bytes should be stored
unchanged, and unchanged bytes should be used to look up the file. It
does not matter if those bytes are a legal UTF-8 string or not, to say
nothing of what normalization form they are.

Unfortunately there are hordes of people out there who think dumb ideas
like case-insensitivity should be applied at low levels to stuff that
really is binary data. This kind of thinking is what causes complexity,
and complexity causes bugs and security holes.

Any program that takes a string it thinks is UTF-8 and does
<i>ANYTHING</i> other than pass the exact bytes unchanged to another
interface that wants UTF-8 is by definition broken. This simple rule will
completely eliminate all ambiguity about UTF-8.


The kernel and character set encodings

Posted Feb 21, 2004 7:49 UTC (Sat) by Cato (subscriber, #7643) [Link]

This problem needs to be addressed somewhere, though not necessarily in the kernel (perhaps in glibc or the GUI layer): two users create identical looking filenames using Vietnamese accented characters (letter + 2 accents in different order, 3 Unicode characters altogher). Then, there are two identical-looking filenames and you don't know how to type the 'right' one. Even if there is only one file involved, without Unicode normalisation you wouldn't be able to use bash filename completion, since you might type the accents in a different order to that used in the filename, though there would be no visual clue as to your mistake.

Given these issues, which affect command line tools as much as GUIs, it may be sensible to put NFC normalisation in glibc or the kernel, despite the complexity. Files created from another system on a Linux NFS filesystem would of course bypass glibc, so the alternatives are batch renormalisation (always an option, convmv may do this) or putting NFC in the kernel.

It's not good enough to say 'case-insensitivity should not be in the kernel' - you need to address these use cases and say how and where you would solve them.


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds