LWN.net Logo

The kernel and character set encodings

The kernel and character set encodings

Posted Feb 19, 2004 9:54 UTC (Thu) by one2team (guest, #7316)
In reply to: The kernel and character set encodings by Cato
Parent article: The kernel and character set encodings

« You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. »

These encodings are mostly useless in a true multi-user system. Why ? Because they are all incompatible. So there is no way for a user that uses encoding A to read stuff (including filenames) made by another user using encoding B. And this is true even for close stuff (KOI8-U and KOI8-R for example). Not to speak of the poor users that may want to quote another langage (French + Russian, Welsh + Greek etc).

The only thing all those encodings are compatible with is english, which restricts second language to english and english only.

One could argue userspace would have just to use Greek encoding for Greek filenames, Russian for Russian ones and so on. But the crux of the problem is userspace have no way to request or guess what encoding was used to write a filename, since the kernel does not enforce any particular encoding nor provides encoding info to userspace.

One additionnal problem is some byte strings can result in invalid UTF-8 and cause applications to barf if they try to decode them.


(Log in to post comments)

The kernel and character set encodings

Posted Feb 19, 2004 13:12 UTC (Thu) by Cato (subscriber, #7643) [Link]

These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.

What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.

The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds