The kernel and character set encodings
Posted Feb 19, 2004 13:12 UTC (Thu) by Cato
In reply to: The kernel and character set encodings
Parent article: The kernel and character set encodings
These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.
What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.
The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.
to post comments)