The kernel and character set encodings
Posted Feb 19, 2004 9:24 UTC (Thu) by Cato
Parent article: The kernel and character set encodings
You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on.
Getting the character encoding right is difficult, and with UTF-8 there is an additional complication, Unicode normalisation - the issue here is that in certain languages, you might have a symbol on the page being encoded as 3 Unicode characters: the letter with accent 1 then accent 2 in one string, and the letter with accent 2 then accent 1 in another string. These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. Unicode normalisation defines a specific order for all such 'combining character' strings, but unfortunately there is more than one normalisation form: Linux and the W3C use NFC, while Darwin and MacOS X use NFD, even on UFS filesystems.
Unicode makes life more complicated for everyone and it's likely some of this needs to be in the kernel, or at least glibc, for uniformity. For more links on Unicode, from a Perl/Wiki oriented perspective, see the plan for TWiki support of UTF-8 and this Unicode normalisation page.
to post comments)