Setting up international character support (Linux.com) [LWN.net]

Riddled with errors

Posted Feb 7, 2006 9:41 UTC (Tue) by dractyl (subscriber, #26334) [Link] (12 responses)

As a person who has to deal with many languages simultaneously in a production environment, I was glad to see an article on UTF-8. I'm moderately horrified that distros still ship with something other than UTF-8 as the default. In any case, I was disheartened to read the article itself, which is riddled with errors and fuzzy thinking.

Let's start (I'm skipping nitpicks and I'm sure I'm missing a lot):

1. "Unicode" is not backwards compatible with ASCII. Unicode defines a number of encodings (the most common of which are UTF-8 and UTF-16/UCS-2). Of all the encodings, UTF-8 at least is backwards compatible. I suspect that CESU-8, UTF-7 and the bastardized Java UTF-8 are all also compatible. That being said, thinking that UTF-8 and Unicode are effectively synonymous in this day and age is wrong as Windows NT/2000/XP all use UTF-16 as their default.

2. UTF-8 is not called UTF-8 because "because it uses four octets per character". Number 1: It encodes bytes in *UP TO* 4 bytes. It encodes all of ISO 8859-1 in 1 byte/character for example. Besides, that's 4 bytes or 32 bits. Not sure where the 8 is coming from. It's called UTF-8 because it's basic unit has *SHOCK*, 8 bits.

3. Lack of glyphs in fonts is not UTF8's problem. In fact, it's kind of a benefit, because you can use fonts which only have the character sets you need. If unicode specified requirements for fonts (well out of it's authority I mught add), it would be something of a mess because all of our fonts would be 32MB big. If I only speak English, why do I want to waste memory storing glyph information for Icelandic, Japanese and Bengali?

4. Still on the subject of fonts, if he's such a king of Unicode, why didn't he point us to one of the unicode complete fonts that are available. MS Office's version of Arial is pretty close to Unicode complete. The free CODE2000 font is also pretty close to Unicode complete. Many unix systems will allow for a kind of font fallback, where it tries to render the glyph in the font you chose. If it can't, it falls back to the next font on the list and tries to get the glyph from there and so on until it finds it or gives up. Put CODE2000 at the bottom of the list, and no matter what font you're using, everything will still display.

5. From TFA, "But if all you want is the ability to type the occasional accent over a letter, then your best choice is a UTF-8 locale". No, taking advantage of deadkey keymaps is the best choice. That doesn't have anything to do with unicode per se, only the input subsystem. Given that he's probably talking about english speaker who occasionally want to type an ümlaut, ISO-8859-1 will also work just fine. Not that I'm advocating you use something other than UTF-8 mind you.

6. The author says you need to append ".utf8" to the locale name. Then he proceeds to use ".utf-8" and ".UTF-8" throughout the rest of the article. I work with a large number of linux distros and "UTF-8" always seems to work. Avoid the others. As for appending it in the xkblayout config line, I'm pretty sure this gets discarded. I have never used it, yet everything I do still ends up as UTF8. I have also made some of my own layouts and don't remember ever seeing that in the config files. Frankly, it's not the keyboard layout handler's job to deal with encoding. It's only concerned with what keystroke generates what character. In any case, the format in which X sends the bits to a program is going to be defined by the system locale so that's probably why I've never had to specify it. It just get's pulled from the environment.

7. I can't speak to the veracity of his statements regarding Gnome as I'm a KDE user.

8. In the section regarding dead keys, the author continues blurring the line between locale and input method. He says, "Both UTF-8 and many legacy locales support deadkeys". It is not the character encoding that supports the deadkeys, it's the keymaps. Allow me to state how this works: what you can type in and how you can type it is governed by kaymaps. What you can display and store is governed by locale. You may be able to type in an acute accented E because you keymap allows you to, but you get a question mark or a square because either the locale or the font cannot deal with the character. If the locale failing you (the question mark), the data is lost because the computer is incapable of storing it. If the font is failing you (the square), the data is still intact but simply cannot be displayed.

I realize he sort of point this all out, but he makes it sound like they are all the same thing, rather than 3 discrete systems with their own responsibilities. It's just sloppy, sloppy thinking.

In any case, I have a splitting headache right now and anime is starting so I'm going to leave it here. Suffice it to say my issues with this article keep right on going and going. Some clearer thinking and some better research/understanding would have resulted in a better article. Thank goodness he didn't start in about CJK input.

The last point I'd like to make is regarding his statement "probably the greatest annoyance is the need to press the apostrophe key followed by the space bar to type a straight quotation mark". If you use AltGr to move the deadkeys up into the 3rd shift group, then you can type just as though it were a normal US keyboard but still have access to all your extended characters.

To finish that point allow me to demonstrate. My keyboard is a standard us keyboard using a slightly modified keymap and an IM. Anyone using my keyboard would be totally unaware that it wasn't standard. I can type these without compositing:

äåé®þüúíóö«»¬áßðfghjkø¶æ©ñµçÄÅÉ®ÞÜÚÍÓÖÁ§ÐØ°Æ¢ÑµÇ

With compositing (via AltGr which is mapped to right-alt on my keyboard), theres a lot more. And then of course I can bring the IM into the picture and do 10's of thousands of CJK symbols. I'd type some but LWN seems to filter them out (which is to say, convert them to things like す which is a Japanese katakana "su").

Anyway, take the article with a grain of salt (or maybe a salt lick), but be sure to always use UTF-8 when you get the chance.

P.S. Sorry if there are any grammatical or spelling errors in this one. I need a tylenol and don't feel like proofreading for another 20 minutes before I get it. Besides, there's anime afoot!

Riddled with errors, indeed

Posted Feb 7, 2006 14:23 UTC (Tue) by eru (subscriber, #2753) [Link] (1 responses)

2. UTF-8 is not called UTF-8 because "because it uses four octets per character". Number 1: It encodes bytes in *UP TO* 4 bytes. It encodes all of ISO 8859-1 in 1 byte/character for example.

Actually, it encodes only the old 7-bit ASCII in 1 byte/character. The accented and umlauted versions of Latin letters of European languages usually need 2 bytes in UTF-8 and sometimes more. This makes converting into UTF-8 a real pain for those users whose languages have so far been adequately handled by ISO 8859-1: The 8-bit text files suddenly are riddled with random-looking garbage, and similarly for UFT-8 files imported into a system with "legacy" encoding. (Personal experience: my native language is Finnish). Only the 7-bit ASCII users will see the upgrade to UTF-8 as painless.

I guess we will have to grin and bear it, in the hope that this is the last character set transition, ever!

Riddled with errors, indeed

Posted Feb 7, 2006 17:46 UTC (Tue) by dractyl (subscriber, #26334) [Link]

Quite right on the ASCII/ISO-8859-1 point. My apologies for that. In any case, I definitely agree that it will be nice to see this transition come to pass, but I think legacy encodings will haunt us for some time. Just the other day I found an old backup with filenames encoded in *something*. SJIS or EUC-JP I should think...

Riddled with errors

Posted Feb 7, 2006 14:45 UTC (Tue) by job (guest, #670) [Link]

Seriously, it's Newsforge/Linux.com. What did you expect? When did you last see an article published there which was well researched (and not from some well known person, like Perens or Stallman)?

Riddled with errors

Posted Feb 7, 2006 22:49 UTC (Tue) by blecoint (✭ supporter ✭, #131) [Link] (7 responses)

Well, actually, the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF (from U+10FFFF), which means a character is encoded on up to 6 bytes. Quite annoying when you expect 4 bytex max.

Riddled with errors

Posted Feb 8, 2006 3:45 UTC (Wed) by dractyl (subscriber, #26334) [Link] (6 responses)

Since the Unicode standard itself (which is seperate from ISO/IEC 10646) defines a 4 byte maximum, I'll consider 5 and 6 byte UTF-8 an abberation born of a problem synchonizing the standards. They track each other closely in terms of code points, but Unicode defines rather a lot more than the mere number/character mappings and is the main standard.

In fact, the Unicode standard specifically references this issue in Appendix C and clearly states that 5 and 6 character bytes are illegal.

Even if this weren't the case, it's still academic as the complete unicode space is 20 bits long and 4 bytes of UTF-8 supplies 21 bits. The 20 bits also includes ample room for expansion, even if we discover a previously unknown huge (multi-tens of thousands characters) alphabet. So even within the ISO/IEC 10646 standard, it isn't actually possible to get past 4 bytes.

If you do come across a 5 or 6 byte encoding, throw it away. It's illegal. Incidentally, the existing bastarsizations of unicode are a much more serious and much less academic problem that you'll run into.

---

Unicode Standard Section 3 - 3.9 defines UTF-8 as 4 bytes
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

Unicode Standard Appendix C - Talks about the differences with ISO/IEC 10646 including the byte differences
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf

RFC 3629 Also follows the Unicode 4 byte standard
http://www.ietf.org/rfc/rfc3629.txt

Riddled with errors

Posted Feb 8, 2006 22:28 UTC (Wed) by blecoint (✭ supporter ✭, #131) [Link] (5 responses)

The Unicode standard claims to be "the official way to implement ISO/IEC 10646" and yet they decided to not implement the full ISO/IEC 10646. While I agree with you that it's an academic question today, nothing prevents the Unicode standard to implement the full ISO/IEC 10646 tomorrow. This is the reason why I try to write code that can handle 6 byte UTF8 character. There are many well known projects that do the same: both XERCES and ICU support 6 byte UTF8 character.

Funny that you mention RFC 3629 because it clearly states in section 10 "Security considerations":
"Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences."
Which I liberally quoted in my original comment.

Anyways, the unicode space if more than enough for today's needs. I was just trying to spread the knowledge that a UTF8 character can take up to 6 bytes although only 4 byte character are used today, which is to me a common misconception about UTF8.

Fianlly, I totally agree with you that the existing bastarsizations of unicode are a serious problem for any application that has no control over its input data.

Riddled with errors

Posted Feb 9, 2006 3:13 UTC (Thu) by dractyl (subscriber, #26334) [Link] (4 responses)

I agree with you. While all 5 and 6 byte UTF-8 encodings are currently illegal, people should undoubtedly be checking for them for the security reasons you mentioned.

That also goes for any overlong representation such as using 0xE0 0x80 0x8A for U+000A. It occurs to me that a nice string like 0xFC 0x80 0x80 0x80 0x80 0x80 0x80 0x8A would screw up some of the stupider UTF-8 engines...

While RFC 3629 does consider the implications of ISO/IEC 10646's notion of UTF-8, RFC 3629 itself subscribes to the idea of the 4 byte limit along with Unicode.

The fact of the matter is that we have 3 independant standards; two say "4 bytes" and one says "6 bytes". Which is "right"? I don't know. I'm inclined to go with Unicode as it's the dominant and quite frankly more useful standard (despite the fact it doesn't allow for style variations in kanji). That being said, I'm still forced to deal with ISO/IEC 10646's concept of the world for security reasons.

At the end of the day I'll just shrug my shoulders and wait for the day such a question actually matters, which looks very much like never where I'm standing.

BTW, thank you for your comments. The ensuing discussion has been rather more interesting than the unfortunate excuse for an article which spawned it.

Riddled with errors

Posted Feb 11, 2006 4:40 UTC (Sat) by pimlott (guest, #1535) [Link] (1 responses)

That being said, I'm still forced to deal with ISO/IEC 10646's concept of the world for security reasons.

Sorry if I'm misinterpreting what you mean here, but I can't help responding to that statement with alarm. If considering that your input might be ISO-10646 makes your code more secure, you must be using slap-dash programming practices (and an unsafe language) to begin with. Ill-formed input is ill-formed input, regardless of whether some related standard thinks it's valid.

Riddled with errors

Posted Feb 11, 2006 9:23 UTC (Sat) by dractyl (subscriber, #26334) [Link]

Allow me to clarify. It seems to me I can approach writing a decoder two different ways:

1. Decide that ISO/IEC 10646 is valid or common enough and play along. In this case I will decode the 6 byte sequences.

2. Decide that I'd rather follow the Unicode format strictly, in which case I'll drop character sequences over 4 bytes long as invalid.

In either case, I'll have to discard codepoints higher than 0x000FFFFF. Even 4 byte characters can still produce illegal codepoints since they encode 21 bits of data compared to Unicode's 20 bit limit. Other illegial sequences such as U+D800 to U+DFFF and U+FFFE and U+FFFF, as well as any other bugaboos still need to be stripped out. As for over long sequences, I'm more inlined to normalize them then anything else.

It was not my intention to suggest a detection of what is used in a given piece of data, as that seems to be unnecessarily fragile and error-prone in a decoder. Merely that we must consider the existence of ISO/IEC 10646 when writing our software.

At the end of the day, it hardly seems to matter which you choose.

High code-points vs. invalid representations of low ones

Posted Feb 13, 2006 4:28 UTC (Mon) by xoddam (subscriber, #2322) [Link] (1 responses)

> a nice string like 0xFC 0x80 0x80 0x80 0x80 0x80 0x80 0x8A would
> screw up some of the stupider UTF-8 engines.

You're absolutely right, and for that reason only the shortest possible
representation of a code point is considered a valid UTF-8 character.
Your code should NEVER produce byte sequences like the above, you should
consider such input to be an attempt to bluff either your program or its
user. The best policy (not sure if it's a MUST or a SHOULD in the
relevant standards) is to discard either the character or the entire
input on these grounds.

None of which decisively determines whether you should accept code points
so high that they haven't been defined yet. The worst that can happen is
an integer overflow (signed unicode characters anyone?), otherwise you
just have a weird character. Throw it away or draw a box; I doubt anyone
really cares which.

High code-points vs. invalid representations of low ones

Posted Feb 13, 2006 5:13 UTC (Mon) by dractyl (subscriber, #26334) [Link]

> Your code should NEVER produce byte sequences like the above

Unless of course I'm trying to break or subvert a UTF-8 decoder, in which case that's exactly the kind of thing I want my code to produce. ;) That's what I originally had in mind when I suggested the byte sequnce.

FYI, the RFC version of the standard says you should not accept undefined codepoints. From the standard: "There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences." Also, an integer overflow can be a very bad thing. I could see an exploitable attack arising out of that (or at least a crash), so it's best to protect against it.

Riddled with errors (Addendum)

Posted Feb 9, 2006 4:50 UTC (Thu) by dractyl (subscriber, #26334) [Link]

Well now that I don't have a splitting headache, I should correct some minor points, just for the sake of completeness.

1. As was already pointed out UTF-8 encodes all of ASCII in 1 byte per character, not ISO 8859-1. It should be noted however that Unicode's codepoints are identical to ISO 8859-1's, just encoded differently.

2. CODE2000 is *not* free but shareware. In fact, I just paid my shareware fee last night when I discovered my mistake. You can find the font at:

http://home.att.net/~jameskass/index.htm

3. In my response, I gave the author a hard time about not pointing out any fonts, which is not quite accurate. While the author did manage to point out the Gentium unicode font, that font only has 1500 or so glyphs, which only makes it suitable for latin and greek (and soon cyrillic) alphabets. It's not anywhere near unicode complete and is not suitable for non-western users.

4. I stated I never came across "UTF-8" in the X config files. This is true as far as the keymaps are concerned, but they do show up in the locales, which affect input via the Compose file which defines what sequences of characters you can type in to get a given character. As usual in Unix there is more than one way to do it. For example, if I want to type in an Ò you can do it any of these ways:

<dead_grave> <O> : "Ò" U00D2 # LATIN CAPITAL LETTER O WITH GRAVE
<Multi_key> <grave> <O> : "Ò" U00D2 # LATIN CAPITAL LETTER O WITH GRAVE
<combining_grave> <O> : "Ò" U00D2 # LATIN CAPITAL LETTER O WITH GRAVE

I used <dead_grave> <O> because I use dead_keys rather than a Compose key (known here as Multi_key). In my keymap I have <dead_grave> set to AltGr+` where AltGr is mapped to the right alt key.

In any case, your locale can affect what sequence of characters you need to type in to get a specific character. For example, tt_RU.TATAR-CYR, tt_RU.KOI8-C, and tt_RU.UTF-8 all have different compose sequences from each other. Thankfully, all UTF-8 locales use the same compose sequences except pt_BR.UTF-8 which changes the compose sequence for a few characters.

Once again, this does not actually have anything to do with UTF-8 or even unicode for that matter. These compose combinations are based on custom and the X11R6 standard rather than Unicode.

5. I stated deadkey keymaps were the best way to input characters such as ü or ç. The Compose key or any special keys you may have on your keyboard are good too. It's really up to personal choice. The point I was making was that this isn't a UTF-8 issue.

6. I stated that you get a square if your font is failing you. This is true, but you may also get a small black diamond with a question mark in it or indeed nothing at all if it's a very bad font.

Setting up international character support (Linux.com)

Posted Feb 7, 2006 11:41 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

This is a pretty awful article, it gets a few facts right, a lot of things wrong and it manages to make the whole thing sound very complicated so that users (especially those with little interest in foreign language and culture) will be inclined to avoid it altogether.

It confuses a locale (information about the user's culture and language, e.g. paper sizes, currency, calendar) with a character set (which is just a bunch of characters, possibly an ordered set) and both of those with a character encoding (a way to turn characters into byte sequences). The last bit is understandable, even early IETF documents confuse character encodings with character sets, because back then the two were often synonymous.

en_GB is the name of a UK locale, with English language, A4 paper, £ sterling currency, the first month is called "January" and decimals are separated with a point not a comma.

UTF-8 is a character encoding, and it encodes the character set ISO 10646 which is sometimes called "Unicode" since it's identical to the Unicode character set, although Unicode.org standardises many things beyond the ISO 10646 character set.

A previous poster pointed out numerous additional errors. If you don't know anything about i18n, Unicode or UTF-8, stay clear of this article. If you do know something, be prepared to be annoyed by it. Next time we see an article which says Linux is "not for commercial use" or that the GPL is "untried, and probably illegal" we should consider this article and remember that incompetence is a more common explanation than maliciousness.

Better explanation?

Posted Feb 7, 2006 17:01 UTC (Tue) by cgorac (guest, #35767) [Link] (2 responses)

Would some of you knowledgeable guys care to explain it better, preferably in terms on configuration files changes? I'm primarily interested in having xterm/rxvt able to display UTF-8 text (uxterm?) and accept Unicode input ("cat >unicode.txt"), to change keyboards (like it is possible now by typing "setxkbmap" command) from specific list of several keyboards using hot-key (say Alt-Shift) combination and to have emacs's input method to follow currently selected keyboard...

Better explanation?

Posted Feb 11, 2006 5:08 UTC (Sat) by pimlott (guest, #1535) [Link] (1 responses)

For a unicode-based environment, you should have only to set your LANG environment variable to en_US.UTF-8 (assuming the en_US part is right for you). (Be sure it's set when X starts, not just when your shell starts, say in .xsession if you use a display manager.) xterm and I hope other modern terminal programs will recognize the locale, and so encode/decode the terminal streams as utf-8 and use a unicode font. While some instructions advise using a special terminal program or passing it special flags, I never had to do any of that (thank goodness). Then cat some utf-8 text and see the pretty symbols. :-) Any other well-written program should likewise honor LANG (although many allow further configuration as well, in case you need to deal with other encodings sometimes).

I can't tell you anything about the rest of your question.

Better explanation?

Posted Feb 14, 2006 3:05 UTC (Tue) by roelofs (guest, #2599) [Link]

xterm and I hope other modern terminal programs will recognize the locale, and so encode/decode the terminal streams as utf-8 and use a unicode font.

xterm may also require the -u8 option. Both LC_CTYPE=en_US.UTF-8 and -u8 are already part of the uxterm script, at least on Slackware. The default font seems to be missing a fair number of CJK characters, however (not to mention virtually all Indic ones, Thai, etc.).

Greg

Modified US symbol map

Posted Feb 10, 2006 2:46 UTC (Fri) by dractyl (subscriber, #26334) [Link]

If it's helpful, here is a diff of the changes I made to my /etc/X11/xkb/symbols/us file to get my very happy keyboard (outside of scim for CJK support of course):

*** ./us.orig Fri Feb 10 11:23:46 2006
--- ./us.nrb Fri Feb 10 11:12:51 2006
***************
*** 71,88 ****
partial alphanumeric_keys
xkb_symbols "intl" {

! name[Group1]= "U.S. English - International (with dead keys)";

include "us(basic)"

// Alphanumeric section
! key <TLDE> { [dead_grave, dead_tilde, grave, asciitilde ] };
key <AE01> { [ 1, exclam, exclamdown, onesuperior ] };
key <AE02> { [ 2, at, twosuperior, dead_doubleacute ] };
key <AE03> { [ 3, numbersign, threesuperior, dead_macron ] };
key <AE04> { [ 4, dollar, currency, sterling ] };
key <AE05> { [ 5, percent, EuroSign ] };
! key <AE06> { [ 6, dead_circumflex, onequarter, asciicircum ] };
key <AE07> { [ 7, ampersand, onehalf, dead_horn ] };
key <AE08> { [ 8, asterisk, threequarters, dead_ogonek ] };
key <AE09> { [ 9, parenleft, leftsinglequotemark, dead_breve ] };
--- 71,88 ----
partial alphanumeric_keys
xkb_symbols "intl" {

! name[Group1]= "U.S. English - International (with shifted dead keys)";

include "us(basic)"

// Alphanumeric section
! key <TLDE> { [ grave, asciitilde, dead_grave, dead_tilde ] };
key <AE01> { [ 1, exclam, exclamdown, onesuperior ] };
key <AE02> { [ 2, at, twosuperior, dead_doubleacute ] };
key <AE03> { [ 3, numbersign, threesuperior, dead_macron ] };
key <AE04> { [ 4, dollar, currency, sterling ] };
key <AE05> { [ 5, percent, EuroSign ] };
! key <AE06> { [ 6,asciicircum, onequarter, dead_circumflex ] };
key <AE07> { [ 7, ampersand, onehalf, dead_horn ] };
key <AE08> { [ 8, asterisk, threequarters, dead_ogonek ] };
key <AE09> { [ 9, parenleft, leftsinglequotemark, dead_breve ] };
***************
*** 109,115 ****

key <AC09> { [ l, L, oslash, Ooblique ] };
key <AC10> { [ semicolon, colon, paragraph, degree ] };
! key <AC11> { [dead_acute, dead_diaeresis, apostrophe, quotedbl ] };

key <AB01> { [ z, Z, ae, AE ] };
key <AB03> { [ c, C, copyright, cent ] };
--- 109,115 ----

key <AC09> { [ l, L, oslash, Ooblique ] };
key <AC10> { [ semicolon, colon, paragraph, degree ] };
! key <AC11> { [apostrophe, quotedbl, dead_acute, dead_diaeresis ] };

key <AB01> { [ z, Z, ae, AE ] };
key <AB03> { [ c, C, copyright, cent ] };

That's it. Just moved the dead keys into the upper shift group so you have to use AltGr to type them. I did this, because I totally agree with the author that having to contend with dead keys *all the time* when you only want accented characters once in a while is highly annoying.