Thoughts from LWN's UTF8 conversion
Later machines, of course, standardized on eight-bit bytes and the ASCII character set. Having a standard meant that nobody had to worry about character set issues anymore; the fact that it was ill-suited for use outside of the United States didn't seem to matter. Even as computers spread worldwide, usage of ASCII stuck around for a long time. Thus, your editor has a ready-made excuse for not thinking much about character sets when he set out to write the "new LWN site code" in 2002. Additionally, the programming languages and web platforms available at the time did not exactly encourage generality in this area. Anything that wasn't ASCII by then was Latin-1 - for anybody with a sufficiently limited world view.
Getting past the Latin-1 limitation took a long time and a lot of work, but that seems to be accomplished and stable at this point. In the process, your editor observed a couple of things that were not immediately obvious to him. Perhaps those observations will prove useful to anybody else who has had a similarly sheltered upbringing.
Now, too, we have a standard for character representation; it is called "Unicode." In theory, all one needs to do is to work in Unicode, and all of those unpleasant character set problems will go away. Which is a nice idea, but there's a little detail that is easy to skip over: Unicode is not actually a standard for the representation of characters. It is, instead, a mapping between integer character numbers ("code points") and the characters themselves. Nobody deals directly with Unicode; they always work with some specific representation of the Unicode code points.
Suitably enlightened programming languages may well have a specific type for dealing with Unicode strings. How the language represents those strings is variable; many use an integer type large enough to hold any code point value, but there are exceptions. The abortive PHP6 attempt used a variable-width encoding based on 16-bit values, for example. With luck, the programmer need not actually know how Unicode is handled internally to a given language, it should Just Work.
But the use of a language-specific internal representation implies that any string obtained from the world outside a given program is not going to be represented in the same way. Of course, there are standards for string representations too - quite a few standards. The encoding used by LWN now - UTF8 - is a good choice for representing a wide range of code points while being efficient in LWN's still mostly-ASCII world. But there are many other choices, but, importantly, they are all encodings; they are not "Unicode."
So programs dealing in Unicode text must know how outside-world strings are represented and convert those strings to the internal format before operating on them. Any program which does anything more complicated to text than copying it cannot safely do so if it does not fully understand how that text is represented; any general solution almost certainly involves decoding external text to a canonical internal form first.
This is an interesting evolution of the computing environment. Unix-like systems are supposed to be oriented around plain text whenever possible; everything should be human-readable. We still have the human-readable part - better than before for those humans whose languages are not well served by ASCII - but there is no such thing as "plain text" anymore. There is only text in a specific encoding. In a very real sense, text has become a sort of binary blob that must be decoded into something the program understands before it can be operated on, then re-encoded before going back out into the world. A lot of Unicode-related misery comes from a failure to understand (and act on) that fundamental point.
LWN's site code is written in Python 2. Version 2.x of the language is entirely able to handle Unicode, especially for relatively large values of x. To that end, it has a unicode string type, but this type is clearly a retrofit. It is not used by default when dealing with strings; even literal strings must be marked explicitly as Unicode, or they are just plain strings.
When Unicode was added to Python 2, the developers tried very hard to make
it Just Work. Any sort of mixture between Unicode and "plain strings"
involves an automatic promotion of those strings to Unicode. It is a nice
idea, in that it allows the programmer to avoid thinking about whether a
given string is Unicode or "just a string." But if the programmer does not
know what is in a string - including its encoding - nobody does. The
resulting confusion can lead to corrupted text or Python exceptions; as
Guido van Rossum put it in the
introduction to Python 3, "This value-specific behavior has
caused numerous sad faces over the years.
" Your editor's
experience, involving a few sad faces for sure, agrees with this; trying to
make strings "just work" leads to code containing booby traps that may not
spring until some truly inopportune time far in the future.
That is why Python 3 changed the rules. There are no "strings" anymore in the language; instead, one works with either Unicode text or binary bytes. As a general rule, data coming into a program from a file, socket, or other source is binary bytes; if the program needs to operate on that data as text, it must explicitly decode it into Unicode. This requirement is, frankly, a pain; there is a lot of explicit encoding and decoding to be done that didn't have to happen in a Python 2 program. But experience says that it is the only rational way; otherwise the program (and programmer) never really know what is in a given string.
In summary: Unicode is not UTF8 (or any other encoding), and encoded text
is essentially binary data. Once those little details get into a
programmer's mind (quite a lengthy process, in your editor's case), most of
the difficulties involved in dealing with Unicode go away.
Much of the above is certainly obvious to anybody who has dealt with
multiple character encodings for any period of time. But it is a bit of a
foreign mind set to developers who have spent their time in specialized
environments or with languages that don't recognize Unicode - kernel
developers, for example. In the end, writing programs that are able to
function in a multiple-encoding world is not hard; it's just one more thing
to think about.
