By Jonathan Corbet
February 1, 2012
There are a lot of things that one does not learn in engineering school.
In your editor's case, anything related to character encodings has to be
put onto that list. That despite the fact that your editor's first
programs were written on a system with a six-bit character size; a special
"shift out" mechanism was needed to represent some of the more obscure
characters - like lower case letters. Text was not portable to machines
with any other architecture, but the absence of a network meant that one
rarely ran into such problems. And when one did, that was what EBCDIC
conversion utilities were for.
Later machines, of course, standardized on eight-bit bytes and the ASCII
character set. Having a standard meant that nobody had to worry about
character set issues anymore; the fact that it was ill-suited for use
outside of the United States didn't seem to matter. Even as computers
spread worldwide, usage of ASCII stuck around for a long time. Thus, your
editor has a ready-made excuse for not thinking much about character sets
when he set out to write the "new LWN site code" in 2002. Additionally,
the programming languages and web platforms available at the time did not
exactly encourage generality in this area. Anything that
wasn't ASCII by then was Latin-1 - for anybody with a sufficiently limited
world view.
Getting past the Latin-1 limitation took a long time and a lot of work, but
that seems to be accomplished and stable at this point. In the process,
your editor observed a couple of things that were not immediately obvious
to him. Perhaps those observations will prove useful to anybody else who
has had a similarly sheltered upbringing.
Now, too, we have a standard for character representation; it is called "Unicode."
In theory, all one needs to do is to work in Unicode, and all of those
unpleasant character set problems will go away. Which is a nice idea, but
there's a little detail that is easy to skip over: Unicode is not actually
a standard for the representation of characters. It is, instead, a mapping
between integer character numbers ("code points") and the characters
themselves. Nobody deals directly with Unicode; they always work with some
specific representation of the Unicode code points.
Suitably enlightened programming languages may well have a specific type
for dealing with Unicode strings. How the language represents those
strings is variable; many use an integer type large enough to hold any code
point value, but there are exceptions. The abortive PHP6 attempt used a variable-width
encoding based on 16-bit values, for example. With luck, the programmer
need not actually know how Unicode is handled internally to a given
language, it should Just Work.
But the use of a language-specific internal representation implies that any
string obtained from the world outside a given
program is not going to be represented in the same way. Of course, there
are standards for string representations too - quite a few standards. The
encoding used by LWN now - UTF8 - is a good choice for representing a wide
range of code points while being efficient in LWN's still mostly-ASCII
world. But there are many other choices, but, importantly, they are all
encodings; they are not "Unicode."
So programs
dealing in Unicode text must know how outside-world strings are represented
and convert those strings to the internal format before operating on them.
Any program which does anything more complicated to text than copying it
cannot safely do so if it does not fully understand how that text is
represented; any general solution almost certainly involves decoding
external text to a canonical internal form first.
This is an interesting evolution of the computing environment. Unix-like
systems are supposed to be oriented around plain text whenever possible;
everything should be human-readable. We still have the human-readable part
- better than before for those humans whose languages are not well served
by ASCII - but there is no such thing as "plain text" anymore. There is
only text in a specific encoding. In a very real sense, text has become a
sort of binary blob that must be decoded into something the program
understands before it can be operated on, then re-encoded before going back
out into the world. A lot of Unicode-related misery comes from a failure
to understand (and act on) that fundamental point.
LWN's site code is written in Python 2. Version 2.x of the language is
entirely able to handle Unicode, especially for relatively large values
of x. To that end, it has a unicode string type, but this
type is clearly a retrofit. It is not used by default when dealing with
strings; even literal strings must be marked explicitly as Unicode, or they
are just plain strings.
When Unicode was added to Python 2, the developers tried very hard to make
it Just Work. Any sort of mixture between Unicode and "plain strings"
involves an automatic promotion of those strings to Unicode. It is a nice
idea, in that it allows the programmer to avoid thinking about whether a
given string is Unicode or "just a string." But if the programmer does not
know what is in a string - including its encoding - nobody does. The
resulting confusion can lead to corrupted text or Python exceptions; as
Guido van Rossum put it in the
introduction to Python 3, "This value-specific behavior has
caused numerous sad faces over the years." Your editor's
experience, involving a few sad faces for sure, agrees with this; trying to
make strings "just work" leads to code containing booby traps that may not
spring until some truly inopportune time far in the future.
That is why Python 3 changed the rules. There are no "strings" anymore in
the language; instead, one works with either Unicode text or binary bytes.
As a general
rule, data coming into a program from a file, socket, or other source is
binary bytes; if the program needs to operate on that data as text, it must
explicitly decode it into Unicode. This requirement is, frankly, a pain;
there is a lot of explicit encoding and decoding to be done that didn't
have to happen in a Python 2 program. But experience says that it is
the only rational way; otherwise the program (and programmer) never really
know what is in a given string.
In summary: Unicode is not UTF8 (or any other encoding), and encoded text
is essentially binary data. Once those little details get into a
programmer's mind (quite a lengthy process, in your editor's case), most of
the difficulties involved in dealing with Unicode go away.
Much of the above is certainly obvious to anybody who has dealt with
multiple character encodings for any period of time. But it is a bit of a
foreign mind set to developers who have spent their time in specialized
environments or with languages that don't recognize Unicode - kernel
developers, for example. In the end, writing programs that are able to
function in a multiple-encoding world is not hard; it's just one more thing
to think about.
(
Log in to post comments)