Thoughts from LWN's UTF8 conversion
| Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |
There are a lot of things that one does not learn in engineering school. In your editor's case, anything related to character encodings has to be put onto that list. That despite the fact that your editor's first programs were written on a system with a six-bit character size; a special "shift out" mechanism was needed to represent some of the more obscure characters - like lower case letters. Text was not portable to machines with any other architecture, but the absence of a network meant that one rarely ran into such problems. And when one did, that was what EBCDIC conversion utilities were for.
Later machines, of course, standardized on eight-bit bytes and the ASCII character set. Having a standard meant that nobody had to worry about character set issues anymore; the fact that it was ill-suited for use outside of the United States didn't seem to matter. Even as computers spread worldwide, usage of ASCII stuck around for a long time. Thus, your editor has a ready-made excuse for not thinking much about character sets when he set out to write the "new LWN site code" in 2002. Additionally, the programming languages and web platforms available at the time did not exactly encourage generality in this area. Anything that wasn't ASCII by then was Latin-1 - for anybody with a sufficiently limited world view.
Getting past the Latin-1 limitation took a long time and a lot of work, but that seems to be accomplished and stable at this point. In the process, your editor observed a couple of things that were not immediately obvious to him. Perhaps those observations will prove useful to anybody else who has had a similarly sheltered upbringing.
Now, too, we have a standard for character representation; it is called "Unicode." In theory, all one needs to do is to work in Unicode, and all of those unpleasant character set problems will go away. Which is a nice idea, but there's a little detail that is easy to skip over: Unicode is not actually a standard for the representation of characters. It is, instead, a mapping between integer character numbers ("code points") and the characters themselves. Nobody deals directly with Unicode; they always work with some specific representation of the Unicode code points.
Suitably enlightened programming languages may well have a specific type for dealing with Unicode strings. How the language represents those strings is variable; many use an integer type large enough to hold any code point value, but there are exceptions. The abortive PHP6 attempt used a variable-width encoding based on 16-bit values, for example. With luck, the programmer need not actually know how Unicode is handled internally to a given language, it should Just Work.
But the use of a language-specific internal representation implies that any string obtained from the world outside a given program is not going to be represented in the same way. Of course, there are standards for string representations too - quite a few standards. The encoding used by LWN now - UTF8 - is a good choice for representing a wide range of code points while being efficient in LWN's still mostly-ASCII world. But there are many other choices, but, importantly, they are all encodings; they are not "Unicode."
So programs dealing in Unicode text must know how outside-world strings are represented and convert those strings to the internal format before operating on them. Any program which does anything more complicated to text than copying it cannot safely do so if it does not fully understand how that text is represented; any general solution almost certainly involves decoding external text to a canonical internal form first.
This is an interesting evolution of the computing environment. Unix-like systems are supposed to be oriented around plain text whenever possible; everything should be human-readable. We still have the human-readable part - better than before for those humans whose languages are not well served by ASCII - but there is no such thing as "plain text" anymore. There is only text in a specific encoding. In a very real sense, text has become a sort of binary blob that must be decoded into something the program understands before it can be operated on, then re-encoded before going back out into the world. A lot of Unicode-related misery comes from a failure to understand (and act on) that fundamental point.
LWN's site code is written in Python 2. Version 2.x of the language is entirely able to handle Unicode, especially for relatively large values of x. To that end, it has a unicode string type, but this type is clearly a retrofit. It is not used by default when dealing with strings; even literal strings must be marked explicitly as Unicode, or they are just plain strings.
When Unicode was added to Python 2, the developers tried very hard to make
it Just Work. Any sort of mixture between Unicode and "plain strings"
involves an automatic promotion of those strings to Unicode. It is a nice
idea, in that it allows the programmer to avoid thinking about whether a
given string is Unicode or "just a string." But if the programmer does not
know what is in a string - including its encoding - nobody does. The
resulting confusion can lead to corrupted text or Python exceptions; as
Guido van Rossum put it in the
introduction to Python 3, "This value-specific behavior has
caused numerous sad faces over the years.
" Your editor's
experience, involving a few sad faces for sure, agrees with this; trying to
make strings "just work" leads to code containing booby traps that may not
spring until some truly inopportune time far in the future.
That is why Python 3 changed the rules. There are no "strings" anymore in the language; instead, one works with either Unicode text or binary bytes. As a general rule, data coming into a program from a file, socket, or other source is binary bytes; if the program needs to operate on that data as text, it must explicitly decode it into Unicode. This requirement is, frankly, a pain; there is a lot of explicit encoding and decoding to be done that didn't have to happen in a Python 2 program. But experience says that it is the only rational way; otherwise the program (and programmer) never really know what is in a given string.
In summary: Unicode is not UTF8 (or any other encoding), and encoded text
is essentially binary data. Once those little details get into a
programmer's mind (quite a lengthy process, in your editor's case), most of
the difficulties involved in dealing with Unicode go away.
Much of the above is certainly obvious to anybody who has dealt with
multiple character encodings for any period of time. But it is a bit of a
foreign mind set to developers who have spent their time in specialized
environments or with languages that don't recognize Unicode - kernel
developers, for example. In the end, writing programs that are able to
function in a multiple-encoding world is not hard; it's just one more thing
to think about.
(Log in to post comments)
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 3:12 UTC (Thu) by allesfresser (guest, #216) [Link]
LWN as open source
Posted Feb 2, 2012 4:00 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]
LWN as open source
Posted Feb 2, 2012 4:04 UTC (Thu) by corbet (editor, #1) [Link]
That has not slipped our mind. The UTF8 change - and a whole lot of invisible work that was needed to get to where we could do that change - was part of the process. It must remain a lower priority than, say, writing a pile of articles every week, but, as incredible as it must seem, it is something we want to do.
LWN as open source
Posted Feb 2, 2012 9:56 UTC (Thu) by Da_Blitz (subscriber, #50583) [Link]
is there any possibility of seeing more articles like this?
LWN as open source
Posted Feb 2, 2012 14:39 UTC (Thu) by fuhchee (guest, #40059) [Link]
Have you considered simply throwing the code over the fence into github? That'd be just a few minutes.
LWN as open source
Posted Feb 5, 2012 20:35 UTC (Sun) by musicon (guest, #4739) [Link]
Like LWN, I have written a fairly large web application that I use to run my business. I'm intending (eventually) to release it as open source, but I'm panicked there are gaping holes that would cause me to lose my livelihood.
Additionally, due to the simple fact that the code has grown as the business has expanded, there are still hard-coded sections left over from when the business was much smaller that aren't suitable for release into the 'wild'.
LWN as open source
Posted Feb 6, 2012 3:57 UTC (Mon) by raven667 (guest, #5198) [Link]
LWN as open source
Posted Feb 2, 2012 15:34 UTC (Thu) by nhasan (guest, #1699) [Link]
LWN as open source
Posted Feb 2, 2012 15:44 UTC (Thu) by jake (editor, #205) [Link]
> software/open source is itself based on proprietary code.
You know, we hear this periodically, but it is a) incorrect and b) not really helping get the code out there. It's not like we are out there selling the code to others in some closed-source form. It is the code that we run our business on and I daresay that there are few companies in the world that don't have some code they use internally and don't release, this is no different.
Except that it is different because we do want to get it out there, and will eventually. The sad thing is that it will likely be the biggest non-event there is because the code is tied tightly to what we do here, not generalized for other uses, and there are much more plausible solutions than a 10-ish year old code base that is targeted at one particular use-case.
We certainly realize that we have been making the promise for a long time and prodding us about it is certainly fair game. Calling the code 'proprietary' is not.
ymmv,
jake
LWN as open source
Posted Feb 2, 2012 17:45 UTC (Thu) by jeremiah (guest, #1221) [Link]
LWN as open source
Posted Feb 3, 2012 15:56 UTC (Fri) by felixfix (subscriber, #242) [Link]
The alternative is to dump it periodically and not take any contributions, and no one wants to do that because they do want some kind of feedback.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 4:24 UTC (Thu) by Tara_Li (guest, #26706) [Link]
As with many things done by committee, I have to say I think Unicode was far over-complicated. I seem to remember that one of the earliest drafts was a 16-bit character set, and the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit. So, in the interest of going bat-**** insane, they went up to a 32-bit code space, and started including control characters to allow the drawing of these syllables, as well as control left-to-right and right-to-left printing (Do they allow for columnar printing? I haven't seen whining and moaning about *that*, even if it would be nice for encoding manga in Japanese.) And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols, Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)
And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 5:56 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]
the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit.
It's a nitpicky point, but syllabaries are writing systems that represent each syllable as a single glyph and aren't the big problem. For example, Japanese Hiragana has only about 100 symbols, as does Cherokee (a non-oriental syllabary). Syllabic writing is mostly used with languages that have fairly simple rules for constructing syllables, so the number of glyphs required is still reasonable. The big problem comes with ideographic writing, where glyphs map to concepts rather than sounds. Those are the cases like Chinese writing or Kanji that can have thousands of glyphs.
And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols, Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)
Honestly, what's the problem with offering those character sets? If you have the space for them in your code table- and you ought to if you have 32 bits to work with- I don't see the harm in including a few oddball symbols, characters for made up languages, and the like. Consider the alternative of having a committee somewhere that judges which languages deserve to have their characters included in Unicode based on whether they're sufficiently serious. That opens the process up to all kinds of political pressure because questions of language are inevitably going to involve culture, ethnicity, and rights of oppressed minorities to use their own languages. I'd rather just leave the process open to anyone who is willing to go to the trouble of putting through a properly formatted application to the appropriate standards body.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 6:43 UTC (Thu) by AnthonyJBentley (guest, #71227) [Link]
As with many things done by committee, I have to say I think Unicode was far over-complicated. I seem to remember that one of the earliest drafts was a 16-bit character set, and the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit. So, in the interest of going bat-**** insane, they went up to a 32-bit code space
Do you think Unicode shouldn’t have increased the character space, and as a result been useless for Asian languages? (And technically it’s 21-bit, but that’s not important.)
And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols
Yes, because these exist in other character encodings. One of the design goals of Unicode is that any existing encoding can be losslessly converted to Unicode.
Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)
No, Tolkien script and Ultima runes are not in Unicode proper. Unicode happens to provide an undefined character space for private use, and some people happened to start using that area for those characters. As far as I know, all attempts to get Tolkien scripts added to Unicode have been rejected by the Unicode Consortium.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 7:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Besides, natural languages are tricky in any case. The amount of glyphs is not really a problem (it was clear from the start that 16 bits are not enough). Languages with complex scripts are much more tricky.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 8:07 UTC (Thu) by keeperofdakeys (guest, #82635) [Link]
And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.
The real distinction is that 'plain text' has one byte per character and code points map directly to bytes; there is no encoding or decoding done. Unicode added a layer of indirection, so that you have different possible encodings. It may all be bytes in the end, but it is the interpretation of those bytes that is important.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 18:22 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]
The real distinction is that 'plain text' has one byte per character and code points map directly to bytes; there is no encoding or decoding done.Of course there's still encoding and decoding being done. What do you think the "C"s in ASCII and EBCDIC stand for? The 1:1 relationship between bytes and code points is still an encoding, it's just simple enough that people tend to ignore it. The illusion that there's no encoding going on will disappear the moment you have to worry about different native encodings, like using EBCDIC data on an ASCII machine or vice versa.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 9:53 UTC (Thu) by nim-nim (subscriber, #34454) [Link]
> miscellaneous symbols
It may seem stupid cruft to you, but that's the only way to make documents that include dingbats (pretty much everything produced by an office suite nowadays) not depend on a specific proprietary font available on a single specific operating system.
To this day users continue to open bugs on various free software apps because they 'misrender' dingbats ✱●▶, smileys ☹☺☻, form checks ☐☒☑✓✔✕✖✗✘, weather symbols ☀☂☃ (much loved by businesses to sum up a project state) which have been inserted in documents by software using pre-unicode fonts (either the windows wingdings* or the Adobe dingbats)
Indeed DejaVu support for a large number of unicode dingbats and symbols did a lot more for its popularity that its support for some human scripts. And users regularly ask for new symbol support.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 14:19 UTC (Thu) by mpr22 (subscriber, #60784) [Link]
Incidentally, the "Runic" block at U+16A0-16FF is nothing to do with video game ultrafans. It's the actual runic alphabet used to write various Germanic languages during the first millennium AD, which unquestionably has legitimate scholarly interest. (Ultima's runic alphabet is a slightly mangled version of the Anglo-Saxon variant.)
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 18:37 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]
Nor are those runes the only obsolete character set. A quick look shows that Unicode also contains Ogham, Cuneiform, Egyptian hieroglyphs, Linear B, and the undeciphered characters from the Phaistos Disk(!). And those are just the ones I recognize immediately as being only historical. I'm sure that some of the languages I'm not as familiar with are also strictly historical.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 19:14 UTC (Thu) by raven667 (guest, #5198) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 21:42 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]
Sure*. And I'm confident that the main reason they have Imperial Aramaic is for Biblical scholars. Looking at their list of supported scripts, I'm a bit surprised they don't have any of the Mesoamerican writing systems for the same general reason. My understanding is that Mayan, at least, was well enough developed and deciphered to be worth including.
*Actually, I'm not so sure for the Phaistos Disk. AFAIK, it's a solitary object with no known ties to any other writing system in existence. Why include it in Unicode when there's little hope of translating it without additional writing in the same script, which would probably add new characters and require an expansion of unknown size? Adding it to Unicode seems like premature optimization.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 13:06 UTC (Mon) by mpr22 (subscriber, #60784) [Link]
The Wikipedia article on Unicode suggests that there has not yet been a formal proposal for Mayan script due to there not yet being firm agreement in the user community for the script over what should go into it and how.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 20:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 7:59 UTC (Fri) by mitchskin (guest, #32405) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 22:49 UTC (Thu) by robert_s (subscriber, #42402) [Link]
If a user decides to input bizarre characters, that's up to them. If you don't are about it or do anything about it, it will likely get spat out in exactly the same way.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 3:56 UTC (Fri) by lambda (subscriber, #40735) [Link]
This is true of anything which has to deal with the real world; the real world is messy. Yes, any standards based design will probably come out more complicated than it needs to be; but there really is a fairly complicated problem to tackle here.
It may be useful to design one (or more) subsets of Unicode tailored for particular use cases, excising all of the cruft that was added on for obscure uses or things that people just don't do any more (like many control characters which aren't actually supported by most modern rendering engines).
> I seem to remember that one of the earliest drafts was a 16-bit character set
Yes, and this was a mistake, as it ensured that the CJK community would never be willing to adopt Unicode; plus the 16 bit encoding, UCS-2, was a bad idea as it was a simplification that attempted to preserve the "one character is one fixed-width code unit" mapping, which doesn't actually work in the general case of trying to represent the world's writing systems.
> they went up to a 32-bit code space
Actually, no. It's 17 planes, each of which is a 16 bit space; or the equivalent of a bit more than 20 bits worth of space. This is due to the fact that they made the initial mistake of thinking that 16 bits would be enough, and had to add a backwards-compatible hack, UTF-16, to extend the code space while still being mostly compatible with the initial implementations of the 16 bit code space.
> (Do they allow for columnar printing? I haven't seen whining and moaning about *that*, even if it would be nice for encoding manga in Japanese.)
I think the idea is that Unicode should be sufficient for representing "plain text," and that higher level markup languages or protocols are supposed to handle rich text, page layout, and the like. Bidirectional text is considered in-scope for plain text, as it's common (these days) to mix writing systems, while columnar text is considered out of scope as it requires specialized page layout features and doesn't really mix with horizontal text the same way that bidirectional text mixes.
> dingbats and miscellaneous symbols,
Yeah, it's picked up some of those, but really, so what? Once you have the code space for it, adding a few more random symbols doesn't really hurt. It helps to be able to unify random proprietary encodings like all of the various Japanese text-message carrier's emoji, and the additional burden is not all that great.
> Tolkien's elvish script
This has never actually been added to Unicode. It was proposed, once, a long time ago, but it has languished in limbo. Klingon was flat out rejected, because the klingon community doesn't actually use the klingon script.
> the runes from the Ultima series of games
No, they added actual real-world runes that have been used in historical inscriptions.
> And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.
Sure, it's just an encoding. But it can be nice to know that the encoded characters you send to someone else will appear as something legible to them, not some horrible garbage because their software doesn't support your character set or interprets it as a different one.
Unicode may seem complex, but remember, much of that complexity already existed, just within individual incompatible character encodings, or is an inherent complexity of trying to integrate different writing systems into one, single, universal encoding.
Thoughts from LWN's UTF8 conversion
Posted Feb 5, 2012 23:09 UTC (Sun) by cmccabe (guest, #60281) [Link]
Instead we got horrible kludges like UCS-2, "BOMs," and half a dozen incompatible encodings. There really should just be one canonical way to serialize unicode. Having N different ways defeats the point of standardization. And it is kind of obvious that the encoding needs to be variable length. People aren't going to stop inventing new languages just because you ran out of bits in your 16-bit integer.
In contrast, having runes or dingbats in syllabery doesn't really bother me.
Thoughts from LWN's UTF8 conversion
Posted Feb 5, 2012 23:26 UTC (Sun) by cmccabe (guest, #60281) [Link]
Basically, the committee decided that a lot of Chinese, Japanese, and other east Asian characters were just slightly different versions of each other, and so should be represented by the same code point. (This is about as politically savvy as telling French people that French is just a corrupted form of German.) It also means that you have to reintroduce the concept of code pages, since the way the Japanese characters are drawn is different than the way the Chinese ones are drawn, etc.
So basically... yeah... huge mistake. However, I guess in practice, people have found ways to work around these issues.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 1:22 UTC (Mon) by anselm (subscriber, #2796) [Link]
This is about as politically savvy as telling French people that French is just a corrupted form of German.
The logic behind Han unification is the logic that says that a French »e« looks pretty much the same as a German »e«, most of the time (some of the French ones tend to come with various kinds of squiggles on top while the German ones mostly don't) even though they are phonetically different. Hence, the French »e« and German »e« share a Unicode code point and there are a few extra code points for versions of the French »e« with squiggles on top.
The same logic concludes that many Han characters do indeed look pretty much the same in Chinese as they do in Japanese, even though they are phonetically different. This does not come as a big surprise given that the Japanese borrowed the Han characters from the Chinese some 1,000 years ago. To a certain degree the languages have since gone their separate ways but there is still a very large overlap. (The drawing differences can presumably be addressed by picking a Chinese-style font over a Japanese-style one, just like you can use a German Fraktur font in place of a »Roman« one to get a very noticeably different style of glyph for an »e«). The fact that Han characters aren't conceptually identical to letters in Western languages complicates things somewhat but not to a degree where it would have been compellingly necessary to have a few tens of thousands of characters three times over.
In any case, when the Han unification issue came up, the people in charge of Unicode still sort-of thought they could cram everything into 16 bits, and not doing Han unification would obviously have precluded that. We're past that point now, and with hindsight we could likely have lived without the Han unification – mostly to make the Japanese happy –, but calling it »the biggest mistake« in Unicode is probably taking things a bit far.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 4:48 UTC (Mon) by cmccabe (guest, #60281) [Link]
The fact is, whatever it might have been 1000 years ago, Japanese is a different language from Mandarin or Cantonese today. It ought to have its own code points rather than having to borrow someone else's. There are code points assigned for hieroglyphics and Linear B, but not for a language that people actually speak? Something went wrong here.
I also don't understand the thinking behind the 16 bit limitation. There are more than 65,535 Chinese characters in existence, so right off the bat they should have realized that 16 bits would not be enough.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 8:15 UTC (Mon) by anselm (subscriber, #2796) [Link]
The fact is, whatever it might have been 1000 years ago, Japanese is a different language from Mandarin or Cantonese today. It ought to have its own code points rather than having to borrow someone else's.
The fact is that, 1000 years of language history notwithstanding, most of the tens of thousands of Japanese characters still look substantially the same as their Chinese equivalents (stylistic, i.e., »font«, differences notwithstanding). There are some characters that Chinese has that Japanese doesn't have, and vice-versa, and these of course should have their own code points, much like »e« and »é« and »è« and »ê« have their own codepoints in Latin (»Western«) script.
However, you seem to be arguing, in effect, that »English e« and »German e« ought to have their own code points because, whatever it might have been 1000 years ago, English is a different language from German today (when 1000 years ago they were really quite similar, linguistically – much more so than at present, by virtue of the fact that the Saxon population of England had immigrated from what is now Germany a few centuries before), even though the German and English writing systems go back to the same roots, much like the Chinese and Japanese writing systems go back to the same roots as far as Han script is concerned. (This also glosses over the fact that, unlike German and English, the Chinese and Japanese languages really aren't similar at all, and hadn't been even when the Japanese took on the Han script over a millennium ago – but that is neither here nor there.)
When Unicode was first proposed, nobody was opposed to »Latin script unification« because that would have been just plain silly. With Han unification, the situation was less clear-cut because, while there are tens of thousands of Han characters (and new ones are being made up all the time), the vast majority of them only occur in names. Literate Japanese are expected to know somewhat over 2,000 kanji, not all the upwards of 50,000 ones that are in the kanji dictionary. The original 16-bit Unicode of 1991 reserved about 20,000 code points for Han characters, and that, given Han unification, would have been more than enough for most applications. We need more space to cover all the obscure characters, and it makes sense to do so considering that Unicode is supposed to be comprehensive, but that doesn't automatically mean all the obscure Han characters should be there three times instead of once.
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 6:49 UTC (Wed) by cmccabe (guest, #60281) [Link]
Who cares about the overhead of the so-called Han characters being repeated 3x? Honestly, could you locate anyone who would care? I have 6 terabytes of hard disk space. I doubt even 1% of that is text. And most of the text would be ASCII even if I lived in another country (that is the reality of text configuration files, etc.)
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 8:21 UTC (Wed) by anselm (subscriber, #2796) [Link]
I just spent an entirely unwarranted amount of time looking at web pages on Han unification. One would expect that if the current situation was so horrible then people (especially from China, Japan, and Korea) would visibly argue for a different setup where all languages using Han characters had their own code point ranges. It's not as if we didn't have the space in UCS. This is apparently not the case – instead, the relevant committees within ISO are trying to improve the actual Han unification.
It may be true that it is difficult to find people who mind the overhead of having every Han character in the UCS table three times. However, it seems to be just as difficult to find people who actually care enough to want this done.
As a former student of Japanese and a person interested in language in general, I do disagree with your notion that »China, Korea and Japan do not use a unified alphabet« but that sort of discussion is probably not germane to LWN.
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 19:36 UTC (Wed) by cmccabe (guest, #60281) [Link]
More info at
http://www.ibm.com/developerworks/java/library/u-secret.html
Hindsight is always 20/20, and no doubt some of the crtiticisms are unfair. But the criticism of Han unification seems fair to me. Anyway, I don't use any of these languages so I can just pretend that things are perfect in standards-land.
Except for that delete/backspace confusion. I still hate that.
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 10:01 UTC (Wed) by etienne (guest, #25256) [Link]
Using different codepoints for (nearly) the same symbol generates problem for comparisson of words/sentences.
Look at the Web "attack" replacing the "a" of well known http addresses with a near identical symbol...
Thoughts from LWN's UTF8 conversion
Posted Feb 14, 2012 21:49 UTC (Tue) by cmccabe (guest, #60281) [Link]
There are a few strategies to deal with the visual equivalence problem. I think web browsers have started displaying certain URLs as punycode if the locale doesn't match the user's locale. The other way is just to not click on URLs given to you by untrusted sources. If Bank of America wants to communicate with me, I can type out their website by hand, not click on a link in an email.
This isn't a Unicode problem. It's a language problem. The languages are fundamentally broken in that they have lots of similar looking glyphs. The CJK scripts are probably the worst in this regard. However, even good old English has a lot of ambiguity. I, l, and 1 all look very similar visually, as do O, and 0. A clever engineer might try to "unify" those letters, but 1 assume that y0u w0u1d not be supportive 0f th1s.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 20:36 UTC (Mon) by BenHutchings (subscriber, #37955) [Link]
I think the original requirement for inclusion in Unicode was that there must be a one-to-one mapping from each 'legacy' character set to Unicode code points, allowing for efficient and lossless round-trip conversion. AFAIK the existing CJK character sets did not cover multiple languages and thus Han unification did not break this property. However the various ISO 8859 scripts included accented (precomposed) characters and so those were assigned their own code points.
I also don't understand the thinking behind the 16 bit limitation. There are more than 65,535 Chinese characters in existence, so right off the bat they should have realized that 16 bits would not be enough.
As I understand it, the existing character sets didn't cover all of those either. The aim of encoding 'new' characters came later as a result of the merge with ISO 10646 (the Universal Character Set). Of course, it might have been sensible to allow for expansion from the start.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 6:24 UTC (Thu) by lordsutch (guest, #53) [Link]
As a general rule, data coming into a program from a file, socket, or other source is binary bytes; if the program needs to operate on that data as text, it must explicitly decode it into Unicode.
In casual use, I find that the more irritating issue is that output requires encoding, which interacts very badly with print() in Python 3.x. print() takes a string, not bytes, but there doesn't seem to be any non-hackish way to declare what encoding stdout and stderr should use, which means different behavior when you execute a script from the command line (say with a UTF-8 locale, making everything smooth) and as a CGI* (seemingly running in pure ASCII, and doing strict conversion, meaning every accented character throws an exception).
* Yes, I know CGI is obsolete... but for a small bit of code that may see 100 hits a day, I'm not going to the hassle of setting up whatever CGI replacement is fashionable these days.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 6:35 UTC (Thu) by thedevil (guest, #32913) [Link]
*does* have a way to change the encoding away from the ambient locale.
See
http://www.haskell.org/ghc/docs/latest/html/libraries/bas...
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 10:25 UTC (Thu) by angdraug (subscriber, #7487) [Link]
Another one is Ruby 1.9. The only reasonable complaint I've heard about it so far is that its internal representation of strings doesn't stick to UCS4 (32-bit characters), but I'm not yet convinced it is a problem.
The only real problem that I've encountered is with some library authors insisting that their library doesn't have to work in non-UTF8 locales, but that's a general problem with Ruby's ecosystem. Remember when everyone complained about poor code quality in CPAN?
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 23:20 UTC (Thu) by Tobu (subscriber, #24111) [Link]
Same complaint with Python; its internal representation can be UCS2 (UTF-16 without some of its consistency requirements), which is a thing of the devil. Plop one character into a string, occasionally get back two (and these things aren't really characters now). That representation is being phased out on Linux thankfully, and will disappear entirely in Python 3.3. There's a lot more that could be done, but the rest isn't a problem of internal representation and can be left to libraries.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 3:44 UTC (Mon) by josh (subscriber, #17465) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 19:19 UTC (Fri) by cmccabe (guest, #60281) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 23:12 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Russian alone has: KOI8-R, KOI8-U, Win1251 and Cp855 - they are different and they are widely used (still!). There are also like 10 historic encodings (ISO, GOST, old GOST, ZX-Spectrum encoding and so on).
So it's still common to receive files with names and/or contents in a wrong encoding (especially on FAT-formatted USB thumb drives). For additional fun, sometimes automatic transcoders (in email, for example) assume incorrect input encoding, so it's possible to get a KOI-8 letter transcoded to UTF-8 as if it was written in Win1251. Fun, fun, fun!
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 11:56 UTC (Thu) by intgr (subscriber, #39733) [Link]
> which interacts very badly with print() in Python 3.x. print() takes a
> string, not bytes
No surprise there; that's because print() is supposed to output human-readable text.
If you want to output binary data, apparently the "right way" is to bypass the encoding layer using sys.stdout.buffer.write()
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 20:56 UTC (Thu) by kunitz (subscriber, #3965) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 13:19 UTC (Fri) by cortana (subscriber, #24596) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 15:29 UTC (Thu) by marduk (subscriber, #3831) [Link]
Recently, I was converting a small script from Python2 to Python3... I had several issues:
* If you use os.listdir('.'), you get back a list of unicode filenames.. so all filenames are automagically decoded from ascii or utf-8 or whatever your system thinks they should be. If filenames are not in any particular encoding, which *nix allows, then you get encoding errors. The solution appears to be to use os.listdir(b'.')
* If you use the re module, and you are using a regex on the afore-mentioned filenames, the regex must be the same type as the string-or-bytes object. E.g. if listdir() is returning bytes then your regex needs to be b'something' but if listdir is returning string then just use 'something'
* Same with other "string" functions (e.g. *.endswith(b'.0001') vs *.endswith('.0001')
* low-level functions (e.g. fdopen) just don't behave as they did in Python2... before they always returned "byte" strings.. now, you have to pass flags or guess what they're going to return. If you're doing an fdopen on stdin, which is an already-opened file then you have to re-create stdin if you want bytes instead of unicode. I still have a program I'd like to port to Python3, but it uses fdopen and, although it runs without errors, it behaves completely different, and I have yet to determine if it is a bug in my conversion or a bug in Python3.
In a way, it's really too bad bytes look/act like "strings" and vice-versa. Sometimes I wish they were completely incompatible, e.g. os.listdir() always took bytes as an argument and always returned a list of bytes. And I wish any operation external to Python itself (e.g filesystem functions, network functions, etc.) always worked with bytes and not unicode(strings). That would clear the confusion as to what should be passed and what to expect to be returned. Let the programmer deal with explicitly decoding/encoding external data.
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 20:53 UTC (Thu) by cmccabe (guest, #60281) [Link]
If you do choose to accept them, you are playing with fire. It's not even safe to print out an arbitrary non-UTF8 filename to stdout or stderr. It may contain control characters that tell the terminal to do something malicious.
Windows and MacOS don't allow non-unicode filenames. Only Linux and the other *NIXes still do. It's really something that ought to be fixed at some point because of all the security vulnerabilities it creates, and the non-existent use cases for binary blobs as filenames.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 12:14 UTC (Fri) by mgedmin (subscriber, #34497) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 19:22 UTC (Fri) by cmccabe (guest, #60281) [Link]
A quick Google search reveals that Python 3.x doesn't protect you from control characters in utf8 file names, either. Sigh... computer security is such a joke sometimes.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 19:33 UTC (Fri) by marduk (subscriber, #3831) [Link]
In the same manner, we don't want our programming languages "protecting" us from using names like "Robert'); DROP TABLE Students; --" ;-)
Thoughts from LWN's UTF8 conversion
Posted Feb 4, 2012 21:34 UTC (Sat) by cmccabe (guest, #60281) [Link]
You can argue that it was a bad decision to have control characters in the first place, but that ship has already sailed. Now it's just important to separate code and data. Filenames are the latter.
Thoughts from LWN's UTF8 conversion
Posted Feb 5, 2012 13:14 UTC (Sun) by marduk (subscriber, #3831) [Link]
When you have a non-technically oriented client that has a system that produced thousands of files you need to process and those files have worked fine for the client for many many years and everyone else the client has shared those files with don't appear to have problems with them, it is a tough sell to tell the client that they need to rename all those thousands of files because you can't open them with those names. You'll find that they'll quickly find someone else to do the job.
The other issues I mentioned have nothing to do with filenames. I've developed a screen-scraping program that uses forkpty() and fdopen(). The Python2 version is "intuitive" and easy to follow. The Python3 version, which still doesn't behave exactly as the 2 version, is much worse. The 2to3 script did nothing for it. One must add/change a lot of stuff manually, and when one takes a step back and actually look at the code, it is obvious that one is doing more work "fighting" the programming language than actually getting stuff done. There is one part where there is a regex that I still haven't gotten to work with Python3. Prepending "b" to the regex doesn't just work for it. It is processing a regex on a byte stream, oh and this byte stream *does* have control characters ;-). I did do some searching on Python3 and bytes and regexes, but the nothing came back that was very helpful. Aside from a few remarks that one *should not* be using regular expressions on bytes... but this *is* a screen-scraping program after all. I eventually gave up, thinking that Python2 will be around for a while, and if I need to abandon Python2 I'll probably just re-rewrite it in another programming language.
Don't get me wrong. I like Python3, which is one reason that I want to convert all my Python2 programs to 3 if/when possible. But the whole bytes vs. unicode string thing: GvRs main rationale was that in Python2 people were confusing byte streams with "text". One of the goals of Python3 was to "fix" that problem. My argument is that it merely exchanges one set of problems for another. In one way that may possibly have made it less confusing is to only allow some functions to accept/return byte strings or unicode strings. If os.listdir(), e.g. only accepted unicode and only returned unicode, and their were another function, say os.blistdir() that only accepted byte strings and only returned bytestrings, that *may* help. Also, low-level os functions like fdopen should never, in my opinion, "automagically" encode/decode data passed to it/returned from it. They should work with bytes only. If you are using a low-level OS function, the training wheels should be taken off.
But my opinion is if you are doing something outside the language (reading/listing files/pipes, network operations, any kind of work on "external" data, then the language should error on the side of caution, saying "this is some external data, I'm just going to return a bytestring" and let the programmer manually deal with if/how the data should be decoded on the way in or encoded on the way out.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 12:36 UTC (Fri) by marduk (subscriber, #3831) [Link]
And there are still environments where UTF-8 is far from universal, e.g. some countries in Asia especially when using older versions of Windows or MacOS. So even if a filename is encoded, one must not assume it is always going to be UTF-8. It's a nice dream, but I live in reality.
I can't for the life of me see any security vulnerabilities created by linux filenames, at least none that don't also exist for UTF-8. There are, however, safety issues when one assumes the character encoding of a byte stream (which is basically what a Unix filename is).
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 13:56 UTC (Fri) by cortana (subscriber, #24596) [Link]
The Windows case is worse. The Windows API does not, and will never, support UTF-8-encoded strings in any of its functions, save for MultiByteToWideChar/WideCharToMultiByte. If you want to do use any Winodws API facilities that use strings, you will either call the native "Unicode" API, which always uses sequences of wchar_t (16 bits wide) representing little-endian UTF-16 code units, or you use the legacy "ANSI" API, which uses sequences of char representing code units in the encoding of the "System Locale" (which may be a single-byte encoding such as Windows-1252, or a multibyte encoding such as Shift-JIS, but is never UTF-8).
This restriction is annoying, but you can see where it came from--Windows NT made the switch to UCS-2 internally in the days before Unicode expanded past 16-bits, therefore it was convenient to use wchar_t, and later extend by adopting UTF-16. The problem is too much software is still written to only use the legacy ANSI APIs, and assumes incorrectly both that char* is an acceptable external representation for filenames.
I've run into this a lot while writing software that runs on Windows; I want to use a library that does something with the filesystem, but that library makes the above assumptions, and hence will only open files via a function like foo_open(char* filename). These are reasonable assumptions, since such libraries usually also operate on UNIX and Mac OS X systems where all filesystem paths use char*, and the Windows ports will probably only have been tested on US-English Windows installations. The assumptions could even said to have been inherited from the C and C++ language standards, despite efforts to the contrary.
I'm coming round to the opinion that libraries should not use filenames at all, but have a typedef that resoves to int on POSIX, HANDLE on Windows, and something else on Mac OS X (FSRef? NSURL/NSString? int?).
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 15:59 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 16:18 UTC (Fri) by cortana (subscriber, #24596) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 16:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
(oh, and C++ iostreams are pile of $SWEAR_WORD)
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 16:40 UTC (Fri) by cortana (subscriber, #24596) [Link]
iostreams are only an example from the standard library. The C standard fopen function is another--it takes char* and not wchar_t*, hence if you use it then you're screwed. Working around this by, say, calling _wfopen if _WIN32 is defined only gets you so far--as soon as you hit a library that has a foo_open function taking char* and not wchar_t*, you hit the same problem.
Summary: if you write a library that deals with files, you are only allowed to take filesystem paths as arguments to your functions unless you've ported the library to several different platforms, and made sure it can deal with Chinese and Runic filenames at the same time. :)
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 16:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Yeah, so I mostly use libraries that can accept file descriptors or FILE* instead of file names. That actually covers quite a lot of functionality.
>Summary: if you write a library that deals with files, you are only allowed to take filesystem paths as arguments to your functions unless you've ported the library to several different platforms, and made sure it can deal with Chinese and Runic filenames at the same time. :)
Well, no arguments here. I'd also add working with filenames encoded in an encoding that is 8-bit and not the same as the system encoding.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 17:19 UTC (Fri) by cortana (subscriber, #24596) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 19:32 UTC (Fri) by cmccabe (guest, #60281) [Link]
Of course, this approach basically forces you to bundle all your libraries with your application. But I was under the impression that this was standard operating procedure on Windows anyway, because some evil guy could overwrite your shared copy of the shared library with an older version, etc.
Does that make sense at all? I've never developed on Windows, so it might be nonsense.
Thoughts from LWN's UTF8 conversion
Posted Feb 4, 2012 0:15 UTC (Sat) by cortana (subscriber, #24596) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 17:02 UTC (Mon) by jwakely (subscriber, #60262) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 4, 2012 7:09 UTC (Sat) by foom (subscriber, #14868) [Link]
But it's not actually incorrect. On the windows implementation of foo_open, the char* should be passed in as a utf8-encoded string, and decoded to utf-16 on the way to the windows wide-char API (e.g. _wfopen or whatever you want to use).
There's really no reason not to do that...
Thoughts from LWN's UTF8 conversion
Posted Feb 10, 2012 19:34 UTC (Fri) by jrw (subscriber, #69959) [Link]
> ...
> I can't for the life of me see any security vulnerabilities created by linux filenames
See David Wheeler's Fixing Unix/Linux/POSIX Filenames: www.dwheeler.com/essays/fixing-unix-linux-filenames.html
Thoughts from LWN's UTF8 conversion
Posted Feb 5, 2012 21:16 UTC (Sun) by k8to (subscriber, #15413) [Link]
Maybe the sane libraries don't?
The "user should fix his problem" is kind of broken when having a nonutf8 filename is not necessarily a problem.
For a looong time, portable software has had to deal with the fact that the encoding of byte strings from filesystem calls has all kinds of execeptions. For the case that you have to render a filename as text (rare), handle conversion failures and support specifiers, such as LANG, LC_CTYPE, and -- if needed for your application domain -- explicit support for declaring name encodings on a file or pattern basis.
In case anyone is board...
Posted Feb 2, 2012 15:31 UTC (Thu) by tstover (guest, #56283) [Link]
In case anyone is board...
Posted Feb 3, 2012 2:53 UTC (Fri) by acolin (subscriber, #61859) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 15:32 UTC (Thu) by JEFFREY (guest, #79095) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 17:06 UTC (Thu) by skvidal (guest, #3094) [Link]
▄██████████████▄▐█▄▄▄▄█▌
██████▌▄▌▄▐▐▌███▌▀▀██▀▀
████▄█▌▄▌▄▐▐▌▀███▄▄█▌
▄▄▄▄▄██████████████
6-bit characters
Posted Feb 2, 2012 17:10 UTC (Thu) by alfille (subscriber, #1631) [Link]
6-bit characters
Posted Feb 2, 2012 17:11 UTC (Thu) by corbet (editor, #1) [Link]
A CDC is exactly what it was, though there was no PLATO involved (that came later).
Thoughts from LWN's UTF8 conversion
Posted Feb 2, 2012 21:18 UTC (Thu) by iabervon (subscriber, #722) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 1:57 UTC (Fri) by ras (subscriber, #33059) [Link]
+1.
Python3 consists of a lot of welcome cleanups, a couple of new features you can choose to use if you wish, and one bad design mistake everyone is forced to confront: using a bastardisation of USC2 to represent Unicode.
Life would have been much simpler if they just had of abandoned Python2 Unicode strings entirely and reverted to the Python1.5 situation where there was one string type and programmer manually handled the encoding, if necessary. The point being it often isn't necessary - most of the time both ends are using compatible encodings, and the parts in the text you do care about, such as HTML encoding, are ASCII. Since ASCII's code points are identical in all popular encodings normal string manipulation and regex's just work regardless of encoding.
None of my brushes with Python2's Unicode have been pleasant. Conversion between Unicode encodings is fragile, so say the least, but often you could avoid it. Python3 forcing you to do a conversion to and from UCS2 means that fragility has infected my python programs, making the situation worse than Python2. For that one reason I'll be sticking to Python2.x for as long as possible.
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 12:06 UTC (Fri) by dag- (guest, #30207) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 3, 2012 12:20 UTC (Fri) by mgedmin (subscriber, #34497) [Link]
* 8-bit ASCII string
* 16-bit UCS2 string
* 32-bit UTF-32 string
depending on the characters contained within. So you're not limited to the Basic Multilingual Plane, or have to deal with UTF-16 surrogates any more.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 12:27 UTC (Mon) by alankila (guest, #47141) [Link]
People have tried this way of treating text as binary as far as possible, and it just doesn't seem to work.
Thoughts from LWN's UTF8 conversion
Posted Feb 7, 2012 9:32 UTC (Tue) by jezuch (subscriber, #52988) [Link]
And between InputStream/OutputStream and Reader/Writer (the former are for byte streams, the latter are for character streams). The conversion between bytes and text needs to be explicit. As another person who lives in a non-ASCII locale I definitely have to say that any language that conflates text and bytes causes brain damage, sorry ;)
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 6:09 UTC (Wed) by ras (subscriber, #33059) [Link]
That aside, most of the complaints here would be addressed if bytes() was just the Python2 str() renamed. The rest would be resolved by replacing Python's UCS2 with something that is a standard and can actually represent all of Unicode - either UCS4 or UTF8. UTF8 has the nice advantage that it could be just a subclass of bytes() that was guaranteed to contain valid utf8. All the string methods could be inherited from bytes() and should just work. Things that care whether they are passed a human readable string could insist on getting utf8.
That would have been the right solution for Python 3.0. It would have meant the bulk of the Python2 code remained compatible, while actually simplifying the str/unicode mess Python2 created into a nice class hierarchy. As it is, each new revision of Python 3 seems as seems to be trying to solve the complex programming interface created by UCS2 with new layers of complexity, and comment from @mgedmin seems to imply that trend is continuing with Python 3.3. Attempting to simplify things by adding more complexity underneath rarely works as a design strategy.
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 19:04 UTC (Wed) by raven667 (guest, #5198) [Link]
Conversion issues aren't new to Unicode.
Posted Feb 3, 2012 16:48 UTC (Fri) by dwmw2 (subscriber, #2063) [Link]
"So programs dealing in Unicode text must know how outside-world strings are represented and convert those strings to the internal format before operating on them."That isn't a new feature with Unicode. That was necessary with legacy character sets too. Strings always had to have an associated label which indicated the character set in use. You couldn't just accept a byte-stream in ISO8859-1 and pass it to someone else who expects ISO8859-8 (or EBCDIC!), and expect sane things to happen.
The real problem is that labelling was hard. Strings would often lose their labels in transit, so you end up having to guess what encoding is in use. See my comments on HTML5 for a discussion of how well that works out.
And even if you do manage to preserve the labelling, converting to your internal format was hard because you couldn't represent all of the characters from one 8-bit character set in another one. So any conversion would be lossy.
Using UTF-8 largely solves the above issues. By using UTF-8 internally you can represent anything you receive. And it makes the labelling problem a whole lot easier — you just make sure everything is converted to UTF-8 as you receive it, and forget the onerous task of carrying labels around with each string. You know that everything, everwhere within your system, is all UTF-8.
As more and more people move to UTF-8, it even makes the labelling problem easier when someone feeds you text that's unlabelled. It's more and more valid to just assume that it's UTF-8. And if it is some legacy 8-bit nonsense instead, that's usually something you can detect because it'll be an invalid UTF-8 byte-stream. So you can validate your assumption too.
In those respects, UTF-8 makes the whole thing easier, not harder.
The issue that UTF-8 does introduce, however, is that characters now take a variable number of bytes. A certain amount of work had to be done, in order to cope with that, but it should mostly be a solved issue these days.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 7:05 UTC (Mon) by C.Gherardi (subscriber, #4233) [Link]
Utilities like grep still fail to work on utf-16 files, making searching of a directory of mixed utf-8 utf-16 files a grep pain.
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 9:47 UTC (Mon) by mpr22 (subscriber, #60784) [Link]
UTF-16 isn't a multibyte encoding. It's a wide (and multi-word) encoding. It's also not cheaply-and-reliably distinguishable from a headerless binary blob. (And I feel compelled to add that my first reaction was "why on Earth would you have such a directory in the first place?")
Thoughts from LWN's UTF8 conversion
Posted Feb 7, 2012 5:53 UTC (Tue) by C.Gherardi (subscriber, #4233) [Link]
I contribute to program that parses text files produces by a commercial poker programs. One of these clients started several years ago using ascii, then cp-1252, a brief interlude to utf8 then in their infinite wisdom switched to utf16. The files are all text (imho), and for long term player they have all of these files mixed in the same directory.
It would be nice to be able to grep this directory for specific lines without the gymnastics required with iconv at the moment.
</derail>
Thoughts from LWN's UTF8 conversion
Posted Feb 8, 2012 2:37 UTC (Wed) by cmccabe (guest, #60281) [Link]
I wrote a similar script for pdfs that allows you to effective grep a PDF from the command line.
http://www.club.cc.cmu.edu/~cmccabe/cgi-bin/gitweb.cgi?p=...
Thoughts from LWN's UTF8 conversion
Posted Feb 6, 2012 15:08 UTC (Mon) by dvdeug (subscriber, #10998) [Link]
Thoughts from LWN's UTF8 conversion
Posted Feb 13, 2012 13:32 UTC (Mon) by ekj (guest, #1524) [Link]
Might want to look at that, it's annoying when you cannot find people who have names with non-ascii characters in them.
