|
|
Subscribe / Log in / New account

Thoughts from LWN's UTF8 conversion

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
February 1, 2012
There are a lot of things that one does not learn in engineering school. In your editor's case, anything related to character encodings has to be put onto that list. That despite the fact that your editor's first programs were written on a system with a six-bit character size; a special "shift out" mechanism was needed to represent some of the more obscure characters - like lower case letters. Text was not portable to machines with any other architecture, but the absence of a network meant that one rarely ran into such problems. And when one did, that was what EBCDIC conversion utilities were for.

Later machines, of course, standardized on eight-bit bytes and the ASCII character set. Having a standard meant that nobody had to worry about character set issues anymore; the fact that it was ill-suited for use outside of the United States didn't seem to matter. Even as computers spread worldwide, usage of ASCII stuck around for a long time. Thus, your editor has a ready-made excuse for not thinking much about character sets when he set out to write the "new LWN site code" in 2002. Additionally, the programming languages and web platforms available at the time did not exactly encourage generality in this area. Anything that wasn't ASCII by then was Latin-1 - for anybody with a sufficiently limited world view.

Getting past the Latin-1 limitation took a long time and a lot of work, but that seems to be accomplished and stable at this point. In the process, your editor observed a couple of things that were not immediately obvious to him. Perhaps those observations will prove useful to anybody else who has had a similarly sheltered upbringing.

Now, too, we have a standard for character representation; it is called "Unicode." In theory, all one needs to do is to work in Unicode, and all of those unpleasant character set problems will go away. Which is a nice idea, but there's a little detail that is easy to skip over: Unicode is not actually a standard for the representation of characters. It is, instead, a mapping between integer character numbers ("code points") and the characters themselves. Nobody deals directly with Unicode; they always work with some specific representation of the Unicode code points.

Suitably enlightened programming languages may well have a specific type for dealing with Unicode strings. How the language represents those strings is variable; many use an integer type large enough to hold any code point value, but there are exceptions. The abortive PHP6 attempt used a variable-width encoding based on 16-bit values, for example. With luck, the programmer need not actually know how Unicode is handled internally to a given language, it should Just Work.

But the use of a language-specific internal representation implies that any string obtained from the world outside a given program is not going to be represented in the same way. Of course, there are standards for string representations too - quite a few standards. The encoding used by LWN now - UTF8 - is a good choice for representing a wide range of code points while being efficient in LWN's still mostly-ASCII world. But there are many other choices, but, importantly, they are all encodings; they are not "Unicode."

So programs dealing in Unicode text must know how outside-world strings are represented and convert those strings to the internal format before operating on them. Any program which does anything more complicated to text than copying it cannot safely do so if it does not fully understand how that text is represented; any general solution almost certainly involves decoding external text to a canonical internal form first.

This is an interesting evolution of the computing environment. Unix-like systems are supposed to be oriented around plain text whenever possible; everything should be human-readable. We still have the human-readable part - better than before for those humans whose languages are not well served by ASCII - but there is no such thing as "plain text" anymore. There is only text in a specific encoding. In a very real sense, text has become a sort of binary blob that must be decoded into something the program understands before it can be operated on, then re-encoded before going back out into the world. A lot of Unicode-related misery comes from a failure to understand (and act on) that fundamental point.

LWN's site code is written in Python 2. Version 2.x of the language is entirely able to handle Unicode, especially for relatively large values of x. To that end, it has a unicode string type, but this type is clearly a retrofit. It is not used by default when dealing with strings; even literal strings must be marked explicitly as Unicode, or they are just plain strings.

When Unicode was added to Python 2, the developers tried very hard to make it Just Work. Any sort of mixture between Unicode and "plain strings" involves an automatic promotion of those strings to Unicode. It is a nice idea, in that it allows the programmer to avoid thinking about whether a given string is Unicode or "just a string." But if the programmer does not know what is in a string - including its encoding - nobody does. The resulting confusion can lead to corrupted text or Python exceptions; as Guido van Rossum put it in the introduction to Python 3, "This value-specific behavior has caused numerous sad faces over the years." Your editor's experience, involving a few sad faces for sure, agrees with this; trying to make strings "just work" leads to code containing booby traps that may not spring until some truly inopportune time far in the future.

That is why Python 3 changed the rules. There are no "strings" anymore in the language; instead, one works with either Unicode text or binary bytes. As a general rule, data coming into a program from a file, socket, or other source is binary bytes; if the program needs to operate on that data as text, it must explicitly decode it into Unicode. This requirement is, frankly, a pain; there is a lot of explicit encoding and decoding to be done that didn't have to happen in a Python 2 program. But experience says that it is the only rational way; otherwise the program (and programmer) never really know what is in a given string.

In summary: Unicode is not UTF8 (or any other encoding), and encoded text is essentially binary data. Once those little details get into a programmer's mind (quite a lengthy process, in your editor's case), most of the difficulties involved in dealing with Unicode go away. Much of the above is certainly obvious to anybody who has dealt with multiple character encodings for any period of time. But it is a bit of a foreign mind set to developers who have spent their time in specialized environments or with languages that don't recognize Unicode - kernel developers, for example. In the end, writing programs that are able to function in a multiple-encoding world is not hard; it's just one more thing to think about.


(Log in to post comments)

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 3:12 UTC (Thu) by allesfresser (guest, #216) [Link]

I for one heartily thank you for taking the time to do the conversion to UTF8.

LWN as open source

Posted Feb 2, 2012 4:00 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

Just a gentle reminder to the editor that he could have released the site code as open source and gotten community help in the process. I hope that happens at some point in the near future.

LWN as open source

Posted Feb 2, 2012 4:04 UTC (Thu) by corbet (editor, #1) [Link]

That has not slipped our mind. The UTF8 change - and a whole lot of invisible work that was needed to get to where we could do that change - was part of the process. It must remain a lower priority than, say, writing a pile of articles every week, but, as incredible as it must seem, it is something we want to do.

LWN as open source

Posted Feb 2, 2012 9:56 UTC (Thu) by Da_Blitz (subscriber, #50583) [Link]

I actually wouldn't mind seeing some more articles about how the site runs, i have been enjoying posts like this and another where you found that pypy sped up some data processing (for the site i assume) for you

is there any possibility of seeing more articles like this?

LWN as open source

Posted Feb 3, 2012 16:53 UTC (Fri) by union (guest, #36393) [Link]

+1 Agree.

LWN as open source

Posted Feb 2, 2012 14:39 UTC (Thu) by fuhchee (guest, #40059) [Link]

"It must remain a lower priority ..."

Have you considered simply throwing the code over the fence into github? That'd be just a few minutes.

LWN as open source

Posted Feb 5, 2012 20:35 UTC (Sun) by musicon (guest, #4739) [Link]

The problem with that approach is you never know if there are bugs in the code that could lead to an exploit of your business or your customers.

Like LWN, I have written a fairly large web application that I use to run my business. I'm intending (eventually) to release it as open source, but I'm panicked there are gaping holes that would cause me to lose my livelihood.

Additionally, due to the simple fact that the code has grown as the business has expanded, there are still hard-coded sections left over from when the business was much smaller that aren't suitable for release into the 'wild'.

LWN as open source

Posted Feb 6, 2012 3:57 UTC (Mon) by raven667 (guest, #5198) [Link]

While source code is nice, security issues are often found in web apps with no code available and there are plenty of tools for doing automated security testing making this sometimes easier than a code audit. So I'm not sure that as a practical matter the release of code drives the discovery of security bugs.

LWN as open source

Posted Feb 2, 2012 15:34 UTC (Thu) by nhasan (guest, #1699) [Link]

I do find it ironic that the site which makes so much noise about free software/open source is itself based on proprietary code. I have been a regular reader for 10+ years and I have seen editors take others to task for not releasing their code.

LWN as open source

Posted Feb 2, 2012 15:44 UTC (Thu) by jake (editor, #205) [Link]

> I do find it ironic that the site which makes so much noise about free
> software/open source is itself based on proprietary code.

You know, we hear this periodically, but it is a) incorrect and b) not really helping get the code out there. It's not like we are out there selling the code to others in some closed-source form. It is the code that we run our business on and I daresay that there are few companies in the world that don't have some code they use internally and don't release, this is no different.

Except that it is different because we do want to get it out there, and will eventually. The sad thing is that it will likely be the biggest non-event there is because the code is tied tightly to what we do here, not generalized for other uses, and there are much more plausible solutions than a 10-ish year old code base that is targeted at one particular use-case.

We certainly realize that we have been making the promise for a long time and prodding us about it is certainly fair game. Calling the code 'proprietary' is not.

ymmv,

jake

LWN as open source

Posted Feb 2, 2012 17:45 UTC (Thu) by jeremiah (guest, #1221) [Link]

I think a lot of code in the world stays internal because it potentially embarrassing. I look at my code and say "That'd be nice to throw over the fence" then I really look at it and think. "My god what will people think about the stupid way I did that when there must be 10 billion libraries out there already doing it better, more efficiently, and more elegantly." Do people really want to see my hair brained attempts to re-invent the wheel because I didn't do enough research to know that Pirelli was already out there?

LWN as open source

Posted Feb 3, 2012 15:56 UTC (Fri) by felixfix (subscriber, #242) [Link]

I think there's also a hassle factor: while others might find the code useful, most would want to change it, fix bugs or add features, and try to send them back, which raises the pain factor considerably. While you might be appreciative of most bug fixes and a few new features, most of the contributions would not be useful, and how do you tell people that? Perhaps more importantly, how do you deal with the contributions in the first place? You can't just take them blindly. You have to look at each one, see what it does, possible revise it to match changes you have made on your own, and all that is a diversion from your real job.

The alternative is to dump it periodically and not take any contributions, and no one wants to do that because they do want some kind of feedback.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 4:24 UTC (Thu) by Tara_Li (guest, #26706) [Link]

As with many things done by committee, I have to say I think Unicode was far over-complicated. I seem to remember that one of the earliest drafts was a 16-bit character set, and the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit. So, in the interest of going bat-**** insane, they went up to a 32-bit code space, and started including control characters to allow the drawing of these syllables, as well as control left-to-right and right-to-left printing (Do they allow for columnar printing? I haven't seen whining and moaning about *that*, even if it would be nice for encoding manga in Japanese.) And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols, Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)

And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 5:56 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit.

It's a nitpicky point, but syllabaries are writing systems that represent each syllable as a single glyph and aren't the big problem. For example, Japanese Hiragana has only about 100 symbols, as does Cherokee (a non-oriental syllabary). Syllabic writing is mostly used with languages that have fairly simple rules for constructing syllables, so the number of glyphs required is still reasonable. The big problem comes with ideographic writing, where glyphs map to concepts rather than sounds. Those are the cases like Chinese writing or Kanji that can have thousands of glyphs.

And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols, Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)

Honestly, what's the problem with offering those character sets? If you have the space for them in your code table- and you ought to if you have 32 bits to work with- I don't see the harm in including a few oddball symbols, characters for made up languages, and the like. Consider the alternative of having a committee somewhere that judges which languages deserve to have their characters included in Unicode based on whether they're sufficiently serious. That opens the process up to all kinds of political pressure because questions of language are inevitably going to involve culture, ethnicity, and rights of oppressed minorities to use their own languages. I'd rather just leave the process open to anyone who is willing to go to the trouble of putting through a properly formatted application to the appropriate standards body.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 6:43 UTC (Thu) by AnthonyJBentley (guest, #71227) [Link]

As with many things done by committee, I have to say I think Unicode was far over-complicated. I seem to remember that one of the earliest drafts was a 16-bit character set, and the only problem with it was that if you started including Chinese and other oriental syllabalaries, it would not fit. So, in the interest of going bat-**** insane, they went up to a 32-bit code space

Do you think Unicode shouldn’t have increased the character space, and as a result been useless for Asian languages? (And technically it’s 21-bit, but that’s not important.)

And along the way it's picked up a lot of stupid cruft - dingbats and miscellaneous symbols

Yes, because these exist in other character encodings. One of the design goals of Unicode is that any existing encoding can be losslessly converted to Unicode.

Tolkien's elvish script (seriously - there's something to be said for geeking out, but really...) and the runes from the Ultima series of games (Ok, you know the difference between Trekkers and Trekkies? People, you're looking at Trekkie territory here!)

No, Tolkien script and Ultima runes are not in Unicode proper. Unicode happens to provide an undefined character space for private use, and some people happened to start using that area for those characters. As far as I know, all attempts to get Tolkien scripts added to Unicode have been rejected by the Unicode Consortium.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 7:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

As someone who has worked on typesetting history books about old languages, I really appreciate the amount of work that goes into Unicode.

Besides, natural languages are tricky in any case. The amount of glyphs is not really a problem (it was clear from the start that 16 bits are not enough). Languages with complex scripts are much more tricky.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 8:07 UTC (Thu) by keeperofdakeys (guest, #82635) [Link]

And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.

The real distinction is that 'plain text' has one byte per character and code points map directly to bytes; there is no encoding or decoding done. Unicode added a layer of indirection, so that you have different possible encodings. It may all be bytes in the end, but it is the interpretation of those bytes that is important.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 18:22 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

The real distinction is that 'plain text' has one byte per character and code points map directly to bytes; there is no encoding or decoding done.
Of course there's still encoding and decoding being done. What do you think the "C"s in ASCII and EBCDIC stand for? The 1:1 relationship between bytes and code points is still an encoding, it's just simple enough that people tend to ignore it. The illusion that there's no encoding going on will disappear the moment you have to worry about different native encodings, like using EBCDIC data on an ASCII machine or vice versa.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 9:53 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> And along the way it's picked up a lot of stupid cruft - dingbats and
> miscellaneous symbols

It may seem stupid cruft to you, but that's the only way to make documents that include dingbats (pretty much everything produced by an office suite nowadays) not depend on a specific proprietary font available on a single specific operating system.

To this day users continue to open bugs on various free software apps because they 'misrender' dingbats ✱●▶, smileys ☹☺☻, form checks ☐☒☑✓✔✕✖✗✘, weather symbols ☀☂☃ (much loved by businesses to sum up a project state) which have been inserted in documents by software using pre-unicode fonts (either the windows wingdings* or the Adobe dingbats)

Indeed DejaVu support for a large number of unicode dingbats and symbols did a lot more for its popularity that its support for some human scripts. And users regularly ask for new symbol support.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 14:19 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Incidentally, the "Runic" block at U+16A0-16FF is nothing to do with video game ultrafans. It's the actual runic alphabet used to write various Germanic languages during the first millennium AD, which unquestionably has legitimate scholarly interest. (Ultima's runic alphabet is a slightly mangled version of the Anglo-Saxon variant.)

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 18:37 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

Nor are those runes the only obsolete character set. A quick look shows that Unicode also contains Ogham, Cuneiform, Egyptian hieroglyphs, Linear B, and the undeciphered characters from the Phaistos Disk(!). And those are just the ones I recognize immediately as being only historical. I'm sure that some of the languages I'm not as familiar with are also strictly historical.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 19:14 UTC (Thu) by raven667 (guest, #5198) [Link]

And those are all useful to have in Unicode so that we can transcribe old texts in those character sets onto modern data storage in its native format. There is plenty of written text in those languages that could conceivably be stored on a computer system.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 21:42 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

Sure*. And I'm confident that the main reason they have Imperial Aramaic is for Biblical scholars. Looking at their list of supported scripts, I'm a bit surprised they don't have any of the Mesoamerican writing systems for the same general reason. My understanding is that Mayan, at least, was well enough developed and deciphered to be worth including.

*Actually, I'm not so sure for the Phaistos Disk. AFAIK, it's a solitary object with no known ties to any other writing system in existence. Why include it in Unicode when there's little hope of translating it without additional writing in the same script, which would probably add new characters and require an expansion of unknown size? Adding it to Unicode seems like premature optimization.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 13:06 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

The Wikipedia article on Unicode suggests that there has not yet been a formal proposal for Mayan script due to there not yet being firm agreement in the user community for the script over what should go into it and how.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 20:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Long time ago I was forced to draw a font for Old Slavonic for Metafont because the existing fonts had a few missing characters. I really started to appreciate Unicode since then.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 7:59 UTC (Fri) by mitchskin (guest, #32405) [Link]

Just as long as they don't start in on the Voynich manuscript :)

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 22:49 UTC (Thu) by robert_s (subscriber, #42402) [Link]

But the great thing about it is you don't have to worry about it. You don't have to worry about escape sequences and code pages etc. You can just use the standardized encoding and decoding functions and treat the result as an opaque object. Most of the time you will come in to contact with odd unicode content will be from user input, and it's usually good practice to handle user input strings as opaque content anyway.

If a user decides to input bizarre characters, that's up to them. If you don't are about it or do anything about it, it will likely get spat out in exactly the same way.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 3:56 UTC (Fri) by lambda (subscriber, #40735) [Link]

> As with many things done by committee, I have to say I think Unicode was far over-complicated

This is true of anything which has to deal with the real world; the real world is messy. Yes, any standards based design will probably come out more complicated than it needs to be; but there really is a fairly complicated problem to tackle here.

It may be useful to design one (or more) subsets of Unicode tailored for particular use cases, excising all of the cruft that was added on for obscure uses or things that people just don't do any more (like many control characters which aren't actually supported by most modern rendering engines).

> I seem to remember that one of the earliest drafts was a 16-bit character set

Yes, and this was a mistake, as it ensured that the CJK community would never be willing to adopt Unicode; plus the 16 bit encoding, UCS-2, was a bad idea as it was a simplification that attempted to preserve the "one character is one fixed-width code unit" mapping, which doesn't actually work in the general case of trying to represent the world's writing systems.

> they went up to a 32-bit code space

Actually, no. It's 17 planes, each of which is a 16 bit space; or the equivalent of a bit more than 20 bits worth of space. This is due to the fact that they made the initial mistake of thinking that 16 bits would be enough, and had to add a backwards-compatible hack, UTF-16, to extend the code space while still being mostly compatible with the initial implementations of the 16 bit code space.

> (Do they allow for columnar printing? I haven't seen whining and moaning about *that*, even if it would be nice for encoding manga in Japanese.)

I think the idea is that Unicode should be sufficient for representing "plain text," and that higher level markup languages or protocols are supposed to handle rich text, page layout, and the like. Bidirectional text is considered in-scope for plain text, as it's common (these days) to mix writing systems, while columnar text is considered out of scope as it requires specialized page layout features and doesn't really mix with horizontal text the same way that bidirectional text mixes.

> dingbats and miscellaneous symbols,

Yeah, it's picked up some of those, but really, so what? Once you have the code space for it, adding a few more random symbols doesn't really hurt. It helps to be able to unify random proprietary encodings like all of the various Japanese text-message carrier's emoji, and the additional burden is not all that great.

> Tolkien's elvish script

This has never actually been added to Unicode. It was proposed, once, a long time ago, but it has languished in limbo. Klingon was flat out rejected, because the klingon community doesn't actually use the klingon script.

> the runes from the Ultima series of games

No, they added actual real-world runes that have been used in historical inscriptions.

> And of course, any character set is really just binary data until it hits the character generator, whether such generator is in hardware or part of the font software. No matter how much we talk about the semantic web, and natural language processing - it all gets reduced to ones and zeros when it hits the silicon.

Sure, it's just an encoding. But it can be nice to know that the encoded characters you send to someone else will appear as something legible to them, not some horrible garbage because their software doesn't support your character set or interprets it as a different one.

Unicode may seem complex, but remember, much of that complexity already existed, just within individual incompatible character encodings, or is an inherent complexity of trying to integrate different writing systems into one, single, universal encoding.

Thoughts from LWN's UTF8 conversion

Posted Feb 5, 2012 23:09 UTC (Sun) by cmccabe (guest, #60281) [Link]

Things would have been a lot better if the committee had just standardized on UTF-8 in the first place. I'm aware that UTF-8 wasn't created until later, but the concept behind it isn't really that complicated.

Instead we got horrible kludges like UCS-2, "BOMs," and half a dozen incompatible encodings. There really should just be one canonical way to serialize unicode. Having N different ways defeats the point of standardization. And it is kind of obvious that the encoding needs to be variable length. People aren't going to stop inventing new languages just because you ran out of bits in your 16-bit integer.

In contrast, having runes or dingbats in syllabery doesn't really bother me.

Thoughts from LWN's UTF8 conversion

Posted Feb 5, 2012 23:26 UTC (Sun) by cmccabe (guest, #60281) [Link]

I guess I also should have mentioned the Han unification in the list of Unicode mistakes. Actually the Han unification is probably the biggest mistake of all.

Basically, the committee decided that a lot of Chinese, Japanese, and other east Asian characters were just slightly different versions of each other, and so should be represented by the same code point. (This is about as politically savvy as telling French people that French is just a corrupted form of German.) It also means that you have to reintroduce the concept of code pages, since the way the Japanese characters are drawn is different than the way the Chinese ones are drawn, etc.

So basically... yeah... huge mistake. However, I guess in practice, people have found ways to work around these issues.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 1:22 UTC (Mon) by anselm (subscriber, #2796) [Link]

This is about as politically savvy as telling French people that French is just a corrupted form of German.

The logic behind Han unification is the logic that says that a French »e« looks pretty much the same as a German »e«, most of the time (some of the French ones tend to come with various kinds of squiggles on top while the German ones mostly don't) even though they are phonetically different. Hence, the French »e« and German »e« share a Unicode code point and there are a few extra code points for versions of the French »e« with squiggles on top.

The same logic concludes that many Han characters do indeed look pretty much the same in Chinese as they do in Japanese, even though they are phonetically different. This does not come as a big surprise given that the Japanese borrowed the Han characters from the Chinese some 1,000 years ago. To a certain degree the languages have since gone their separate ways but there is still a very large overlap. (The drawing differences can presumably be addressed by picking a Chinese-style font over a Japanese-style one, just like you can use a German Fraktur font in place of a »Roman« one to get a very noticeably different style of glyph for an »e«). The fact that Han characters aren't conceptually identical to letters in Western languages complicates things somewhat but not to a degree where it would have been compellingly necessary to have a few tens of thousands of characters three times over.

In any case, when the Han unification issue came up, the people in charge of Unicode still sort-of thought they could cram everything into 16 bits, and not doing Han unification would obviously have precluded that. We're past that point now, and with hindsight we could likely have lived without the Han unification – mostly to make the Japanese happy –, but calling it »the biggest mistake« in Unicode is probably taking things a bit far.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 4:48 UTC (Mon) by cmccabe (guest, #60281) [Link]

Your example undermines the point you were trying to make. You yourself admit that "there are a few extra code points for versions of the French »e« with squiggles on top." Why do those extra code points exist, when they could just tell French people to use a different code page... er, I mean font? Because it's Unicode, not one-of-many code.

The fact is, whatever it might have been 1000 years ago, Japanese is a different language from Mandarin or Cantonese today. It ought to have its own code points rather than having to borrow someone else's. There are code points assigned for hieroglyphics and Linear B, but not for a language that people actually speak? Something went wrong here.

I also don't understand the thinking behind the 16 bit limitation. There are more than 65,535 Chinese characters in existence, so right off the bat they should have realized that 16 bits would not be enough.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 8:15 UTC (Mon) by anselm (subscriber, #2796) [Link]

The fact is, whatever it might have been 1000 years ago, Japanese is a different language from Mandarin or Cantonese today. It ought to have its own code points rather than having to borrow someone else's.

The fact is that, 1000 years of language history notwithstanding, most of the tens of thousands of Japanese characters still look substantially the same as their Chinese equivalents (stylistic, i.e., »font«, differences notwithstanding). There are some characters that Chinese has that Japanese doesn't have, and vice-versa, and these of course should have their own code points, much like »e« and »é« and »è« and »ê« have their own codepoints in Latin (»Western«) script.

However, you seem to be arguing, in effect, that »English e« and »German e« ought to have their own code points because, whatever it might have been 1000 years ago, English is a different language from German today (when 1000 years ago they were really quite similar, linguistically – much more so than at present, by virtue of the fact that the Saxon population of England had immigrated from what is now Germany a few centuries before), even though the German and English writing systems go back to the same roots, much like the Chinese and Japanese writing systems go back to the same roots as far as Han script is concerned. (This also glosses over the fact that, unlike German and English, the Chinese and Japanese languages really aren't similar at all, and hadn't been even when the Japanese took on the Han script over a millennium ago – but that is neither here nor there.)

When Unicode was first proposed, nobody was opposed to »Latin script unification« because that would have been just plain silly. With Han unification, the situation was less clear-cut because, while there are tens of thousands of Han characters (and new ones are being made up all the time), the vast majority of them only occur in names. Literate Japanese are expected to know somewhat over 2,000 kanji, not all the upwards of 50,000 ones that are in the kanji dictionary. The original 16-bit Unicode of 1991 reserved about 20,000 code points for Han characters, and that, given Han unification, would have been more than enough for most applications. We need more space to cover all the obscure characters, and it makes sense to do so considering that Unicode is supposed to be comprehensive, but that doesn't automatically mean all the obscure Han characters should be there three times instead of once.

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 6:49 UTC (Wed) by cmccabe (guest, #60281) [Link]

The western European languages already use a unified alphabet, modulo a few accent symbols here and there. However, China, Korea, and Japan do not use a unified alphabet. It's not our responsibility to "fix" their culture so that they do. We just need to create a system that the people who live in those countries can actually use without unpleasantness.

Who cares about the overhead of the so-called Han characters being repeated 3x? Honestly, could you locate anyone who would care? I have 6 terabytes of hard disk space. I doubt even 1% of that is text. And most of the text would be ASCII even if I lived in another country (that is the reality of text configuration files, etc.)

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 8:21 UTC (Wed) by anselm (subscriber, #2796) [Link]

I just spent an entirely unwarranted amount of time looking at web pages on Han unification. One would expect that if the current situation was so horrible then people (especially from China, Japan, and Korea) would visibly argue for a different setup where all languages using Han characters had their own code point ranges. It's not as if we didn't have the space in UCS. This is apparently not the case – instead, the relevant committees within ISO are trying to improve the actual Han unification.

It may be true that it is difficult to find people who mind the overhead of having every Han character in the UCS table three times. However, it seems to be just as difficult to find people who actually care enough to want this done.

As a former student of Japanese and a person interested in language in general, I do disagree with your notion that »China, Korea and Japan do not use a unified alphabet« but that sort of discussion is probably not germane to LWN.

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 19:36 UTC (Wed) by cmccabe (guest, #60281) [Link]

Well, in Japan, there are still a lot of other character sets floating around. People have even invented some new ones like GCS, TRON, and UTF-2000. That's the practical effect of these problems.

More info at
http://www.ibm.com/developerworks/java/library/u-secret.html

Hindsight is always 20/20, and no doubt some of the crtiticisms are unfair. But the criticism of Han unification seems fair to me. Anyway, I don't use any of these languages so I can just pretend that things are perfect in standards-land.

Except for that delete/backspace confusion. I still hate that.

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 10:01 UTC (Wed) by etienne (guest, #25256) [Link]

> Honestly, could you locate anyone who would care?

Using different codepoints for (nearly) the same symbol generates problem for comparisson of words/sentences.
Look at the Web "attack" replacing the "a" of well known http addresses with a near identical symbol...

Thoughts from LWN's UTF8 conversion

Posted Feb 14, 2012 21:49 UTC (Tue) by cmccabe (guest, #60281) [Link]

There are already TONS of glyphs in unicode that look almost alike visually. You are trying to close the barn door, but to say the horse has bolted would be an understatement. It would be more appropriate to say that the horse ran away 15 years ago, and relocated to another continent.

There are a few strategies to deal with the visual equivalence problem. I think web browsers have started displaying certain URLs as punycode if the locale doesn't match the user's locale. The other way is just to not click on URLs given to you by untrusted sources. If Bank of America wants to communicate with me, I can type out their website by hand, not click on a link in an email.

This isn't a Unicode problem. It's a language problem. The languages are fundamentally broken in that they have lots of similar looking glyphs. The CJK scripts are probably the worst in this regard. However, even good old English has a lot of ambiguity. I, l, and 1 all look very similar visually, as do O, and 0. A clever engineer might try to "unify" those letters, but 1 assume that y0u w0u1d not be supportive 0f th1s.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 20:36 UTC (Mon) by BenHutchings (subscriber, #37955) [Link]

I think the original requirement for inclusion in Unicode was that there must be a one-to-one mapping from each 'legacy' character set to Unicode code points, allowing for efficient and lossless round-trip conversion. AFAIK the existing CJK character sets did not cover multiple languages and thus Han unification did not break this property. However the various ISO 8859 scripts included accented (precomposed) characters and so those were assigned their own code points.

I also don't understand the thinking behind the 16 bit limitation. There are more than 65,535 Chinese characters in existence, so right off the bat they should have realized that 16 bits would not be enough.

As I understand it, the existing character sets didn't cover all of those either. The aim of encoding 'new' characters came later as a result of the merge with ISO 10646 (the Universal Character Set). Of course, it might have been sensible to allow for expansion from the start.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 6:24 UTC (Thu) by lordsutch (guest, #53) [Link]

As a general rule, data coming into a program from a file, socket, or other source is binary bytes; if the program needs to operate on that data as text, it must explicitly decode it into Unicode.

In casual use, I find that the more irritating issue is that output requires encoding, which interacts very badly with print() in Python 3.x. print() takes a string, not bytes, but there doesn't seem to be any non-hackish way to declare what encoding stdout and stderr should use, which means different behavior when you execute a script from the command line (say with a UTF-8 locale, making everything smooth) and as a CGI* (seemingly running in pure ASCII, and doing strict conversion, meaning every accented character throws an exception).

* Yes, I know CGI is obsolete... but for a small bit of code that may see 100 hits a day, I'm not going to the hassle of setting up whatever CGI replacement is fashionable these days.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 6:35 UTC (Thu) by thedevil (guest, #32913) [Link]

The other language that seems to get this right is Haskell, and it
*does* have a way to change the encoding away from the ambient locale.
See

http://www.haskell.org/ghc/docs/latest/html/libraries/bas...

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 10:25 UTC (Thu) by angdraug (subscriber, #7487) [Link]

Another one is Ruby 1.9. The only reasonable complaint I've heard about it so far is that its internal representation of strings doesn't stick to UCS4 (32-bit characters), but I'm not yet convinced it is a problem.

The only real problem that I've encountered is with some library authors insisting that their library doesn't have to work in non-UTF8 locales, but that's a general problem with Ruby's ecosystem. Remember when everyone complained about poor code quality in CPAN?

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 23:20 UTC (Thu) by Tobu (subscriber, #24111) [Link]

Same complaint with Python; its internal representation can be UCS2 (UTF-16 without some of its consistency requirements), which is a thing of the devil. Plop one character into a string, occasionally get back two (and these things aren't really characters now). That representation is being phased out on Linux thankfully, and will disappear entirely in Python 3.3. There's a lot more that could be done, but the rest isn't a problem of internal representation and can be left to libraries.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 3:44 UTC (Mon) by josh (subscriber, #17465) [Link]

Seems perfectly reasonable to me for new code to deal exclusively in Unicode, unless it has a specific reason to need to handle random crazy character sets.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 19:19 UTC (Fri) by cmccabe (guest, #60281) [Link]

Why would want to ignore the system-wide locale that the user has specifically chosen? Sounds like a misfeature to me.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 23:12 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Heh. That's the beauty of 8-bit encodings. There are so much of them to choose from!

Russian alone has: KOI8-R, KOI8-U, Win1251 and Cp855 - they are different and they are widely used (still!). There are also like 10 historic encodings (ISO, GOST, old GOST, ZX-Spectrum encoding and so on).

So it's still common to receive files with names and/or contents in a wrong encoding (especially on FAT-formatted USB thumb drives). For additional fun, sometimes automatic transcoders (in email, for example) assume incorrect input encoding, so it's possible to get a KOI-8 letter transcoded to UTF-8 as if it was written in Win1251. Fun, fun, fun!

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 11:56 UTC (Thu) by intgr (subscriber, #39733) [Link]

> I find that the more irritating issue is that output requires encoding,
> which interacts very badly with print() in Python 3.x. print() takes a
> string, not bytes

No surprise there; that's because print() is supposed to output human-readable text.

If you want to output binary data, apparently the "right way" is to bypass the encoding layer using sys.stdout.buffer.write()

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 20:56 UTC (Thu) by kunitz (subscriber, #3965) [Link]

If you have an UTF-8 locale, setting the environment variable PYTHONIOENCODING to utf-8 makes a lot of sense.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 13:19 UTC (Fri) by cortana (subscriber, #24596) [Link]

It's always bugged me that you have to do this. I don't have to set HASKALLENCODING, RUBYENCODING, LUAENCODING, etc... Python should use LC_CTYPE like everything else seems to!

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 15:29 UTC (Thu) by marduk (subscriber, #3831) [Link]

Sadly, Python3 doesn't really "replace" the "string"/unicode issue, but merely exchanges one set of problems with another.

Recently, I was converting a small script from Python2 to Python3... I had several issues:

* If you use os.listdir('.'), you get back a list of unicode filenames.. so all filenames are automagically decoded from ascii or utf-8 or whatever your system thinks they should be. If filenames are not in any particular encoding, which *nix allows, then you get encoding errors. The solution appears to be to use os.listdir(b'.')

* If you use the re module, and you are using a regex on the afore-mentioned filenames, the regex must be the same type as the string-or-bytes object. E.g. if listdir() is returning bytes then your regex needs to be b'something' but if listdir is returning string then just use 'something'

* Same with other "string" functions (e.g. *.endswith(b'.0001') vs *.endswith('.0001')

* low-level functions (e.g. fdopen) just don't behave as they did in Python2... before they always returned "byte" strings.. now, you have to pass flags or guess what they're going to return. If you're doing an fdopen on stdin, which is an already-opened file then you have to re-create stdin if you want bytes instead of unicode. I still have a program I'd like to port to Python3, but it uses fdopen and, although it runs without errors, it behaves completely different, and I have yet to determine if it is a bug in my conversion or a bug in Python3.

In a way, it's really too bad bytes look/act like "strings" and vice-versa. Sometimes I wish they were completely incompatible, e.g. os.listdir() always took bytes as an argument and always returned a list of bytes. And I wish any operation external to Python itself (e.g filesystem functions, network functions, etc.) always worked with bytes and not unicode(strings). That would clear the confusion as to what should be passed and what to expect to be returned. Let the programmer deal with explicitly decoding/encoding external data.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 20:53 UTC (Thu) by cmccabe (guest, #60281) [Link]

99.9999% of the time, you really *shouldn't* handle non-UTF8 filenames. Just let the script fail, and let the user fix his problem.

If you do choose to accept them, you are playing with fire. It's not even safe to print out an arbitrary non-UTF8 filename to stdout or stderr. It may contain control characters that tell the terminal to do something malicious.

Windows and MacOS don't allow non-unicode filenames. Only Linux and the other *NIXes still do. It's really something that ought to be fixed at some point because of all the security vulnerabilities it creates, and the non-existent use cases for binary blobs as filenames.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 12:14 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

You can put terminal control characters into valid UTF-8 file names too.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 19:22 UTC (Fri) by cmccabe (guest, #60281) [Link]

I was kind of assuming that when banning non-utf8 characters, we'd also ban control characters in filenames as well. But you're right; that assumption ought to have been spelled out.

A quick Google search reveals that Python 3.x doesn't protect you from control characters in utf8 file names, either. Sigh... computer security is such a joke sometimes.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 19:33 UTC (Fri) by marduk (subscriber, #3831) [Link]

That's because they are *valid* filenames (and valid UTF-8) on certain operating systems.

In the same manner, we don't want our programming languages "protecting" us from using names like "Robert'); DROP TABLE Students; --" ;-)

Thoughts from LWN's UTF8 conversion

Posted Feb 4, 2012 21:34 UTC (Sat) by cmccabe (guest, #60281) [Link]

This conversation has been had many times before. There just isn't any valid use-case for control characters in filenames. They're purely a black hat tool for breaking security. You can write your filename in any language in the world-- even ancient Sumerian-- without using control characters.

You can argue that it was a bad decision to have control characters in the first place, but that ship has already sailed. Now it's just important to separate code and data. Filenames are the latter.

Thoughts from LWN's UTF8 conversion

Posted Feb 5, 2012 13:14 UTC (Sun) by marduk (subscriber, #3831) [Link]

The (my) original post in this thread said nothing of so-called "control characters" which, by the way *are* valid utf-8. I merely stated that in Python3 os.listdir('.') assumes filenames encoded in whatever your system encoding is, when in fact the file names could be in another encodig, or no encoding at all. Whether or not they contain control characters is not the issue. If your purpose is just to get a key — a hash value so that you may later get a handle for that file so you can process data — they you should not need to worry about filenames and encodings. Indeed, Python3 *does* let you do this (os.listdir(b'.')) but it's a change that one might not have expected and is easy to forget.

When you have a non-technically oriented client that has a system that produced thousands of files you need to process and those files have worked fine for the client for many many years and everyone else the client has shared those files with don't appear to have problems with them, it is a tough sell to tell the client that they need to rename all those thousands of files because you can't open them with those names. You'll find that they'll quickly find someone else to do the job.

The other issues I mentioned have nothing to do with filenames. I've developed a screen-scraping program that uses forkpty() and fdopen(). The Python2 version is "intuitive" and easy to follow. The Python3 version, which still doesn't behave exactly as the 2 version, is much worse. The 2to3 script did nothing for it. One must add/change a lot of stuff manually, and when one takes a step back and actually look at the code, it is obvious that one is doing more work "fighting" the programming language than actually getting stuff done. There is one part where there is a regex that I still haven't gotten to work with Python3. Prepending "b" to the regex doesn't just work for it. It is processing a regex on a byte stream, oh and this byte stream *does* have control characters ;-). I did do some searching on Python3 and bytes and regexes, but the nothing came back that was very helpful. Aside from a few remarks that one *should not* be using regular expressions on bytes... but this *is* a screen-scraping program after all. I eventually gave up, thinking that Python2 will be around for a while, and if I need to abandon Python2 I'll probably just re-rewrite it in another programming language.

Don't get me wrong. I like Python3, which is one reason that I want to convert all my Python2 programs to 3 if/when possible. But the whole bytes vs. unicode string thing: GvRs main rationale was that in Python2 people were confusing byte streams with "text". One of the goals of Python3 was to "fix" that problem. My argument is that it merely exchanges one set of problems for another. In one way that may possibly have made it less confusing is to only allow some functions to accept/return byte strings or unicode strings. If os.listdir(), e.g. only accepted unicode and only returned unicode, and their were another function, say os.blistdir() that only accepted byte strings and only returned bytestrings, that *may* help. Also, low-level os functions like fdopen should never, in my opinion, "automagically" encode/decode data passed to it/returned from it. They should work with bytes only. If you are using a low-level OS function, the training wheels should be taken off.

But my opinion is if you are doing something outside the language (reading/listing files/pipes, network operations, any kind of work on "external" data, then the language should error on the side of caution, saying "this is some external data, I'm just going to return a bytestring" and let the programmer manually deal with if/how the data should be decoded on the way in or encoded on the way out.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 12:36 UTC (Fri) by marduk (subscriber, #3831) [Link]

There are times when you have no choice in the file name. This isn't a "desktop" issue where you can just tell the user to "please change your filename". The point that this has been in place for years and they *are* legal *nix file names and every other *nix programming language I know of has no problem with them. Sometimes there are other consumers/creators of the file, and it is not feasible or even possible that they can be changed. Ignoring them or telling the user to rename them is just burying one's head in the sand.

And there are still environments where UTF-8 is far from universal, e.g. some countries in Asia especially when using older versions of Windows or MacOS. So even if a filename is encoded, one must not assume it is always going to be UTF-8. It's a nice dream, but I live in reality.

I can't for the life of me see any security vulnerabilities created by linux filenames, at least none that don't also exist for UTF-8. There are, however, safety issues when one assumes the character encoding of a byte stream (which is basically what a Unix filename is).

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 13:56 UTC (Fri) by cortana (subscriber, #24596) [Link]

The Windows case is worse. The Windows API does not, and will never, support UTF-8-encoded strings in any of its functions, save for MultiByteToWideChar/WideCharToMultiByte. If you want to do use any Winodws API facilities that use strings, you will either call the native "Unicode" API, which always uses sequences of wchar_t (16 bits wide) representing little-endian UTF-16 code units, or you use the legacy "ANSI" API, which uses sequences of char representing code units in the encoding of the "System Locale" (which may be a single-byte encoding such as Windows-1252, or a multibyte encoding such as Shift-JIS, but is never UTF-8).

This restriction is annoying, but you can see where it came from--Windows NT made the switch to UCS-2 internally in the days before Unicode expanded past 16-bits, therefore it was convenient to use wchar_t, and later extend by adopting UTF-16. The problem is too much software is still written to only use the legacy ANSI APIs, and assumes incorrectly both that char* is an acceptable external representation for filenames.

I've run into this a lot while writing software that runs on Windows; I want to use a library that does something with the filesystem, but that library makes the above assumptions, and hence will only open files via a function like foo_open(char* filename). These are reasonable assumptions, since such libraries usually also operate on UNIX and Mac OS X systems where all filesystem paths use char*, and the Windows ports will probably only have been tested on US-English Windows installations. The assumptions could even said to have been inherited from the C and C++ language standards, despite efforts to the contrary.

I'm coming round to the opinion that libraries should not use filenames at all, but have a typedef that resoves to int on POSIX, HANDLE on Windows, and something else on Mac OS X (FSRef? NSURL/NSString? int?).

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 15:59 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, Boost has wonderful Boost.Filesystem library which abstracts paths in a special class which can do transcoding to/from utf-8 if necessary.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 16:18 UTC (Fri) by cortana (subscriber, #24596) [Link]

It's wonderful indeed (at least in its v3 incarnation)--but it's not sufficient. You can't open an ifstream with a boost::filesystem::path, only with a char*, and the fundamental problem we have with Windows is that its API takes char* only as a convenience for backwards-compatibility; there is no way to represent all possible valid paths with char* strings.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 16:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

I seem to remember that there's a wrapper somewhere in Boost that does that. Personally, I have a simple wrapper that does fopen or wfopen depending on whether WIN32 is defined.

(oh, and C++ iostreams are pile of $SWEAR_WORD)

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 16:40 UTC (Fri) by cortana (subscriber, #24596) [Link]

I think you're thinking of boost::filesystem::{i,o,}fstream; however you can only use it with wchar_t* paths if you are building with MS Visual C++, because it relies on std::{i,o,fstream} having overloads that take wchar_t, as well as the standard char* functions, and only MSVC provides those non-standard extensions.

iostreams are only an example from the standard library. The C standard fopen function is another--it takes char* and not wchar_t*, hence if you use it then you're screwed. Working around this by, say, calling _wfopen if _WIN32 is defined only gets you so far--as soon as you hit a library that has a foo_open function taking char* and not wchar_t*, you hit the same problem.

Summary: if you write a library that deals with files, you are only allowed to take filesystem paths as arguments to your functions unless you've ported the library to several different platforms, and made sure it can deal with Chinese and Runic filenames at the same time. :)

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 16:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

>Working around this by, say, calling _wfopen if _WIN32 is defined only gets you so far--as soon as you hit a library that has a foo_open function taking char* and not wchar_t*, you hit the same problem.

Yeah, so I mostly use libraries that can accept file descriptors or FILE* instead of file names. That actually covers quite a lot of functionality.

>Summary: if you write a library that deals with files, you are only allowed to take filesystem paths as arguments to your functions unless you've ported the library to several different platforms, and made sure it can deal with Chinese and Runic filenames at the same time. :)

Well, no arguments here. I'd also add working with filenames encoded in an encoding that is 8-bit and not the same as the system encoding.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 17:19 UTC (Fri) by cortana (subscriber, #24596) [Link]

Be careful with FILE*. Your application might use a different C Runtime from the DLL you're passing the FILE* to (Windows! It's great!). Best stick with HANDLE I think. :)

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 19:32 UTC (Fri) by cmccabe (guest, #60281) [Link]

Maybe what ought to happen is that you write some kind of shim library that has all the C standard library functions in it-- open, fopen, etc. You have some code in the shim library that translates all the UTF8 filenames to UCS-16 or whatever Windows is using. Then you rebuild all the libraries you're using against this shim library.

Of course, this approach basically forces you to bundle all your libraries with your application. But I was under the impression that this was standard operating procedure on Windows anyway, because some evil guy could overwrite your shared copy of the shared library with an older version, etc.

Does that make sense at all? I've never developed on Windows, so it might be nonsense.

Thoughts from LWN's UTF8 conversion

Posted Feb 4, 2012 0:15 UTC (Sat) by cortana (subscriber, #24596) [Link]

It can work and it's one way to do it. You will run into problems if you need to use a DLL that isn't linked against your replacement library itself though (think binary-only third-party DLLs).

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 17:02 UTC (Mon) by jwakely (subscriber, #60262) [Link]

Boost.Filesystem will most likely be in the next C++ standard, at which point std::fstream will get new constructors taking std::path objects. I realise that's not much use to anyone right now, unless you're planning to go into a coma for a few years.

Thoughts from LWN's UTF8 conversion

Posted Feb 4, 2012 7:09 UTC (Sat) by foom (subscriber, #14868) [Link]

> assumes incorrectly both that char* is an acceptable external representation for filenames

But it's not actually incorrect. On the windows implementation of foo_open, the char* should be passed in as a utf8-encoded string, and decoded to utf-16 on the way to the windows wide-char API (e.g. _wfopen or whatever you want to use).

There's really no reason not to do that...

Thoughts from LWN's UTF8 conversion

Posted Feb 10, 2012 19:34 UTC (Fri) by jrw (subscriber, #69959) [Link]

> every other *nix programming language I know of has no problem with them
> ...
> I can't for the life of me see any security vulnerabilities created by linux filenames

See David Wheeler's Fixing Unix/Linux/POSIX Filenames: www.dwheeler.com/essays/fixing-unix-linux-filenames.html

Thoughts from LWN's UTF8 conversion

Posted Feb 5, 2012 21:16 UTC (Sun) by k8to (subscriber, #15413) [Link]

I have created a non-utf8 filename on os x. Its part of a test suite for filename handling.

Maybe the sane libraries don't?

The "user should fix his problem" is kind of broken when having a nonutf8 filename is not necessarily a problem.

For a looong time, portable software has had to deal with the fact that the encoding of byte strings from filesystem calls has all kinds of execeptions. For the case that you have to render a filename as text (rare), handle conversion failures and support specifiers, such as LANG, LC_CTYPE, and -- if needed for your application domain -- explicit support for declaring name encodings on a file or pattern basis.

In case anyone is board...

Posted Feb 2, 2012 15:31 UTC (Thu) by tstover (guest, #56283) [Link]

here are some slides from a talk I once gave

http://www.thomasstover.com/unicodepresentation.pdf

In case anyone is board...

Posted Feb 3, 2012 2:53 UTC (Fri) by acolin (subscriber, #61859) [Link]

Thanks for posting the slides -- an informative read, particularly the slides on rendering.

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 15:32 UTC (Thu) by JEFFREY (guest, #79095) [Link]

I think it’s great!

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 17:06 UTC (Thu) by skvidal (guest, #3094) [Link]

As a general rule I believe all things unicode and python result in:

▄██████████████▄▐█▄▄▄▄█▌
██████▌▄▌▄▐▐▌███▌▀▀██▀▀
████▄█▌▄▌▄▐▐▌▀███▄▄█▌
▄▄▄▄▄██████████████

6-bit characters

Posted Feb 2, 2012 17:10 UTC (Thu) by alfille (subscriber, #1631) [Link]

Your description of 6-bit characters with shift key sounds like the old PLATO system running on CDC hardware with 60-bit words (really large back then) and 10 characters/word.

6-bit characters

Posted Feb 2, 2012 17:11 UTC (Thu) by corbet (editor, #1) [Link]

A CDC is exactly what it was, though there was no PLATO involved (that came later).

Thoughts from LWN's UTF8 conversion

Posted Feb 2, 2012 21:18 UTC (Thu) by iabervon (subscriber, #722) [Link]

Python 2's Unicode handling resulted in sad faces, but most systems' Unicode handling instead resulted in "☹"s. Actually, Python's Unicode handling tends to result in exceptions, because it's devilishly hard to get Python to believe that some library's byte-sequence output ought to be in UTF-8, not ASCII, when the only person who knows for sure is the programmer calling the library (rather than either the library itself or the user running the program).

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 1:57 UTC (Fri) by ras (subscriber, #33059) [Link]

> if the program needs to operate on that data as text, it must explicitly decode it into Unicode. This requirement is, frankly, a pain; there is a lot of explicit encoding and decoding to be done that didn't have to happen in a Python 2 program

+1.

Python3 consists of a lot of welcome cleanups, a couple of new features you can choose to use if you wish, and one bad design mistake everyone is forced to confront: using a bastardisation of USC2 to represent Unicode.

Life would have been much simpler if they just had of abandoned Python2 Unicode strings entirely and reverted to the Python1.5 situation where there was one string type and programmer manually handled the encoding, if necessary. The point being it often isn't necessary - most of the time both ends are using compatible encodings, and the parts in the text you do care about, such as HTML encoding, are ASCII. Since ASCII's code points are identical in all popular encodings normal string manipulation and regex's just work regardless of encoding.

None of my brushes with Python2's Unicode have been pleasant. Conversion between Unicode encodings is fragile, so say the least, but often you could avoid it. Python3 forcing you to do a conversion to and from UCS2 means that fragility has infected my python programs, making the situation worse than Python2. For that one reason I'll be sticking to Python2.x for as long as possible.

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 12:06 UTC (Fri) by dag- (guest, #30207) [Link]

I concur with the last paragraph, despite all good intentions forcing this upon all python3 users is painful :-/

Thoughts from LWN's UTF8 conversion

Posted Feb 3, 2012 12:20 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

IIRC Python 3.2 (or 3.3?) string objects pick one of three internal representations:

* 8-bit ASCII string
* 16-bit UCS2 string
* 32-bit UTF-32 string

depending on the characters contained within. So you're not limited to the Basic Multilingual Plane, or have to deal with UTF-16 surrogates any more.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 12:27 UTC (Mon) by alankila (guest, #47141) [Link]

As a guy who has had to deal with encodings for a long time, I absolutely think that python's split between string and bytes is the right thing to do -- in fact it reminds me of java's corresponding split between String and byte[].

People have tried this way of treating text as binary as far as possible, and it just doesn't seem to work.

Thoughts from LWN's UTF8 conversion

Posted Feb 7, 2012 9:32 UTC (Tue) by jezuch (subscriber, #52988) [Link]

> As a guy who has had to deal with encodings for a long time, I absolutely think that python's split between string and bytes is the right thing to do -- in fact it reminds me of java's corresponding split between String and byte[].

And between InputStream/OutputStream and Reader/Writer (the former are for byte streams, the latter are for character streams). The conversion between bytes and text needs to be explicit. As another person who lives in a non-ASCII locale I definitely have to say that any language that conflates text and bytes causes brain damage, sorry ;)

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 6:09 UTC (Wed) by ras (subscriber, #33059) [Link]

Maybe you could give some illustrations of when not splitting strings and bytes cause problems. Most of us haven't seen them, and I can't see any listed in this thread. Problems caused by splitting them have been listed here which makes the bland assertion look a little weak.

That aside, most of the complaints here would be addressed if bytes() was just the Python2 str() renamed. The rest would be resolved by replacing Python's UCS2 with something that is a standard and can actually represent all of Unicode - either UCS4 or UTF8. UTF8 has the nice advantage that it could be just a subclass of bytes() that was guaranteed to contain valid utf8. All the string methods could be inherited from bytes() and should just work. Things that care whether they are passed a human readable string could insist on getting utf8.

That would have been the right solution for Python 3.0. It would have meant the bulk of the Python2 code remained compatible, while actually simplifying the str/unicode mess Python2 created into a nice class hierarchy. As it is, each new revision of Python 3 seems as seems to be trying to solve the complex programming interface created by UCS2 with new layers of complexity, and comment from @mgedmin seems to imply that trend is continuing with Python 3.3. Attempting to simplify things by adding more complexity underneath rarely works as a design strategy.

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 19:04 UTC (Wed) by raven667 (guest, #5198) [Link]

I maybe be misunderstanding your statement but, if you want to do any string manipulation you will want to be working on a normalized abstract string rather than whatever binary representation its encoded with. They needed to pick something and they picked UCS2, a debatable decision, but there just isn't a 1:1 relationship between bytes and character strings so a separation between abstract string manipulation and the underlying binary encoding seems sensible to me. Length of bytes and length of string are often going to be different for example. Handling all strings in a consistant manner, rather than just the US-ASCII subset, is vastly more complex which is just the nature of human existence.

Conversion issues aren't new to Unicode.

Posted Feb 3, 2012 16:48 UTC (Fri) by dwmw2 (subscriber, #2063) [Link]

"So programs dealing in Unicode text must know how outside-world strings are represented and convert those strings to the internal format before operating on them."
That isn't a new feature with Unicode. That was necessary with legacy character sets too. Strings always had to have an associated label which indicated the character set in use. You couldn't just accept a byte-stream in ISO8859-1 and pass it to someone else who expects ISO8859-8 (or EBCDIC!), and expect sane things to happen.

The real problem is that labelling was hard. Strings would often lose their labels in transit, so you end up having to guess what encoding is in use. See my comments on HTML5 for a discussion of how well that works out.

And even if you do manage to preserve the labelling, converting to your internal format was hard because you couldn't represent all of the characters from one 8-bit character set in another one. So any conversion would be lossy.

Using UTF-8 largely solves the above issues. By using UTF-8 internally you can represent anything you receive. And it makes the labelling problem a whole lot easier — you just make sure everything is converted to UTF-8 as you receive it, and forget the onerous task of carrying labels around with each string. You know that everything, everwhere within your system, is all UTF-8.

As more and more people move to UTF-8, it even makes the labelling problem easier when someone feeds you text that's unlabelled. It's more and more valid to just assume that it's UTF-8. And if it is some legacy 8-bit nonsense instead, that's usually something you can detect because it'll be an invalid UTF-8 byte-stream. So you can validate your assumption too.

In those respects, UTF-8 makes the whole thing easier, not harder.

The issue that UTF-8 does introduce, however, is that characters now take a variable number of bytes. A certain amount of work had to be done, in order to cope with that, but it should mostly be a solved issue these days.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 7:05 UTC (Mon) by C.Gherardi (subscriber, #4233) [Link]

On a mostly unrelated note, the Unix utility suites have a long way to go with regards to 'just working' with multi-byte character sets.

Utilities like grep still fail to work on utf-16 files, making searching of a directory of mixed utf-8 utf-16 files a grep pain.

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 9:47 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

UTF-16 isn't a multibyte encoding. It's a wide (and multi-word) encoding. It's also not cheaply-and-reliably distinguishable from a headerless binary blob. (And I feel compelled to add that my first reaction was "why on Earth would you have such a directory in the first place?")

Thoughts from LWN's UTF8 conversion

Posted Feb 7, 2012 5:53 UTC (Tue) by C.Gherardi (subscriber, #4233) [Link]

I'm not 100% sure of the difference between multi-byte and wide encoding, and tbh not that interested.

I contribute to program that parses text files produces by a commercial poker programs. One of these clients started several years ago using ascii, then cp-1252, a brief interlude to utf8 then in their infinite wisdom switched to utf16. The files are all text (imho), and for long term player they have all of these files mixed in the same directory.

It would be nice to be able to grep this directory for specific lines without the gymnastics required with iconv at the moment.
</derail>

Thoughts from LWN's UTF8 conversion

Posted Feb 8, 2012 2:37 UTC (Wed) by cmccabe (guest, #60281) [Link]

You can write a script that decodes your crazy binary format into normal UTF-8 text and runs grep on it.

I wrote a similar script for pdfs that allows you to effective grep a PDF from the command line.

http://www.club.cc.cmu.edu/~cmccabe/cgi-bin/gitweb.cgi?p=...

Thoughts from LWN's UTF8 conversion

Posted Feb 6, 2012 15:08 UTC (Mon) by dvdeug (subscriber, #10998) [Link]

Is there any intent that they work on UTF-16? That's just not an acceptable text encoding on Unix; you may as well have a directory with mixed UTF-8 and DOC or RTF files.

Thoughts from LWN's UTF8 conversion

Posted Feb 13, 2012 13:32 UTC (Mon) by ekj (guest, #1524) [Link]

Your search-function still complains bitterly about "unsupported characters" if one dares to search for something that's not US-ascii, though. Like searching for Telephønesystem.

Might want to look at that, it's annoying when you cannot find people who have names with non-ascii characters in them.


Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds