UTF-16
UTF-16
Posted Mar 24, 2010 21:07 UTC (Wed) by ikm (guest, #493)In reply to: UTF-16 by wahern
Parent article: Resetting PHP 6
Would you elaborate? To the best of my knowledge, UCS-2 has always been a fixed-width BMP representation. Even its name says so.
> useless for anything but European languages and perhaps a handful of some Asian languages
Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.
> billions of people either can't use your application, or don't expect it to work for them well
Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios. In your interpretation, though, it feels just like the opposite: Unicode is impossible and never works, half the world just can't use it at all. That's not true.
Posted Mar 25, 2010 5:20 UTC (Thu)
by wahern (subscriber, #37304)
[Link] (25 responses)
The problem here is conflating low-level codepoints with textual semantics. There are more dimensions than just bare codepoints and combining codepoints (where w/ the BMP during the heyday of UCS-2 you could always find, I think, a single codepoint alternative for any combining pair).
Take Indic scripts for example. You could have multiple codepoints which while not technically combining characters, require that certain rules are followed; together with other semantic forms collectively called graphemes and grapheme clusters. If you split a "string" between these graphemes and stitch them back together ad hoc, you may end up w/ a non-sense segment that might not even display properly. In this sense, the fixed-width of the codepoints is illusory when you're attempting to logically manipulate the text. Unicode does more than define codepoints; it also defines a slew of semantic devices intended to abstract text manipulation, and these are at a much higher level than slicing and dicing an array of codepoints. (As noted elsethread, these can be provided as iterators and special string operators.)
It's been a few years since I've worked with Chinese or Japanese scripts, but there are similar issues. Though because supporting those scripts is a far more common exercise for American and European programmers, there are common tricks employed--lots of if's and then's littering legacy code--to do the right things in common cases to silence the Q/A department fielding calls from sales reps in Asia.
> Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.
"Majority" and "pretty much". That's the retrospective problem that still afflicts the technical community today. Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP. What happens when somebody wants to create a web application around these texts in their traditional script? With an imagination one could imagine all manner of new requirements just around the corner that will continue to erode the analysis that the BMP is "good enough". For example the phenomenon may reverse of simplifying scripts in various regions, begun in part because of perceived complexity viz-a-viz the limitations of modern computing hardware and/or inherited [racist] notions of cultural suitability. Mao's Simplified Chinese project may turn out to be ahistorical, like so many other modernist projects to fix thousands of years of cultural development around the world.
Of course, the whole idea that the BMP is "good enough" is nonsensical from the get go. in order to intelligently handle graphemes and grapheme clusters you have to throw out the notion of fixed-width anything, period.
> Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios.
I don't think I exaggerate. First of all, as far as I know, Unicode is sufficient. But I've never actually seen an open source application--other than ICU--that does anything more effectively with Unicode than use wchar_t or similar concessions. (Pango and other libraries can handle many text flow issues, but the real problems today lie in document processing.)
I think it's a fair observation that the parsing and display of most non-European scripts exacts more of a burden than for European scripts. For example (and this is more about display than parsing) I'm sure it rarely if ever crosses the mind of a Chinese or Japanese script reader that most of the text they read online will be displayed horizontally rather than vertically. But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode? (Even if neither is wrong per se.) I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs. It's an unjustifiable intransigence. And we haven't been able to move beyond it because the solutions so far tried simply attempt to reconcile historical programming practice w/ a handful of convenient Unicode concepts. Thus this obsession with codepoints, when what should really be driving syntax and library development aren't these low-level concerns but how to simplify the task of manipulating graphemes and higher-level script elements.
Posted Mar 25, 2010 11:04 UTC (Thu)
by tetromino (guest, #33846)
[Link] (2 responses)
> Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.
> But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode?
Posted Mar 26, 2010 19:31 UTC (Fri)
by spacehunt (guest, #1037)
[Link] (1 responses)
> 25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?
A lot of Chinese characters in modern usage are outside of the BMP:
> Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?
It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan. Just go to any bookstore or newspaper stand in these three places and see for yourself.
Posted Mar 31, 2010 4:35 UTC (Wed)
by j16sdiz (guest, #57302)
[Link]
> It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan.
As a Chinese living in Hong Kong I can tell you this:
And yes, you can have Confucius in BMP. (Just like how you have KJV bible in latin1 -- replace those long-S
Posted Mar 25, 2010 11:41 UTC (Thu)
by ikm (guest, #493)
[Link] (4 responses)
Me: Here's a list of scripts BMP supports. That's the majority of all scripts in Unicode.
You: "Majority" and "pretty much". Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.
So, BMP without classic Chinese is largely useless? Nice. You know what, enough of this nonsense. Your position basically boils down to "if you can't support all the languages in the world, both extinct and in existence, 100.0% correct, and all the features of Unicode 5.0, too, your effort is largely useless". But it's not; the world isn't black and white.
Posted Mar 25, 2010 12:48 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
And a few years later the pressure has mounted enough you do need to process the real thing, not a simplified model, and you need to do the work you didn't want to do in the first place *and* handle all the weird cases your previous shortcuts generated.
The "good enough" i18n school has been a major waste of development so far. It has proven again and again to be shortsighted
Posted Mar 25, 2010 13:04 UTC (Thu)
by ikm (guest, #493)
[Link] (2 responses)
You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.
Posted Mar 25, 2010 16:48 UTC (Thu)
by JoeF (guest, #4486)
[Link] (1 responses)
And what exactly makes you the final authority on using UTF?
Posted Mar 25, 2010 17:06 UTC (Thu)
by ikm (guest, #493)
[Link]
p.s. And btw, you *can* represent ancient Chinese with UTF... The original post was probably referring to some much more esoteric stuff.
Posted Mar 25, 2010 15:13 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (16 responses)
The invention of alphabets was a major breakthrough - because they are inherently simpler than logographies. It's not just about computers: compare how much time a child typically needs to learn one versus the other.
> I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs.
Of course yes, what did you expect? This problem will be naturally solved when countries with complicated writing systems will stop waiting for the western world to solve problems only they have.
> It's an unjustifiable intransigence.
Yeah, software developers are racists since most of them do not bother about foreign languages...
Posted Mar 25, 2010 16:17 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (5 responses)
* Nobody uses logograms. In a logographic system you have a 1:1 correspondence between graphemes and words. Invent a new word, and you need a new grapheme. Given how readily humans (everywhere) invent new words, this is quickly overwhelming. So, as with the ancient Egyptian system, the Chinese system is clearly influenced by logographic ideas, but it is not a logographic system, a native writer of Chinese can write down words of Chinese they have never seen, based on hearing them and inferring the correct "spelling", just as you might in English.
Posted Mar 25, 2010 19:24 UTC (Thu)
by atai (subscriber, #10977)
[Link] (4 responses)
The ideas that China is backwards because of the language and written characters should now go bankrupt.
Posted Mar 25, 2010 22:26 UTC (Thu)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Mar 25, 2010 23:21 UTC (Thu)
by atai (subscriber, #10977)
[Link] (1 responses)
But if you say Chinese characters changed more than the Latin alphabet, then you are clearly wrong; the "traditional" Chinese characters certainly stayed mostly the same since 105 BC (What happened in Korea, Japan or Vietnam do not apply because these are not Chinese).
I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?
Posted Mar 26, 2010 11:16 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link]
13th Century English (i.e. what linguists call "Middle English") should be readable-for-meaning by an educated speaker of Modern English with a few marginal glosses. Reading-for-sound is almost as easy (95% of it is covered by "Don't silence the silent-in-Modern-English consonants. Pronounce the vowels like Latin / Italian / Spanish instead of like Modern English"). My understanding is that the Greek of 2000 years ago is similarly readable to fluent Modern Greek users. (The phonological issues are a bit trickier in that case.) In both cases - and, I'm sure, in the case of classical Chinese - it would take more than just knowing the words and grammar to receive the full meaning of the text. Metaphors and cultural assumptions are tricky things.
Posted Apr 15, 2010 9:27 UTC (Thu)
by qu1j0t3 (guest, #25786)
[Link]
Posted Mar 25, 2010 16:21 UTC (Thu)
by paulj (subscriber, #341)
[Link] (8 responses)
Posted Mar 25, 2010 19:33 UTC (Thu)
by atai (subscriber, #10977)
[Link] (3 responses)
Posted Mar 26, 2010 2:49 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Mar 26, 2010 15:37 UTC (Fri)
by chuckles (guest, #41964)
[Link] (1 responses)
Posted Mar 26, 2010 21:24 UTC (Fri)
by paulj (subscriber, #341)
[Link]
And yes they are. Shame there's much misunderstanding (in both directions)
Posted Mar 25, 2010 20:23 UTC (Thu)
by khc (guest, #45209)
[Link] (3 responses)
Posted Mar 26, 2010 2:44 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Mar 27, 2010 22:39 UTC (Sat)
by man_ls (guest, #15091)
[Link] (1 responses)
Posted Mar 28, 2010 4:22 UTC (Sun)
by paulj (subscriber, #341)
[Link]
This was a Han chinese person from north-eastern China, i.e. someone from
Posted Dec 27, 2010 2:01 UTC (Mon)
by dvdeug (guest, #10998)
[Link]
Not only that, some of these scripts you're not supporting are wonders. Just because Arabic is always written in cursive and thus needs complex script support, doesn't mean that it's not an alphabet that's perfectly suited to its language, that is in fact easier to learn for children, then the English alphabet is for English speakers.
Supporting Chinese or Arabic is like any other feature. You can refuse to support it, but if your program is important, patches or forks are going to float around to fix. Since Debian and other distributions are committed to supporting those languages, the version of the program that will be in the distributions will be the forked version. If there is no fork, they may just not include it. That's the cost you'll have to pay for ignoring the features they want.
UTF-16
UTF-16
I am not sure if I understand you. Would you mind giving a specific example of what you are talking about? (You will need to select HTML format to use Unicode on lwn.net.)
25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?
Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?
UTF-16
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00...
UTF-16
speakers because that's what you saw being used in a restaurant menu?
Just go to any bookstore or newspaper stand in these three places and see for yourself.
Most of the Chinese characters are in BMP. Some of those outside BMP are used in Hong Kong, but they are not
as important as you think -- most of them can be replaced with something in BMP (and that's how we have been
doing this before the HKSCS standard)
with th, and stuff like that)
UTF-16
UTF-16
UTF-16
You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.UTF-16
While you may have no need to represent ancient Chinese characters, others may.
Just because you don't need it doesn't mean that others won't have use for it.
Your argument smacks of "640K should be enough for anyone" (misattributed to BillG).
UTF-16
UTF-16
UTF-16
UTF-16
UTF-16
nameless Phoenicians, Mesopotamians and Egyptians 2500-odd years ago,
*everyone* else was backwards. It's an awesome piece of technology. (btw,
the Chinese characters have had numerous major revisions, simplifications
and complexifications over the last two millennia, the most recent being
the traditional/simplified split: any claim that the characters are
unchanged is laughable. They have certainly changed much more than the
Roman alphabet.)
UTF-16
UTF-16
I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?
Anyone who wants to explore the topic of comparative alphabets further may find McLuhan's works, such as The Gutenberg Galaxy, rewarding.
McLuhan
UTF-16
roman alphabet encoding of mandarin), and learn hanzi logography buiding on
their knowledge of pinyin.
UTF-16
UTF-16
and it is indexed by pinyin. From what I have seen of (mainland) chinese,
pinyin appears to be their primary way of writing chinese (i.e. most writing
these days is done electronically, and pinyin is used as the input
encoding).
UTF-16
While pinyin is nice, there are no tone markers. So you have a 1 in 5 chance (4 tones plus neutral) of getting it right.
You are correct that pinyin is the input system on computers, cell phones, everything electronic, in mainland china. Taiwan has its own system. Also, Chinese are very proud people, Characters aren't going anywhere for a LONG time.
UTF-16
computer you just enter the roman chars and the computer gives you an
appropriate list of glyphs to pick (with arrow key or number).
though. Anyway, OT.. ;)
UTF-16
UTF-16
China has 1,325,639,982 inhabitants, according to Google. That is more than the whole of Europe, Russia, US, Canada and Australia combined. Even if there is a central government, we can assume a certain cultural diversity.
UTF-16
UTF-16
the dominant cultural group in China, from the more developed part of China.
I don't know how representative their education was, but I suspect there's
at least some standardisation and uniformity.
UTF-16