Resetting PHP 6 [LWN.net]

UTF8 vs UTF32

Posted Mar 24, 2010 16:52 UTC (Wed) by dlang (guest, #313) [Link] (15 responses)

which is better depends on what you expect to do with the strings.

if what you are doing with the strings is mostly storing, matching, concatenating and outputting them, UTF8 is better due to it's smaller size. In these cases the program doesn't really care what the data is, it's just a string of bytes to work with, the fact that it happens to be human readable is a side effect.

If you are searching the string, parsing the string with variable-length items in the string, there is still no real penalty to the variable length encoding of UTF8 as you have to start at the beginning of the string and walk it anyway.

If you have text with fixed size items in it that you need to access by character position, then UTF32 is better as every character is a fixed size, but there is a significant size penalty in manipulating and copying these strings around, so unless you are doing this a lot, you may find you are better off just walking the string. Remember that on modern CPUs a smaller cache footprint usually translates directly to higher performance. Walking a small string that fits into the cache a dozen times will be faster than accessing a larger string that doesn't fit in cache once (completely ignoring the executable code that gets pushed out of the cache by the larger string)

UTF32 carries a large memory size penalty, plus the need to convert when reading/writing.

what it is good for is if you are doing a lot of string manipulation by position (i.e. you know that the field you want starts at position 50)

UTF8 vs UTF32

Posted Mar 24, 2010 17:57 UTC (Wed) by foom (subscriber, #14868) [Link] (14 responses)

> what it is good for is if you are doing a lot of string manipulation by position (i.e. you know that
> the field you want starts at position 50)

I've never actually heard of a good reason for anyone to want that. Remember that in Unicode,
what the user thinks of a character might in fact be a composition of many unicode codepoints (e.g.
base character + accent). Accessing the Nth codepoint is an almost entirely useless thing to
optimize your storage for.

And yet, most programming languages have made this mistake, because they treat strings as an
array of codepoints, rather than as a unique data structure. All you really want is forward-and-
backward iterable and random access given an iterator.

UTF8 vs UTF32

Posted Mar 24, 2010 19:49 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (3 responses)

>I've never actually heard of a good reason for anyone to want that.

If you have to translate from an old, COBOL-style fixed-width data format into something modern and/or at least reasonably sane, that's a very good reason to want just that.

UTF8 vs UTF32

Posted Mar 24, 2010 20:44 UTC (Wed) by foom (subscriber, #14868) [Link] (2 responses)

I'm gonna bet the old COBOL stuff is not actually outputting Unicode. :)

UTF8 vs UTF32

Posted Mar 24, 2010 20:45 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (1 responses)

Quite. But that IS a case where you want to do string manipulation by position.

UTF8 vs UTF32

Posted Mar 25, 2010 14:14 UTC (Thu) by dgm (subscriber, #49227) [Link]

Fine then. If you limit yourself to the ASCII subset, UTF-8 offers you that. Is that enough for COBOL?

Iterators vs indices

Posted Mar 25, 2010 23:08 UTC (Thu) by butlerm (subscriber, #13312) [Link] (9 responses)

I've never actually heard of a good reason for anyone to want that.

Most languages (and SQL in particular) work exclusively using string indexes. You cannot use an iterator in a functional context. SQL doesn't have string iterators and never will. Iterators are a procedural thing.

Numeric indices are the only language independent option available for specifying and extracting parts of strings. That is why they are universal. I wouldn't use a language that didn't support them, in large part due to the difficulty of translating idioms from languages that do.

As a consequence, anything that needs to be done to efficiently present a string of any reasonable size as a linear array of _characters_ (or at least code points) is the language's problem, not the programmer's. That is the approach SQL takes, and from an application programmer's point of view it works very well.

That is not to say that an iterator interface shouldn't be provided (in procedural languages) as well, but of the two an index based interface is more fundamental.

Iterators vs indices

Posted Mar 26, 2010 13:49 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

I think you must be using a funny definition of functional. There is absolutely nothing that prevents an iterator-based API from working in a functional language. And of course most functional languages have many such APIs.

Let's take a traditional example: singly-linked-lists are a quite common data-structure in functional (or mostly-functional) languages like Haskell, Scheme, etc. Yet, you don't index them by position (that of course is available if you need it, but it's time O(n), so you don't normally want to use it). Instead, you use an iterator, which in this case is a pointer to the current element.

If anyone suggested that the primary access method for a singly linked list should be by integer position, they'd be rightly told that's insane -- iterating over the list would take O(n^2)!

Now, maybe your real point was simply that existing languages already have a poorly-designed Unicode String API that they have to keep compatiblity with -- and that API doesn't include iterators. So, they therefore have constraints they need to preserve, such as O(1) access by character index, because existing programs require it.

I won't argue with that, but I still assert it's not actually a useful feature for a unicode string API, in absence of the API-compatibility requirement.

Iterators vs indices

Posted Mar 26, 2010 22:09 UTC (Fri) by spitzak (guest, #4593) [Link]

If you really can't get away from the integer index, a solution is to have a string format that stores the most recent index computed and where it was in the string. Then when asked for a new index it will move from that previous position to the new one if the previous position is less that 2x the new one.

For the vast majority of cases where each integer starting from zero is used to get the "character" this would put the implementation back to O(1). And it would allow more complex accessors, such as "what error is here".

Iterators vs indices

Posted Mar 30, 2010 6:57 UTC (Tue) by njs (subscriber, #40338) [Link] (6 responses)

Another option is to store a string as a tree structure, where the leaves are some reasonable-sized chunks of bytes (to amortize storage overhead), and the tree nodes are annotated with the number of characters/bytes/code points/lines/whatever that occur underneath them. This allows random O(log n) access by character/byte/... offset. (You can maintain several different sorts of counts, and get fast access for all of them in the same data structure.) You also get cheap random insertion/deletion, which is an important operation for some tasks (e.g., editor buffers!) but horrendously slow for arrays.

For some reason nobody does this, though.

Iterators vs indices

Posted Mar 30, 2010 7:11 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do

Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.

Iterators vs indices

Posted Mar 30, 2010 7:46 UTC (Tue) by njs (subscriber, #40338) [Link] (2 responses)

> the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.

Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.

But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.

I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.

Iterators vs indices

Posted Mar 31, 2010 2:05 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)

Iterators vs indices

Posted Mar 31, 2010 4:30 UTC (Wed) by njs (subscriber, #40338) [Link]

> I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?

Iterators vs indices

Posted Mar 30, 2010 8:12 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Didn't the GNU C++ ext/rope work in exactly this way?

Iterators vs indices

Posted Mar 30, 2010 17:06 UTC (Tue) by njs (subscriber, #40338) [Link]

No, interestingly -- they are more complicated and less like a conventional tree structure than one would think: http://www.sgi.com/tech/stl/ropeimpl.html

The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.

UTF-16

Posted Mar 24, 2010 17:16 UTC (Wed) by clugstj (subscriber, #4020) [Link] (40 responses)

What were the Java people smoking when they picked UTF-16?

UTF-16

Posted Mar 24, 2010 17:47 UTC (Wed) by Nahor (subscriber, #51583) [Link] (32 responses)

When Java was invented, Unicode was 16 bits only, UTF-16 didn't exist and UCS-2 was the encoding of choice. So it all made sense at the time.

Shortly after, Unicode was extended to 32 bits (Unicode 2.0).
Java became UTF-16 only with Java 5.0 (JDK/JRE 1.5) and UTF-8 was not much of an option anymore if they wanted to stay compatible with older Java code.

It's the same thing with the Unicode support in Windows by the way (except that they are still UCS-2 AFAIK)

UTF-16

Posted Mar 24, 2010 20:35 UTC (Wed) by wahern (subscriber, #37304) [Link] (31 responses)

Unicode was never 16-bits, nor is it now 32-bits. It was never just that simple.

Nor was UCS-2 fixed-width per se. It was fixed-width for the BMP, and only with several caveats which made it largely useless for anything but European languages and perhaps a handful of some Asian languages. (Disregarding, even, American languages and others around the world.)

Sun's and Microsoft's decision to wed themselves to, effectively, this restricted UCS-2 model (even if nominally they advertise UTF-16 now), was short-sighted, plain and simple. Not only short-sighted, but stupid. It merely replaced one problem--a multitude of character set schemes--with another--a fixed scheme that still only supported the most basic semantics of a handful of scripts from technologically advanced nations. All the work was retrospective, not at all prospective.

None of UTF-8, UTF-16, or UTF-32 make functionally manipulating Unicode text any easier than the other. Combining characters are just the tip of the iceberg. Old notions such as word splitting are radically different. Security issues abound regarding normalization modes. All the hoops programmers jump through to "support" Unicode a la Win32 or Java doesn't actually provide much if any multilingual benefit beyond Latin and Cyrillic scripts, with superficial support for some others. (In other words, if you think your application is truly multilingual, you're sheltered and deluded; but the fact remains that billions of people either can't use your application, or don't expect it to work for them well anyhow--many other aspects of modern computers are biased toward European script).

To properly support Unicode--including the South and Southeast Asian languages which have billions of readers--applications need to drop all their assumptions about string manipulation. Even simple operations such as concatenation have caveats.

What languages should do is to leave intact existing notions of "strings", which are irreconcilably burdened by decades of ingrained habit, and create new syntax, new APIs, and new libraries which are thoroughly Unicode oriented.

And none of this even considers display, which poses a slew of other issues.

UTF-16

Posted Mar 24, 2010 20:38 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (2 responses)

>What languages should do is to leave intact existing notions of "strings", which are irreconcilably burdened by decades of ingrained habit, and create new syntax, new APIs, and new libraries which are thoroughly Unicode oriented.

So, Unicode text would have to be represented by some other data structure?

Intriguing. What would you suggest? Obviously a simple array would not cut it, that's what we have now and that's what you're arguing against. So, what would we use instead? Linked lists of entities? Trees of some kind?

UTF-16

Posted Mar 24, 2010 21:53 UTC (Wed) by elanthis (guest, #6227) [Link] (1 responses)

Don't be facetious. That a string is internally represented as an array of
bytes or codepoints is an entirely different thing than its client API being
purely iterator based. The problem he was talking about was that the client
API to strings in most languages is to expose it as an array of characters
even though internally it _isn't_ an array of characters, it's an array of
bytes or codepoints. The accessors for strings really don't make much sense
as an array, either, because an array is something indexable by offset, which
makes no sense: what exactly is the offset support to represent? Bytes?
Codepoints? Characters? You can provide a firm answer to this question, of
course, but the answer is only going to be the one the user wants some of the
time. A purely iterator based approach would allow the client to ask for a
byte iterator, a codepoint iterator, or even a character iterator, and get
exactly the behaviour they expect/need.

UTF-16

Posted Mar 24, 2010 21:56 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

I was not being facetious. The idea of representing strings using a different data structure was actually something I was thinking was an interesting idea.

But, you're right, there's no need for internal and external representations to match. At least, on some levels. At some level you do have to be able to get at the array of bytes directly.

UTF-16

Posted Mar 24, 2010 21:07 UTC (Wed) by ikm (guest, #493) [Link] (26 responses)

> Nor was UCS-2 fixed-width per se. It was fixed-width for the BMP

Would you elaborate? To the best of my knowledge, UCS-2 has always been a fixed-width BMP representation. Even its name says so.

> useless for anything but European languages and perhaps a handful of some Asian languages

Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.

> billions of people either can't use your application, or don't expect it to work for them well

Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios. In your interpretation, though, it feels just like the opposite: Unicode is impossible and never works, half the world just can't use it at all. That's not true.

UTF-16

Posted Mar 25, 2010 5:20 UTC (Thu) by wahern (subscriber, #37304) [Link] (25 responses)

> Would you elaborate? To the best of my knowledge, UCS-2 has always been a fixed-width BMP representation. Even its name says so.

The problem here is conflating low-level codepoints with textual semantics. There are more dimensions than just bare codepoints and combining codepoints (where w/ the BMP during the heyday of UCS-2 you could always find, I think, a single codepoint alternative for any combining pair).

Take Indic scripts for example. You could have multiple codepoints which while not technically combining characters, require that certain rules are followed; together with other semantic forms collectively called graphemes and grapheme clusters. If you split a "string" between these graphemes and stitch them back together ad hoc, you may end up w/ a non-sense segment that might not even display properly. In this sense, the fixed-width of the codepoints is illusory when you're attempting to logically manipulate the text. Unicode does more than define codepoints; it also defines a slew of semantic devices intended to abstract text manipulation, and these are at a much higher level than slicing and dicing an array of codepoints. (As noted elsethread, these can be provided as iterators and special string operators.)

It's been a few years since I've worked with Chinese or Japanese scripts, but there are similar issues. Though because supporting those scripts is a far more common exercise for American and European programmers, there are common tricks employed--lots of if's and then's littering legacy code--to do the right things in common cases to silence the Q/A department fielding calls from sales reps in Asia.

> Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.

"Majority" and "pretty much". That's the retrospective problem that still afflicts the technical community today. Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP. What happens when somebody wants to create a web application around these texts in their traditional script? With an imagination one could imagine all manner of new requirements just around the corner that will continue to erode the analysis that the BMP is "good enough". For example the phenomenon may reverse of simplifying scripts in various regions, begun in part because of perceived complexity viz-a-viz the limitations of modern computing hardware and/or inherited [racist] notions of cultural suitability. Mao's Simplified Chinese project may turn out to be ahistorical, like so many other modernist projects to fix thousands of years of cultural development around the world.

Of course, the whole idea that the BMP is "good enough" is nonsensical from the get go. in order to intelligently handle graphemes and grapheme clusters you have to throw out the notion of fixed-width anything, period.

> Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios.

I don't think I exaggerate. First of all, as far as I know, Unicode is sufficient. But I've never actually seen an open source application--other than ICU--that does anything more effectively with Unicode than use wchar_t or similar concessions. (Pango and other libraries can handle many text flow issues, but the real problems today lie in document processing.)

I think it's a fair observation that the parsing and display of most non-European scripts exacts more of a burden than for European scripts. For example (and this is more about display than parsing) I'm sure it rarely if ever crosses the mind of a Chinese or Japanese script reader that most of the text they read online will be displayed horizontally rather than vertically. But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode? (Even if neither is wrong per se.) I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs. It's an unjustifiable intransigence. And we haven't been able to move beyond it because the solutions so far tried simply attempt to reconcile historical programming practice w/ a handful of convenient Unicode concepts. Thus this obsession with codepoints, when what should really be driving syntax and library development aren't these low-level concerns but how to simplify the task of manipulating graphemes and higher-level script elements.

UTF-16

Posted Mar 25, 2010 11:04 UTC (Thu) by tetromino (guest, #33846) [Link] (2 responses)

> You could have multiple codepoints which while not technically combining characters, require that certain rules are followed; together with other semantic forms collectively called graphemes and grapheme clusters. If you split a "string" between these graphemes and stitch them back together ad hoc, you may end up w/ a non-sense segment that might not even display properly.
I am not sure if I understand you. Would you mind giving a specific example of what you are talking about? (You will need to select HTML format to use Unicode on lwn.net.)

> Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.
25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?

> But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode?
Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?

UTF-16

Posted Mar 26, 2010 19:31 UTC (Fri) by spacehunt (guest, #1037) [Link] (1 responses)

I'm a native Cantonese speaker in Hong Kong, hopefully my observations would serve as useful reference...

> 25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?

A lot of Chinese characters in modern usage are outside of the BMP:
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00...

> Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?

It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan. Just go to any bookstore or newspaper stand in these three places and see for yourself.

UTF-16

Posted Mar 31, 2010 4:35 UTC (Wed) by j16sdiz (guest, #57302) [Link]

> > Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese
speakers because that's what you saw being used in a restaurant menu?

> It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan.
Just go to any bookstore or newspaper stand in these three places and see for yourself.

As a Chinese living in Hong Kong I can tell you this:
Most of the Chinese characters are in BMP. Some of those outside BMP are used in Hong Kong, but they are not
as important as you think -- most of them can be replaced with something in BMP (and that's how we have been
doing this before the HKSCS standard)

And yes, you can have Confucius in BMP. (Just like how you have KJV bible in latin1 -- replace those long-S
with th, and stuff like that)

UTF-16

Posted Mar 25, 2010 11:41 UTC (Thu) by ikm (guest, #493) [Link] (4 responses)

You: BMP [..is..] largely useless for anything but European languages and perhaps a handful of some Asian language

Me: Here's a list of scripts BMP supports. That's the majority of all scripts in Unicode.

You: "Majority" and "pretty much". Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.

So, BMP without classic Chinese is largely useless? Nice. You know what, enough of this nonsense. Your position basically boils down to "if you can't support all the languages in the world, both extinct and in existence, 100.0% correct, and all the features of Unicode 5.0, too, your effort is largely useless". But it's not; the world isn't black and white.

UTF-16

Posted Mar 25, 2010 12:48 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (3 responses)

Unicode and i18n is a long history of "but this thing people write in real life is hard, can't I simplify it to make my latin-oriented code simpler?"

And a few years later the pressure has mounted enough you do need to process the real thing, not a simplified model, and you need to do the work you didn't want to do in the first place *and* handle all the weird cases your previous shortcuts generated.

The "good enough" i18n school has been a major waste of development so far. It has proven again and again to be shortsighted

UTF-16

Posted Mar 25, 2010 13:04 UTC (Thu) by ikm (guest, #493) [Link] (2 responses)

> you do need to process the real thing, not a simplified model

You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.

UTF-16

Posted Mar 25, 2010 16:48 UTC (Thu) by JoeF (guest, #4486) [Link] (1 responses)

You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.

And what exactly makes you the final authority on using UTF?
While you may have no need to represent ancient Chinese characters, others may.
Just because you don't need it doesn't mean that others won't have use for it.
Your argument smacks of "640K should be enough for anyone" (misattributed to BillG).

UTF-16

Posted Mar 25, 2010 17:06 UTC (Thu) by ikm (guest, #493) [Link]

Oh, no no, I was only referring to my own private decisions. The original post stated the necessity for each and every one, and I disagreed. Others decide for themselves, of course.

p.s. And btw, you *can* represent ancient Chinese with UTF... The original post was probably referring to some much more esoteric stuff.

UTF-16

Posted Mar 25, 2010 15:13 UTC (Thu) by marcH (subscriber, #57642) [Link] (16 responses)

> ... because of perceived complexity viz-a-viz the limitations of modern computing hardware and/or inherited [racist] notions of cultural suitability.

The invention of alphabets was a major breakthrough - because they are inherently simpler than logographies. It's not just about computers: compare how much time a child typically needs to learn one versus the other.

> I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs.

Of course yes, what did you expect? This problem will be naturally solved when countries with complicated writing systems will stop waiting for the western world to solve problems only they have.

> It's an unjustifiable intransigence.

Yeah, software developers are racists since most of them do not bother about foreign languages...

UTF-16

Posted Mar 25, 2010 16:17 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (5 responses)

The Han characters aren't logograms*, but you're right that alphabetic writing systems are better and it isn't even hard to find native Chinese linguists who agree. Some believe that the future of Chinese as a spoken language (or several languages, depending on your politics) depends now on accepting that it will be written in an existing alphabet - probably the Latin alphabet - and the Han characters will in a few generations become a historical curiosity, like runes. For what it's worth your example of teaching children is very apt, I'm told that Chinese schools have begun using the Latin alphabet to teach (young) children in some places already.

* Nobody uses logograms. In a logographic system you have a 1:1 correspondence between graphemes and words. Invent a new word, and you need a new grapheme. Given how readily humans (everywhere) invent new words, this is quickly overwhelming. So, as with the ancient Egyptian system, the Chinese system is clearly influenced by logographic ideas, but it is not a logographic system, a native writer of Chinese can write down words of Chinese they have never seen, based on hearing them and inferring the correct "spelling", just as you might in English.

UTF-16

Posted Mar 25, 2010 19:24 UTC (Thu) by atai (subscriber, #10977) [Link] (4 responses)

As a Chinese, I can tell you that the Chinese characters are not going anywhere. The Chinese characters will stay and be used for Chinese writings, for the next 2000 years just as in the previous 2000 years.

The ideas that China is backwards because of the language and written characters should now go bankrupt.

UTF-16

Posted Mar 25, 2010 22:26 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

Well, after the singular invention of alphabetic writing systems by some
nameless Phoenicians, Mesopotamians and Egyptians 2500-odd years ago,
*everyone* else was backwards. It's an awesome piece of technology. (btw,
the Chinese characters have had numerous major revisions, simplifications
and complexifications over the last two millennia, the most recent being
the traditional/simplified split: any claim that the characters are
unchanged is laughable. They have certainly changed much more than the
Roman alphabet.)

UTF-16

Posted Mar 25, 2010 23:21 UTC (Thu) by atai (subscriber, #10977) [Link] (1 responses)

I don't know if alphabetic writing is forwards or backwards.

But if you say Chinese characters changed more than the Latin alphabet, then you are clearly wrong; the "traditional" Chinese characters certainly stayed mostly the same since 105 BC (What happened in Korea, Japan or Vietnam do not apply because these are not Chinese).

I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?

UTF-16

Posted Mar 26, 2010 11:16 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?

13th Century English (i.e. what linguists call "Middle English") should be readable-for-meaning by an educated speaker of Modern English with a few marginal glosses. Reading-for-sound is almost as easy (95% of it is covered by "Don't silence the silent-in-Modern-English consonants. Pronounce the vowels like Latin / Italian / Spanish instead of like Modern English").

My understanding is that the Greek of 2000 years ago is similarly readable to fluent Modern Greek users. (The phonological issues are a bit trickier in that case.)

In both cases - and, I'm sure, in the case of classical Chinese - it would take more than just knowing the words and grammar to receive the full meaning of the text. Metaphors and cultural assumptions are tricky things.

McLuhan

Posted Apr 15, 2010 9:27 UTC (Thu) by qu1j0t3 (guest, #25786) [Link]

Anyone who wants to explore the topic of comparative alphabets further may find McLuhan's works, such as The Gutenberg Galaxy, rewarding.

UTF-16

Posted Mar 25, 2010 16:21 UTC (Thu) by paulj (subscriber, #341) [Link] (8 responses)

As a data-point, I believe children in China are first taught pinyin (i.e.
roman alphabet encoding of mandarin), and learn hanzi logography buiding on
their knowledge of pinyin.

UTF-16

Posted Mar 25, 2010 19:33 UTC (Thu) by atai (subscriber, #10977) [Link] (3 responses)

But pinyin is not a writing system for Chinese. It helps with teaching pronunciation.

UTF-16

Posted Mar 26, 2010 2:49 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

I have a (mainland chinese) chinese dictionary here, intended for kids,
and it is indexed by pinyin. From what I have seen of (mainland) chinese,
pinyin appears to be their primary way of writing chinese (i.e. most writing
these days is done electronically, and pinyin is used as the input
encoding).

UTF-16

Posted Mar 26, 2010 15:37 UTC (Fri) by chuckles (guest, #41964) [Link] (1 responses)

I'm in China right now learning Mandarin so I can comment on this. Children learn pinyin at the same time as the characters. The Pinyin is printed over the characters and is used to help with pronunciation. While dictionaries targeted towards little children and foreigners are indexed by pinyin, normal dictionaries used by adults are not. Dictionaries used by adults are indexed by the radicals.
While pinyin is nice, there are no tone markers. So you have a 1 in 5 chance (4 tones plus neutral) of getting it right.
You are correct that pinyin is the input system on computers, cell phones, everything electronic, in mainland china. Taiwan has its own system. Also, Chinese are very proud people, Characters aren't going anywhere for a LONG time.

UTF-16

Posted Mar 26, 2010 21:24 UTC (Fri) by paulj (subscriber, #341) [Link]

Yes, I gather formal pinyin has accents to differentiate the tones, but on a
computer you just enter the roman chars and the computer gives you an
appropriate list of glyphs to pick (with arrow key or number).

And yes they are. Shame there's much misunderstanding (in both directions)
though. Anyway, OT.. ;)

UTF-16

Posted Mar 25, 2010 20:23 UTC (Thu) by khc (guest, #45209) [Link] (3 responses)

I was raised in Hong Kong and not in mainland China, but I do have relatives in China. I've never heard that kids learn pinyin before the characters.

UTF-16

Posted Mar 26, 2010 2:44 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

This is what someone who was raised in China has told me.

UTF-16

Posted Mar 27, 2010 22:39 UTC (Sat) by man_ls (guest, #15091) [Link] (1 responses)

China has 1,325,639,982 inhabitants, according to Google. That is more than the whole of Europe, Russia, US, Canada and Australia combined. Even if there is a central government, we can assume a certain cultural diversity.

UTF-16

Posted Mar 28, 2010 4:22 UTC (Sun) by paulj (subscriber, #341) [Link]

Good point. :)

This was a Han chinese person from north-eastern China, i.e. someone from
the dominant cultural group in China, from the more developed part of China.
I don't know how representative their education was, but I suspect there's
at least some standardisation and uniformity.

UTF-16

Posted Dec 27, 2010 2:01 UTC (Mon) by dvdeug (guest, #10998) [Link]

You're writing in a language with one of the most screwed up orthographies in existence. Convince English speakers to use a reasonable orthography, and then you can start complaining about the rest of the world.

Not only that, some of these scripts you're not supporting are wonders. Just because Arabic is always written in cursive and thus needs complex script support, doesn't mean that it's not an alphabet that's perfectly suited to its language, that is in fact easier to learn for children, then the English alphabet is for English speakers.

Supporting Chinese or Arabic is like any other feature. You can refuse to support it, but if your program is important, patches or forks are going to float around to fix. Since Debian and other distributions are committed to supporting those languages, the version of the program that will be in the distributions will be the forked version. If there is no fork, they may just not include it. That's the cost you'll have to pay for ignoring the features they want.

UTF-16

Posted Mar 25, 2010 2:27 UTC (Thu) by Nahor (subscriber, #51583) [Link]

> Unicode was never 16-bits

http://en.wikipedia.org/wiki/Unicode#History:
Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits [...]
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits.

> nor is it now 32-bits

Indeed it's less (http://en.wikipedia.org/wiki/Unicode#Architecture_and_ter...):
Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF

> Nor was UCS-2 fixed-width per se

http://en.wikipedia.org/wiki/Universal_Character_Set:
UCS-2, uses a single code value [...] between 0 and 65,535 for each character, and allows exactly two bytes to represent that value.

> [...]

Unicode/UTF-16/UCS-2/... may not be perfect, it's still better than what we had before. At least now we have a universal way of display foreign alphabets.
Byte arrays to represent a string may not be ideal but it's not worse than before. Features like word splitting may not be easy but they never were. And not all applications need such features. A lot of them just want to able to display Asian characters on an English OS.

UTF-16

Posted Mar 24, 2010 17:49 UTC (Wed) by jrn (subscriber, #64214) [Link] (6 responses)

I can guess a few reasons:

- For many human languages, UTF-16 is more compact than UTF-8.
- UTF-16 is hard to confuse with ISO 8859 and other old encodings.
- Java 1.0 preceded Unicode 2.0. There was no UCS-4 back then.

HTH, Jonathan

UTF-16

Posted Mar 24, 2010 18:25 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (5 responses)

"For many human languages, UTF-16 is more compact than UTF-8."

Care to cite any real evidence for actual text that's routinely processed by computers? It ought to be easy to instrument a system to measure this, but when the claim is made nobody seems to have collected such evidence (or, cynically, they have and it does not support their thesis).

See, people tend to make this claim by looking at a single character, say U+4E56

and they say, this is just one UTF-16 code unit, 2 bytes, but it is 3 bytes in UTF-8, so therefore using UTF-8 costs 50% extra overhead.

But wait a minute, why is the computer processing this character? Is it, perhaps, as part of a larger document? Each ASCII character used in the document costs 100% more in UTF-16 than UTF-8. It is common for documents to include U+0020 (the space character, even some languages which did not use spacing traditionally tend to introduce it when they're computerised) and line separators at least.

And then there's non-human formatting. Sure, maybe the document is written in Chinese, but if it's an HTML document, or a TeX document, or a man page, or a Postscript file, or... then it will be full of English text or latin character abbreviations created by English-using programmers.

So, I don't think the position is so overwhelmingly in favour of UTF-8 that existing, working systems should urgently migrate, but I would definitely recommend against using UTF-16 in new systems.

UTF-16

Posted Mar 25, 2010 0:04 UTC (Thu) by tetromino (guest, #33846) [Link] (4 responses)

> Care to cite any real evidence for actual text that's routinely processed by computers?

I selected 20 random articles in the Japanese Wikipedia (using http://ja.wikipedia.org/wiki/特別:おまかせ表示) and compared the size of their source code (wikitext) in UTF-8 and UTF-16. For 6 of the articles, UTF-8 was more compact; for the remaining 14, UTF-16 was better. The UTF-16 wikitext size ranged from +32% to -16% in size relative to UTF-8, depending on how much of the article's source consisted of wiki syntax, numbers, English words, filenames in the Latin alphabet, etc.

On average, UTF-16 was 2.3% more compact than UTF-8. And concatenating all the articles together, the UTF-16 version would be 3.2% more compact.

So as long as you know that your system's users will be mostly Japanese, it seems that migrating from UTF-8 to UTF-16 for string storage would be a small win.

UTF-16

Posted Mar 25, 2010 14:17 UTC (Thu) by Simetrical (guest, #53439) [Link] (2 responses)

Unless there's surrounding markup of any kind. If you looked at the HTML
instead of the wikitext, UTF-8 would win overwhelmingly.

In general: UTF-8 is at most 50% larger than UTF-16, while UTF-16 is at most
100% larger than UTF-8; and the latter case is *much* more likely than the
former in practice. There's no reason to use UTF-16 in any general-purpose
app -- UTF-8 is clearly superior overall, even if you can come up with
special cases where UTF-16 is somewhat better.

UTF-16

Posted Mar 25, 2010 15:46 UTC (Thu) by liljencrantz (guest, #28458) [Link] (1 responses)

Are you saying wikitext is not markup?

UTF-16

Posted Mar 25, 2010 18:05 UTC (Thu) by Simetrical (guest, #53439) [Link]

Yes, sorry, of course wikitext is markup. But it's (by design) very lightweight markup that only accounts for a small fraction of the text in most cases. If you're using something like HTML, let alone a programming language, UTF-8 is a clear win for East Asian text. tetromino's data suggests that even in a case with a huge advantage for UTF-16 (CJK with only light markup), it's still only a few percent smaller. That's just not worth it, when UTF-8 is *much* smaller in so many real-world cases.

UTF-16

Posted Mar 25, 2010 16:03 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Thankyou for actually stepping up and measuring!

Resetting PHP 6

Posted Mar 24, 2010 18:09 UTC (Wed) by wingo (guest, #26929) [Link]

There are many possible string representations.

https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresen...

Guile internally uses latin-1 when possible, and utf-32 otherwise.

Resetting PHP 6

Posted Mar 26, 2010 4:11 UTC (Fri) by spitzak (guest, #4593) [Link] (2 responses)

No, UTF-8 is preferable.

The truly unavoidable technical reason is that only UTF-8 can safely encode UTF-8 errors. Lossless transmission of data is a requirement for safe and bug-free computing.

Other reasons:

1. Much faster due to no need to translate on input/output

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

3. Often enormously simpler as error detection can be deferred until the string is interpreted.

4. If errors are preserved until display, they can be replaced with more user-friendly replacements (such as the ISO-8859-1 for each byte). This is not safe if errors must be replaced as part of data processing.

5. High-speed byte-based search algorithms work. Tables used by these would go up in size by a factor of 256^3 or more if they were rewritten to use 16-bit units.

5. For almost all real text files UTF-8 is shorter than UTF-16. This is not a big deal but some people think it is important.

Resetting PHP 6

Posted Mar 26, 2010 12:29 UTC (Fri) by ringerc (subscriber, #3071) [Link] (1 responses)

1. Much faster due to no need to translate on input/output

... if the surrounding systems to which I/O is done (the file system, other library APIs, network hosts, etc) are in fact using a utf-8 encoding themselves. Alas, even on many modern systems non-utf-8 encodings are very common.

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

Not safely. The use of existing APIs with new encodings is a HUGE source of bugs in software. I've wasted vast amounts of time tracking down and fixing cases where software fails to do external->internal encoding conversion on input, fails to do internal->external encoding conversion on output, converts already-converted data (mangling it horribly by re-interpreting it as being in the wrong encoding), etc. Using utf-8 with existing encoding-agnostic APIs is a blight on software engineering. Any API should take either a properly typed argument that's specified to ONLY hold text of a known encoding - possibly single fixed encoding like utf-8, or possibly a bytes+encoding tuple structure. If it takes a raw "byte string" it should take a second argument specifying what encoding that data is in.

The fact that POSIX file systems and APIs don't care about "text" with known encoding, only "strings of bytes", is an incredible PITA. Ever had the fun of backing up a network share used by multiple hosts each of which like to use different text encodings? Ever then had to find and restore a single file within that share without knowing what encoding it was in and thus what the byte sequence of the file name was, only the "text" of the file name? ARGH.

"wide" APIs are painful, but they're more than worth it in the bugs and data corruption they prevent.

That's not to say that UTF-16 is better than UTF-8 or vice versa. Rather, "single known encoding enforced" is better than "it's just some bytes".

Resetting PHP 6

Posted Mar 26, 2010 14:41 UTC (Fri) by marcH (subscriber, #57642) [Link]

Yes: UTF-8 is a brilliant backward-compatibility hack that allows software developers to offload their homework to someone else later down the road. It's a truly admirable hack.

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt