wchar_t

Posted Apr 21, 2009 1:19 UTC (Tue) by xoddam (guest, #2322)
In reply to: wchar_t by proski
Parent article: What's coming in glibc 2.10

On the other hand, pinyin is pretty much universal on the mainland, often as a way of specifying regional or personal pronunciation.

Vietnam has successfully switched (almost) entirely from han tu to the Latin alphabet, albeit with a forest of diacritics. Chinese might one day do the same, but it's unlikely since there is such a variety of Chinese speech and usage. Unlike Vietnamese, Chinese national identity has never been a matter of shared pronunciation.

Both pinyin and bopomofo have some chance of evolving to make it possible to write both pronunciation and semantics reproducibly in the same representation, but neither is likely to become a universal replacement for hanzi, since they lose the advantage (not meaningfully damaged by the Simplified/Traditional split) that the several very different Chinese languages become mutually intelligible when written down.

Universal alphabetisation of Chinese won't be possible until the regional differences become better acknowledged, so people learn literacy both in their first dialect and in the "standard" language(s).

As for the relatively low count of "unique" symbols -- the whole idea of unifying hanzi and the Japanese and Korean versions using the common semantics to reduce the required code space and "assist" translations and text searches has met great resistance, especially in Japan, and despite it there are now nearly 100,000 distinct characters defined in Unicode. 16 bits was always a pipe dream.

It is ultimately necessary (ie. required by users) to represent distinct glyphs uniquely; Unicode still doesn't satisfy many users precisely because it tries not to have too many distinct code points; probably it never will.

I expect one day the idea of choosing a font based on national context will be abandoned, and the code point count will finally explode, defining one Unicode character per glyph.

wchar_t

Posted Apr 30, 2009 17:27 UTC (Thu) by pixelpapst (guest, #55301) [Link]

> I expect one day the idea of choosing a font based on national context will be abandoned, and the code point count will finally explode, defining one Unicode character per glyph.

I agree. And I think when this happens, we just *might* see a revival of UTF-16 in Asia - in a modal form. So you wouldn't repeat the high-order surrogate when it is the same as that of the previous non-BMP character.

This would pack these texts a bit tighter than UTF-8 or UCS-4 (can encode 10 bits per low-order surrogate), while being a bit easier to parse than all the Escape-Sequence modal encodings.

IMHO, let's see.