wchar_t

Posted Apr 20, 2009 8:37 UTC (Mon) by tialaramex (subscriber, #21167)
In reply to: Why the vehement objections to decimal floating-point? by ringerc
Parent article: What's coming in glibc 2.10

You don't need a Unicode string type in C, and it's probably a mistake to ask for one to be built into C++ (but I can't tell, maybe C++ standardisation is about trying to /collect/ mistakes at this point).

wchar_t is a legacy of the mistaken belief that Unicode was (as some documents from a decade or more ago declared) the encoding of all world symbols into a 16-bit value. Once UCS-2 was obsolete, wchar_t was obsolete too, don't use it. Use UTF-8 on the wire, on disk and even in memory except when you're doing heavyweight character processing, and then use UTF-32, ie uint32_t or at a pinch (since the tops bits are unused anyway) int.

The only real non-legacy argument for UTF-16 was that it's less bytes than UTF-8 for texts in some writing systems, notably Chinese. But the evidence of the last couple of decades is that the alphabetic and syllabic writing systems will eat the others alive, the majority of the world's population may yet be speaking Chinese in our lifetimes, but if so they'll write it mostly in Roman script destroying UTF-16's size advantage.

wchar_t

Posted Apr 20, 2009 13:33 UTC (Mon) by mrshiny (guest, #4266) [Link] (3 responses)

Chinese has only about 410 unique syllables (2050 if you include tones). There are thousands of words and many sentences which, even if tones are properly conveyed, are ambiguous. I would be surprised if the current romanizations replaced Chinese characters. I would be less surprised to see an alphabet arise instead, much like what the Japanese have.

wchar_t

Posted Apr 20, 2009 18:55 UTC (Mon) by proski (subscriber, #104) [Link] (2 responses)

It's already happening in Taiwan: http://en.wikipedia.org/wiki/Bopomofo

wchar_t

Posted Apr 21, 2009 1:19 UTC (Tue) by xoddam (guest, #2322) [Link] (1 responses)

On the other hand, pinyin is pretty much universal on the mainland, often as a way of specifying regional or personal pronunciation.

Vietnam has successfully switched (almost) entirely from han tu to the Latin alphabet, albeit with a forest of diacritics. Chinese might one day do the same, but it's unlikely since there is such a variety of Chinese speech and usage. Unlike Vietnamese, Chinese national identity has never been a matter of shared pronunciation.

Both pinyin and bopomofo have some chance of evolving to make it possible to write both pronunciation and semantics reproducibly in the same representation, but neither is likely to become a universal replacement for hanzi, since they lose the advantage (not meaningfully damaged by the Simplified/Traditional split) that the several very different Chinese languages become mutually intelligible when written down.

Universal alphabetisation of Chinese won't be possible until the regional differences become better acknowledged, so people learn literacy both in their first dialect and in the "standard" language(s).

As for the relatively low count of "unique" symbols -- the whole idea of unifying hanzi and the Japanese and Korean versions using the common semantics to reduce the required code space and "assist" translations and text searches has met great resistance, especially in Japan, and despite it there are now nearly 100,000 distinct characters defined in Unicode. 16 bits was always a pipe dream.

It is ultimately necessary (ie. required by users) to represent distinct glyphs uniquely; Unicode still doesn't satisfy many users precisely because it tries not to have too many distinct code points; probably it never will.

I expect one day the idea of choosing a font based on national context will be abandoned, and the code point count will finally explode, defining one Unicode character per glyph.

wchar_t

Posted Apr 30, 2009 17:27 UTC (Thu) by pixelpapst (guest, #55301) [Link]

> I expect one day the idea of choosing a font based on national context will be abandoned, and the code point count will finally explode, defining one Unicode character per glyph.

I agree. And I think when this happens, we just *might* see a revival of UTF-16 in Asia - in a modal form. So you wouldn't repeat the high-order surrogate when it is the same as that of the previous non-BMP character.

This would pack these texts a bit tighter than UTF-8 or UCS-4 (can encode 10 bits per low-order surrogate), while being a bit easier to parse than all the Escape-Sequence modal encodings.

IMHO, let's see.

wchar_t

Posted Apr 21, 2009 0:25 UTC (Tue) by xoddam (guest, #2322) [Link]

wchar_t is 32-bit by default in g++ and in the stddef.h usually used by gcc.

There is a g++ compiler option -fshort-wchart to change the intrinsic type in C++, and you can use alternative headers or pre-define "-D__WCHAR_T__=uint16_t" for C, but this is pretty unusual on Linux except when cross-compiling for another platform (or building WINE).