UTF-16

Posted Mar 24, 2010 18:25 UTC (Wed) by tialaramex (subscriber, #21167)
In reply to: UTF-16 by jrn
Parent article: Resetting PHP 6

"For many human languages, UTF-16 is more compact than UTF-8."

Care to cite any real evidence for actual text that's routinely processed by computers? It ought to be easy to instrument a system to measure this, but when the claim is made nobody seems to have collected such evidence (or, cynically, they have and it does not support their thesis).

See, people tend to make this claim by looking at a single character, say U+4E56

and they say, this is just one UTF-16 code unit, 2 bytes, but it is 3 bytes in UTF-8, so therefore using UTF-8 costs 50% extra overhead.

But wait a minute, why is the computer processing this character? Is it, perhaps, as part of a larger document? Each ASCII character used in the document costs 100% more in UTF-16 than UTF-8. It is common for documents to include U+0020 (the space character, even some languages which did not use spacing traditionally tend to introduce it when they're computerised) and line separators at least.

And then there's non-human formatting. Sure, maybe the document is written in Chinese, but if it's an HTML document, or a TeX document, or a man page, or a Postscript file, or... then it will be full of English text or latin character abbreviations created by English-using programmers.

So, I don't think the position is so overwhelmingly in favour of UTF-8 that existing, working systems should urgently migrate, but I would definitely recommend against using UTF-16 in new systems.

UTF-16

Posted Mar 25, 2010 0:04 UTC (Thu) by tetromino (guest, #33846) [Link] (4 responses)

> Care to cite any real evidence for actual text that's routinely processed by computers?

I selected 20 random articles in the Japanese Wikipedia (using http://ja.wikipedia.org/wiki/特別:おまかせ表示) and compared the size of their source code (wikitext) in UTF-8 and UTF-16. For 6 of the articles, UTF-8 was more compact; for the remaining 14, UTF-16 was better. The UTF-16 wikitext size ranged from +32% to -16% in size relative to UTF-8, depending on how much of the article's source consisted of wiki syntax, numbers, English words, filenames in the Latin alphabet, etc.

On average, UTF-16 was 2.3% more compact than UTF-8. And concatenating all the articles together, the UTF-16 version would be 3.2% more compact.

So as long as you know that your system's users will be mostly Japanese, it seems that migrating from UTF-8 to UTF-16 for string storage would be a small win.

UTF-16

Posted Mar 25, 2010 14:17 UTC (Thu) by Simetrical (guest, #53439) [Link] (2 responses)

Unless there's surrounding markup of any kind. If you looked at the HTML
instead of the wikitext, UTF-8 would win overwhelmingly.

In general: UTF-8 is at most 50% larger than UTF-16, while UTF-16 is at most
100% larger than UTF-8; and the latter case is *much* more likely than the
former in practice. There's no reason to use UTF-16 in any general-purpose
app -- UTF-8 is clearly superior overall, even if you can come up with
special cases where UTF-16 is somewhat better.

UTF-16

Posted Mar 25, 2010 15:46 UTC (Thu) by liljencrantz (guest, #28458) [Link] (1 responses)

Are you saying wikitext is not markup?

UTF-16

Posted Mar 25, 2010 18:05 UTC (Thu) by Simetrical (guest, #53439) [Link]

Yes, sorry, of course wikitext is markup. But it's (by design) very lightweight markup that only accounts for a small fraction of the text in most cases. If you're using something like HTML, let alone a programming language, UTF-8 is a clear win for East Asian text. tetromino's data suggests that even in a case with a huge advantage for UTF-16 (CJK with only light markup), it's still only a few percent smaller. That's just not worth it, when UTF-8 is *much* smaller in so many real-world cases.

UTF-16

Posted Mar 25, 2010 16:03 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Thankyou for actually stepping up and measuring!