UTF-16
UTF-16
Posted Mar 24, 2010 18:25 UTC (Wed) by tialaramex (subscriber, #21167)In reply to: UTF-16 by jrn
Parent article: Resetting PHP 6
Care to cite any real evidence for actual text that's routinely processed by computers? It ought to be easy to instrument a system to measure this, but when the claim is made nobody seems to have collected such evidence (or, cynically, they have and it does not support their thesis).
See, people tend to make this claim by looking at a single character, say U+4E56
and they say, this is just one UTF-16 code unit, 2 bytes, but it is 3 bytes in UTF-8, so therefore using UTF-8 costs 50% extra overhead.
But wait a minute, why is the computer processing this character? Is it, perhaps, as part of a larger document? Each ASCII character used in the document costs 100% more in UTF-16 than UTF-8. It is common for documents to include U+0020 (the space character, even some languages which did not use spacing traditionally tend to introduce it when they're computerised) and line separators at least.
And then there's non-human formatting. Sure, maybe the document is written in Chinese, but if it's an HTML document, or a TeX document, or a man page, or a Postscript file, or... then it will be full of English text or latin character abbreviations created by English-using programmers.
So, I don't think the position is so overwhelmingly in favour of UTF-8 that existing, working systems should urgently migrate, but I would definitely recommend against using UTF-16 in new systems.
Posted Mar 25, 2010 0:04 UTC (Thu)
by tetromino (guest, #33846)
[Link] (4 responses)
I selected 20 random articles in the Japanese Wikipedia (using http://ja.wikipedia.org/wiki/特別:おまかせ表示) and compared the size of their source code (wikitext) in UTF-8 and UTF-16. For 6 of the articles, UTF-8 was more compact; for the remaining 14, UTF-16 was better. The UTF-16 wikitext size ranged from +32% to -16% in size relative to UTF-8, depending on how much of the article's source consisted of wiki syntax, numbers, English words, filenames in the Latin alphabet, etc. On average, UTF-16 was 2.3% more compact than UTF-8. And concatenating all the articles together, the UTF-16 version would be 3.2% more compact. So as long as you know that your system's users will be mostly Japanese, it seems that migrating from UTF-8 to UTF-16 for string storage would be a small win.
Posted Mar 25, 2010 14:17 UTC (Thu)
by Simetrical (guest, #53439)
[Link] (2 responses)
In general: UTF-8 is at most 50% larger than UTF-16, while UTF-16 is at most
Posted Mar 25, 2010 15:46 UTC (Thu)
by liljencrantz (guest, #28458)
[Link] (1 responses)
Posted Mar 25, 2010 18:05 UTC (Thu)
by Simetrical (guest, #53439)
[Link]
Posted Mar 25, 2010 16:03 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link]
> Care to cite any real evidence for actual text that's routinely processed by computers?
UTF-16
UTF-16
instead of the wikitext, UTF-8 would win overwhelmingly.
100% larger than UTF-8; and the latter case is *much* more likely than the
former in practice. There's no reason to use UTF-16 in any general-purpose
app -- UTF-8 is clearly superior overall, even if you can come up with
special cases where UTF-16 is somewhat better.
UTF-16
UTF-16
UTF-16