User: Password:
Subscribe / Log in / New account

Moving to Python 3

Moving to Python 3

Posted Feb 10, 2011 7:26 UTC (Thu) by peregrin (guest, #56601)
In reply to: Moving to Python 3 by ras
Parent article: Moving to Python 3

"UCS2 is almost never found in the real world."

Windows 2000/XP/2003/Vista/2008/7 is almost never found in the real world?

(Log in to post comments)

Moving to Python 3

Posted Feb 10, 2011 11:52 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The Win32 Unicode APIs were retconned to be UTF-16, with older versions (I think NT 4 and perhaps Windows 2000) simply "not supporting" planes other than the BMP (ie characters U+10000 and beyond)

So, no, Windows isn't an example of UCS2, and hasn't been for many years.

Moving to Python 3

Posted Feb 10, 2011 17:10 UTC (Thu) by marcH (subscriber, #57642) [Link]

I think a lot of programs support UCS-2 only. I mean they would fail in various ways as soon as a supplementary character comes. How many Java programs do you expect to use Java.lang.String.codePointCount() ?

In this sense, UCS-2 is extremely often found in the real world.

UTF family

Posted Feb 11, 2011 4:01 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

I expect a lot of Java programs (and other programs) work fine with supplementary characters and myraid other thing so long as they leave anything clever to software written by someone else (or more likely a team of somebody elses) who actually knows lots about text.

What were you imagining they should be using java.lang.String.codePointCount() for ? Text is hard, like I said, and a count of Unicode code points is rarely what you need.

Examples of things which are assigned one or more Unicode code points: A harmless, invisible and ignorable marker; indication that subsequent neutral text is intended to be displayed right-to-left; the cedilla accent on a character; a lowercase x; a vertical tab; indication that a non-fatal error occurred in some previous processing.

Moving to Python 3

Posted Feb 10, 2011 11:57 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Oh, and in terms of interoperability, both UCS2 and UTF-16 are a big problem. Nobody wants to add BOMs everywhere, but if you don't you have no idea what you're looking at. So you end up with even products built by Microsoft entirely with Microsoft technologies (and thus heavily invested in 16-bit code units) communicating in UTF-8 anyway.

As the original poster said (even if their terminology is wrong in a bunch of places) UCS-2 looked like it might be clever in the mid-1990s. Once it became clear that Unicode's hyperspace would be populated, and UCS2 wasn't capable of handling that, the choice was no longer between UCS2 and UTF8 (where UCS2 delivers some intuitive-seeming properties, although not as many as sometimes claimed) but between UTF8 and UTF16, where UTF16 is completely horrible.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds