Posted Feb 10, 2011 6:39 UTC (Thu) by ras (subscriber, #33059)
Parent article: Moving to Python 3
I'll be avoiding Python 3 as long as I can. The adoption of UCS2 for strings is a major stuff-up. There was a good way to improve string handling in Python 3. Drop UCS2 strings entirely. Their introduction was a mistake.
I think the mistake arose from a common misconception. It seems popular to equate UCS2 with unicode support. This is an abuse of terminology. The old strings represented unicode perfectly well as UTF-8. The new strings style strings use UCS2 instead. That may have been a good idea when Java introduced it, because back then unicode only occupied one code plane in UCS2. Now UCS2, just like UTF-8 must use multibyte sequences for some unicode code points. So the one good point is gone.
The major downside remains however: UCS2 is almost never found in the real world. So you spend 1/2 your time converting between whatever the outside world is using and UCS2, and then back again. The lines of code increase, memory requirements almost double, the execution time increases and in my experience the bugs sky rocket.
Posted Feb 10, 2011 7:26 UTC (Thu) by peregrin (subscriber, #56601)
[Link]
"UCS2 is almost never found in the real world."
Windows 2000/XP/2003/Vista/2008/7 is almost never found in the real world?
Moving to Python 3
Posted Feb 10, 2011 11:52 UTC (Thu) by tialaramex (subscriber, #21167)
[Link]
The Win32 Unicode APIs were retconned to be UTF-16, with older versions (I think NT 4 and perhaps Windows 2000) simply "not supporting" planes other than the BMP (ie characters U+10000 and beyond)
So, no, Windows isn't an example of UCS2, and hasn't been for many years.
Moving to Python 3
Posted Feb 10, 2011 17:10 UTC (Thu) by marcH (subscriber, #57642)
[Link]
I think a lot of programs support UCS-2 only. I mean they would fail in various ways as soon as a supplementary character comes. How many Java programs do you expect to use Java.lang.String.codePointCount() ?
In this sense, UCS-2 is extremely often found in the real world.
UTF family
Posted Feb 11, 2011 4:01 UTC (Fri) by tialaramex (subscriber, #21167)
[Link]
I expect a lot of Java programs (and other programs) work fine with supplementary characters and myraid other thing so long as they leave anything clever to software written by someone else (or more likely a team of somebody elses) who actually knows lots about text.
What were you imagining they should be using java.lang.String.codePointCount() for ? Text is hard, like I said, and a count of Unicode code points is rarely what you need.
Examples of things which are assigned one or more Unicode code points: A harmless, invisible and ignorable marker; indication that subsequent neutral text is intended to be displayed right-to-left; the cedilla accent on a character; a lowercase x; a vertical tab; indication that a non-fatal error occurred in some previous processing.
Moving to Python 3
Posted Feb 10, 2011 11:57 UTC (Thu) by tialaramex (subscriber, #21167)
[Link]
Oh, and in terms of interoperability, both UCS2 and UTF-16 are a big problem. Nobody wants to add BOMs everywhere, but if you don't you have no idea what you're looking at. So you end up with even products built by Microsoft entirely with Microsoft technologies (and thus heavily invested in 16-bit code units) communicating in UTF-8 anyway.
As the original poster said (even if their terminology is wrong in a bunch of places) UCS-2 looked like it might be clever in the mid-1990s. Once it became clear that Unicode's hyperspace would be populated, and UCS2 wasn't capable of handling that, the choice was no longer between UCS2 and UTF8 (where UCS2 delivers some intuitive-seeming properties, although not as many as sometimes claimed) but between UTF8 and UTF16, where UTF16 is completely horrible.
Moving to Python 3
Posted Feb 10, 2011 9:33 UTC (Thu) by rweir (subscriber, #24833)
[Link]
>The adoption of UCS2 for strings is a major stuff-up.
???
all python3 did was switch what the 'str' type refers to, from 'bytes' to 'abstract sequence of unicode codepoints'. as far as I know, python 2 and python 3 both support ucs-2 or ucs-4 as the concrete-you-almost-never-have-to-care representation for unicode strings.
Moving to Python 3
Posted Feb 10, 2011 11:03 UTC (Thu) by cortana (subscriber, #24596)
[Link]
Who cares how Python 3 encodes its unicode strings internally? If you cared about the extra memory overhead of storing strings in UTF-16 v.s. UTF-8 (which is not always a net overhead, BTW), and the extra time taken to convert from UTF-8 to UTF-16, then you are free to keep using the 'bytes' type and store UTF-8 data in it. But if you really cared that much, surely you would be using C, C++, D, etc.
Leaky abstractions
Posted Feb 10, 2011 12:05 UTC (Thu) by tialaramex (subscriber, #21167)
[Link]
So long as the programmer never knows, I don't care.
But in my experience it's surprisingly hard to prevent this abstraction from leaking. Text is really tricky, in fact one of the main lessons from the Unicode project is that text is way trickier than anyone had really thought before.
For example, what happens with canonicalisation in Python?
(You will not be surprised to know that the answer in C is generally "C does not care about canonicalisation, it's all byte strings to us")
Leaky abstractions
Posted Feb 10, 2011 17:11 UTC (Thu) by marcH (subscriber, #57642)
[Link]
Except for wchar_t?
Leaky abstractions
Posted Feb 11, 2011 4:06 UTC (Fri) by tialaramex (subscriber, #21167)
[Link]
The standard conveniently permits that wchar_t can be char, allowing you to ignore it altogether :D
Moving to Python 3
Posted Feb 17, 2011 4:14 UTC (Thu) by spitzak (guest, #4593)
[Link]
In the real world, text that is "UTF-8" can contain ERRORS (ie bytes that are not in UTF-8 order). They CANNOT be converted to UCS-2 or UTF-16 or UTF-32 or whatever Python thinks should be used. Any such conversion will either throw an error (resulting in a denial-of-service bug) or will be lossy (resulting in who knows what security or functionality loss bug).
The real result, in Python 3 and 2 and on Windows and virtually everywhere else where the "wchar" madness infects designers is that any programmers working with text where the UTF-8 might contain an error is that they resort to destroying the UTF-8 support by saying the text is actually ASCII or ISO-8859-1 or whatever (sometimes they double-UTF-8 encode it which is the same as ISO-8859-1). Basically the question is whether to eliminate the ability to see even the ASCII letters in the filenames versus the ability to see some rarely-used foreign letters in the cases where they happen to be encoded correctly. If you don't believe me then you have not looked at any recent applications that read text filenames, even on Windows. Or just look at the idiotic behavior of Python 2, described right here in this article!
Congratulations, your belief in new encodings has set I18N back 20 years. We will never see filenames that work across systems and support Unicode. Never ever ever, because of your stubborn belief that you are "right".
The real answer:
Text is a stream of 8 bit bytes. In about 1% of the cases you will care about any characters other than a tiny number of ASCII ones such as NUL and CR. You will then have to decode it, using an INTERATOR that steps through the string, and is capable of returning Unicode code points, Unicode composed characters, and clear lossless indications of encoding errors.
Strings in source files should assume UTF-8 encoding. If the source file itself is UTF-8 this is trivial. But "\u1234" should produce the 3-byte UTF-8 encoding of U+1234. "\xNN" should produce a byte with that value, despite the fact that this can produce an invalid UTF-8 encoding. Printing UTF-8 should never throw an error, it should produce error boxes for encoding errors, one for each byte. On backwards systems where some idiot thought "wchar" was a hot idea, you may need to convert to it, in which case encoding errors should translate to U+DCxx where xx is the byte's value (these are errors in UTF-16 as well), but conversion back from UTF-16 will be lossy as these will turn back into 3 UTF-8 bytes.
Moving to Python 3
Posted Feb 17, 2011 14:27 UTC (Thu) by foom (subscriber, #14868)
[Link]
+1000 to the sentiment. Decoding bytes to codepoints is a total waste of time, almost always, and just adds unnecessary complication to the system.
However, as I said in http://lwn.net/Articles/426906/ python3 *does* do non-lossy decoding/encoding for filenames with random bytes in them.
Moving to Python 3
Posted Feb 11, 2011 0:00 UTC (Fri) by dave_malcolm (subscriber, #15013)
[Link]
FWIW, PEP 393 may ameliorate some of these issues (still at the planning stage, though)