LWN.net Logo

Moving to Python 3

Moving to Python 3

Posted Feb 10, 2011 9:33 UTC (Thu) by rweir (subscriber, #24833)
In reply to: Moving to Python 3 by ras
Parent article: Moving to Python 3

>The adoption of UCS2 for strings is a major stuff-up.

???

all python3 did was switch what the 'str' type refers to, from 'bytes' to 'abstract sequence of unicode codepoints'. as far as I know, python 2 and python 3 both support ucs-2 or ucs-4 as the concrete-you-almost-never-have-to-care representation for unicode strings.


(Log in to post comments)

Moving to Python 3

Posted Feb 10, 2011 11:03 UTC (Thu) by cortana (subscriber, #24596) [Link]

Who cares how Python 3 encodes its unicode strings internally? If you cared about the extra memory overhead of storing strings in UTF-16 v.s. UTF-8 (which is not always a net overhead, BTW), and the extra time taken to convert from UTF-8 to UTF-16, then you are free to keep using the 'bytes' type and store UTF-8 data in it. But if you really cared that much, surely you would be using C, C++, D, etc.

Leaky abstractions

Posted Feb 10, 2011 12:05 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

So long as the programmer never knows, I don't care.

But in my experience it's surprisingly hard to prevent this abstraction from leaking. Text is really tricky, in fact one of the main lessons from the Unicode project is that text is way trickier than anyone had really thought before.

For example, what happens with canonicalisation in Python?

(You will not be surprised to know that the answer in C is generally "C does not care about canonicalisation, it's all byte strings to us")

Leaky abstractions

Posted Feb 10, 2011 17:11 UTC (Thu) by marcH (subscriber, #57642) [Link]

Except for wchar_t?

Leaky abstractions

Posted Feb 11, 2011 4:06 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

The standard conveniently permits that wchar_t can be char, allowing you to ignore it altogether :D

Moving to Python 3

Posted Feb 17, 2011 4:14 UTC (Thu) by spitzak (guest, #4593) [Link]

In the real world, text that is "UTF-8" can contain ERRORS (ie bytes that are not in UTF-8 order). They CANNOT be converted to UCS-2 or UTF-16 or UTF-32 or whatever Python thinks should be used. Any such conversion will either throw an error (resulting in a denial-of-service bug) or will be lossy (resulting in who knows what security or functionality loss bug).

The real result, in Python 3 and 2 and on Windows and virtually everywhere else where the "wchar" madness infects designers is that any programmers working with text where the UTF-8 might contain an error is that they resort to destroying the UTF-8 support by saying the text is actually ASCII or ISO-8859-1 or whatever (sometimes they double-UTF-8 encode it which is the same as ISO-8859-1). Basically the question is whether to eliminate the ability to see even the ASCII letters in the filenames versus the ability to see some rarely-used foreign letters in the cases where they happen to be encoded correctly. If you don't believe me then you have not looked at any recent applications that read text filenames, even on Windows. Or just look at the idiotic behavior of Python 2, described right here in this article!

Congratulations, your belief in new encodings has set I18N back 20 years. We will never see filenames that work across systems and support Unicode. Never ever ever, because of your stubborn belief that you are "right".

The real answer:

Text is a stream of 8 bit bytes. In about 1% of the cases you will care about any characters other than a tiny number of ASCII ones such as NUL and CR. You will then have to decode it, using an INTERATOR that steps through the string, and is capable of returning Unicode code points, Unicode composed characters, and clear lossless indications of encoding errors.

Strings in source files should assume UTF-8 encoding. If the source file itself is UTF-8 this is trivial. But "\u1234" should produce the 3-byte UTF-8 encoding of U+1234. "\xNN" should produce a byte with that value, despite the fact that this can produce an invalid UTF-8 encoding. Printing UTF-8 should never throw an error, it should produce error boxes for encoding errors, one for each byte. On backwards systems where some idiot thought "wchar" was a hot idea, you may need to convert to it, in which case encoding errors should translate to U+DCxx where xx is the byte's value (these are errors in UTF-16 as well), but conversion back from UTF-16 will be lossy as these will turn back into 3 UTF-8 bytes.

Moving to Python 3

Posted Feb 17, 2011 14:27 UTC (Thu) by foom (subscriber, #14868) [Link]

+1000 to the sentiment. Decoding bytes to codepoints is a total waste of time, almost always, and just adds unnecessary complication to the system.

However, as I said in http://lwn.net/Articles/426906/ python3 *does* do non-lossy decoding/encoding for filenames with random bytes in them.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds