User: Password:
Subscribe / Log in / New account

Moving to Python 3

Moving to Python 3

Posted Feb 17, 2011 4:14 UTC (Thu) by spitzak (guest, #4593)
In reply to: Moving to Python 3 by cortana
Parent article: Moving to Python 3

In the real world, text that is "UTF-8" can contain ERRORS (ie bytes that are not in UTF-8 order). They CANNOT be converted to UCS-2 or UTF-16 or UTF-32 or whatever Python thinks should be used. Any such conversion will either throw an error (resulting in a denial-of-service bug) or will be lossy (resulting in who knows what security or functionality loss bug).

The real result, in Python 3 and 2 and on Windows and virtually everywhere else where the "wchar" madness infects designers is that any programmers working with text where the UTF-8 might contain an error is that they resort to destroying the UTF-8 support by saying the text is actually ASCII or ISO-8859-1 or whatever (sometimes they double-UTF-8 encode it which is the same as ISO-8859-1). Basically the question is whether to eliminate the ability to see even the ASCII letters in the filenames versus the ability to see some rarely-used foreign letters in the cases where they happen to be encoded correctly. If you don't believe me then you have not looked at any recent applications that read text filenames, even on Windows. Or just look at the idiotic behavior of Python 2, described right here in this article!

Congratulations, your belief in new encodings has set I18N back 20 years. We will never see filenames that work across systems and support Unicode. Never ever ever, because of your stubborn belief that you are "right".

The real answer:

Text is a stream of 8 bit bytes. In about 1% of the cases you will care about any characters other than a tiny number of ASCII ones such as NUL and CR. You will then have to decode it, using an INTERATOR that steps through the string, and is capable of returning Unicode code points, Unicode composed characters, and clear lossless indications of encoding errors.

Strings in source files should assume UTF-8 encoding. If the source file itself is UTF-8 this is trivial. But "\u1234" should produce the 3-byte UTF-8 encoding of U+1234. "\xNN" should produce a byte with that value, despite the fact that this can produce an invalid UTF-8 encoding. Printing UTF-8 should never throw an error, it should produce error boxes for encoding errors, one for each byte. On backwards systems where some idiot thought "wchar" was a hot idea, you may need to convert to it, in which case encoding errors should translate to U+DCxx where xx is the byte's value (these are errors in UTF-16 as well), but conversion back from UTF-16 will be lossy as these will turn back into 3 UTF-8 bytes.

(Log in to post comments)

Moving to Python 3

Posted Feb 17, 2011 14:27 UTC (Thu) by foom (subscriber, #14868) [Link]

+1000 to the sentiment. Decoding bytes to codepoints is a total waste of time, almost always, and just adds unnecessary complication to the system.

However, as I said in python3 *does* do non-lossy decoding/encoding for filenames with random bytes in them.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds