Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 17:51 UTC (Fri) by anselm (subscriber, #2796)
In reply to: Python 3, ASCII, and UTF-8 by Cyberax
Parent article: Python 3, ASCII, and UTF-8

As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7, where you had to jump through your choice of flaming hoops (involving, e.g., the codecs module) to be able to accomplish the same thing. (People who would rather read binary data from a file can still open it in binary mode and see the raw bytes.)

So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point). It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course. Having the concept of Unicode strings in the first place, however, is a prerequisite for being able to deal with these cases at all.

It may well be the case that Perl 6's handling of Unicode data is currently better than Python's. They certainly took their own sweet time to figure it out, after all. But given the way things work between the Python and Perl communities, if the Perl people have in fact caught on to something worthwhile, chances are that the same – or very similar – functionality will pop up in Python in due course (and vice versa). So I wouldn't consider the Unicode functionality in current Python to be the last word on the topic.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:47 UTC (Fri) by jwilk (subscriber, #63328) [Link]

open() by default uses locale encoding, which is rarely what you want.
If you want UTF-8, you need to specify the encoding explicitly.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file.
I can do so in Py2 as well: 'content = open("file").read().decode("utf-8")'. I still don't see why it justifies huge breaking changes that require mass re-audit of huge amount of code.

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points
I've yet to hear why I would want to deal with Unicode codepoints instead of bytes everywhere by default.

> It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course.
No they won't, without yet another round of incompatible changes. Py3 is stuck with Uselesscode strings.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 23:46 UTC (Fri) by lsl (subscriber, #86508) [Link]

> As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7,

I can do that using the raw open and read syscalls. Why would I need magic Unicode strings for that?

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point).

Except for the fact that arrays of Unicode code points aren't a terribly useful representation most of the time. Byte strings can carry UTF8-encoded Unicode text just as well, in addition to being able to hold everything else users might want to feed into my program and expect it to roundtrip correctly. The odd day where I really need to apply some Unicode-aware text transformation that works on code points I can convert it there and then. Or just package it up such that it accepts and returns UTF-8 strings.