Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 19:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
In reply to: Python 3, ASCII, and UTF-8 by anselm
Parent article: Python 3, ASCII, and UTF-8

> The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded.
As I said, they are magic strings that solve everything.

As a result, they are strictly not better than simple byte sequences. You can't even concatenate them without getting strange results (just play around with combining characters and RTL switching). Pretty much the only safe thing you can do with Py3 strings is to spit them out verbatim. At which point it's just easier to deal with good old honest raw byte sequences.

And it doesn't have to be so - Perl 6 does Unicode the correct way. They most definitely specify the encoding and normalize the text to make indexing and regexes sane. In particular, combining characters and languages with complex scripts are treated well by normalizing the text into a series of graphemes.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 17:51 UTC (Fri) by anselm (subscriber, #2796) [Link] (3 responses)

As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7, where you had to jump through your choice of flaming hoops (involving, e.g., the codecs module) to be able to accomplish the same thing. (People who would rather read binary data from a file can still open it in binary mode and see the raw bytes.)

So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point). It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course. Having the concept of Unicode strings in the first place, however, is a prerequisite for being able to deal with these cases at all.

It may well be the case that Perl 6's handling of Unicode data is currently better than Python's. They certainly took their own sweet time to figure it out, after all. But given the way things work between the Python and Perl communities, if the Perl people have in fact caught on to something worthwhile, chances are that the same – or very similar – functionality will pop up in Python in due course (and vice versa). So I wouldn't consider the Unicode functionality in current Python to be the last word on the topic.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:47 UTC (Fri) by jwilk (subscriber, #63328) [Link]

open() by default uses locale encoding, which is rarely what you want.
If you want UTF-8, you need to specify the encoding explicitly.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file.
I can do so in Py2 as well: 'content = open("file").read().decode("utf-8")'. I still don't see why it justifies huge breaking changes that require mass re-audit of huge amount of code.

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points
I've yet to hear why I would want to deal with Unicode codepoints instead of bytes everywhere by default.

> It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course.
No they won't, without yet another round of incompatible changes. Py3 is stuck with Uselesscode strings.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 23:46 UTC (Fri) by lsl (subscriber, #86508) [Link]

> As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7,

I can do that using the raw open and read syscalls. Why would I need magic Unicode strings for that?

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point).

Except for the fact that arrays of Unicode code points aren't a terribly useful representation most of the time. Byte strings can carry UTF8-encoded Unicode text just as well, in addition to being able to hold everything else users might want to feed into my program and expect it to roundtrip correctly. The odd day where I really need to apply some Unicode-aware text transformation that works on code points I can convert it there and then. Or just package it up such that it accepts and returns UTF-8 strings.