Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 13:08 UTC (Tue) by foom (subscriber, #14868)
In reply to: Python 3, ASCII, and UTF-8 by peter-b
Parent article: Python 3, ASCII, and UTF-8

Yes, the point is that a "codepoint" sequence is effectively no more useful as an adequate representation of text than a byte sequence is. All those concepts you mention, like glyph breaks and word breaks, are just as easy to implement and use on top of a utf8"ish" byte buffer as they are on top of a codepoint sequence.

Python3 goes to a lot of effort to convert just about everything, be it actually in a valid known encoding or not, into a codepoint sequence, in the name of Unicode Support. But that is unfortunately mostly wasted effort.

Imagine if, instead of a Unicode String type, python instead just had functions to iterate over and otherwise manipulate a utf8 buffer by whichever grouping you like, byte, codepoint, glyph or word?

It could have ended up with something more like golang, where "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text."

And that would've been good.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 23:24 UTC (Wed) by togga (guest, #53103) [Link]

"But that is unfortunately mostly wasted effort."

Not only that, it also wastes it's users time and effort. It trades that rare complex tasks might look easy on the surface with common simple tasks getting annoyingly complex.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 4:36 UTC (Thu) by ras (subscriber, #33059) [Link] (17 responses)

> Yes, the point is that a "codepoint" sequence is effectively no more useful as an adequate representation of text than a byte sequence is.

I'm not a huge fan of Python3 Unicode implementation, but this over eggs the issue. We have had byte sequences representing text for as long as I can remember (which is longer than I care to admit). This issue was if it really was text, you had to convert those byte sequences into text. There were standards, code pages and what not that told you how to do it of course. But there were so many to choose from.

ISO 10646 solved the problem by dividing it into half. It assigned each unique character a number it called a code point. Then, as a separate exercise, it proposed some ways of encoding those numbers into bytes. Everyone seems to be happy enough to accept the ISO 10646 mapping from characters to numbers, meaning every one is just as happy to with accepting the only true code point for € is 0x20ac as they are accepting the only true code point for 'a' is 0x61. (Well everyone except Cyberax apparently, who wants a code point an Italian i and presumably '\r\n' as well.) But maybe even Cyberax would agree 10646 was a major step forward over code pages. The encoding into bytes will remain a blight until everyone agrees on to just use UTF-8. Given UCS2 is hard wired into Windows and Javascript that might be never. Sigh.

My only issue with Python3 is it is far too eager to convert bytes into text. stdin, stdout, os.listdir(), the default return of open() - they should all be byte streams. I don't know why they thought they could safely convert it to text. Doing so in a 'nix environment looked to me to be recipe for disaster because the LANG= could be different on each run, and that's pretty much how it turned out. They had languages to copy from that did it mostly right, like Java for instance, and they still got it wrong.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 6:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

One small comment - I'm not against Unicode or UCS.

I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 9:47 UTC (Thu) by anselm (subscriber, #2796) [Link] (5 responses)

I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.

The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded. When you read text from a file, socket, … (where it tends to be encoded according to UTF-8, UCS2, or whatever, and that encoding is an attribute of the communication channel in question) it is decoded into a Unicode string for internal processing, and when you write it back it is again encoded according to UTF-8, UCS2, or whatever, depending on where it goes. The system assumes that people writing Python programs are more likely to deal with Unicode text rather than sequences of arbitrary bytes, and therefore leans in the former direction by default.

On the whole this is not an entirely unreasonable approach to take. There are problems with (inadvertently) feeding this mechanism streams of arbitrary bytes, which tends to fail due to decoding issues, with fuzzy boundaries between what should be treated as text vs. arbitrary bytes (such as path names on Linux), and with the general quagmire that Unicode quickly becomes when it comes to more complicated scripts. But saying that the way Python 3 handles Unicode is much worse in general than the way Python 2 handled Unicode doesn't ring true. My personal experience after mostly moving over to Python 3 is that now I seem to have much fewer issues with Unicode handling than I used to when I was mostly using Python 2, so as far as I'm concerned, Python 3 is an improvement in this area – but then again I'm one of those programmers who have to deal with Unicode text much more often than with streams of arbitrary bytes.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 19:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded.
As I said, they are magic strings that solve everything.

As a result, they are strictly not better than simple byte sequences. You can't even concatenate them without getting strange results (just play around with combining characters and RTL switching). Pretty much the only safe thing you can do with Py3 strings is to spit them out verbatim. At which point it's just easier to deal with good old honest raw byte sequences.

And it doesn't have to be so - Perl 6 does Unicode the correct way. They most definitely specify the encoding and normalize the text to make indexing and regexes sane. In particular, combining characters and languages with complex scripts are treated well by normalizing the text into a series of graphemes.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 17:51 UTC (Fri) by anselm (subscriber, #2796) [Link] (3 responses)

As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7, where you had to jump through your choice of flaming hoops (involving, e.g., the codecs module) to be able to accomplish the same thing. (People who would rather read binary data from a file can still open it in binary mode and see the raw bytes.)

So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point). It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course. Having the concept of Unicode strings in the first place, however, is a prerequisite for being able to deal with these cases at all.

It may well be the case that Perl 6's handling of Unicode data is currently better than Python's. They certainly took their own sweet time to figure it out, after all. But given the way things work between the Python and Perl communities, if the Perl people have in fact caught on to something worthwhile, chances are that the same – or very similar – functionality will pop up in Python in due course (and vice versa). So I wouldn't consider the Unicode functionality in current Python to be the last word on the topic.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:47 UTC (Fri) by jwilk (subscriber, #63328) [Link]

open() by default uses locale encoding, which is rarely what you want.
If you want UTF-8, you need to specify the encoding explicitly.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file.
I can do so in Py2 as well: 'content = open("file").read().decode("utf-8")'. I still don't see why it justifies huge breaking changes that require mass re-audit of huge amount of code.

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points
I've yet to hear why I would want to deal with Unicode codepoints instead of bytes everywhere by default.

> It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course.
No they won't, without yet another round of incompatible changes. Py3 is stuck with Uselesscode strings.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 23:46 UTC (Fri) by lsl (subscriber, #86508) [Link]

> As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7,

I can do that using the raw open and read syscalls. Why would I need magic Unicode strings for that?

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point).

Except for the fact that arrays of Unicode code points aren't a terribly useful representation most of the time. Byte strings can carry UTF8-encoded Unicode text just as well, in addition to being able to hold everything else users might want to feed into my program and expect it to roundtrip correctly. The odd day where I really need to apply some Unicode-aware text transformation that works on code points I can convert it there and then. Or just package it up such that it accepts and returns UTF-8 strings.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 3:03 UTC (Wed) by HelloWorld (guest, #56129) [Link] (9 responses)

> It assigned each unique character a number it called a code point.
A code point is *not* a character. Umlauts like ä, ö and ü can be written in two ways in Unicode, either with a single code point for the whole thing or with one code point for the underlying vowel and one code point for the diacritic. This sequence of two code points is called a grapheme cluster. There are entire Unicode blocks that contain only code points that only make sense as part of grapheme clusters, like Hangul Jamo for Korean. Many variations of emoji are also implemented this way. For this reason I don't think it makes sense to treat strings as sequences of Unicode code points, it should be grapheme clusters, and that's what Perl 6 does while Python, like always, fucked it up (and Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now)

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 8:29 UTC (Wed) by jem (subscriber, #24231) [Link] (1 responses)

>Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now

Jav moved from UCS-2 (64k points) to UTF-16 (1M+ code points) in version 1.5 (2004). Of course, the transition is not completely transparent to applications, which can still think they are dealing with UCS-2.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 11:30 UTC (Wed) by HelloWorld (guest, #56129) [Link]

Yeah, so? The problem is that a ”string“ is defined to be a sequence of things (”code units“) that have no semantic meaning whatsoever. It's basically the same as treating UTF-8 strings as a sequence of bytes (except with 16 bits) and having the user sort out all that pesky code point/grapheme cluster/normalisation etc. stuff. The language simply doesn't help at all.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

And don't forget about Unihan and variant selectors: https://en.wikipedia.org/wiki/Han_unification#Examples_of... - yet ANOTHER can of worms.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:20 UTC (Wed) by ras (subscriber, #33059) [Link] (5 responses)

> yet ANOTHER can of worms.

Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

A desire to group multiple graphical variants of one character into one code point. It also gives a possibility of fallbacks - variant-unaware fonts can just have one variant which will (probably) be kinda understood by most speakers. With separate code points you'll just get boxes in place of missing variants.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:15 UTC (Wed) by ras (subscriber, #33059) [Link] (3 responses)

> A desire to group multiple graphical variants of one character into one code point.

Sounds almost reasonable.

I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:31 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Complex graphemes are a foregone conclusion anyway, so why not add more?

It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:40 UTC (Thu) by ras (subscriber, #33059) [Link]

> देवनागरी लिपि

I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.

I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.

It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?

What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:44 UTC (Thu) by ras (subscriber, #33059) [Link]

Oh, and thanks for responding. I've learn a lot though this exchange.