Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 3:03 UTC (Wed) by HelloWorld (guest, #56129)
In reply to: Python 3, ASCII, and UTF-8 by ras
Parent article: Python 3, ASCII, and UTF-8

> It assigned each unique character a number it called a code point.
A code point is *not* a character. Umlauts like ä, ö and ü can be written in two ways in Unicode, either with a single code point for the whole thing or with one code point for the underlying vowel and one code point for the diacritic. This sequence of two code points is called a grapheme cluster. There are entire Unicode blocks that contain only code points that only make sense as part of grapheme clusters, like Hangul Jamo for Korean. Many variations of emoji are also implemented this way. For this reason I don't think it makes sense to treat strings as sequences of Unicode code points, it should be grapheme clusters, and that's what Perl 6 does while Python, like always, fucked it up (and Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now)

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 8:29 UTC (Wed) by jem (subscriber, #24231) [Link] (1 responses)

>Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now

Jav moved from UCS-2 (64k points) to UTF-16 (1M+ code points) in version 1.5 (2004). Of course, the transition is not completely transparent to applications, which can still think they are dealing with UCS-2.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 11:30 UTC (Wed) by HelloWorld (guest, #56129) [Link]

Yeah, so? The problem is that a ”string“ is defined to be a sequence of things (”code units“) that have no semantic meaning whatsoever. It's basically the same as treating UTF-8 strings as a sequence of bytes (except with 16 bits) and having the user sort out all that pesky code point/grapheme cluster/normalisation etc. stuff. The language simply doesn't help at all.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

And don't forget about Unihan and variant selectors: https://en.wikipedia.org/wiki/Han_unification#Examples_of... - yet ANOTHER can of worms.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:20 UTC (Wed) by ras (subscriber, #33059) [Link] (5 responses)

> yet ANOTHER can of worms.

Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

A desire to group multiple graphical variants of one character into one code point. It also gives a possibility of fallbacks - variant-unaware fonts can just have one variant which will (probably) be kinda understood by most speakers. With separate code points you'll just get boxes in place of missing variants.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:15 UTC (Wed) by ras (subscriber, #33059) [Link] (3 responses)

> A desire to group multiple graphical variants of one character into one code point.

Sounds almost reasonable.

I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:31 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Complex graphemes are a foregone conclusion anyway, so why not add more?

It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:40 UTC (Thu) by ras (subscriber, #33059) [Link]

> देवनागरी लिपि

I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.

I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.

It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?

What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:44 UTC (Thu) by ras (subscriber, #33059) [Link]

Oh, and thanks for responding. I've learn a lot though this exchange.