Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Posted Jan 10, 2018 3:03 UTC (Wed) by HelloWorld (guest, #56129)In reply to: Python 3, ASCII, and UTF-8 by ras
Parent article: Python 3, ASCII, and UTF-8
A code point is *not* a character. Umlauts like ä, ö and ü can be written in two ways in Unicode, either with a single code point for the whole thing or with one code point for the underlying vowel and one code point for the diacritic. This sequence of two code points is called a grapheme cluster. There are entire Unicode blocks that contain only code points that only make sense as part of grapheme clusters, like Hangul Jamo for Korean. Many variations of emoji are also implemented this way. For this reason I don't think it makes sense to treat strings as sequences of Unicode code points, it should be grapheme clusters, and that's what Perl 6 does while Python, like always, fucked it up (and Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now)
Posted Jan 10, 2018 8:29 UTC (Wed)
by jem (subscriber, #24231)
[Link] (1 responses)
Jav moved from UCS-2 (64k points) to UTF-16 (1M+ code points) in version 1.5 (2004). Of course, the transition is not completely transparent to applications, which can still think they are dealing with UCS-2.
Posted Jan 10, 2018 11:30 UTC (Wed)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 10, 2018 17:40 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 10, 2018 22:20 UTC (Wed)
by ras (subscriber, #33059)
[Link] (5 responses)
Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?
Posted Jan 10, 2018 22:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Jan 10, 2018 23:15 UTC (Wed)
by ras (subscriber, #33059)
[Link] (3 responses)
Sounds almost reasonable.
I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.
Posted Jan 10, 2018 23:31 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.
Posted Jan 11, 2018 0:40 UTC (Thu)
by ras (subscriber, #33059)
[Link]
I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.
I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.
It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?
What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.
Posted Jan 11, 2018 0:44 UTC (Thu)
by ras (subscriber, #33059)
[Link]
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8