Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
In reply to: Python 3, ASCII, and UTF-8 by HelloWorld
Parent article: Python 3, ASCII, and UTF-8

And don't forget about Unihan and variant selectors: https://en.wikipedia.org/wiki/Han_unification#Examples_of... - yet ANOTHER can of worms.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:20 UTC (Wed) by ras (subscriber, #33059) [Link] (5 responses)

> yet ANOTHER can of worms.

Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

A desire to group multiple graphical variants of one character into one code point. It also gives a possibility of fallbacks - variant-unaware fonts can just have one variant which will (probably) be kinda understood by most speakers. With separate code points you'll just get boxes in place of missing variants.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:15 UTC (Wed) by ras (subscriber, #33059) [Link] (3 responses)

> A desire to group multiple graphical variants of one character into one code point.

Sounds almost reasonable.

I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:31 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Complex graphemes are a foregone conclusion anyway, so why not add more?

It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:40 UTC (Thu) by ras (subscriber, #33059) [Link]

> देवनागरी लिपि

I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.

I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.

It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?

What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:44 UTC (Thu) by ras (subscriber, #33059) [Link]

Oh, and thanks for responding. I've learn a lot though this exchange.