Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Posted Jan 10, 2018 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)In reply to: Python 3, ASCII, and UTF-8 by HelloWorld
Parent article: Python 3, ASCII, and UTF-8
Posted Jan 10, 2018 22:20 UTC (Wed)
by ras (subscriber, #33059)
[Link] (5 responses)
Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?
Posted Jan 10, 2018 22:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Jan 10, 2018 23:15 UTC (Wed)
by ras (subscriber, #33059)
[Link] (3 responses)
Sounds almost reasonable.
I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.
Posted Jan 10, 2018 23:31 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.
Posted Jan 11, 2018 0:40 UTC (Thu)
by ras (subscriber, #33059)
[Link]
I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.
I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.
It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?
What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.
Posted Jan 11, 2018 0:44 UTC (Thu)
by ras (subscriber, #33059)
[Link]
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8