Cannon: Why Python 3 exists
Cannon: Why Python 3 exists
Posted Dec 19, 2015 0:28 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)In reply to: Cannon: Why Python 3 exists by barryascott
Parent article: Cannon: Why Python 3 exists
Use utf-8 strings. Duh.
> In Python2 it is far to easy to get the unicode strings and bytes messed up.
Don't use "unicode" strings. Simply use utf-8 encoded bytes.
That's it. You don't need anything more, really. Just validate utf-8 on the edges.
Posted Dec 20, 2015 23:27 UTC (Sun)
by barryascott (subscriber, #80640)
[Link] (4 responses)
Given python 3 optimises the storage of strings into the smallest storage there is not a space reason to use utf-8.
Barry
Posted Dec 21, 2015 2:57 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
And no, Python unicode strings do not, I repeat DO NOT, give you O(1) character indexing.
Perl's NFG strings do, but they have other issues. But at least Perl developers _tried_ to do something useful with the string type.
Posted Dec 22, 2015 11:21 UTC (Tue)
by sorokin (guest, #88478)
[Link]
Posted Dec 30, 2015 10:45 UTC (Wed)
by barryascott (subscriber, #80640)
[Link] (1 responses)
The pain is having to scan the string from the start to find the code point boundaries.
> And no, Python unicode strings do not, I repeat DO NOT, give you O(1) character indexing.
I just read the python 3.5.1 code and, unless I missed something in the macros, it is
Posted Dec 30, 2015 11:12 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> I just read the python 3.5.1 code and, unless I missed something in the macros, it is clearly O(1).
Codepoints are NOT characters!!!
Read about it: https://en.wikipedia.org/wiki/Combining_character
Thanks for providing yet another confirmation of Python3 boneheaded design.
Posted Dec 21, 2015 5:06 UTC (Mon)
by MattJD (subscriber, #91390)
[Link] (1 responses)
I don't understand the difference you imply between utf-8 strings and unicode. utf-8 is just a method to store unicode code points that is endian independent, and efficient for ascii (among some other various benefits). How does using utf-8 solve any of the unicode problems?
Go stores byte strings, which can be parsed as utf-8 returning unicode code points (in runes). That solves some conversion problems, but that doesn't explain how it solves any of the language handling parts. That is the hard part, as can be seen in the conversation over Perl 6's implementation.
> And no, Python unicode strings do not, I repeat DO NOT, give you O(1) character indexing.
Yes, I agree with that. But how does utf-8 solve this? utf-8 is still variable length, which has the same problem.
Posted Dec 21, 2015 6:56 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
In Python2 a 'string' is just a sequence of bytes, without any particular encoding attached. Python3 went to great lengths to make a separate "unicode" string type and break existing code, since all the strings by default are now "unicode".
This was done, ostensibly, to infuse Python with some magical property of "unicodeness". And in process of doing that the Python developers broke a LOT of existing software, sometimes forcing to use ugly workarounds for Python3.
My problem with all of this is that this "unicode" transition did not produce anything of value. In some cases Py2 is actually more i18n-friendly than Python3!
I think the only advantage of Py3 regarding multi-language support is that it now treats the source code as UTF-8 by default (Py2 treats it as ASCII) and it also permits UTF-8 identifiers. And that's it, for the price of untold number of breakages.
Cannon: Why Python 3 exists
Indeed for langauges like chinese utf-8 will expand the memory needed by about 50%.
Cannon: Why Python 3 exists
Just a useful link
Cannon: Why Python 3 exists
It is far easier when you can just index into vector of code points.
clearly O(1).
Cannon: Why Python 3 exists
> It is far easier when you can just index into vector of code points.
Why would you NEED to index codepoints?
No, it's NOT. It gives O(1) indexing for _codepoints_ which is a pretty much useless feature.
Cannon: Why Python 3 exists
> Use utf-8 strings. Duh.
Cannon: Why Python 3 exists