Fedora and Python 2
Fedora and Python 2
Posted Apr 5, 2018 20:19 UTC (Thu) by hsivonen (subscriber, #91034)Parent article: Fedora and Python 2
Posted Apr 5, 2018 23:09 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (24 responses)
Frankly, I don't see the appeal. You need Python 3 features, you use Python 3.
Posted Apr 6, 2018 3:38 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Apr 6, 2018 13:09 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Posted Apr 11, 2018 21:19 UTC (Wed)
by togga (guest, #53103)
[Link] (21 responses)
What if you need some Python 3 features but without the Python3 encode/decode string-hell?
Except for GIL and threading, Python2 was quite a productive language. Tauthon would be a nice starting point for distros wanting to support and maintain legacy code.
My understanding is that most of Py2 to Py3 conversions out there have been a waste of time, It is sad to see Numpy in the middle of this.
Posted Apr 12, 2018 14:26 UTC (Thu)
by ceplm (subscriber, #41334)
[Link] (19 responses)
There is no encode/decode hell, there are only programmers who should peel onions in submarine (https://wp.me/p83KNI-eH).
Posted Apr 12, 2018 19:41 UTC (Thu)
by togga (guest, #53103)
[Link]
Posted Apr 12, 2018 20:29 UTC (Thu)
by peniblec (subscriber, #111147)
[Link] (17 responses)
Correct me if I’m wrong, but Joel’s point in this article is that: It does not make sense to have a string without knowing what encoding it uses. […] If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly. To paraphrase, if you have to display any kind of text to a human user, you (or your programming environment) must explicitly know what encoding to use to translate the byte streams you carry around into intelligible characters. Now AFAIU, when people complain about the “encode/decode string-hell” they are not really disputing this. From what I gather, these people deplore that by default, various parts of Python 3’s standard library expect their inputs to be Unicode characters, in contexts where there is no reason for them to be. Personally, while I enjoy Python 3 overall, I agree that the decision to have streams default to meat-world characters rather than bytes is debatable. Not every program has to deal with human-readable strings. Let’s say though that we all collectively agreed that Python having a bias toward human text is a good thing: let’s assume that dealing with byte-streams that do not map to Unicode characters is so rare that having to sprinkle a few Even then, Python’s approach to human text feels somewhat naive: lengths, indexing, iteration and comparison are all based on code points, which AFAIU do not really represent anything meaningful in meat-space. For example, Python 3 thinks that I can’t think of a program geared toward human interaction that should consider these two strings different. Python does offer tl;dr: While Joel’s article is a classic and a must-read, I’m not sure it addresses the problems raised by Python 3’s critics: the language’s preference toward meat-space characters adds hoops to jump through when dealing with genuine byte-streams; the language’s naive handling of these meat-space characters adds hoops to jump through when dealing with those too.
Posted Apr 12, 2018 22:45 UTC (Thu)
by dvdeug (guest, #10998)
[Link] (3 responses)
Posted Apr 13, 2018 6:05 UTC (Fri)
by peniblec (subscriber, #111147)
[Link] (2 responses)
Posted Apr 14, 2018 6:26 UTC (Sat)
by dvdeug (guest, #10998)
[Link] (1 responses)
Posted Apr 14, 2018 11:48 UTC (Sat)
by peniblec (subscriber, #111147)
[Link]
OK. Let’s say Python’s string type uses normalization/grapheme clusters/nanomachines to correctly compare sequences of Unicode characters. Would that necessarily make a text editor overzealously normalize your whole file, thus polluting your patch? I don’t know how actual text editors do it, but I imagine that their representation of your file’s content is more nuanced than simply “whatever get their position in the file’s byte-stream (start and end offset, cached once decoded), so that the editor knows where to apply changes; get their “canonical” Unicode representation, so that the editor can do whatever an editor is supposed to do with meat-space characters (comparison for search-and-replace, length computation for line-wrapping). So with such a design, I don’t think “Python’s (Congratulations, you’ve nerd-sniped me into designing a text editor ;) ) Alternative workaround: teach our diffing tools to normalize text before computing differences :D They do already let us skip whitespace changes, for example, which is a subclass of the more general category of “things computers care about despite being mostly irrelevant to meatbags”.
Posted Apr 12, 2018 23:07 UTC (Thu)
by HelloWorld (guest, #56129)
[Link]
Posted Apr 13, 2018 1:38 UTC (Fri)
by smurf (subscriber, #17840)
[Link] (7 responses)
You need to normalize Unicode before doing meaningful things with it. That's a given in any programming language.
You might find fault with the people who invented Unicode. Blaming your (non-)choice of programming language isn't going to help, except that I can think of lots of ways to make it worse. Just look at Java.
Posted Apr 13, 2018 6:23 UTC (Fri)
by peniblec (subscriber, #111147)
[Link] (1 responses)
If normalization is so obviously needed before dealing with Unicode strings, wouldn’t it make sense for languages to take care of it by default? For example, a language’s string-comparison function could automatically make normalized copies of its operands and compare these; users who actually want to compare codepoints could use something like (Not sure what iteration should produce by default, though. Grapheme clusters?) Maybe performance would take such a hit that it makes sense to let the user ask for normalization explicitly. Disclaimer: I don’t actually know any language which deals with Unicode strings this way; then again, I don’t actually know many languages.
Posted Apr 13, 2018 6:47 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I personally wouldn't have been so opposed to Py3 if it were to do the same. Unicode is a hard problem and full support of it might require compromises.
Posted Apr 13, 2018 12:44 UTC (Fri)
by HelloWorld (guest, #56129)
[Link] (4 responses)
The essential bit: “Unfortunately, the standard normalisation forms are buggy, and under the current stability policy, cannot be fixed. One example of this that I know is U+387 GREEK ANO TELEIA, which wrongly decomposes canonically (!) into U+00B7 MIDDLE DOT (the Greek name even means literally “upper dot”). This means that some processes may choose to avoid normalisation, because, even the canonical forms risk losing important information.”
Posted Apr 13, 2018 16:39 UTC (Fri)
by ceplm (subscriber, #41334)
[Link]
Posted Apr 13, 2018 17:48 UTC (Fri)
by sfeam (subscriber, #2841)
[Link] (2 responses)
Posted Apr 13, 2018 10:30 UTC (Fri)
by ceplm (subscriber, #41334)
[Link] (3 responses)
And yes, I agree that the implementation in py3k is not perfect, conversion between on-wire eight-bit-per-character to proper str is sometimes problematic, but a lot of work has been spent on it already and the situation is not that bleak, I would call it whatever hell.
Certainly, comparing to the disaster py2k was, py3k is huge improvement.
Posted Apr 14, 2018 13:38 UTC (Sat)
by peniblec (subscriber, #111147)
[Link] (2 responses)
Fair enough. I’ve mostly only worked in Python 3 codebases, and the only place where I hear people debate the str-vs-bytes business is on LWN. That restricts my sample of arguments against Python 3 to the high-level design issues I mentioned; I have not been “in the trenches” migrating sloppy code to Python 3. In my imagination the “characters are bytes” camp (and their code) had been dissolved during the noughties; I guess that was wishful thinking :)
Posted Apr 14, 2018 15:15 UTC (Sat)
by excors (subscriber, #95769)
[Link] (1 responses)
People in the first category might understand Unicode perfectly well, but they often need to deal with e.g. filenames (which aren't really Unicode on Linux or Windows), or with e.g. HTTP headers (where the encoding is unclearly specified and real data often violates the specification anyway), and they want a language that makes it easy and natural to process data like that. Python 3 makes it less easy and less natural than Python 2, since the language and the libraries tend to default to Unicode strings, so those people are unhappy. Meanwhile people in the second category prefer having everything be Unicode by default, since that's all they use anyway. Neither side is wrong or ignorant, they just have different use cases and different requirements, and Python failed to find a way to satisfy both groups.
Posted Apr 14, 2018 16:48 UTC (Sat)
by SiB (subscriber, #4048)
[Link]
In our department (physics) we use python for data analysis and for instrumentation control (including space flight). Python 3 is perfectly fine for the data analysis. Instrumentation control uses the python repl as commanding interface, where python 2 is still ahead.
Posted Apr 28, 2018 20:22 UTC (Sat)
by RooTer (guest, #91640)
[Link]
Having developed python apps in both Python 2 and 3 for years, I would say the encode/decode hell exists in Python 2 realm, not 3.
Posted Apr 6, 2018 11:33 UTC (Fri)
by Otus (subscriber, #67685)
[Link]
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
b
s and .buffer
s here and there is not a deal-breaker.'é' != 'é'
because one is 'e'+'\N{COMBINING ACUTE ACCENT}'
and the other is '\N{LATIN SMALL LETTER E WITH ACUTE}'
. My French AZERTY keyboard makes typing the latter straightforward; I understand that GTK applications make it easy to type the former with “e Control-Shift-U 301”.unicodedata.normalize()
to solve this specific problem; must we rely on every text-handling Python program out there to make its input go through this function? Arguably, shouldn’t the language abstract this minutiae away from us?
Fedora and Python 2
Fedora and Python 2
editor normalize tokens only for some operations (e.g.
character-count, searching) and otherwise preserve the file's content,
only effectively changing the parts the user actually edited?
Fedora and Python 2
Fedora and Python 2
open(filename)
returned”. I would assume that they represent a “file” as sequences of opaque “word” or “line” objects, each of those objects having methods to
str
canonicalizing behind your back” would necessarily lead to “OMG this commit is full of extraneous crap introduced by this dumb Python text editor”. Again, I might not have thought enough about this, maybe the above does nothing to solve the problem.Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
list(s1.codepoints()) == list(s2.codepoints())
.Fedora and Python 2
Fedora and Python 2
https://mortoray.com/2013/11/27/the-string-type-is-broken...
Fedora and Python 2
U+0387 doesn't "decompose" into anything. It's not a combining form. It is an example of a character in one alphabet whose common written form happens to look like a character from some other alphabet or set of conventional symbols. Because they look similar [in typical fonts] people tend to type whichever is more convenient. But neither one is the "canonical form" of the other. A more familiar pair would be Greek letter "mu" (U+03BC) and the scientific prefix "micro" (U+00B5). The existence of such pairs can be a problem, but it's a different problem than canonicalization. While it might make sense to be suspicious of micro signs appearing in what is otherwise a Greek alphabet URL, it would be a bad idea to replace all micro signs with "mu" (or vice versa) in a document that happened to include both Greek text and quantities in SI units.
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Seems as stupid `UnicodeDecodeError`plague almost every Python 2 project, and switch to Python 3 would be a good idea just for the clear str/bytes distinction.
Fedora and Python 2