Fedora and Python 2

Posted Apr 5, 2018 20:19 UTC (Thu) by hsivonen (subscriber, #91034)
Parent article: Fedora and Python 2

It would be good if the distros pooled effort to maintain Tauthon (https://github.com/naftaliharris/tauthon/blob/master/READ...) and shipped it as the package providing the /usr/bin/python program.

Fedora and Python 2

Posted Apr 5, 2018 23:09 UTC (Thu) by smurf (subscriber, #17840) [Link] (24 responses)

Cute name, but the declining rate of changes since September 2017 seems to indicate that some work would be required to revive it.

Frankly, I don't see the appeal. You need Python 3 features, you use Python 3.

Fedora and Python 2

Posted Apr 6, 2018 3:38 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (1 responses)

What does that provide that `future` doesn't?

Fedora and Python 2

Posted Apr 6, 2018 13:09 UTC (Fri) by smurf (subscriber, #17840) [Link]

A whole damn lot. async/await, "yield from", type annotations, … read the linked-to web page.

Fedora and Python 2

Posted Apr 11, 2018 21:19 UTC (Wed) by togga (guest, #53103) [Link] (21 responses)

"You need Python 3 features, you use Python 3."

What if you need some Python 3 features but without the Python3 encode/decode string-hell?

Except for GIL and threading, Python2 was quite a productive language. Tauthon would be a nice starting point for distros wanting to support and maintain legacy code.

My understanding is that most of Py2 to Py3 conversions out there have been a waste of time, It is sad to see Numpy in the middle of this.

Fedora and Python 2

Posted Apr 12, 2018 14:26 UTC (Thu) by ceplm (subscriber, #41334) [Link] (19 responses)

> What if you need some Python 3 features but without the Python3 encode/decode string-hell?

There is no encode/decode hell, there are only programmers who should peel onions in submarine (https://wp.me/p83KNI-eH).

Fedora and Python 2

Posted Apr 12, 2018 19:41 UTC (Thu) by togga (guest, #53103) [Link]

I get it. When peeling onions in a submarine, all encode/decode issues doesn't feel like hell anymore. Your referenced article made the same progress as the onions regarding Py3 design issue.

Fedora and Python 2

Posted Apr 12, 2018 20:29 UTC (Thu) by peniblec (subscriber, #111147) [Link] (17 responses)

Correct me if I’m wrong, but Joel’s point in this article is that:

It does not make sense to have a string without knowing what encoding it uses. […] If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

To paraphrase, if you have to display any kind of text to a human user, you (or your programming environment) must explicitly know what encoding to use to translate the byte streams you carry around into intelligible characters.

Now AFAIU, when people complain about the “encode/decode string-hell” they are not really disputing this. From what I gather, these people deplore that by default, various parts of Python 3’s standard library expect their inputs to be Unicode characters, in contexts where there is no reason for them to be.

Personally, while I enjoy Python 3 overall, I agree that the decision to have streams default to meat-world characters rather than bytes is debatable. Not every program has to deal with human-readable strings.

Let’s say though that we all collectively agreed that Python having a bias toward human text is a good thing: let’s assume that dealing with byte-streams that do not map to Unicode characters is so rare that having to sprinkle a few bs and .buffers here and there is not a deal-breaker.

Even then, Python’s approach to human text feels somewhat naive: lengths, indexing, iteration and comparison are all based on code points, which AFAIU do not really represent anything meaningful in meat-space.

For example, Python 3 thinks that 'é' != 'é' because one is 'e'+'\N{COMBINING ACUTE ACCENT}' and the other is '\N{LATIN SMALL LETTER E WITH ACUTE}'. My French AZERTY keyboard makes typing the latter straightforward; I understand that GTK applications make it easy to type the former with “e Control-Shift-U 301”.

I can’t think of a program geared toward human interaction that should consider these two strings different. Python does offer unicodedata.normalize() to solve this specific problem; must we rely on every text-handling Python program out there to make its input go through this function? Arguably, shouldn’t the language abstract this minutiae away from us?

tl;dr: While Joel’s article is a classic and a must-read, I’m not sure it addresses the problems raised by Python 3’s critics:

the language’s preference toward meat-space characters adds hoops to jump through when dealing with genuine byte-streams;
the language’s naive handling of these meat-space characters adds hoops to jump through when dealing with those too.

Fedora and Python 2

Posted Apr 12, 2018 22:45 UTC (Thu) by dvdeug (guest, #10998) [Link] (3 responses)

There's certainly an argument for normalization, but every person annoyed by Python 3 would likely be more pissed off if, by default, it silently changed text when reading it in. Imagine a text editor where you opened your new novel, "Nous étions à l’étude, quand le Proviseur entra, ..." and changed it to "Nous étudiions, quand le Directeur entra, ..." and fed it back to git to discover a diff that changed every single line in the file.

Fedora and Python 2

Posted Apr 13, 2018 6:05 UTC (Fri) by peniblec (subscriber, #111147) [Link] (2 responses)

I may not have thought enough about this, but couldn't this text
editor normalize tokens only for some operations (e.g.
character-count, searching) and otherwise preserve the file's content,
only effectively changing the parts the user actually edited?

Fedora and Python 2

Posted Apr 14, 2018 6:26 UTC (Sat) by dvdeug (guest, #10998) [Link] (1 responses)

If Python normalized text by default, the text editor would have a hard time doing that.

Fedora and Python 2

Posted Apr 14, 2018 11:48 UTC (Sat) by peniblec (subscriber, #111147) [Link]

OK. Let’s say Python’s string type uses normalization/grapheme clusters/nanomachines to correctly compare sequences of Unicode characters. Would that necessarily make a text editor overzealously normalize your whole file, thus polluting your patch?

I don’t know how actual text editors do it, but I imagine that their representation of your file’s content is more nuanced than simply “whatever open(filename) returned”. I would assume that they represent a “file” as sequences of opaque “word” or “line” objects, each of those objects having methods to

get their position in the file’s byte-stream (start and end offset, cached once decoded), so that the editor knows where to apply changes;
get their “canonical” Unicode representation, so that the editor can do whatever an editor is supposed to do with meat-space characters (comparison for search-and-replace, length computation for line-wrapping).

So with such a design, I don’t think “Python’s str canonicalizing behind your back” would necessarily lead to “OMG this commit is full of extraneous crap introduced by this dumb Python text editor”. Again, I might not have thought enough about this, maybe the above does nothing to solve the problem.

(Congratulations, you’ve nerd-sniped me into designing a text editor ;) )

Alternative workaround: teach our diffing tools to normalize text before computing differences :D

They do already let us skip whitespace changes, for example, which is a subclass of the more general category of “things computers care about despite being mostly irrelevant to meatbags”.

Fedora and Python 2

Posted Apr 12, 2018 23:07 UTC (Thu) by HelloWorld (guest, #56129) [Link]

It's interesting how many languages get this wrong. For instance, Java doesn't even give you code points, much less grapheme clusters. Instead, it gives you 16-bit “char” values (“code units” in Unicode-speak) and then you have some methods like codePointAt that give you the code point at some (char-based) position in the string. And when you want to iterate over the code points in a string, I don't know how you're supposed to get from one index to the next, i.e. whether you need to increase the index by one or by two. It might be that you need to compare against Character.MAX_LOW_SURROGATE, but I'm not sure… Needless to say it doesn't help you dealing with grapheme clusters at all, apparently you're supposed to use third-party libraries like icu4j. All in all, it's a clusterfuck (ba-dum tss!)

Fedora and Python 2

Posted Apr 13, 2018 1:38 UTC (Fri) by smurf (subscriber, #17840) [Link] (7 responses)

> the language’s naive handling of these meat-space characters adds hoops to jump through when dealing with those too.

You need to normalize Unicode before doing meaningful things with it. That's a given in any programming language.

You might find fault with the people who invented Unicode. Blaming your (non-)choice of programming language isn't going to help, except that I can think of lots of ways to make it worse. Just look at Java.

Fedora and Python 2

Posted Apr 13, 2018 6:23 UTC (Fri) by peniblec (subscriber, #111147) [Link] (1 responses)

If normalization is so obviously needed before dealing with Unicode strings, wouldn’t it make sense for languages to take care of it by default?

For example, a language’s string-comparison function could automatically make normalized copies of its operands and compare these; users who actually want to compare codepoints could use something like list(s1.codepoints()) == list(s2.codepoints()).

(Not sure what iteration should produce by default, though. Grapheme clusters?)

Maybe performance would take such a hit that it makes sense to let the user ask for normalization explicitly.

Disclaimer: I don’t actually know any language which deals with Unicode strings this way; then again, I don’t actually know many languages.

Fedora and Python 2

Posted Apr 13, 2018 6:47 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, that's exactly what Perl 6 did. It encodes the text into grapheme clusters, the stuff that people think about as characters. They can be directly indexed, used in splits and so on. As far as I know, that's the only mainstream(-ish) language that does this.

I personally wouldn't have been so opposed to Py3 if it were to do the same. Unicode is a hard problem and full support of it might require compromises.

Fedora and Python 2

Posted Apr 13, 2018 12:44 UTC (Fri) by HelloWorld (guest, #56129) [Link] (4 responses)

Apparently canonicalisation isn't the solution either. I found this interesting comment elsewhere:
https://mortoray.com/2013/11/27/the-string-type-is-broken...

The essential bit: “Unfortunately, the standard normalisation forms are buggy, and under the current stability policy, cannot be fixed. One example of this that I know is U+387 GREEK ANO TELEIA, which wrongly decomposes canonically (!) into U+00B7 MIDDLE DOT (the Greek name even means literally “upper dot”). This means that some processes may choose to avoid normalisation, because, even the canonical forms risk losing important information.”

Fedora and Python 2

Posted Apr 13, 2018 16:39 UTC (Fri) by ceplm (subscriber, #41334) [Link]

Standard reply to every "foo is known to be buggy" is "And what's the bug number?" Also, I would ask the author of the comment why bug cannot be fixed. Doesn't make sense to me.

Fedora and Python 2

Posted Apr 13, 2018 17:48 UTC (Fri) by sfeam (subscriber, #2841) [Link] (2 responses)

U+0387 doesn't "decompose" into anything. It's not a combining form. It is an example of a character in one alphabet whose common written form happens to look like a character from some other alphabet or set of conventional symbols. Because they look similar [in typical fonts] people tend to type whichever is more convenient. But neither one is the "canonical form" of the other. A more familiar pair would be Greek letter "mu" (U+03BC) and the scientific prefix "micro" (U+00B5). The existence of such pairs can be a problem, but it's a different problem than canonicalization. While it might make sense to be suspicious of micro signs appearing in what is otherwise a Greek alphabet URL, it would be a bad idea to replace all micro signs with "mu" (or vice versa) in a document that happened to include both Greek text and quantities in SI units.

Fedora and Python 2

Posted Apr 13, 2018 19:38 UTC (Fri) by jwilk (subscriber, #63328) [Link] (1 responses)

>>> unicodedata.normalize('NFD', u'\u0387') == u'\xB7'
True

Fedora and Python 2

Posted Apr 13, 2018 21:37 UTC (Fri) by sfeam (subscriber, #2841) [Link]

Well that's a bug then, isn't it.

Fedora and Python 2

Posted Apr 13, 2018 10:30 UTC (Fri) by ceplm (subscriber, #41334) [Link] (3 responses)

My point with the Joel's article was that in my experience large part of people complaining about Python 3 encoding/decoding hell are those who still believe that "string is bunch of bytes" is enough, because they live in the bubble of languages where it is enough (i.e., English and Western European languages). I have converted recently M2Crypto to be py2k/py3k-straddling and I had no problems with Unicode encoding/decoding. What I had problems with, and plenty plenty of them, was that completely messy str/unicode/bytes py2k mess completely confused the real situation. Instead of blaming py3k, I keep blaming py2k and those programmers that their cuckoo-land of "character is one byte" delusion.

And yes, I agree that the implementation in py3k is not perfect, conversion between on-wire eight-bit-per-character to proper str is sometimes problematic, but a lot of work has been spent on it already and the situation is not that bleak, I would call it whatever hell.

Certainly, comparing to the disaster py2k was, py3k is huge improvement.

Fedora and Python 2

Posted Apr 14, 2018 13:38 UTC (Sat) by peniblec (subscriber, #111147) [Link] (2 responses)

Fair enough. I’ve mostly only worked in Python 3 codebases, and the only place where I hear people debate the str-vs-bytes business is on LWN.

That restricts my sample of arguments against Python 3 to the high-level design issues I mentioned; I have not been “in the trenches” migrating sloppy code to Python 3. In my imagination the “characters are bytes” camp (and their code) had been dissolved during the noughties; I guess that was wishful thinking :)

Fedora and Python 2

Posted Apr 14, 2018 15:15 UTC (Sat) by excors (subscriber, #95769) [Link] (1 responses)

Perhaps the issue is that some people (particularly people on LWN) are more interested in systems programming, and their programs typically deal with protocols and file formats and APIs that are primarily byte-based and occasionally contain human-readable text, whereas other people are more interested in e.g. web programming where their data is primarily human-readable inputs and outputs and modern Unicode-based file formats (HTML, CSS, etc).

People in the first category might understand Unicode perfectly well, but they often need to deal with e.g. filenames (which aren't really Unicode on Linux or Windows), or with e.g. HTTP headers (where the encoding is unclearly specified and real data often violates the specification anyway), and they want a language that makes it easy and natural to process data like that. Python 3 makes it less easy and less natural than Python 2, since the language and the libraries tend to default to Unicode strings, so those people are unhappy. Meanwhile people in the second category prefer having everything be Unicode by default, since that's all they use anyway. Neither side is wrong or ignorant, they just have different use cases and different requirements, and Python failed to find a way to satisfy both groups.

Fedora and Python 2

Posted Apr 14, 2018 16:48 UTC (Sat) by SiB (subscriber, #4048) [Link]

Exactly!

In our department (physics) we use python for data analysis and for instrumentation control (including space flight). Python 3 is perfectly fine for the data analysis. Instrumentation control uses the python repl as commanding interface, where python 2 is still ahead.

Fedora and Python 2

Posted Apr 28, 2018 20:22 UTC (Sat) by RooTer (guest, #91640) [Link]

> What if you need some Python 3 features but without the Python3 encode/decode string-hell?

Having developed python apps in both Python 2 and 3 for years, I would say the encode/decode hell exists in Python 2 realm, not 3.
Seems as stupid `UnicodeDecodeError`plague almost every Python 2 project, and switch to Python 3 would be a good idea just for the clear str/bytes distinction.

Fedora and Python 2

Posted Apr 6, 2018 11:33 UTC (Fri) by Otus (subscriber, #67685) [Link]

Or perhaps pypy would be a better alternative?